## l. Introduction

### 1. Identify a domain specific area:

Bio-informatics - 
Healthcare and Medicine, specifically diabetes

### 2. Dataset:

TEMPORARY TOY DATASET: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html

sklearn.datasets (datasets.load_diabetes())


## ========================================
### use this dataset for real submission:
### https://www.kaggle.com/code/adamhertelendi/diabetes-ds-data-analysis

### 3. Objectives of the Project

Diabetes is responsible for {insert number here} deaths every year. There are markers that can be distinguished that can be used to determine the likelihood of diabetes developing 

## ll. Implementation

### 4. Convert / store dataset locally and preprocess the data

In [1]:
from sklearn import datasets

In [2]:
# load diabetes dataset from the sklearn dataset library
diabetes = datasets.load_diabetes()

# check type of dataset
print(type(diabetes))

<class 'sklearn.utils.Bunch'>


In [3]:
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bra

##### The first 10 columns are numeric descriptive values, and the 11th column is the 'y' variable that will be predicted

In [4]:
# print out the columns
print(diabetes.feature_names)

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']


In [5]:
# assign the x and y variables
x = diabetes.data
y = diabetes.target

In [6]:
print(diabetes.data)

[[ 0.03807591  0.05068012  0.06169621 ... -0.00259226  0.01990842
  -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 ... -0.03949338 -0.06832974
  -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 ... -0.00259226  0.00286377
  -0.02593034]
 ...
 [ 0.04170844  0.05068012 -0.01590626 ... -0.01107952 -0.04687948
   0.01549073]
 [-0.04547248 -0.04464164  0.03906215 ...  0.02655962  0.04452837
  -0.02593034]
 [-0.04547248 -0.04464164 -0.0730303  ... -0.03949338 -0.00421986
   0.00306441]]


In [7]:
# check the rows and columns of x and y
x.shape, y.shape

((442, 10), (442,))

#### Identify key series of the dataset and provide statistical summary of the data, including:
 - measures of central tendency
 - measures of spread
 - type of distribution

##### Import module necessary for data split

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
# initialise data into an 80/20 split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)

In [10]:
x_train.shape, y_train.shape

((353, 10), (353,))

In [11]:
x_test.shape, y_test.shape

((89, 10), (89,))

##### Import modules necessary for creating the regression model

In [12]:
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import linear_model

Define the regression model

In [13]:
model = linear_model.LinearRegression()

Build the training model

In [14]:
model.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Use the trained model to make predicions on the test set

In [15]:
y_pred = model.predict(x_test)

#### Create dictionary of dependent variables and their respective coefficients

In [16]:
# initiate dictionary
coeff_dict = {}

# assign the variables and respective coefficient values
for i, coeff in enumerate(model.coef_):
    coeff_dict[diabetes.feature_names[i]] = coeff


In [17]:
coeff_dict

{'age': -16.708573184617272,
 'sex': -315.70760706964575,
 'bmi': 509.51366240837945,
 'bp': 366.0843301185073,
 's1': -601.4487110952564,
 's2': 359.76314464720457,
 's3': -21.998359254461842,
 's4': 198.41134185212601,
 's5': 593.6049221611734,
 's6': 83.65121228090881}

##### Print out the model performance

In [18]:
print(f"Intercept: {model.intercept_}\n")
print(f"Mean Squared Error: {mean_squared_error(y_test,y_pred)}\n")
print(f"Coefficient of determination (R^2): {r2_score(y_test, y_pred)}\n")
print(f"Coefficients: {coeff_dict}")

Intercept: 152.83493271594782

Mean Squared Error: 4151.383234806006

Coefficient of determination (R^2): 0.3947742849661636

Coefficients: {'age': -16.708573184617272, 'sex': -315.70760706964575, 'bmi': 509.51366240837945, 'bp': 366.0843301185073, 's1': -601.4487110952564, 's2': 359.76314464720457, 's3': -21.998359254461842, 's4': 198.41134185212601, 's5': 593.6049221611734, 's6': 83.65121228090881}


#### Import module for creating plot

In [19]:
import seaborn as sns

In [20]:
sns.scatterplot(y_test, y_pred)

TypeError: scatterplot() takes from 0 to 1 positional arguments but 2 were given