# Diabetes Data Set

Dataset file: 'diabetes.data'  
Reference link for description of dataset: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

### Preview of the Data Set

Load the data set.

a) Analyse the data set. Print the number of features, feature names, data types of the features, number of data points and the values of the first 10 data points.

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
# accessing the data
data = pd.read_csv('diabetes.data', sep='\t')
# printing first 10 rows of the data
data.head(10)

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
0,59,2,32.1,101.0,157,93.2,38.0,4.0,4.8598,87,151
1,48,1,21.6,87.0,183,103.2,70.0,3.0,3.8918,69,75
2,72,2,30.5,93.0,156,93.6,41.0,4.0,4.6728,85,141
3,24,1,25.3,84.0,198,131.4,40.0,5.0,4.8903,89,206
4,50,1,23.0,101.0,192,125.4,52.0,4.0,4.2905,80,135
5,23,1,22.6,89.0,139,64.8,61.0,2.0,4.1897,68,97
6,36,2,22.0,90.0,160,99.6,50.0,3.0,3.9512,82,138
7,66,2,26.2,114.0,255,185.0,56.0,4.55,4.2485,92,63
8,60,2,32.1,83.0,179,119.4,42.0,4.0,4.4773,94,110
9,29,1,30.0,85.0,180,93.4,43.0,4.0,5.3845,88,310


In [2]:
# features
print("The size of the dataset: ", data.shape)
print("The no. of samples are: ", data.shape[0])
print('The no.of features are: ', data.shape[1])

The size of the dataset:  (442, 11)
The no. of samples are:  442
The no.of features are:  11


In [3]:
# name of the available features
features = data.columns
print("The features available are: ", features)

The features available are:  Index(['AGE', 'SEX', 'BMI', 'BP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'Y'], dtype='object')


In [4]:
# getting the datatype of features 
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
AGE    442 non-null int64
SEX    442 non-null int64
BMI    442 non-null float64
BP     442 non-null float64
S1     442 non-null int64
S2     442 non-null float64
S3     442 non-null float64
S4     442 non-null float64
S5     442 non-null float64
S6     442 non-null int64
Y      442 non-null int64
dtypes: float64(6), int64(5)
memory usage: 38.1 KB


In [5]:
# descriptve statistics of the columns of the data
data.describe()

# There are no missing values in the dataset.
# The features are of different scales.

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,48.5181,1.468326,26.375792,94.647014,189.140271,115.43914,49.788462,4.070249,4.641411,91.260181,152.133484
std,13.109028,0.499561,4.418122,13.831283,34.608052,30.413081,12.934202,1.29045,0.522391,11.496335,77.093005
min,19.0,1.0,18.0,62.0,97.0,41.6,22.0,2.0,3.2581,58.0,25.0
25%,38.25,1.0,23.2,84.0,164.25,96.05,40.25,3.0,4.2767,83.25,87.0
50%,50.0,1.0,25.7,93.0,186.0,113.0,48.0,4.0,4.62005,91.0,140.5
75%,59.0,2.0,29.275,105.0,209.75,134.5,57.75,5.0,4.9972,98.0,211.5
max,79.0,2.0,42.2,133.0,301.0,242.4,99.0,9.09,6.107,124.0,346.0


### Training and Testing Data Sets

b) Split the data set into training and testing data set with a 80:20 ratio.

(Hint: What precautions must you take before you split the data set?)

In [6]:
# splitting the dataset 
from sklearn.model_selection import train_test_split
X = data.drop(['Y'], axis = 1)
Y = data['Y']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.20, random_state = 42)
print("The dimension of the training data is: ", X_train.shape)
print("The dimension of the test data is: ", X_test.shape)

The dimension of the training data is:  (353, 10)
The dimension of the test data is:  (89, 10)


### Linear Regression

c) Using linear regression, seek a model for the response of interest ($Y$), as a function of the baseline variables such as age, sex, body mass index, etc. Compute the training error and testing error.

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr = LinearRegression()
fit1 = lr.fit(X_train, Y_train)

y_pred_train = lr.predict(X_train)
mse_train = mean_squared_error(Y_train, y_pred_train)

y_pred_test = lr.predict(X_test)
mse_test = mean_squared_error(Y_test, y_pred_test)

print("The training error is: {}\n ".format(mse_train))
print("The testing error is: {}\n ".format(mse_test))
print("The coefficients of the variables in the regression model: \n",  fit1.coef_)

The training error is: 2868.549702835577
 
The testing error is: 2900.193628493484
 
The coefficients of the variables in the regression model: 
 [  0.13768782 -23.06446772   5.84636265   1.19709252  -1.28168474
   0.81115203   0.60165319  10.15953917  67.1089624    0.20159907]


### Data Preprocessing

d) Normalize the data set and perform linear regression again. Compute the training error and testing error. Comment.

In [15]:
from sklearn.preprocessing import normalize
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

#normalizing the data using sklearn normalize function which ensures all the columns have standard deviation 1 except Y
X1 = data.drop(['Y'], axis = 1)
Y1 = data['Y']
X1 = X1 - X1.mean()
X1 = normalize(X1)
data_n = pd.DataFrame(X1, columns = ['AGE', 'SEX', 'BMI', 'BP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6'])
X_train_n, X_test_n, Y_train_n, Y_test_n = train_test_split(X1, Y1, test_size = 0.20, random_state = 42)

# fitting the linear regression model
lr2 = LinearRegression()
fit2 = lr2.fit(X_train_n, Y_train_n)
y_pred_train2 = lr2.predict(X_train_n)
mse_train2 = mean_squared_error(Y_train_n, y_pred_train2)
y_pred_test2 = lr2.predict(X_test_n)
mse_test2 = mean_squared_error(Y_test_n, y_pred_test2)

print("The training error(after normalizing) is: {} \n".format(mse_train2))
print("The testing error(after normalizing) is: {} \n".format(mse_test2))
print("The coefficients of the variables in the regression model: \n",  fit2.coef_)

The training error(after normalizing) is: 3107.5711888936744 

The testing error(after normalizing) is: 2972.4675510998663 

The coefficients of the variables in the regression model: 
 [  11.46979459 -755.47760298  200.11327228   42.10253798  -20.97777326
   14.39406043   12.29039893  519.07141803 1830.71250724   10.40641672]


In [16]:
from sklearn.preprocessing import normalize
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

#normalizing the data using sklearn normalize function which ensures all the columns have standard deviation 1
data_n = normalize(data)
data_n = pd.DataFrame(data_n, columns = ['AGE', 'SEX', 'BMI', 'BP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'Y'])
X1 = data_n.drop(['Y'], axis = 1)
Y1 = data_n['Y']
X_train_n, X_test_n, Y_train_n, Y_test_n = train_test_split(X1, Y1, test_size = 0.20, random_state = 42)

# fitting the linear regression model
lr2 = LinearRegression()
fit2 = lr2.fit(X_train_n, Y_train_n)
y_pred_train2 = lr2.predict(X_train_n)
mse_train2 = mean_squared_error(Y_train_n, y_pred_train2)
y_pred_test2 = lr2.predict(X_test_n)
mse_test2 = mean_squared_error(Y_test_n, y_pred_test2)

print("The training error(after normalizing) is: {} \n".format(mse_train2))
print("The testing error(after normalizing) is: {} \n".format(mse_test2))
print("The coefficients of the variables in the regression model: \n",  fit2.coef_)

The training error(after normalizing) is: 0.0015952614578293252 

The testing error(after normalizing) is: 0.0014785441098964435 

The coefficients of the variables in the regression model: 
 [-0.30611552 -2.45582069 -0.57179961 -0.57533418 -1.24546868 -0.55533014
 -0.61160414 -2.71190732  8.10427218 -0.68586854]


### Feature Reduction

e) Rank the features in order of importance (based on the study in d)). Comment.

In [17]:
# Finding the correlation matrix and finding the correlation of other variables with Y
data_n = pd.DataFrame(X1, columns = ['AGE', 'SEX', 'BMI', 'BP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6'])
data_n['Y'] = Y1
print("The Rank based on correlation with the output: \n", abs(data_n.corr()['Y']).sort_values(ascending = False).drop(labels = ['Y']))

The Rank based on correlation with the output: 
 S1     0.896540
S2     0.636091
S3     0.604019
S6     0.515001
S5     0.449214
BP     0.405009
AGE    0.316734
SEX    0.290446
BMI    0.224424
S4     0.089442
Name: Y, dtype: float64


In [18]:
# ranking as per the regression model coefficients
coef_rank = pd.Series(fit2.coef_, index = ['AGE', 'SEX', 'BMI', 'BP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6'])
print("The Rank based on regression coefficients: \n",abs(coef_rank).sort_values(ascending = False))

The Rank based on regression coefficients: 
 S5     8.104272
S4     2.711907
SEX    2.455821
S1     1.245469
S6     0.685869
S3     0.611604
BP     0.575334
BMI    0.571800
S2     0.555330
AGE    0.306116
dtype: float64


In [19]:
# ranking of features using the sklearn feature selection module
from sklearn.feature_selection import f_regression
F, pval = f_regression(X1,Y1)
F_rank = pd.Series(F, index = ['AGE', 'SEX', 'BMI', 'BP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6'])
print("The Rank based on F_score: \n", abs(F_rank).sort_values(ascending = False))

pval_rank = pd.Series(pval, index = ['AGE', 'SEX', 'BMI', 'BP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6'])
print("The Rank based on p_value: \n", abs(pval_rank).sort_values(ascending = False))

The Rank based on F_score: 
 S1     1802.422333
S2      299.013912
S3      252.737525
S6      158.823950
S5      111.235341
BP       86.336188
AGE      49.063162
SEX      40.537536
BMI      23.336458
S4        3.548324
dtype: float64
The Rank based on p_value: 
 S4      6.026481e-02
BMI     1.880971e-06
SEX     4.857360e-10
AGE     9.336098e-12
BP      7.068198e-19
S5      2.449895e-23
S6      2.625735e-31
S3      2.702055e-45
S2      1.702265e-51
S1     1.068849e-157
dtype: float64



### Polynomial Regression

f) Repeat the exercise in d) with quadratic features. List the features you would add to the existing data set. Compute the training error and the testing error. Comment.

In [20]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

pol = PolynomialFeatures(degree = 2)
X_train_new = pol.fit_transform(X_train_n)
X_test_new = pol.fit_transform(X_test_n)
lr3 = LinearRegression()
fit3 = lr3.fit(X_train_new, Y_train_n)

y_pred_train3 = lr3.predict(X_train_new)
mse_train3 = mean_squared_error(Y_train_n, y_pred_train3)

y_pred_test3 = lr3.predict(X_test_new)
mse_test3 = mean_squared_error(Y_test_n, y_pred_test3)

print("The training error(after normalizing and using quadratic features) is: {} \n".format(mse_train3))
print("The testing error(after normalizing and quadratic features) is: {} \n".format(mse_test3))
print("The coefficients of the variables in the polynomial regression model: \n",  fit3.coef_)

The training error(after normalizing and using quadratic features) is: 0.00048399914905966244 

The testing error(after normalizing and quadratic features) is: 0.0005318401443962595 

The coefficients of the variables in the polynomial regression model: 
 [ 1.77020163e+13  9.58339324e-01  9.48280816e+00  1.86090673e+00
  3.08994432e+00  4.37934831e+00  3.89439865e-01  1.52263423e+00
  3.25122546e+01 -2.15450483e+01  7.58095159e-02  6.79014923e-01
  6.37415093e+00 -6.92637556e+00 -6.36674311e-01  2.72813795e+00
 -2.52354413e+00 -3.79924085e+00 -4.42855003e+01 -4.76343556e+01
  1.44985405e+00  1.45335537e+03  2.61942396e+01  3.47046832e+01
 -3.41682495e+00 -6.62818049e+00 -1.07067824e+02 -9.34555604e+02
 -1.97902388e+02  6.00639266e-01  3.68570986e+01 -1.96713019e+01
 -1.40942179e+01  1.55987634e+00  7.10076916e+00  6.91172048e+01
 -2.68304811e+01  1.51634637e+01 -8.98568643e-01 -3.90202738e+00
  1.00669913e+00  1.67356313e+00 -9.04831030e+00 -2.66476350e+01
  1.83144979e+00 -4.21976471e