# This Exercise explores the NCSU.EDU Diabetes Dataset with an OLS Linear Regression Model. The Linear Regression Model is Used to Predict the Progression of Diabetes One Year after the Baseline.

### The source of the data can be found at the "Diabetes Data" website: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
### The CSV data can be found at: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt
### The Diabetes Data website notes:
#### From Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499, we have "Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline."

#### We will start the exercise by importing the necessary libraries

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy.stats import norm
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn import datasets

### Next we will import the CSV data from the URL in the description above. 

In [12]:
# Import the diabetes dataset CSV file from the URL into a pandas daraframe, "df".
# Pandas sep="\t" can be used when reading the CSV file to df since data is separated by tabs instead of another character.
df = pd.read_csv('https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt', sep = '\t')
print (df)

     AGE  SEX   BMI      BP   S1     S2    S3    S4      S5   S6    Y
0     59    2  32.1  101.00  157   93.2  38.0  4.00  4.8598   87  151
1     48    1  21.6   87.00  183  103.2  70.0  3.00  3.8918   69   75
2     72    2  30.5   93.00  156   93.6  41.0  4.00  4.6728   85  141
3     24    1  25.3   84.00  198  131.4  40.0  5.00  4.8903   89  206
4     50    1  23.0  101.00  192  125.4  52.0  4.00  4.2905   80  135
..   ...  ...   ...     ...  ...    ...   ...   ...     ...  ...  ...
437   60    2  28.2  112.00  185  113.8  42.0  4.00  4.9836   93  178
438   47    2  24.9   75.00  225  166.0  42.0  5.00  4.4427  102  104
439   60    2  24.9   99.67  162  106.6  43.0  3.77  4.1271   95  132
440   36    1  30.0   95.00  201  125.2  42.0  4.79  5.1299   85  220
441   36    1  19.6   71.00  250  133.2  97.0  3.00  4.5951   92   57

[442 rows x 11 columns]


### The pandas info command is used to explore the Basic Field Information. 

In [13]:
# Pandas basic info on the dataframe 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AGE     442 non-null    int64  
 1   SEX     442 non-null    int64  
 2   BMI     442 non-null    float64
 3   BP      442 non-null    float64
 4   S1      442 non-null    int64  
 5   S2      442 non-null    float64
 6   S3      442 non-null    float64
 7   S4      442 non-null    float64
 8   S5      442 non-null    float64
 9   S6      442 non-null    int64  
 10  Y       442 non-null    int64  
dtypes: float64(6), int64(5)
memory usage: 38.1 KB


### The 'SEX' category is configured as a int64 Dtype, however we need to change this to a categorical Dtype.

In [14]:
# Convert the 'SEX' of the person to a categorical variable
categorical_var = ['SEX']
df[categorical_var] = df[categorical_var].astype('category')

### Now let's check the dataframe information again to confirm the Dtype is now category.

In [15]:
# Pandas basic info on the dataframe 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   AGE     442 non-null    int64   
 1   SEX     442 non-null    category
 2   BMI     442 non-null    float64 
 3   BP      442 non-null    float64 
 4   S1      442 non-null    int64   
 5   S2      442 non-null    float64 
 6   S3      442 non-null    float64 
 7   S4      442 non-null    float64 
 8   S5      442 non-null    float64 
 9   S6      442 non-null    int64   
 10  Y       442 non-null    int64   
dtypes: category(1), float64(6), int64(4)
memory usage: 35.2 KB


### Examine the dataframe to get a better understanding of the data

In [16]:
# Panda's describe function is used to see how the dataframe looks.
# dfDescription is set equal to include = "all" parameters for the columns where the statistics are inappropriate for the datatype.
dfDescription = df.describe(include="all")
print (dfDescription)

               AGE    SEX         BMI          BP          S1          S2  \
count   442.000000  442.0  442.000000  442.000000  442.000000  442.000000   
unique         NaN    2.0         NaN         NaN         NaN         NaN   
top            NaN    1.0         NaN         NaN         NaN         NaN   
freq           NaN  235.0         NaN         NaN         NaN         NaN   
mean     48.518100    NaN   26.375792   94.647014  189.140271  115.439140   
std      13.109028    NaN    4.418122   13.831283   34.608052   30.413081   
min      19.000000    NaN   18.000000   62.000000   97.000000   41.600000   
25%      38.250000    NaN   23.200000   84.000000  164.250000   96.050000   
50%      50.000000    NaN   25.700000   93.000000  186.000000  113.000000   
75%      59.000000    NaN   29.275000  105.000000  209.750000  134.500000   
max      79.000000    NaN   42.200000  133.000000  301.000000  242.400000   

                S3          S4          S5          S6           Y  
count 

### Next to train and test an OLS model, we will split the full dataset into 70% training and 30% test sub datasets. 

In [17]:
# sklearn.model_selection train_test_split() function is used to split the dataset.
# The train dataset is 70% and the test dataset (test_size) is 30% (0.3)
# The random_state is set to 42 in order to produce repeatable results with the same random_state
# The train and tests sets are called "df_train" and "df_test".
df_train, df_test = train_test_split(df, test_size=0.3, random_state=42)

### Fit the Multilinear OLS Regression Model using the Training Dataset and Print the Summary. 

In [18]:
# The multilinear OLS regression model is fit to the training dataset (df_train) and the results are printed in the table below.
est_train = ols(formula="Y ~ AGE + SEX + BMI + S1 + S2 + S3 + S4 + S5 + S6", data=df_train).fit()
print(est_train.summary())

                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.485
Model:                            OLS   Adj. R-squared:                  0.469
Method:                 Least Squares   F-statistic:                     31.23
Date:                Wed, 21 Aug 2024   Prob (F-statistic):           2.82e-38
Time:                        00:20:50   Log-Likelihood:                -1683.9
No. Observations:                 309   AIC:                             3388.
Df Residuals:                     299   BIC:                             3425.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   -295.5125     81.152     -3.641      0.0

### Inspection of the OLS Regression Results show there are non-significant variables where, P>|t| values are greater than 0.05.
### The non-significant variables are extracted and the model is rerun.  

In [19]:
# Extract the non-significant variables and train the model again on the training dataset (df_train)
# A new model is trained using the training dataset (df_train) with the SEX, BMI, S3, and S5 variables to determine Y
est_train = ols(formula="Y ~ SEX + BMI + S3 + S5", data=df_train).fit()
print(est_train.params)
print(est_train.summary())

Intercept   -176.648928
SEX[T.2]     -17.185273
BMI            7.377660
S3            -1.065873
S5            41.824183
dtype: float64
                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.470
Model:                            OLS   Adj. R-squared:                  0.463
Method:                 Least Squares   F-statistic:                     67.34
Date:                Wed, 21 Aug 2024   Prob (F-statistic):           9.47e-41
Time:                        00:20:50   Log-Likelihood:                -1688.3
No. Observations:                 309   AIC:                             3387.
Df Residuals:                     304   BIC:                             3405.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.9

### The Out Of Sample (OOS) R-Squared value will help us determine how the model did on the test dataset.
### The trained model is used on the test dataset and then the OOS R^2 value is calculated.

In [20]:
# r2 is set to the results of the out of sample r^2 for the linear regression model.
test_prediction = est_train.predict(df_test)
r2 = r2_score(df_test['Y'], test_prediction)

print('OOS R-squared: '+ str(r2))

OOS R-squared: 0.4851185328484515
