<a href="https://colab.research.google.com/github/TurkiAlghusoon/Portfolio/blob/main/6_multiple_linear_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this schenario, the goal is to develop a multiple regresison model to understand the relationship between the various health indicators and diabetes progression

#### Diabetes dataset

Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.
Data Set Characteristics
Number of Instances: 442
Number of Attributes
First 10 columns are numeric predictive values
Target: Column 11 is a quantitative measure of disease progression one year after baseline
Attribute Information
 - age:     age in years <br>
  - sex:     sex <br>
  - bmi:     body mass index <br>
  - bp:      average blood pressure <br>
  - s1:      tc, total serum cholesterol <br>
  - s2:      ldl, low-density lipoproteins <br>
  - s3:      hdl, high-density lipoproteins <br>
  - s4:      tch, total cholesterol / HDL <br>
  - s5:      ltg, possibly log of serum triglycerides level <br>
  - s6:      glu, blood sugar level <br>
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times n_samples (i.e. the sum of squares of each column totals 1).
Source URL
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499. (https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)


Source URL: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
Data URL: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt
Note: The Data URL mentioned-above is obtained from the source URL. The source URL provides detailed information about the dataset, variables and also reference links including the dataset link.



In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy.stats import norm
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn import datasets

In [3]:
# Read the data
df = pd.read_csv('https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt', delimiter = '\t')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AGE     442 non-null    int64  
 1   SEX     442 non-null    int64  
 2   BMI     442 non-null    float64
 3   BP      442 non-null    float64
 4   S1      442 non-null    int64  
 5   S2      442 non-null    float64
 6   S3      442 non-null    float64
 7   S4      442 non-null    float64
 8   S5      442 non-null    float64
 9   S6      442 non-null    int64  
 10  Y       442 non-null    int64  
dtypes: float64(6), int64(5)
memory usage: 38.1 KB


In [12]:
df.head()

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
0,59,2,32.1,101.0,157,93.2,38.0,4.0,4.8598,87,151
1,48,1,21.6,87.0,183,103.2,70.0,3.0,3.8918,69,75
2,72,2,30.5,93.0,156,93.6,41.0,4.0,4.6728,85,141
3,24,1,25.3,84.0,198,131.4,40.0,5.0,4.8903,89,206
4,50,1,23.0,101.0,192,125.4,52.0,4.0,4.2905,80,135


In [5]:
# Coverting Sex to categorical
df['SEX'] = df['SEX'].astype('category')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   AGE     442 non-null    int64   
 1   SEX     442 non-null    category
 2   BMI     442 non-null    float64 
 3   BP      442 non-null    float64 
 4   S1      442 non-null    int64   
 5   S2      442 non-null    float64 
 6   S3      442 non-null    float64 
 7   S4      442 non-null    float64 
 8   S5      442 non-null    float64 
 9   S6      442 non-null    int64   
 10  Y       442 non-null    int64   
dtypes: category(1), float64(6), int64(4)
memory usage: 35.2 KB


In [7]:
#Using Panda's describe function to peak into the dataframe.

dfDescription = df.describe(include='all')
print (dfDescription)

               AGE    SEX         BMI          BP          S1          S2  \
count   442.000000  442.0  442.000000  442.000000  442.000000  442.000000   
unique         NaN    2.0         NaN         NaN         NaN         NaN   
top            NaN    1.0         NaN         NaN         NaN         NaN   
freq           NaN  235.0         NaN         NaN         NaN         NaN   
mean     48.518100    NaN   26.375792   94.647014  189.140271  115.439140   
std      13.109028    NaN    4.418122   13.831283   34.608052   30.413081   
min      19.000000    NaN   18.000000   62.000000   97.000000   41.600000   
25%      38.250000    NaN   23.200000   84.000000  164.250000   96.050000   
50%      50.000000    NaN   25.700000   93.000000  186.000000  113.000000   
75%      59.000000    NaN   29.275000  105.000000  209.750000  134.500000   
max      79.000000    NaN   42.200000  133.000000  301.000000  242.400000   

                S3          S4          S5          S6           Y  
count 

In [8]:
# Splitting the data into train and test sets.

df_train, df_test = train_test_split(df, test_size = 0.3, random_state = 42)

In [9]:
# Crating a linear regression model using Statsmodels
model = ols(formula='Y ~ AGE + SEX + BMI + S1 + S2 + S3 + S4 + S5 + S6', data=df_train)
est_train = model.fit()
print(est_train.summary())

                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.485
Model:                            OLS   Adj. R-squared:                  0.469
Method:                 Least Squares   F-statistic:                     31.23
Date:                Mon, 16 Sep 2024   Prob (F-statistic):           2.82e-38
Time:                        21:55:36   Log-Likelihood:                -1683.9
No. Observations:                 309   AIC:                             3388.
Df Residuals:                     299   BIC:                             3425.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   -295.5125     81.152     -3.641      0.0

In [10]:
# Create a model that only include the significant features
model = ols(formula='Y ~ SEX + BMI + S5', data=df_train)
est_train = model.fit()

print(est_train.params)

Intercept   -283.234481
SEX[T.2]      -8.840695
BMI            8.152330
S5            48.064774
dtype: float64


In [11]:
#Calculating the Out of Sample R-squared
pred = est_train.predict(df_test)
r2 = r2_score(df_test['Y'], pred)
print('OOS R-squared: '+ str(r2))

OOS R-squared: 0.4825821522407042
