## <a id='toc1_1_'></a>[Gradient Descent Methods](#toc0_)

In this notebook, we will explore various loss functions and apply gradient descent methods to optimize these functions. Our focus will be on the Diabetes dataset from the scikit-learn library, a well-regarded dataset in the machine learning community. This dataset consists of medical diagnostic measurements from numerous patients and is specifically designed to study diabetes progression. We will use these data points to predict the quantitative measure of disease progression one year after baseline, thus practicing the application of regression analysis in a medical context.

## <a id='toc1_2_'></a>[Authors](#toc0_)
* **Alireza Arbabi**
* **Hadi Babalou**
* **Ali Padyav**
* **Kasra Hajiheidari**

## <a id='toc1_3_'></a>[Table of Contents](#toc0_)

- [Gradient Descent Methods](#toc1_1_)    
  - [Authors](#toc1_2_)    
  - [Table of Contents](#toc1_3_)    
  - [Setting Up the Environment](#toc1_4_)    
  - [Data Preparation](#toc1_5_)    
    - [Dataset Description](#toc1_5_1_)    
    - [Loading the Dataset](#toc1_5_2_)    
    - [Preprocessing](#toc1_5_3_)    
      - [Missing Values](#toc1_5_3_1_)    
      - [Duplicates](#toc1_5_3_2_)    
      - [Type Conversion](#toc1_5_3_3_)    
      - [Normalization](#toc1_5_3_4_)    
      - [Train-Test Split](#toc1_5_3_5_)    
  - [Loss Functions](#toc1_6_)    
    - [Mean Squared Error (MSE)](#toc1_6_1_)    
    - [Mean Absolute Error (MAE)](#toc1_6_2_)    
    - [Root Mean Squared Error (RMSE)](#toc1_6_3_)    
    - [R² Score (Coefficient of Determination)](#toc1_6_4_)    
    - [Ordinary Least Squares (OLS)](#toc1_6_5_)    
  - [Regression Model](#toc1_7_)    
    - [Model](#toc1_7_1_)    
    - [Training](#toc1_7_2_)    
  - [Evaluation](#toc1_8_)    
    - [Results Summary Table](#toc1_8_1_)    
  - [Questions](#toc1_9_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_4_'></a>[Setting Up the Environment](#toc0_)

In [1]:
# !pip install numpy
# !pip install pandas
# !pip install seaborn
# !pip install matplotlib
# !pip install tqdm
# !pip install scipy
# !pip install scikit-learn
# !pip install statsmodels

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from scipy import stats
import tqdm


import warnings
warnings.filterwarnings("ignore")

## <a id='toc1_5_'></a>[Data Preparation](#toc0_)

### <a id='toc1_5_1_'></a>[Dataset Description](#toc0_)

The sklearn diabetes dataset is a built-in sample dataset included with scikit-learn for regression tasks. It contains information about patients with diabetes.


Origin: The dataset is originally sourced from the diabetes dataset available in the StatLib library. It was created by Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani as part of their research.

Contents: The dataset consists of ten baseline variables, age, sex, body mass index (BMI), average blood pressure, and six blood serum measurements for 442 diabetes patients. The response of interest is a quantitative measure of disease progression one year after baseline.

Features:
- Age: Age in years
- Sex: Gender of the patient (0 for female, 1 for male)
- BMI: Body mass index, a measure of body fat based on height and weight
- BP: Average blood pressure
- S1 to S6: Six blood serum measurements

Target Variable (Disease Progression): A quantitative measure of diabetes progression after one year from baseline.

Purpose: The dataset is commonly used for regression tasks, particularly in predictive modeling and assessing the efficacy of various algorithms in predicting diabetes progression.

Size: It contains data for 442 patients.

### <a id='toc1_5_2_'></a>[Loading the Dataset](#toc0_)

In [3]:
diabetes = datasets.load_diabetes(as_frame=True)
df = diabetes.frame
df.head(10)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.041176,-0.096346,97.0
6,-0.045472,0.05068,-0.047163,-0.015999,-0.040096,-0.0248,0.000779,-0.039493,-0.062917,-0.038357,138.0
7,0.063504,0.05068,-0.001895,0.066629,0.09062,0.108914,0.022869,0.017703,-0.035816,0.003064,63.0
8,0.041708,0.05068,0.061696,-0.040099,-0.013953,0.006202,-0.028674,-0.002592,-0.01496,0.011349,110.0
9,-0.0709,-0.044642,0.039062,-0.033213,-0.012577,-0.034508,-0.024993,-0.002592,0.067737,-0.013504,310.0


### <a id='toc1_5_3_'></a>[Preprocessing](#toc0_)

In [4]:
print(df.dtypes)

age       float64
sex       float64
bmi       float64
bp        float64
s1        float64
s2        float64
s3        float64
s4        float64
s5        float64
s6        float64
target    float64
dtype: object


#### <a id='toc1_5_3_1_'></a>[Missing Values](#toc0_)

In [5]:
print(df.isnull().sum())

age       0
sex       0
bmi       0
bp        0
s1        0
s2        0
s3        0
s4        0
s5        0
s6        0
target    0
dtype: int64


#### <a id='toc1_5_3_2_'></a>[Duplicates](#toc0_)

There is no duplicate data in the dataset.

In [6]:
print(df.duplicated().sum())

0


#### <a id='toc1_5_3_3_'></a>[Type Conversion](#toc0_)

As all the features are numerical, we do not need to convert any data types.

#### <a id='toc1_5_3_4_'></a>[Normalization](#toc0_)

StandardScaler is a preprocessing technique in machine learning used to standardize the features by removing the mean and scaling them to unit variance. This ensures that each feature has a mean of 0 and a standard deviation of 1.

Standardization is often performed on numerical features in datasets before training machine learning models. It helps in situations where the features have different scales or units, ensuring that each feature contributes equally to the analysis and preventing features with larger scales from dominating the model's training process.

In [7]:
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
df_scaled = pd.DataFrame(df_scaled, columns=df.columns)
df_scaled.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.8005,1.065488,1.297088,0.459841,-0.929746,-0.732065,-0.912451,-0.054499,0.418531,-0.370989,-0.014719
1,-0.039567,-0.938537,-1.08218,-0.553505,-0.177624,-0.402886,1.564414,-0.830301,-1.436589,-1.938479,-1.001659
2,1.793307,1.065488,0.934533,-0.119214,-0.958674,-0.718897,-0.680245,-0.054499,0.060156,-0.545154,-0.14458
3,-1.872441,-0.938537,-0.243771,-0.77065,0.256292,0.525397,-0.757647,0.721302,0.476983,-0.196823,0.699513
4,0.113172,-0.938537,-0.764944,0.459841,0.082726,0.32789,0.171178,-0.054499,-0.672502,-0.980568,-0.222496


#### <a id='toc1_5_3_5_'></a>[Train-Test Split](#toc0_)

In [8]:
X = df_scaled.drop('target', axis=1)
y = df_scaled['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=42)
print(f"Count of instances in\n Whole Dataset: {X.shape[0]}\n Training set: {X_train.shape[0]}\n Testing set: {X_test.shape[0]}")


Count of instances in
 Whole Dataset: 442
 Training set: 419
 Testing set: 23


# <a id='toc1_6_'></a>[Main Task](#toc0_)

## <a id='toc1_6_'></a>[Loss Functions](#toc0_)

### <a id='toc1_6_1_'></a>[Mean Squared Error (MSE)](#toc0_)

In [9]:
def mse(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

### <a id='toc1_6_2_'></a>[Mean Absolute Error (MAE)](#toc0_)

In [10]:
def mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

### <a id='toc1_6_3_'></a>[Root Mean Squared Error (RMSE)](#toc0_)

In [11]:
def rmse(y_true, y_pred):
    return np.sqrt(mse(y_true, y_pred))

### <a id='toc1_6_4_'></a>[R² Score (Coefficient of Determination)](#toc0_)

In [12]:
def r2(y_true, y_pred):
    return 1 - mse(y_true, y_pred) / mse(y_true, np.mean(y_true))

### <a id='toc1_6_5_'></a>[Ordinary Least Squares (OLS)](#toc0_)

In [13]:
def ols(y_true, y_pred):
    return np.sum((y_true - y_pred) ** 2)

## <a id='toc1_7_'></a>[Regression Model](#toc0_)

### <a id='toc1_7_1_'></a>[Model](#toc0_)

Linear Regression is a fundamental algorithm in machine learning and statistics. It's a supervised learning technique that predicts a continuous outcome variable (Y) based on one or more predictor variables (X).

The goal of Linear Regression is to find the best fit line that can accurately predict the output for the continuous response variable. This line is represented by the equation:

Y = a + bX + e

Where:
- Y is the dependent variable (output/outcome/prediction/estimation)
- X is the independent variable (input/feature)
- a is the Y-intercept (the value of Y when X=0)
- b is the slope of the line (the change in Y that comes with a one-unit change in X)
- e is the error term (difference between actual and predicted values)

The algorithm uses the method of least squares to estimate the values of 'a' and 'b'. It minimizes the sum of the squared residuals (the differences between the observed and predicted values) to find the 'line of best fit'.

Linear Regression assumptions:
- Linearity: The relationship between X and the mean of Y is linear.
- Homoscedasticity: The variance of residual is the same for any value of X.
- Independence: Observations are independent of each other.
- Normality: For any fixed value of X, Y is normally distributed.

### <a id='toc1_7_1_'></a>[Sklearn Linear Regression Model](#toc0_)

In [14]:
model = LinearRegression()

#### <a id='toc1_7_2_'></a>[Training](#toc0_)

In [15]:
model.fit(X_train, y_train)

#### <a id='toc1_8_'></a>[Evaluation](#toc0_)

In [16]:
y_pred = model.predict(X_test)

mean_squared_error = mse(y_test, y_pred)
mean_absolute_error = mae(y_test, y_pred)
root_mean_squared_error = rmse(y_test, y_pred)
r_squared = r2(y_test, y_pred)

In [17]:
y_pred_train = model.predict(X_train)

mean_squared_error_train = mse(y_train, y_pred_train)
mean_absolute_error_train = mae(y_train, y_pred_train)
root_mean_squared_error_train = rmse(y_train, y_pred_train)
r_squared_train = r2(y_train, y_pred_train)

#### <a id='toc1_8_1_'></a>[Results Summary Table](#toc0_)

In [18]:
results_summary_data = [
    ['', 'MSE', 'MAE', 'RMSE', 'R2 score'],
    ['Train Set', mean_squared_error_train, mean_absolute_error_train, root_mean_squared_error_train, r_squared_train], 
    ['Test Set', mean_squared_error, mean_absolute_error, root_mean_squared_error, r_squared]
]
results_summary_df = pd.DataFrame(results_summary_data, columns=results_summary_data[0])
results_summary_df


Unnamed: 0,Unnamed: 1,MSE,MAE,RMSE,R2 score
0,,MSE,MAE,RMSE,R2 score
1,Train Set,0.477449,0.55793,0.690977,0.513588
2,Test Set,0.592948,0.63993,0.770031,0.516379


### <a id='toc1_7_1_'></a>[Ordinary Least Squares Regression Model](#toc0_)

In [22]:
X_train_ols = sm.add_constant(X_train)
X_test_ols = sm.add_constant(X_test)

ols_model = sm.OLS(y_train, X_train_ols)

#### <a id='toc1_7_2_'></a>[Training](#toc0_)

In [23]:
ols_model.fit().summary()

0,1,2,3
Dep. Variable:,target,R-squared (uncentered):,0.514
Model:,OLS,Adj. R-squared (uncentered):,0.502
Method:,Least Squares,F-statistic:,43.18
Date:,"Sat, 20 Apr 2024",Prob (F-statistic):,5.4100000000000004e-58
Time:,23:05:43,Log-Likelihood:,-439.74
No. Observations:,419,AIC:,899.5
Df Residuals:,409,BIC:,939.8
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
age,0.0078,0.038,0.205,0.837,-0.067,0.082
sex,-0.1589,0.039,-4.076,0.000,-0.236,-0.082
bmi,0.3290,0.042,7.750,0.000,0.246,0.412
bp,0.2101,0.041,5.138,0.000,0.130,0.290
s1,-0.5231,0.260,-2.008,0.045,-1.035,-0.011
s2,0.2977,0.211,1.408,0.160,-0.118,0.713
s3,0.0851,0.133,0.641,0.522,-0.176,0.346
s4,0.1519,0.103,1.477,0.141,-0.050,0.354
s5,0.4370,0.108,4.060,0.000,0.225,0.649

0,1,2,3
Omnibus:,0.806,Durbin-Watson:,1.853
Prob(Omnibus):,0.668,Jarque-Bera (JB):,0.883
Skew:,0.034,Prob(JB):,0.643
Kurtosis:,2.786,Cond. No.,21.3


#### <a id='toc1_8_'></a>[Evaluation](#toc0_)

In [None]:
### نمد

#### <a id='toc1_8_1_'></a>[Results Summary Table](#toc0_)

Unnamed: 0,Unnamed: 1,MSE,MAE,RMSE,R2 score,OLS
0,,MSE,MAE,RMSE,R2 score,OLS
1,Train Set,0.477449,0.55793,0.690977,0.513588,200.051266
2,Test Set,0.592948,0.63993,0.770031,0.516379,13.637794


## <a id='toc1_9_'></a>[Questions](#toc0_)

**Analyze and evaluate the values in Table (1).**

**Review the R² and Adjusted R² values obtained in part 4. Explain what these values indicate and what the implications of high or low values might be.  
Also, discuss the differences between these two metrics.**

**Review the p-values obtained in part 4 for each column of data and explain what these values indicate. Discuss what an appropriate value for p-values is and which columns currently have suitable values.**

**Assess and analyze the importance of each feature in the dataset based on the results obtained in part 4 regarding an individual's diabetic condition.**