<h1 style="color:white; background-color:#4CAF50; padding:10px; border-radius:10px; text-align:center;">
  Diabetes Dataset: A Data Analysis Project with Explanations PART2
</h1>

<h1 style="color:white; background-color:red; padding:10px; border-radius:10px; text-align:center;">
  Model Development (Fit Model)
</h1>

### Model Development and Improvement Process

The model development and improvement process shares common steps for both classification and regression problems, but the models, metrics, and optimization methods used vary depending on the problem type. The **Diabetes dataset** is generally used for classification problems because its target variable (Outcome) consists of two classes: 0 (non-diabetic) and 1 (diabetic). However, to convert this dataset into a regression problem, we can select a different target, such as predicting **BMI** (Body Mass Index). In this case, it would make sense to develop a regression model.

### Common Steps for Classification and Regression Problems

- **Data Preprocessing:** This involves cleaning the data, handling missing values, and preparing the dataset for modeling.
- **Model Selection:** Depending on whether it is a classification or regression problem, different models will be chosen.
- **Hyperparameter Optimization:** Optimizing the model parameters to improve performance.
- **Cross-Validation:** Ensuring the model's generalization ability by validating it on different subsets of the data.
- **Ensemble Methods:** Combining multiple models to improve predictive performance.

### Differences in Models and Metrics

- **Classification Models:** Logistic Regression, Random Forest, Decision Trees, Support Vector Machines (SVM), etc.
- **Regression Models:** Linear Regression, Ridge/Lasso Regression, Decision Tree Regressor, Random Forest Regressor, etc.
  
- **Classification Metrics:** Accuracy, Precision, Recall, F1 Score, ROC-AUC.
- **Regression Metrics:** Mean Squared Error (MSE), Mean Absolute Error (MAE), R-squared (R²).

### Model Improvement Techniques

To improve model accuracy, strategies such as data preprocessing, model selection, hyperparameter optimization, cross-validation, and ensemble methods can be applied. Throughout the process, decisions should be made based on the dataset structure and problem type, and different models should be tested.

By carefully applying each of these steps in both classification and regression problems, the model's performance can be continuously improved.


## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

#Import Machine Learning Algorithms
from sklearn.model_selection import train_test_split  # Split data into training and testing sets for model evaluation
from sklearn.linear_model import LogisticRegression  # Logistic Regression for classification tasks
from sklearn.neighbors import KNeighborsClassifier  # KNN for classification

# Import Ensemble Learning Algorithms (combining multiple models for better performance)
from sklearn.ensemble import RandomForestClassifier  # Random Forest for robust classification
from sklearn.ensemble import GradientBoostingClassifier  # Gradient Boosting for decision tree-based learning with improved accuracy

# Evaluation Metrics (assessing model performance)
from sklearn.metrics import confusion_matrix  # Visualize model predictions vs. true labels
from sklearn.metrics import accuracy_score, recall_score, f1_score  # Calculate common performance metrics

In [2]:
df = pd.read_csv('Diabetes.csv')
print(df.head())
df.head()

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
df.columns


Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

## Data Pre-Processing:

##### How is the Test Size Chosen?
##### The size of the test set is carefully selected to balance the model's training and evaluation processes. This decision typically depends on the following factors:

##### 1. Size of the Dataset
##### Small datasets: If your dataset is very small (for example, a few hundred samples), choosing a smaller test set makes sense. It's important to have enough data left for training. In this case, a test size of 20-30% is typically preferred.
##### Large datasets: In larger datasets, more data can be used for training. So, the test set ratio can be slightly higher. In large datasets, a test size of 20-40% is common.
##### 2. Model's Generalization Ability
##### Larger test set: If you want to better assess the model's generalization ability, you can keep the test set slightly larger. This helps you see the model's performance on real-world data more clearly.
##### Larger training set: If you want to provide the model with more data for learning, you can keep the training set larger and the test set smaller. This is especially important when data is limited.
##### 3. Type of Problem
##### In some types of problems, the test set size may vary. For example, in time series analysis or problems with limited data, a more careful selection is necessary. Typically, a test set size of 10-30% is preferred in these cases.
##### 4. Use of CV (Cross-Validation)
##### If you're using methods like k-fold cross-validation, since a portion of the dataset will be used multiple times for training and testing, having a larger test size may not be as critical. With CV, you can generally leave a smaller test set (for example, 20%).
##### In General:
##### 70% training - 30% test is a common approach.
##### However, if your data is very limited, you can use ratios like 80% training - 20% test or 90% training - 10% test.

## a.) Classification Problem

In [4]:
# Defining Features and Target Variable:
X = df[['Glucose', 'BMI', 'Age']]  
y = df['Outcome']

# Splitting Data into Training and Testing Sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### 1.) Model Selection and Training

##### Here, we use a Logistic Regression model, which is commonly used for binary classification problems like this one. The model is trained using the training data (X_train and y_train). The fit() function allows the model to learn the relationship between the features and the target variable.

In [5]:
from sklearn.linear_model import LogisticRegression

# Choosing the Logistic Regression model
model = LogisticRegression(max_iter=1000)

# Training the model on the training data
model.fit(X_train, y_train)


#### 2.) Making Predictions

##### After training, we use the test set (X_test) to make predictions. The predicted values (y_pred) will be compared with the actual values (y_test) to evaluate the model's performance.



In [6]:
y_pred = model.predict(X_test)

### 3.) Evaluating Model Performance

#### We assess the model's performance using metrics like accuracy, precision, recall, and F1-score. The confusion matrix provides a detailed breakdown of correct and incorrect predictions. Evaluating the model is crucial to understanding its strengths and weaknesses.



In [7]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# Other metrics (precision, recall, f1-score)
print(classification_report(y_test, y_pred))


Accuracy:  0.7467532467532467
Confusion Matrix:
 [[80 19]
 [20 35]]
              precision    recall  f1-score   support

           0       0.80      0.81      0.80        99
           1       0.65      0.64      0.64        55

    accuracy                           0.75       154
   macro avg       0.72      0.72      0.72       154
weighted avg       0.75      0.75      0.75       154



### Performance Metrics Explanation

#### Precision:
- **Class 0:** 0.80 (This means the model correctly predicted 80 out of 100 instances that it classified as class 0).
- **Class 1:** 0.65 (This means the model correctly predicted 35 out of 54 instances that it classified as class 1).

#### Recall:
- **Class 0:** 0.81 (The model correctly identified 81% of the actual instances of class 0).
- **Class 1:** 0.64 (The model correctly identified 64% of the actual instances of class 1).

#### F1 Score:
- **Class 0:** 0.80 (This reflects the balance between precision and recall for class 0).
- **Class 1:** 0.64 (This reflects the balance between precision and recall for class 1).

#### Support:
- There are 99 instances for class 0 and 55 instances for class 1 in the test set.

#### Macro and Weighted Average:
- The macro average provides an overall performance summary by calculating the average of metrics for each class.
- The weighted average accounts for the support (the number of true instances) of each class, providing a more balanced performance measure.

These metrics are valuable for understanding how well the model performs across different classes and help identify areas for improvement in the model's performance.


# 4.) Hyperparameter Tuning

##### Hyperparameter tuning helps improve the model's performance by finding the best combination of parameters. Grid Search tests different parameter values (like C and solver) and selects the combination that yields the highest performance.



In [8]:
from sklearn.model_selection import GridSearchCV

# Parameter grid
param_grid = {'C': [0.1, 1, 10, 100], 'solver': ['lbfgs', 'liblinear']}

# Grid search to find the best parameters
grid = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)
grid.fit(X_train, y_train)

# Best parameters
print("Best parameters: ", grid.best_params_)


Best parameters:  {'C': 0.1, 'solver': 'lbfgs'}


#### Best Parameters: This output shows the combination of hyperparameters that yielded the best performance for the model:

#### C: 0.1, which indicates that the model will apply more regularization, helping to prevent overfitting.
#### Solver: 'lbfgs', which is an effective optimization algorithm for small datasets.
#### These results indicate the potential to improve the model's accuracy by optimizing its settings, and it is recommended to retrain the model using these best parameters in the next step.

## b.) Regression Problem

### If we wanted to treat this data set as a regression problem, we would try to predict a continuous variable instead of the target variable (outcome). For example, we can predict a continuous variable such as BMI (Body Mass Index) or Glucose (Blood Sugar) in diabetes patients. In order to evaluate the performance in regression problems, metrics such as R2 (R-squared), Mean Squared E

In [12]:
# Selecting features and target variable for the regression task
X_reg = df.drop(columns=['BMI'])  # Features
y_reg = df['BMI']  # Target variable (BMI)

# Splitting the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# Checking the shapes of the datasets
print(X_train_reg.shape, X_test_reg.shape, y_train_reg.shape, y_test_reg.shape)


(614, 8) (154, 8) (614,) (154,)


#### Training a Linear Regression Model

In [13]:
from sklearn.linear_model import LinearRegression

# Creating the Linear Regression model
linear_reg_model = LinearRegression()

# Training the model on the training data
linear_reg_model.fit(X_train_reg, y_train_reg)

# Making predictions on the test data
y_pred_reg = linear_reg_model.predict(X_test_reg)


### Evaluating the Model

In [14]:
# Evaluating the model with R^2 and Mean Squared Error
from sklearn.metrics import r2_score, mean_squared_error

# R^2 score and Mean Squared Error
r2 = r2_score(y_test_reg, y_pred_reg)
mse = mean_squared_error(y_test_reg, y_pred_reg)

print("R² score:", r2)
print("Mean Squared Error:", mse)


R² score: 0.26201406132012617
Mean Squared Error: 52.46005874215569


### Let's first implement Ridge Regression

#### For this dataset, Ridge Regression seems more appropriate because:
#### It assumes that all features contribute to predicting BMI and keeps them in the model.
### How Ridge Regression Works:
##### Ridge Regression uses L2 regularization, which adds a penalty to the model based on the squared magnitudes of the coefficients.
##### This regularization term shrinks the coefficients of less important features, but unlike Lasso, it does not set any coefficients to zero.
##### The key benefit of Ridge is that it helps prevent overfitting by controlling the size of the coefficients, ensuring that the model is less sensitive to noise in the data.


In [15]:
from sklearn.linear_model import Ridge

# Creating and training the Ridge Regression model
ridge_model = Ridge(alpha=1.0)  # The alpha parameter controls the regularization strength
ridge_model.fit(X_train_reg, y_train_reg)

# Making predictions with Ridge Regression
y_pred_ridge = ridge_model.predict(X_test_reg)

# Evaluating the Ridge model
ridge_r2 = r2_score(y_test_reg, y_pred_ridge)
ridge_mse = mean_squared_error(y_test_reg, y_pred_ridge)

print("Ridge Regression R²:", ridge_r2)
print("Ridge Regression MSE:", ridge_mse)


Ridge Regression R²: 0.26242456629280875
Ridge Regression MSE: 52.43087781356049


### Code for Tuning Ridge Parameters with GridSearchCV:

In [17]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

# Defining the range of alpha values to test
alpha_range = {'alpha': [0.01, 0.1, 1, 10, 100]}

# Creating the Ridge model
ridge_model = Ridge()

# Using GridSearchCV to find the best alpha
grid_search = GridSearchCV(ridge_model, alpha_range, scoring='r2', cv=5)
grid_search.fit(X_train_reg, y_train_reg)

# The best alpha value
best_alpha = grid_search.best_params_['alpha']
print("Best alpha:", best_alpha)

# Retraining the Ridge model with the best alpha
best_ridge_model = Ridge(alpha=best_alpha)
best_ridge_model.fit(X_train_reg, y_train_reg)

# Making predictions and evaluating the tuned model
y_pred_best_ridge = best_ridge_model.predict(X_test_reg)
tuned_r2 = r2_score(y_test_reg, y_pred_best_ridge)
tuned_mse = mean_squared_error(y_test_reg, y_pred_best_ridge)

print("Tuned Ridge Regression R²:", tuned_r2)
print("Tuned Ridge Regression MSE:", tuned_mse)


Best alpha: 10
Tuned Ridge Regression R²: 0.26549203732038695
Tuned Ridge Regression MSE: 52.212825271008704


### Conclusion:
Yes, this is a **good thing**! After tuning the **alpha** parameter, the model is now both:

- **More explanatory** (higher R²),
- **More accurate** (lower MSE).

This suggests that the Ridge regularization has effectively balanced between **underfitting** and **overfitting**, improving the model’s performance.

If you'd like, we can further explore other optimization methods, models, or even visualize the results. Let me know!
