<a href="https://www.kaggle.com/code/ajmalkalikavu/diabetes-prediction?scriptVersionId=220748157" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Problum statment**
# **Diabetes Prediction**

-- The Dataset is about Diabetis test.

-- This dataset have 768 rows & 9 columns.

-- All attributes are numeric variables.

**Column Heads:**

1. Pregnancies
2. Glucose
3. BloodPressure
4. SkinThickness
5. Insulin
6. BMI
7. DiabetesPedigreeFunction
8. Age
9. Outcome (Have Diabetis or not)

* **TARGET VARIABLE in this Dtaset is 'Outcome'**

**Goals:**

* The primary goal of this dataset is to predict whether a patient has diabetes or not based on diagnostic measurements.

* **Early Detection and Diagnosis:** By using this dataset to build a predictive model, healthcare professionals can potentially identify individuals at risk of diabetes even before they develop noticeable symptoms. Early detection allows for timely intervention and management, which can significantly improve patient outcomes.

* **Personalized Risk Assessment:** The model can help assess an individual's risk of developing diabetes based on their specific characteristics (pregnancies, glucose level, blood pressure, etc.). This personalized risk assessment can empower individuals to make informed lifestyle choices and take preventive measures to reduce their risk.

* **Improving Treatment Strategies:** The insights gained from analyzing this dataset can contribute to the development of more effective treatment strategies for diabetes. By understanding the factors that influence the disease's progression, healthcare providers can tailor treatment plans to individual needs, potentially leading to better blood sugar control and reduced complications.

* **Public Health Planning:** The dataset can be used to study the prevalence and distribution of diabetes within specific populations (e.g., Pima Indian women in this case). This information is valuable for public health planning and resource allocation, allowing for targeted interventions and prevention programs.

* **Research and Development:** The dataset serves as a valuable resource for researchers studying diabetes. It can be used to explore new diagnostic methods, identify potential drug targets, and gain a deeper understanding of the disease's underlying mechanisms.

* Improve the prediction, diagnosis, treatment, and prevention of diabetes. By enabling early detection, personalized risk assessment, and more effective treatment strategies, this dataset can ultimately help reduce the burden of this chronic disease.
---

# Libraries

In [None]:
import pandas as pd
import numpy as np

#Ploating
import matplotlib.pyplot as plt
import seaborn as sns

#Emcoding
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()

# x-y split
from sklearn.model_selection import train_test_split

#Scaling
from sklearn.preprocessing import StandardScaler

# Classification algorithms
from sklearn.linear_model import LogisticRegression,RidgeClassifier,Perceptron,PassiveAggressiveClassifier,SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier,AdaBoostClassifier,GradientBoostingClassifier,BaggingClassifier,VotingClassifier,StackingClassifier
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.svm import SVC,LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis,LinearDiscriminantAnalysis
import xgboost as xgb
import lightgbm as lgb

#MSE AND MAE
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from math import sqrt

#
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.model_selection import cross_val_score
from tabulate import tabulate
import warnings
from sklearn.metrics import accuracy_score


from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

# Loading Dataset

In [None]:
df=pd.read_csv('/kaggle/input/diabetes-dataset/diabetes.csv')
df

# ***Understanding the data***

In [None]:
df.info()

* All attributes are numeric variables.

In [None]:
df.shape

This dataset have 786 rows & 9 columns

In [None]:
df.columns

In [None]:
df.describe().T

In [None]:
df.isnull().sum()

There is no Null values(Missing values) in the Dataset

In [None]:
df.duplicated().sum()

There is no Duplicated rows in the Dataset

In [None]:
df.dtypes

All attributes are numeric variables.

# Data Visualization

**Histogram for entire dataset**

In [None]:
df.hist(figsize=(15,15))
plt.show()

Outcome

In [None]:
plt.pie(df['Outcome'].value_counts(),autopct='%1.1f%%',labels=['Diabetic','non-Diabetic'],colors=['Grey','g'])
plt.title('Outcome')
plt.legend()
plt.show()

* In this dataset 65.1% of individuals are Diabetic & 34.9% are non Diabetic.

# Outlier detection

In [None]:
sns.boxplot(df)

In [None]:
df1=df.copy()

In [None]:
def outlier(column):
  Q1=column.quantile(0.25)
  Q3=column.quantile(0.75)
  IQR=Q3-Q1
  Lp=Q1-1.5*IQR
  Up=Q3+1.5*IQR
  return (column<Lp)|(column>Up)

OIQR=(df1.select_dtypes(include='number')).apply(outlier)

In [None]:
print(OIQR.sum())

In [None]:
df1=df1[~(OIQR.any(axis=1))]

In [None]:
df2=df1.copy()

In [None]:
print((df2.select_dtypes(include='number')).skew())

In [None]:
df2['p_Insulin']=np.log1p(df1['Insulin'])
df2['p_Age']=np.log1p(df1['Age'])
df2['p_Pregnancies']=np.log1p(df1['Pregnancies'])
df2['p_Pregnancies']=np.log1p(df1['Pregnancies'])

# Correlation Analysis:

In [None]:
c=df2.corr()
c

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(c,annot=True,cmap='Greens')
plt.title('Correlation heatmap')
plt.show()

In [None]:
s=df2.corr()['Outcome']
s.sort_values(ascending=False)

# Feature Selection

In [None]:
df2=df2.drop(['p_Insulin','p_Pregnancies','Age'],axis=1)

In [None]:
x=df2.drop('Outcome',axis=1)
y=df2['Outcome']

# Data Splitting

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=12)

# Initial Modeling:

In [None]:
# List of Classification models to apply
models = {
    'Logistic Regression': LogisticRegression(),
    'LDA': LinearDiscriminantAnalysis(),
    'Ridge Classifier': RidgeClassifier(),
    'Perceptron': Perceptron(),
    'Passive Aggressive': PassiveAggressiveClassifier(),
    'SGD Classifier': SGDClassifier(),
    'SVC': SVC(probability=True),
    'Linear SVC': LinearSVC(),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Extra Trees': ExtraTreesClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': xgb.XGBClassifier(),
    'LightGBM': lgb.LGBMClassifier(),
    # 'CatBoost': CatBoostClassifier(verbose=0),
    'Gaussian NB': GaussianNB(),
    'Multinomial NB': MultinomialNB(),
    'Bernoulli NB': BernoulliNB(),
    'MLP': MLPClassifier(),
    'Quadratic Discriminant Analysis': QuadraticDiscriminantAnalysis(),
    'Gaussian Process Classifier': GaussianProcessClassifier()
}

skf = StratifiedKFold(n_splits=5)

In [None]:
result={}
for name,model in models.items():
  model.fit(x_train,y_train)
  y_pred=model.predict(x_test)

  accuracy=accuracy_score(y_test,y_pred)
  precision=precision_score(y_test,y_pred,average='weighted')
  recall=recall_score(y_test,y_pred,average='weighted')
  f1=f1_score(y_test,y_pred,average='weighted')
  # roc_auc=roc_auc_score(y_test,y_pred,average='weighted',multi_class='ovr')
  CM=confusion_matrix(y_test,y_pred)
  report=classification_report(y_test,y_pred,output_dict=True)

  metrics={
      # 'Model':name,
      'Accuracy':accuracy,
      'Precision':precision,
      'Recall':recall,
      'F1 Score':f1,
      # 'ROC AUC':roc_auc,
      'Confusion Matrix':CM,
      'Classification Report':report
  }
  result[name] = metrics

results_df = pd.DataFrame(result).T
results_df

Gradient Boosting | Ridge Classifier | Linear SVC

# Voating

Hard

In [None]:
voting_clf_hard = VotingClassifier(estimators=[
    ('Gradient Boosting',GradientBoostingClassifier()),
    ('Ridge Classifier',RidgeClassifier()),
    ('Linear SVC',LinearSVC())
], voting='hard')

In [None]:
voting_clf_hard.fit(x_train, y_train)

In [None]:
y_pred_hard = voting_clf_hard.predict(x_test)

In [None]:
accuracy_hard = accuracy_score(y_test, y_pred_hard)
print(f'Hard Voting Classifier Accuracy: {accuracy_hard:.4f}')

---

#Documentation and Reporting


## Diabetes Prediction Project Report.

**1. Problem Statement:**
- This dataset provides a comprehensive collection of Diabetes test.

- The primary objective is to build a model that accurately predicts Is the individual Diabetic or not based on its features.

**2. Libraries:**
- import required Libraries

**3. Data Understanding:**
- The dataset has 768 rows & 9 columns.
- Features include: Pregnancie , Glucose , BloodPressure , SkinThickness , Insulin , BMI , DiabetesPedigreeFunction , Age & Outcome
- Target Variable: Outcome.

**4. Data Cleaning:**
- 82 duplicated rows where found.
- duplicated rows removed.
- Missing values were found in 'title' , 'genres' , 'releaseYear' , 'imdbId' , 'imdbAverageRating' and 'imdbNumVotes' columns.
- Missing values were handled by droping rows where have null values.
- Droped the column availableCountries.
- Droped the column imdbId.
- Changed dtype of releaseYear and imdbNumVotes form float into int.

**5. Encoding:**
- Encode the column type with replace method.
- Encod columns title and genres with label encoding.

**6. Data Visualization:**
- Pairplot created for entire dataset to understand their relation.
- Histograms were created for imdbAverageRating, imdbNumVotes and releaseYear to understand their distributions.
- Pie chart created for type to understand their percent for each value.
- Skewness of imdbNumVotes handled using log1p method.

**7. Outlier detection:**
- Outliers were identified and addressed in the 'releaseYear' columns using a box plot method.
- And removed in the 'releaseYear'.

**8. Correlation Analysis:**
- Correlation Analysis: a heatmap of the correlation matrix revealed the relationships between the variables.

**9. X-Y spliting**
- Target column stored in y variable.
- Other features stored in x variable.
- split x_train, x_test, y_train and y_test.

**10. Data Scaling:**
- x_train and x_test were scaled using StandardScaler.

**11. Initial Modeling:**
- Multiple regression models were tested:-
* Linear Regression.
* Ridge Regression.
* Lasso Regression.
* ElasticNet Regression.
* Decision Tree Regression.
* Random Forest Regression.
* Gradient Boosting Regression.
* Support Vector Regression.
* K-Nearest Neighbors Regression.
* Neural Network Regression.

-- Select 3 best models(Gradient Boosting , Ridge Classifier , Linear SVC)

--Take Hard Voting Classifier Accuracy as the final model.

-- Get Accuracy approximate 0.8047

**THANK YOU**