## Imports :

In [70]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

## Loading the data :

In [53]:
dataframe_of_dataset = pd.read_csv('Dataset Folder/diabetes.csv')

In [54]:
dataframe_of_dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


A Few questions that I want to ask myself at this point :-

-> What is the purpose of my project? What is the end result that I am trying to achieve?

Ans. I am trying to Make a predictor which predicts whether a person has Diabetes or not.


-> Why do you have to make a predictor yourself? We've got doctors!

Ans. We do have doctors, but to make their job easier and also by helping the people to diagnose themselves and start taking precautions, care; do I need to make a predictor.


-> How will you make a predictor? 

Ans. I will make a predictor using a few techniques/tricks; I will see which trick works the best to predict whether one has diabetes or not. 


-> What are the examples of tricks?

Ans. Maybe KNN trick, Logistic Regression trick, SVM trick, Decision Tree Classifier Trick, RandomForest Classifier trick, xgboost trick or one of the others.


-> Do you know those tricks? 

Ans. Not all of them. In particular, I need to see for once Logistic Regression, SVM trick fully, Decision Tree Classifier, RandomForest Maybe fully, xgboost fully etc. 


-> Can you apply those tricks directly on the existing data regarding diabetes that you have?

Ans. No! First, the data needs to be seen and checked and processed. Things like : How many diabteic datapoints & non-diabetic datapoints, How many missing values in each column of the dataset(what to do for them?), How many continous features and how many categorical features, correlations and things like that and maybe more.......(i.e. Data preprocessing)


-> After that, we can understand each trick and apply the tricks one by one!!!!!!!!


-> Thereafter, use performance metrics such as accuracy, precision, f-score etc. for evaluating the performance of algorithm.

-> Just think a bit about different performance metrics in different situations.


-> Refer, the reports published by different people on Diabetes Prediction to see and find something, if at all something comes up.

## Data Exploration :

In [55]:
# Number of Diabetic datapoints and Non-diabetic datapoints :-

diabetic_dataframe = dataframe_of_dataset[dataframe_of_dataset['Outcome']==1]

non_diabetic_dataframe = dataframe_of_dataset[dataframe_of_dataset['Outcome']==0]

print(f'Number of Diabetic datapoints/rows in the dataset are : {diabetic_dataframe.shape[0]}')
print(f'Number of Non-Diabetic datapoints/rows in the dataset are : {non_diabetic_dataframe.shape[0]}')

Number of Diabetic datapoints/rows in the dataset are : 268
Number of Non-Diabetic datapoints/rows in the dataset are : 500


In [7]:
# Renaming the column names for convenience, dealing with the dataframe :-

dataframe_of_dataset.rename(columns={
                                     "Pregnancies": "preg", 
                                     "Glucose": "glu", 
                                     "BloodPressure":"bp", 
                                     "SkinThickness":"skinThickness",
                                     "Insulin":"insulin", 
                                     "BMI":"bmi",
                                     "DiabetesPedigreeFunction":"dpf",
                                     "Age":"age", 
                                     "Outcome":"outcome"}, inplace=True
                           )


dataframe_of_dataset

Unnamed: 0,preg,glu,bp,skinThickness,insulin,bmi,dpf,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [8]:
# Let's have a look at the null('NaN') values in the dataframe :-

for column in dataframe_of_dataset.columns:
    print(f'Number of NaN(null) values in \'{column}\' column are : \
    {dataframe_of_dataset[column].isnull().sum()}')

Number of NaN(null) values in 'preg' column are :     0
Number of NaN(null) values in 'glu' column are :     0
Number of NaN(null) values in 'bp' column are :     0
Number of NaN(null) values in 'skinThickness' column are :     0
Number of NaN(null) values in 'insulin' column are :     0
Number of NaN(null) values in 'bmi' column are :     0
Number of NaN(null) values in 'dpf' column are :     0
Number of NaN(null) values in 'age' column are :     0
Number of NaN(null) values in 'outcome' column are :     0


### Now, we know there are no 'null' values in the entire dataframe. But, there could '0' in the dataset and that value of '0' in columns apart from just 'preg' and 'outcome' are subject to performance change of the algorithm.(Since, that's missing information). Hence, we should find out the missing data in columns.

### If, 'preg' and 'outcome' are having '0' values, then it is not missing information as there can be '0' pregnancies and non-diabetic patients which is indicated by '0'.

In [9]:
# Let's have a look at the missing values in the dataframe

columns_missing_values = ['glu', 'bp', 'skinThickness', 'insulin', 'bmi', 'dpf']


for missing_values_column in columns_missing_values:
    
    print(f'Number of missing values in {missing_values_column} are \
          : {len(dataframe_of_dataset[dataframe_of_dataset[missing_values_column]==0])}')
    

Number of missing values in glu are           : 5
Number of missing values in bp are           : 35
Number of missing values in skinThickness are           : 227
Number of missing values in insulin are           : 374
Number of missing values in bmi are           : 11
Number of missing values in dpf are           : 0


## Before any sort of manipulation of the missing data, let's apply the techniques and find out how well they help prediction.

In [10]:
# First of all splitting the dataset into training and testing :

X = dataframe_of_dataset.iloc[:, :-1]

y = dataframe_of_dataset.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

## i.e. Before Anything :

### Unweighted KNN Technique :-

In [29]:
neigh = KNeighborsClassifier(n_neighbors=17)

neigh.fit(X_train, y_train)

y_pred = neigh.predict(X_test)

knn_accuracy_score = accuracy_score(y_test, y_pred)*100

print(f"{knn_accuracy_score}% is the accuracy score in percentage form with knn.")

78.57142857142857% is the accuracy score in percentage form with knn.




### Weighted KNN Technique :-

In [69]:
neigh = KNeighborsClassifier(n_neighbors=15, weights="distance")

neigh.fit(X_train, y_train)

y_pred = neigh.predict(X_test)

knn_accuracy_score = accuracy_score(y_test, y_pred)*100

print(f"{knn_accuracy_score}% is the accuracy score in percentage form with knn.")

77.92207792207793% is the accuracy score in percentage form with knn.




### Logistic Regression :-

In [50]:
logistic = LogisticRegression(max_iter=200)

logistic.fit(X_train, y_train)

# Predicting the X_test datapoints :
y_pred = logistic.predict(X_test)

logistic_accuracy_score = accuracy_score(y_test, y_pred)*100

print(f"{logistic_accuracy_score}% is the accuracy score in percentage form with LogisticRegression.")

77.92207792207793% is the accuracy score in percentage form with LogisticRegression.


### Decision Trees :-

In [81]:
decision_tree = DecisionTreeClassifier(max_depth=2)

decision_tree.fit(X_train, y_train)

# Predicting the X_test :
y_pred = decision_tree.predict(X_test)

decision_accuracy_score = accuracy_score(y_test, y_pred)*100

print(f"{decision_accuracy_score}% is the accuracy score in percentage form with DecisionTreesTrick.")

79.87012987012987% is the accuracy score in percentage form with DecisionTreesTrick.


In [12]:
# --------------------------------------------------------------------------------------------------------------

In [13]:
dataframe_of_dataset[(dataframe_of_dataset['glu']==0) & (dataframe_of_dataset['outcome']==0)]

Unnamed: 0,preg,glu,bp,skinThickness,insulin,bmi,dpf,age,outcome
75,1,0,48,20,0,24.7,0.14,22,0
182,1,0,74,20,23,27.7,0.299,21,0
342,1,0,68,35,0,32.0,0.389,22,0


In [14]:
dataframe_of_dataset[(dataframe_of_dataset['outcome']==0) & (dataframe_of_dataset['glu']!=0)].mean()

preg               3.311871
glu              110.643863
bp                68.213280
skinThickness     19.631791
insulin           69.160966
bmi               30.317304
dpf                0.430662
age               31.247485
outcome            0.000000
dtype: float64

In [15]:
dataframe_of_dataset[dataframe_of_dataset['outcome']==1].mean()

preg               4.865672
glu              141.257463
bp                70.824627
skinThickness     22.164179
insulin          100.335821
bmi               35.142537
dpf                0.550500
age               37.067164
outcome            1.000000
dtype: float64

In [16]:
# Going to perform different types of imputation tricks based on various ideas and use ML algorithms.