## Introduction
Greetings from the Kaggle bot! This is an automatically-generated kernel with starter code demonstrating how to read in the data and begin exploring. Click the blue "Edit Notebook" or "Fork Notebook" button at the top of this kernel to begin editing.

## Exploratory Analysis
To begin this exploratory analysis, first use `matplotlib` to import libraries and define functions for plotting the data. Depending on the data, not all plots will be made. (Hey, I'm just a kerneling bot, not a Kaggle Competitions Grandmaster!)

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Regular EDA ( Exploratory data analysis ) and plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# we want our plots to appear inside the notebook
%matplotlib inline

# Models from SCikit -Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings("ignore")

There is 1 csv file in the current version of the dataset:


In [None]:
print(os.listdir('../input'))

The next hidden code cells define functions for plotting data. Click on the "Code" button in the published kernel to reveal the hidden code.

Now you're ready to read in the data and use the plotting functions to visualize the data.

### Let's check 1st file: ../input/diabetes.csv

In [None]:
nRowsRead = 1000 # specify 'None' if want to read whole file
df= pd.read_csv('../input/diabetes.csv', delimiter=',', nrows = nRowsRead)
df.dataframeName = 'diabetes.csv'
nRow, nCol = df.shape
print(f'There are {nRow} rows and {nCol} columns')

Let's take a quick look at what the data looks like:

In [None]:
df.head(5)

In [None]:
df.tail()

Scatter and density plots:

In [None]:
df["Outcome"].value_counts()

In [None]:
df["Outcome"].value_counts().plot(kind="bar",color=["salmon","lightblue"])

In [None]:
df.info()

In [None]:
df.describe()

### Diabetes Related to Age

In [None]:
# Create another figure
plt.figure(figsize=(10,6))

# Scatter with positive examples
plt.scatter(df.Age[df.Outcome==1],
           df.Pregnancies[df.Outcome==1],c="salmon")

# Scatter with negative examples
plt.scatter(df.Age[df.Outcome==0],
           df.Pregnancies[df.Outcome==0],
           c="lightblue")

# Add some helpful info
plt.title("Diabetics in Pregnant women")
plt.xlabel("Age")
plt.ylabel("Pregnancies")
plt.legend(["Diabetics","No Diabetics"]);

In [None]:
data = pd.read_csv('../input/diabetes.csv', delimiter=',', nrows = nRowsRead) 
df = pd.DataFrame(data, columns= ['Age','Pregnancies','BMI','DiabetesPedigreeFunction','Outcome'])



In [None]:
data.query('Age > 70',inplace = True)

print(data)

In [None]:
# Create a plot of crosstab
pd.crosstab(df.Outcome,df.Age).plot(kind ="bar",
                                   figsize=(10,6),
                                   color=["salmon","lightblue"])
plt.title("Diabetics frequency related to Age")
plt.xlabel("0 = No Diabetics, 1 = Diabetics")
plt.ylabel("Age")
plt.legend(["Diabetics","No Diabetics"]);
plt.xticks(rotation=0);

In [None]:
df.Age.plot.hist();

### Diabetes related to BMI

In [None]:
df.BMI.plot.hist();

In [None]:
df.corr()

In [None]:
corr_matrix = df.corr()
fig,ax = plt.subplots(figsize =(15,10))
ax = sns.heatmap(corr_matrix,
                annot = True,
                linewidth=0.5,
                fmt=".2f",
                cmap="YlGnBu");
bottom,top = ax.get_ylim()
ax.set_ylim(bottom + 0.5,top-0.5)

In [None]:
df.head()

In [None]:
#Split data into X and Y
X = df.drop("Outcome",axis =1)
y = df["Outcome"]

In [None]:
# Split data into train and test sets
np.random.seed(42)

# Split into train & test set
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2)

### Introducing different Models
1. Logistic Regression - In this Model, the probabilities describing the possible outcomes of a single trial are modelled using a logistic function.
2. Naive Bayes - This model is based on Bayes’ theorem with the assumption of independence between every pair of features
3. Stochastic Gradient Descent - It is a simple and very efficient approach to fit linear models.
4. K-Nearest Neighbours - Classification is computed from a simple majority vote of the k nearest neighbours of each point.
5. Decision Tree - Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data.
6. Random Forest - Random forest classifier is a meta-estimator that fits a number of decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and controls over-fitting
7. Support Vector Machine - Support vector machine is a representation of the training data as points in space separated into categories by a clear gap that is as wide as possible.
Let us find out which model best fits for this requirement.

In [None]:
# Put models in a dictionary

models = { "Logistic Regression": LogisticRegression(),
            "Naive Bayes": GaussianNB(),
            "Stochastic Gradient": SGDClassifier(),
            "KNeighbors Classifier": KNeighborsClassifier(),
            "DecisionTree Classifier": DecisionTreeClassifier(),
            "RandomForest Classifier": RandomForestClassifier(),
            "Support Vector Machine": SVC()}
# Create a function to fit and score models

def fit_and_score(models, X_train,X_test,y_train,y_test):
    """
    Fits and evaluates given machine learning models.
    models: a dict of different SCikit-Learn machine learning models
    X_train : training data(no labels)
    X_test: testing data (no labels)
    y_train: training labels
    y_test: test labels
    """
    # set random seeed
    np.random.seed(42)
    # Make a dictionary to keep model scores
    model_scores = {}
    # Loop through models
    for name, model in models.items():
        # Fit the model to the data
        model.fit(X_train,y_train)
        # Evaluate the model and append its score to model scores
        model_scores[name] = model.score(X_test,y_test)
    return model_scores

In [None]:
model_scores = fit_and_score(models=models,
                             X_train= X_train,
                             X_test= X_test,
                            y_train=y_train,
                            y_test=y_test)
model_scores

In [None]:
model_compare = pd.DataFrame(model_scores, index=["accuracy"])
model_compare.T.plot.bar();

### Now we 've got a baseline model... and we know a model's first predictions and always what we should based our next steps off.

What should we do?

Let's look at the following:

Hyper parameter tuning
Feature importance
Confusion matrix
Cross - validation
Precision
Recall
F1 Score
Classification report
ROC curve
Area under the Curve(AUC)

### Hyerparameter Tuning

In [None]:
from sklearn.neighbors import KNeighborsClassifier
# Let's tune KNN
train_scores=[]
test_scores= [] 

# create a list of different values for n_nieghbors
neighbors = range(1,21)

# Setup KNN instance
knn = KNeighborsClassifier()

# Loop through different n_neighbors
for i in neighbors:
    knn.set_params(n_neighbors=i)
    
    # Fit the algorithm
    knn.fit(X_train,y_train)
    
    # update the training scores list
    train_scores.append(knn.score(X_train,y_train))
    
    # update the test scores list
    test_scores.append(knn.score(X_test,y_test))

### Hyperparameter tuning with RandomizedSearchCV

we're going to tune:
 * LogisticRegression()
 * RandomForestClassifier()
...using RandomizedSearchCV


In [None]:
plt.plot(neighbors,train_scores, label = "Train score")
plt.plot(neighbors,test_scores, label = "Test score")
plt.xticks(np.arange(1, 21, 1))
plt.xlabel("Number of neighbors")
plt.ylabel("Model score")
plt.legend()

print(f" Maximum KNN score on the test data:{max(test_scores)*100:.2f}%")

In [None]:
# Create a hyperparameter grid for LogisticRegression

log_reg_grid = {"C":np.logspace(-4, 4, 20),
               "solver":["liblinear"]}

# Create a hyperparameter grid for RandomForestClassifier
rf_grid = {"n_estimators": np.arange(10, 1000, 50),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2)}

### Now we've got Hyperparamter grids setup for each of our models, let tune them using RandomizedSearchCV

In [None]:
# Tune LogisticRegression

np.random.seed(42)

# Setup random hyperparameter search for LogisticRegression
rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                                param_distributions=log_reg_grid,
                                cv=5,
                                n_iter=20,
                                verbose=True)

# Fit random hyperparameter search model for LogisticRegression
rs_log_reg.fit(X_train, y_train)

In [None]:
rs_log_reg.best_params_

In [None]:
rs_log_reg.score(X_test, y_test)

In [None]:
# Tune RandomForestClassifier
# Setup random seed
np.random.seed(42)

# Setup random hyperparameter search for RandomForestClassifier
rs_rf = RandomizedSearchCV(RandomForestClassifier(), 
                           param_distributions=rf_grid,
                           cv=5,
                           n_iter=20,
                           verbose=True)

# Fit random hyperparameter search model for RandomForestClassifier()
rs_rf.fit(X_train, y_train)

In [None]:
# Find the best hyperparameters
rs_rf.best_params_

In [None]:
# Evaluate the randomized search RandomForestClassifier model
rs_rf.score(X_test, y_test)

### Hyperparamter Tuning with GridSearchCV
Since our LogisticRegression model provides the best scores so far, we'll try and improve them again using GridSearchCV...

In [None]:

# Different hyperparameters for our LogisticRegression model
log_reg_grid = {"C": np.logspace(-4, 4, 30),
                "solver": ["liblinear"]}

# Setup grid hyperparameter search for LogisticRegression
gs_log_reg = GridSearchCV(LogisticRegression(),
                          param_grid=log_reg_grid,
                          cv=5,
                          verbose=True)

# Fit grid hyperparameter search model
gs_log_reg.fit(X_train, y_train);

In [None]:
# Check the best hyperparmaters
gs_log_reg.best_params_

In [None]:
# Evaluate the grid search LogisticRegression model
gs_log_reg.score(X_test, y_test)

In [None]:
# Confusion matrix
print(confusion_matrix(y_test, y_preds))

### Evaluting our tuned machine learning classifier, beyond accuracy
1. ROC curve and AUC score
2. Confusion matrix
3. Classification report
4. Precision
5. Recall
6. F1-score ...
It would be great if cross-validation was used where possible.
To make comparisons and evaluate our trained model, first we need to make predictions.

In [None]:

# Make predictions with tuned model
y_preds = gs_log_reg.predict(X_test)

In [None]:
# Import Seaborn
import seaborn as sns

# Increase font size
sns.set(font_scale=1.5) 
def plot_conf_mat(y_test, y_preds):
    """
    Plots a confusion matrix using Seaborn's heatmap().
    """
    fig, ax = plt.subplots(figsize=(3, 3))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds),
                     annot=True, # Annotate the boxes
                     cbar=False)
    plt.xlabel("Predicted label") # predictions go on the x-axis
    plt.ylabel("True label") # true labels go on the y-axis 
    
plot_conf_mat(y_test, y_preds)

In [None]:
print(classification_report(y_test, y_preds))

In [None]:
# Create a new classifier with best parameters
clf = LogisticRegression(C=0.20433597178569418,
                         solver="liblinear")

In [None]:
# Cross-validated accuracy
cv_acc = cross_val_score(clf,
                         X,
                         y,
                         cv=5,
                         scoring="accuracy")
cv_acc

In [None]:
cv_acc = np.mean(cv_acc)
cv_acc

In [None]:
# Cross-validated precision
cv_precision = cross_val_score(clf,
                         X,
                         y,
                         cv=5,
                         scoring="precision")
cv_precision=np.mean(cv_precision)
cv_precision

In [None]:
# Cross-validated recall
cv_recall = cross_val_score(clf,
                         X,
                         y,
                         cv=5,
                         scoring="recall")
cv_recall = np.mean(cv_recall)
cv_recall

In [None]:
# Cross-validated f1-score
cv_f1 = cross_val_score(clf,
                         X,
                         y,
                         cv=5,
                         scoring="f1")
cv_f1 = np.mean(cv_f1)
cv_f1

In [None]:
# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({"Accuracy": cv_acc,
                           "Precision": cv_precision,
                           "Recall": cv_recall,
                           "F1": cv_f1},
                          index=[0])

cv_metrics.T.plot.bar(title="Cross-validated classification metrics",
                      legend=False);

In [None]:
# Fit an instance of LogisticRegression
clf = LogisticRegression(C=0.20433597178569418,
                         solver="liblinear")

clf.fit(X_train, y_train);

In [None]:
# Match coef's of features to columns
feature_dict = dict(zip(df.columns, list(clf.coef_[0])))
feature_dict

In [None]:
# Visualize feature importance
feature_df = pd.DataFrame(feature_dict, index=[0])
feature_df.T.plot.bar(title="Feature Importance",legend = False);

### Sources:
    https://rpubs.com/ikodesh/53189#:~:text=According%20to%20http%3A%2F%2Fwww,family%20history%20to%20predict%20how

### Conclusion:

In this study, shows that women getting a higher chance of diabetes other than Age, BMI, pregnancies, "Diabetes Pedigree function" should be considered as one of the important factor.It shows the Diabetes mellitus history in relatives and the genetic relationship of those relatives to the patient may be a cause This measure of genetic influence gave us an idea of the hereditary risk one might have with the onset of diabetes mellitus.so,Feature importance would be given to DiabetesPedigreeFunction for further analysis in the prediction of Diabetes.