# Logistic Regression Model 

We will use Logistic Regression, to model the "Pima Indians Diabetes" data set. This model will predict which people are likely to develop diabetes.


The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Import Libraries 

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt      
%matplotlib inline 
import seaborn as sns

## Load and Review Data

In [None]:
df = pd.read_csv('/kaggle/input/pima-indians-diabetes-database/diabetes.csv')
df.shape

In [None]:
df.head()

In [None]:
 # Check if there are any null values in data set

df.isnull().values.any()

## Plotting Histogram for first 8 columns. Excluding the outcome column.

In [None]:
columns = list(df)[0:-1] 
df[columns].hist(stacked=False, bins=100, figsize=(12,30), layout=(14,2)); 

   ## Identify Correlation in data

In [None]:
df.corr()

### However we can plot correlation in graphical representation so below is a function for that

In [None]:
def plot_corr(df, size=11):
    corr = df.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns)
    plt.yticks(range(len(corr.columns)), corr.columns)
    
plot_corr(df)

#### In above plot yellow colour represents maximum correlation and blue colour represents minimum correlation.

## BiVariate Plots

In [None]:
sns.pairplot(df,diag_kind='kde')

## Calculate diabetes ratio of True/False from outcome variable

In [None]:
n_true = len(df.loc[df['Outcome'] == True])
n_false = len(df.loc[df['Outcome'] == False])
print("Number of true cases: {0} ({1:2.2f}%)".format(n_true, (n_true / (n_true + n_false)) * 100 ))
print("Number of false cases: {0} ({1:2.2f}%)".format(n_false, (n_false / (n_true + n_false)) * 100))

#### So we have 34.90% people in current data set who have diabetes and rest of 65.10% doesn't have diabetes. Its a good distribution True/False cases of diabetes in data.

## Spliting the data
* ### We will use 70% of data for training and 30% for testing.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('Outcome',axis=1)  # Predictor feature columns (8 X m)
Y = df['Outcome']              # Predicted class (1=True, 0=False) (1 X m)

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
# 1 is just any random seed number

x_train.head()

## Lets check split of data

In [None]:
print("{0:0.2f}% data is in training set".format((len(x_train)/len(df.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(df.index)) * 100))

## Now lets check diabetes True/False ratio in split data



In [None]:
print("Original Diabetes True Values    : {0} ({1:0.2f}%)".format(len(df.loc[df['Outcome'] == 1]), (len(df.loc[df['Outcome'] == 1])/len(df.index)) * 100))
print("Original Diabetes False Values   : {0} ({1:0.2f}%)".format(len(df.loc[df['Outcome'] == 0]), (len(df.loc[df['Outcome'] == 0])/len(df.index)) * 100))
print("")
print("Training Diabetes True Values    : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train[:] == 1])/len(y_train)) * 100))
print("Training Diabetes False Values   : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train[:] == 0])/len(y_train)) * 100))
print("")
print("Test Diabetes True Values        : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1])/len(y_test)) * 100))
print("Test Diabetes False Values       : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0])/len(y_test)) * 100))
print("")

## Data Preparation

* As we checked missing values earlier but haven't got any. But there can be lots of entries with 0 values. We must need to take care of those as well.

In [None]:
x_train.head()

#### We can see lots of entries with 0 value above.

## Replace 0s with serial mean

In [None]:
from sklearn.impute import SimpleImputer
rep_0 = SimpleImputer(missing_values=0, strategy="mean")
cols=x_train.columns
x_train = pd.DataFrame(rep_0.fit_transform(x_train))
x_test = pd.DataFrame(rep_0.fit_transform(x_test))

x_train.columns = cols
x_test.columns = cols

x_train.head()

## Logistic Regression Model

In [None]:
from sklearn import metrics

from sklearn.linear_model import LogisticRegression

# Fit the model on train
model = LogisticRegression(solver="liblinear")
model.fit(x_train, y_train)
#predict on test
y_predict = model.predict(x_test)


coef_df = pd.DataFrame(model.coef_)
coef_df['intercept'] = model.intercept_
print(coef_df)

### Model Score 

In [None]:
model_score = model.score(x_test, y_test)
print(model_score)

## The Confusion Matrix

In [None]:
cm = metrics.confusion_matrix(y_test, y_predict, labels=[1, 0])

df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True)


## Evaluation :-

* True Positives (TP): we correctly predicted that no of people have diabetes = 48

* True Negatives (TN): we correctly predicted that no of people who don't have diabetes = 132

* False Positives (FP): we incorrectly predicted that no of people who do have diabetes (a "Type I error"). 14 Falsely predict positive, Type I error

* False Negatives (FN): we incorrectly predicted that no of people who don't have diabetes (a "Type II error"). 37 Falsely predict negative Type II error

​

# Though I am a beginner in Kaggle any comment or tip is helpful. If you like my kernel give me an upvote.