## Logistic Regression Modeling for Early Stage Diabetes Risk Prediction

### Logistic regression
Logistic regression uses an equation as the representation, very much like linear regression.

Input values (x) are combined linearly using weights or coefficient values (referred to as W) to predict an output value (y). A key difference from linear regression is that the output value being modeled is a binary values (0 or 1) rather than a continuous value.<br>

###  $\hat{y}(w, x) = \frac{1}{1+exp^{-(w_0 + w_1 * x_1 + ... + w_p * x_p)}}$

#### Dataset
The dataset is available at <strong>"data/diabetes_data.csv"</strong> in the respective challenge's repo.<br>
<strong>Original Source:</strong> http://archive.ics.uci.edu/ml/machine-learning-databases/00529/diabetes_data_upload.csv. The dataset just got released in July 2020.<br><br>

#### Features (X)

1. Age                - Values ranging from 16-90
2. Gender             - Binary value (Male/Female)
3. Polyuria           - Binary value (Yes/No)
4. Polydipsia         - Binary value (Yes/No)
5. sudden weight loss - Binary value (Yes/No)
6. weakness           - Binary value (Yes/No)
7. Polyphagia         - Binary value (Yes/No)
8. Genital thrush     - Binary value (Yes/No)
9. visual blurring    - Binary value (Yes/No)
10. Itching           - Binary value (Yes/No)
11. Irritability      - Binary value (Yes/No)
12. delayed healing   - Binary value (Yes/No)
13. partial paresis   - Binary value (Yes/No)
14. muscle stiffness  - Binary value (Yes/No)
15. Alopecia          - Binary value (Yes/No)
16. Obesity           - Binary value (Yes/No)

#### Output/Target target (Y) 
17. class - Binary class (Positive/Negative)

#### Objective
To learn logistic regression and practice handling of both numerical and categorical features

#### Tasks
- Download, load the data and print first 5 and last 5 rows
- Transform categorical features into numerical features. Use label encoding or any other suitable preprocessing technique
- Since the age feature is in larger range, age column can be normalized into smaller scale (like 0 to 1) using different methods such as scaling, standardizing or any other suitable preprocessing technique (Example - sklearn.preprocessing.MinMaxScaler class)
- Define X matrix (independent features) and y vector (target feature)
- Split the dataset into 60% for training and rest 40% for testing (sklearn.model_selection.train_test_split function)
- Train Logistic Regression Model using builtin function on the training set (sklearn.linear_model.LogisticRegression class)
- Use the trained model to predict on testing set
- Print 'Accuracy' obtained on the testing dataset i.e. (sklearn.metrics.accuracy_score function)
- Print other classification metrics such as:
    - classification report (sklearn.metrics.classification_report),
    - confusion matrix (sklearn.metrics.confusion_matrix),
    - precision, recall and f1 scores (sklearn.metrics.precision_recall_fscore_support)

#### Further fun (will not be evaluated)
- Plot loss curve (Loss vs number of iterations)
- Preprocess data with different feature scaling methods (i.e. scaling, normalization, standardization, etc) and observe accuracies on both X_train and X_test
- Training model on different train-test splits such as 60-40, 50-50, 70-30, 80-20, 90-10, 95-5 etc. and observe accuracies on both X_train and X_test
- Shuffling of training samples with different *random seed values* in the train_test_split function. Check the model error for the testing data for each setup.


#### Helpful links
- Scikit-learn documentation for logistic regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- How Logistic Regression works: https://machinelearningmastery.com/logistic-regression-for-machine-learning/
- Feature Scaling: https://scikit-learn.org/stable/modules/preprocessing.html
- Training testing splitting: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- Classification metrics in sklearn: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
- Use slack for doubts: https://join.slack.com/t/deepconnectai/shared_invite/zt-givlfnf6-~cn3SQ43k0BGDrG9_YOn4g

In [209]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import seaborn as sns

In [210]:
df=pd.read_csv('Diabetes_data.csv')
df.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,Male,No,Yes,No,Yes,No,No,No,Yes,No,Yes,No,Yes,Yes,Yes,Positive
1,58,Male,No,No,No,Yes,No,No,Yes,No,No,No,Yes,No,Yes,No,Positive
2,41,Male,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,Yes,No,Positive
3,45,Male,No,No,Yes,Yes,Yes,Yes,No,Yes,No,Yes,No,No,No,No,Positive
4,60,Male,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Positive


In [211]:
df.shape

(520, 17)

In [212]:
df.dtypes

Age                    int64
Gender                object
Polyuria              object
Polydipsia            object
sudden weight loss    object
weakness              object
Polyphagia            object
Genital thrush        object
visual blurring       object
Itching               object
Irritability          object
delayed healing       object
partial paresis       object
muscle stiffness      object
Alopecia              object
Obesity               object
class                 object
dtype: object

In [213]:
df.Age.describe()

count    520.000000
mean      48.028846
std       12.151466
min       16.000000
25%       39.000000
50%       47.500000
75%       57.000000
max       90.000000
Name: Age, dtype: float64

In [214]:
df1=df.copy()

In [215]:
from sklearn.preprocessing import MinMaxScaler
min_max=MinMaxScaler()
df[['Age']]=min_max.fit_transform(df[['Age']])

In [216]:
# Define X and y
X = df.drop('class',1)
y = df[['class']]

In [217]:
X.shape

(520, 16)

In [218]:
y.shape

(520, 1)

In [219]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
for i in X.columns:
  if(i!='Age'):
    X[i]= le.fit_transform(X[i])

In [220]:
X.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity
0,0.324324,1,0,1,0,1,0,0,0,1,0,1,0,1,1,1
1,0.567568,1,0,0,0,1,0,0,1,0,0,0,1,0,1,0
2,0.337838,1,1,0,0,1,1,0,0,1,0,1,0,1,1,0
3,0.391892,1,0,0,1,1,1,1,0,1,0,1,0,0,0,0
4,0.594595,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1


# Splitting Data

In [221]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4, random_state=100)

In [222]:
# Initialize the model from sklearn
model = LogisticRegression()

In [223]:
# Fit the model
model.fit(X_train, y_train)

  return f(*args, **kwargs)


LogisticRegression()

In [224]:
# Predict on testing set X_test
y_pred = model.predict(X_test)

In [225]:
#df2=pd.DataFrame(columns=[y_test,y_pred])
#df2

# Accuracy

In [226]:
# Print Accuracy on testing set
test_accuracy_sklearn = accuracy_score(y_test, y_pred)

print(f"\nAccuracy on testing set with scaling: {test_accuracy_sklearn}")


Accuracy on testing set with scaling: 0.9423076923076923


# Classification Report

In [227]:
from sklearn.metrics import classification_report
report=classification_report(y_test,y_pred)
report
#report=pd.DataFrame(eval(report))

'              precision    recall  f1-score   support\n\n    Negative       0.91      0.93      0.92        74\n    Positive       0.96      0.95      0.95       134\n\n    accuracy                           0.94       208\n   macro avg       0.94      0.94      0.94       208\nweighted avg       0.94      0.94      0.94       208\n'

# Confusion Matrix

In [228]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred)
cm
#sns.heatmap(cm, annot=True)

array([[ 69,   5],
       [  7, 127]], dtype=int64)

# Other scores

In [229]:
from sklearn.metrics import precision_recall_fscore_support
score=precision_recall_fscore_support(y_test,y_pred)
score

(array([0.90789474, 0.96212121]),
 array([0.93243243, 0.94776119]),
 array([0.92      , 0.95488722]),
 array([ 74, 134], dtype=int64))

# Without Age Scalling

In [230]:
X1 = df1.drop('class',1)
y1 = df1[['class']]

In [231]:
for i in X1.columns:
  if(i!='Age'):
    X1[i]= le.fit_transform(X1[i])

In [232]:
X_train, X_test, y_train, y_test = train_test_split(X1,y1,test_size=0.4, random_state=100)

In [233]:
model.fit(X_train, y_train)

  return f(*args, **kwargs)


LogisticRegression()

In [234]:
y_pred = model.predict(X_test)
test_accuracy_sklearn = accuracy_score(y_test, y_pred)

print(f"\nAccuracy on testing set without scaling: {test_accuracy_sklearn}")


Accuracy on testing set without scaling: 0.9326923076923077
