## Logistic Regression Modeling for Early Stage Diabetes Risk Prediction

## Part 2.1: Getting familiar with linear algebraic functions

#### Tasks
- Create matrix of size 10*10 with random integer numbers
- Compute the following linear algebric operations on the matrix using built in functions supported in Numpy, Scipy etc.
  - Find inverse of the matrix and print it
  - Calculate dot product of the matrix with same matrix in transpose A.AT
  - Decompose the original matrix using eigen decomposition print the eigen values and eigen vectors
  - Calculate jacobian matrix 
  - Calculate hessian matrix

In [14]:
import random
import numpy as np
A = [[random.randint(0,100) for x in range(10)] for y in range(10)]
A=np.array(A)
A

array([[ 87,  65,  75,   3,  50,  71,  19,  27,  86,  95],
       [ 21,  64,   7,  82,  44,   9,  70,  46,   5,  58],
       [ 54,  11,  35,  43,   7,  46,   8,  21,  86,  39],
       [ 25, 100,  70,  84,   2,  38,  45,  32,  29,  13],
       [ 11,  12,  10,  53,  90,   1,  17,  33,  88,  77],
       [ 44,  92,  61,  47,  67,  51,  89,  78,  15,  14],
       [ 53, 100,  23,   6,  16,  68,  47,  25,  31,  80],
       [ 27,  28,  30,  73,  68,  75,  79,  64,  52,  51],
       [ 62,  79,  76,  45,  90,  30,   0,  60,  43,  47],
       [ 24,  58,  57,  37,   8,  90,   1,  63,  32,  54]])

In [15]:
AT=A.T
AT

array([[ 87,  21,  54,  25,  11,  44,  53,  27,  62,  24],
       [ 65,  64,  11, 100,  12,  92, 100,  28,  79,  58],
       [ 75,   7,  35,  70,  10,  61,  23,  30,  76,  57],
       [  3,  82,  43,  84,  53,  47,   6,  73,  45,  37],
       [ 50,  44,   7,   2,  90,  67,  16,  68,  90,   8],
       [ 71,   9,  46,  38,   1,  51,  68,  75,  30,  90],
       [ 19,  70,   8,  45,  17,  89,  47,  79,   0,   1],
       [ 27,  46,  21,  32,  33,  78,  25,  64,  60,  63],
       [ 86,   5,  86,  29,  88,  15,  31,  52,  43,  32],
       [ 95,  58,  39,  13,  77,  14,  80,  51,  47,  54]])

In [16]:
np.dot(A,AT)

array([[42480, 18109, 23603, 22423, 23314, 27912, 30316, 27909, 32777,
        26636],
       [18109, 23732, 10549, 20254, 16998, 25205, 18717, 23914, 20511,
        15071],
       [23603, 10549, 17698, 14307, 15431, 14545, 14952, 18318, 17613,
        15905],
       [22423, 20254, 14307, 28088, 12219, 27708, 20909, 22467, 23648,
        20625],
       [23314, 16998, 15431, 12219, 26326, 17255, 14351, 22955, 22288,
        13371],
       [27912, 25205, 14545, 27708, 17255, 37846, 25475, 30923, 30290,
        22973],
       [30316, 18717, 14952, 20909, 14351, 25475, 28449, 22552, 23277,
        21787],
       [27909, 23914, 18318, 22467, 22955, 30923, 22552, 33633, 26294,
        22506],
       [32777, 20511, 17613, 23648, 22288, 30290, 23277, 26294, 34544,
        23181],
       [26636, 15071, 15905, 20625, 13371, 22973, 21787, 22506, 23181,
        24632]])

In [None]:
from numpy import linalg as LA

## Part 2.2: Logistic Regression using newton method

### Logistic regression
Logistic regression uses an equation as the representation, very much like linear regression.

Input values (x) are combined linearly using weights or coefficient values (referred to as W) to predict an output value (y). A key difference from linear regression is that the output value being modeled is a binary values (0 or 1) rather than a continuous value.<br>

###  $\hat{y}(w, x) = \frac{1}{1+exp^{-(w_0 + w_1 * x_1 + ... + w_p * x_p)}}$

#### Dataset
The dataset is available at <strong>"data/diabetes_data.csv"</strong> in the respective challenge's repo.<br>
<strong>Original Source:</strong> http://archive.ics.uci.edu/ml/machine-learning-databases/00529/diabetes_data_upload.csv. The dataset just got released in July 2020.<br><br>

#### Features (X)

1. Age                - Values ranging from 16-90
2. Gender             - Binary value (Male/Female)
3. Polyuria           - Binary value (Yes/No)
4. Polydipsia         - Binary value (Yes/No)
5. sudden weight loss - Binary value (Yes/No)
6. weakness           - Binary value (Yes/No)
7. Polyphagia         - Binary value (Yes/No)
8. Genital thrush     - Binary value (Yes/No)
9. visual blurring    - Binary value (Yes/No)
10. Itching           - Binary value (Yes/No)
11. Irritability      - Binary value (Yes/No)
12. delayed healing   - Binary value (Yes/No)
13. partial paresis   - Binary value (Yes/No)
14. muscle stiffness  - Binary value (Yes/No)
15. Alopecia          - Binary value (Yes/No)
16. Obesity           - Binary value (Yes/No)

#### Output/Target target (Y) 
17. class - Binary class (Positive/Negative)

#### Objective
To learn logistic regression and practice handling of both numerical and categorical features

#### Tasks
- Download, load the data and print first 5 and last 5 rows
- Transform categorical features into numerical features. Use label encoding or any other suitable preprocessing technique
- Since the age feature is in larger range, age column can be normalized into smaller scale (like 0 to 1) using different methods such as scaling, standardizing or any other suitable preprocessing technique (Example - sklearn.preprocessing.MinMaxScaler class)
- Define X matrix (independent features) and y vector (target feature)
- Split the dataset into 60% for training and rest 40% for testing (sklearn.model_selection.train_test_split function)
- Train Logistic Regression Model on the training set (sklearn.linear_model.LogisticRegression class)
- Use the trained model to predict on testing set
- Print 'Accuracy' obtained on the testing dataset i.e. (sklearn.metrics.accuracy_score function)

#### Further fun (will not be evaluated)
- Plot loss curve (Loss vs number of iterations)
- Preprocess data with different feature scaling methods (i.e. scaling, normalization, standardization, etc) and observe accuracies on both X_train and X_test
- Training model on different train-test splits such as 60-40, 50-50, 70-30, 80-20, 90-10, 95-5 etc. and observe accuracies on both X_train and X_test
- Shuffling of training samples with different *random seed values* in the train_test_split function. Check the model error for the testing data for each setup.
- Print other classification metrics such as:
    - classification report (sklearn.metrics.classification_report),
    - confusion matrix (sklearn.metrics.confusion_matrix),
    - precision, recall and f1 scores (sklearn.metrics.precision_recall_fscore_support)

#### Helpful links
- Scikit-learn documentation for logistic regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- How Logistic Regression works: https://machinelearningmastery.com/logistic-regression-for-machine-learning/
- Feature Scaling: https://scikit-learn.org/stable/modules/preprocessing.html
- Training testing splitting: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- Classification metrics in sklearn: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
- Use slack for doubts: https://join.slack.com/t/deepconnectai/shared_invite/zt-givlfnf6-~cn3SQ43k0BGDrG9_YOn4g

In [18]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [20]:
# Download the dataset from the source
!wget _URL_ "http://archive.ics.uci.edu/ml/machine-learning-databases/00529/diabetes_data_upload.csv"

--2020-08-30 10:38:28--  http://_url_/
Resolving _url_ (_url_)... failed: Name or service not known.
wget: unable to resolve host address ‘_url_’
--2020-08-30 10:38:28--  http://archive.ics.uci.edu/ml/machine-learning-databases/00529/diabetes_data_upload.csv
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34682 (34K) [application/x-httpd-php]
Saving to: ‘diabetes_data_upload.csv’


2020-08-30 10:38:28 (498 KB/s) - ‘diabetes_data_upload.csv’ saved [34682/34682]

FINISHED --2020-08-30 10:38:28--
Total wall clock time: 0.3s
Downloaded: 1 files, 34K in 0.07s (498 KB/s)


In [21]:
# NOTE: DO NOT CHANGE THE VARIABLE NAME(S) IN THIS CELL
# Load the data
data = pd.read_csv("diabetes_data_upload.csv")

In [22]:
data.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,Male,No,Yes,No,Yes,No,No,No,Yes,No,Yes,No,Yes,Yes,Yes,Positive
1,58,Male,No,No,No,Yes,No,No,Yes,No,No,No,Yes,No,Yes,No,Positive
2,41,Male,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,Yes,No,Positive
3,45,Male,No,No,Yes,Yes,Yes,Yes,No,Yes,No,Yes,No,No,No,No,Positive
4,60,Male,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Positive


In [23]:
data.tail()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
515,39,Female,Yes,Yes,Yes,No,Yes,No,No,Yes,No,Yes,Yes,No,No,No,Positive
516,48,Female,Yes,Yes,Yes,Yes,Yes,No,No,Yes,Yes,Yes,Yes,No,No,No,Positive
517,58,Female,Yes,Yes,Yes,Yes,Yes,No,Yes,No,No,No,Yes,Yes,No,Yes,Positive
518,32,Female,No,No,No,Yes,No,No,Yes,Yes,No,Yes,No,No,Yes,No,Negative
519,42,Male,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Negative


In [None]:
# Handle categorical/binary columns

In [47]:
col_name.drop('Age','Gender')

Index(['Gender', 'Polyuria', 'Polydipsia', 'sudden weight loss', 'weakness',
       'Polyphagia', 'Genital thrush', 'visual blurring', 'Itching',
       'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity', 'class'],
      dtype='object')

In [57]:
col_name=data.columns
data=data.replace(to_replace ="Yes", value ="1")
data=data.replace(to_replace ="No", value ="0")
data['Gender']=data['Gender'].replace(to_replace ="Male", value ="0")
data['Gender']=data['Gender'].replace(to_replace ="Female", value ="1")

data['class']=data['class'].replace(to_replace ="Negative", value ="0")
data['class']=data['class'].replace(to_replace ="Positive", value ="1")

In [58]:
data

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,0,0,1,0,1,0,0,0,1,0,1,0,1,1,1,1
1,58,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,1
2,41,0,1,0,0,1,1,0,0,1,0,1,0,1,1,0,1
3,45,0,0,0,1,1,1,1,0,1,0,1,0,0,0,0,1
4,60,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
515,39,1,1,1,1,0,1,0,0,1,0,1,1,0,0,0,1
516,48,1,1,1,1,1,1,0,0,1,1,1,1,0,0,0,1
517,58,1,1,1,1,1,1,0,1,0,0,0,1,1,0,1,1
518,32,1,0,0,0,1,0,0,1,1,0,1,0,0,1,0,0


In [None]:
# Normalize the age feature

In [62]:
from sklearn.preprocessing import MinMaxScaler
MMS=MinMaxScaler()
data['Age']=MMS.fit_transform(data[['Age']])

In [63]:
# Define your X and y
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values

In [64]:
# Split the dataset into training and testing here
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [None]:
def predict(X, weights):
    '''Predict class for X.
    For the given dataset, predicted vector has only values 0/1
    Args:
        X : Numpy array (num_samples, num_features)
        weights : Model weights for logistic regression
    Returns:
        Binary predictions : (num_samples,)
    '''

    ### START CODE HERE ###
    z = 
    logits = 
    y_pred = 
    ### END CODE HERE ###
    
    return y_pred

In [None]:
def sigmoid(z):
        '''Sigmoid function: f:R->(0,1)
        Args:
            z : A numpy array (num_samples,)
        Returns:
            A numpy array where sigmoid function applied to every element
        '''
        ### START CODE HERE
        sig_z = 
        ### END CODE HERE
        
        assert (z.shape==sig_z.shape), 'Error in sigmoid implementation. Check carefully'
        return sig_z

In [None]:
def cross_entropy_loss(y_true, y_pred):
    '''Calculate cross entropy loss
    Note: Cross entropy is defined for multiple classes/labels as well
    but for this dataset we only need binary cross entropy loss
    Args:
        y_true : Numpy array of true values (0/1) of size (num_samples,)
        y_pred : Numpy array of predicted values (probabilites) of size (num_samples,)
    Returns:
        Cross entropy loss: A scalar value
    '''
    # Fix 0 values in y_pred
    y_pred = np.maximum(np.full(y_pred.shape, 1e-7), np.minimum(np.full(y_pred.shape, 1-1e-7), y_pred))
    
    ### START CODE HERE
    ce_loss = 
    ### END CODE HERE
    
    return ce_loss

In [None]:
def newton_optimization(X, y, max_iterations=25):
    '''Implement netwon method for optimizing weights
    Args:
        X : Numpy array (num_samples, num_features)
        max_iterations : Max iterations to update the weights
    Returns:
        Optimal weights (num_features,)
    '''
    num_samples = X.shape[0]
    num_features = X.shape[1]
    # Initialize random weights
    weights = np.zeros(num_features,)
    # Initialize losses
    losses = []
    
    # Newton Method
    for i in range(max_iterations):
        # Predict/Calculate probabilties using sigmoid function
        y_p = 
        
        # Define gradient for J (cost function) i.e. cross entropy loss
        gradient = 
        
        # Define hessian matrix for cross entropy loss
        hessian =
        
        # Update the model using hessian matrix and gradient computed
        weights = 
        
        # Calculate cross entropy loss
        loss = cross_entropy_loss(y, y_p)
        # Append it
        losses.append(loss)

    return weights, losses

In [None]:
# Train weights
weights, losses = newton_optimization(X_train, y_train)

In [None]:
# Plot the loss curve
plt.plot([i+1 for i in range(len(losses))], losses)
plt.title("Loss curve")
plt.xlabel("Iteration num")
plt.ylabel("Cross entropy curve")
plt.show()

In [None]:
our_model_test_acuracy = accuracy_score(y_test, predict(X_test, weights))

print(f"\nAccuracy in testing set by our model: {our_model_test_acuracy}")

#### Compare with the scikit learn implementation

In [None]:
# Initialize the model
model = LogisticRegression(solver='newton-cg', verbose=1)

In [None]:
# Fit the model. Wait! We will complete this step for you ;)
model.fit(X_train, y_train)

In [None]:
# Predict on testing set X_test
y_pred = model.predict(X_test)

In [None]:
# Print Accuracy on testing set
sklearn_test_accuracy = accuracy_score(y_test, y_pred)

print(f"\nAccuracy in testing set by sklearn model: {sklearn_test_accuracy}")