## Logistic Regression Modelling for Early stage diabetes risk prediction

### Logistic regression
Logistic regression uses an equation as the representation, very much like linear regression.

Input values (x) are combined linearly using weights or coefficient values (referred to as W) to predict an output value (y). A key difference from linear regression is that the output value being modeled is a binary values (0 or 1) rather than a numeric value.<br>

#####  $\hat{y}$ (w, x) = 1/(1+exp^-(w_0 + w_1 * x_1 + ... + w_p * x_ps))

#### Dataset
The dataset is available at http://archive.ics.uci.edu/ml/machine-learning-databases/00529/diabetes_data_upload.csv. The dataset just got released in July 2020.<br>

#### Features (X)

1. Age 1.20-65
2. Sex 1. Male, 2.Female
3. Polyuria 1.Yes, 2.No.
4. Polydipsia 1.Yes, 2.No.
5. sudden weight loss 1.Yes, 2.No.
6. weakness 1.Yes, 2.No.
7. Polyphagia 1.Yes, 2.No.
8. Genital thrush 1.Yes, 2.No.
9. visual blurring 1.Yes, 2.No.
10. Itching 1.Yes, 2.No.
11. Irritability 1.Yes, 2.No.
12. delayed healing 1.Yes, 2.No.
13. partial paresis 1.Yes, 2.No.
14. muscle stiness 1.Yes, 2.No.
15. Alopecia 1.Yes, 2.No.
16. Obesity 1.Yes, 2.No.

#### Output target (Y) 
17. Class 1.Positive, 2.Negative.

#### Objective
To learn logistic regression and practice handling of both numerical and categorical features

#### Tasks
- Download and load the data and print first 5 rows and last 5 rows
- Transform categorical features into numerical features. Use either one hot encoding, label encoding or any other suitable preprocessing technique.
- Since the age feature is in larger range, age column can be normalized into smaller scale (like 0 to 1) using different methods such as scaling, standardizing or any other suitable preprocessing technique. Use Sklearn builtin functions from sklearn.preprocessing.
- Define X matrix (independent features) and y vector (target feature)
- Split the dataset into 60% for training and rest 40% for testing. You can utilize built in function train_test_split in sklearn for this task. 
- Train Logistic Regression Model (sklearn.linear_model.LogisticRegression class) on the training set
- Run the model on testing set
- Print 'Accuracy' obtained on the testing dataset i.e. (sklearn.metrics.accuracy_score function)


#### Further fun (will not be evaluated)
- Implement logistic regression from scratch for this dataset
- Plot loss curve (Loss vs number of iterations)
- Testing between whether label encoder vs one hot encoder for categorical features gives better results.
- Running model with different feature scaling methods (i.e. scaling, normalization, standardization etc using sklearn)
- Training model with different sizes of dataset splitting such as 60-40, 50-50, 70-30, 80-20, 90-10, 95-5 etc.
- Shuffling of training samples with different random seed values in the train_test_split function. Check the model error for the testing data for each setup.
- Print other classification metrics such as classification report (sklearn.metrics.classification_report), confusion matrix (metrics.confusion_matrix), precision, recall and f1 scores (metrics.precision_recall_fscore_support).

#### Helpful links
- Scikit-learn documentation for logistic regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- How Logistic Regression works: https://machinelearningmastery.com/logistic-regression-for-machine-learning/
- Feature Scaling: https://scikit-learn.org/stable/modules/preprocessing.html
- Training testing splitting: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- Classification metrics in sklearn: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
- Use slack for doubts: https://join.slack.com/t/deepconnectai/shared_invite/zt-givlfnf6-~cn3SQ43k0BGDrG9_YOn4g

In [None]:
import numpy as np
import pandas as pd
import sklearn.linear_model.LogisticRegression
from sklearn.metrics import accuracy_score

# IF sklearn.compose.ColumnTransformer is used for feature transformation, then below import will help to infer features
# from helper.utils import get_column_names_from_ColumnTransformer

In [None]:
# Download the dataset from the source
!wget URL

In [None]:
# NOTE: DO NOT CHANGE THE VARIABLE NAME(S) IN THIS CELL
# Load the data
data = 

In [None]:
# You may need original list of columns to interpret the features after transformation, maybe
orig_cols = data.columns

In [None]:
# Handle categorical values

In [None]:
# Normalize the age feature

In [None]:
# Define your X and y
X = 
y = 

In [None]:
# Split the dataset into training and testing here
X_train, X_test, y_train, y_test = 

In [None]:
# Initialize the model
model = 

In [None]:
# Fit the model. Wait! We will complete this step for you ;)
model.fit(X_train, y_train)

In [None]:
# Predict on testing set X_test

y_pred = 

In [None]:
# Print Accuracy on testing set
accuracy = 

print(f"\nAccuracy in testing set: {accuracy}")