## Machine Learning Project Cycle
<br>
Let's look at how a DataScientist or Machine Learning Engineer would approach a project.
<br>
<img src="MLCycle.JPG">
<br>
<center><caption>**Machine Learning Project Cycle**</caption></center>
<br>
<br>
- Understanding the problem statement and deciding the category where this problem would lie, whether supervised( classification, prediction) or unsupervised (clustering).
<br>
- Looking for the dataset, if we do not get any dataset from the customer. We start with looking for data on open source platforms and scrape data from related websites and modify it based on the task.
<br>
- Read the data in the programming language specified, which is maximum times it is  python, but sometimes customer needs it on other language.
<br>
- Prepare the data, if it is in unstructered format convert it into structured format and **[normalize](https://www.geeksforgeeks.org/data-normalization-in-data-mining/)**(scaling the data) the data.
<br>
- Split the data to train data and evaluation data so that we can check how well the model is performing on blind data (data on which it has not been trained).
<br>
<img src = "DataSplit.JPG">
<br>
***The percentage of data split depends on how much data we have. If we have a dataset with 10 lakh images then 10% validation data would mean 1 lakh iamges and 2 lakh images for test data, we certainly do not want to miss 3 lakh images for testing.***
<br>
- Based on the task in hand, whether supervised or unsupervised, we then choose the respective algorithm to be applied on the data. For example: If prediction, we will use *Linear Regression* and if classification we will use *Logistic Regression*
<br>
- If we are doing linear regression, then we will use Error and R Square to see how well our model is performing and for classification we will see accuracy, precision, recall to check how well our model is performing
<br>
- After hyphothesis testing, if the model does not perform well try to improve it with different algorithms and feature engineering( transforming raw data to more appropriate features that better represent underlying problem)
<br>
- Once the model is ready, deploy it using django or flask so that it can be integrated with website or mobile app.
<br>
<br>
<font size=3>***Training time***</font>
<img src="TrainingCycle.JPG">
<br>
<center><caption><font size = 3>**Step by step process for training**</font></caption></center>
<br>
<br>
<font size = 3>***Production time***</font>
<img src="ProductionCycle.JPG">
<br>
<center><caption><font size=3>**Step by step process for production**</font></caption></center>

### Classification Algorithm
<br>
### Logistic Regression
<br>
***Logistic Regression*** (also called Logit Regression) is commonly used to estimate the probability that an instance belongs to a particular class (e.g., what is the probability that this email is spam?). If the estimated probability is greater than threshold, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), and otherwise it predicts that it does not (i.e., it belongs to the negative class, labeled “0”). This makes it a binary classifier.
<br>
<br>
***Binary classification***
<img src = "BinaryClassification.JPG">
<br>
<caption><center>**Explaining the selection of class for binary classifier**</center></caption>
<br>
***Multi-class classification***
<img src="MultiClassClassification.JPG">
<br>
<caption><center>**Explaining the selection of class for multi-class classifier**</center></caption>

In [2]:
# Lets work with data now
#start by importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

In [35]:
#Reading the data file
data = pd.read_csv("diabetes.csv")
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


The above data which we have is ffrom the National Institute of Diabetes and Digestive and Kidney Diseases (https://www.kaggle.com/uciml/pima-indians-diabetes-database). In this data we have diabetes outcome based on diferent parameters where 1 would denote ***Diabetic*** and 0 would denote ***Not Diabetic***.

In [36]:
data.iloc[580]

Pregnancies                   0.000
Glucose                     151.000
BloodPressure                90.000
SkinThickness                46.000
Insulin                       0.000
BMI                          42.100
DiabetesPedigreeFunction      0.371
Age                          21.000
Outcome                       1.000
Name: 580, dtype: float64

In [4]:
"""Lets seperate the dependent and independent variable before training"""
label = data['Outcome']
data.drop("Outcome",axis=1,inplace=True)

In [5]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [7]:
label.value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [9]:
"""Split the data into train and test"""
train_data, test_data, train_label, test_label = train_test_split(data,label, test_size=0.25)
print(train_data.shape, test_data.shape, train_label.shape, test_label.shape)

(576, 8) (192, 8) (576,) (192,)


In [11]:
test_label.value_counts()

0    133
1     59
Name: Outcome, dtype: int64

### Training the model

In [24]:
lr = LogisticRegression()
model = lr.fit(train_data, train_label)
train_pred = model.predict(train_data)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [25]:
train_pred

array([1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1,
       0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,

In [30]:
print(classification_report(train_label,train_pred))

              precision    recall  f1-score   support

           0       0.80      0.88      0.84       367
           1       0.74      0.62      0.67       209

    accuracy                           0.78       576
   macro avg       0.77      0.75      0.76       576
weighted avg       0.78      0.78      0.78       576



In [27]:
print(confusion_matrix(train_label, train_pred))

[[322  45]
 [ 80 129]]


In [33]:
train_data.head(1)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
580,0,151,90,46,0,42.1,0.371,21


In [34]:
data.iloc[580]

Pregnancies                   0.000
Glucose                     151.000
BloodPressure                90.000
SkinThickness                46.000
Insulin                       0.000
BMI                          42.100
DiabetesPedigreeFunction      0.371
Age                          21.000
Name: 580, dtype: float64

In [15]:
test_pred = model.predict(test_data)
print(classification_report(test_label, test_pred))
print(confusion_matrix(test_label, test_pred))

              precision    recall  f1-score   support

           0       0.80      0.87      0.83       133
           1       0.64      0.51      0.57        59

    accuracy                           0.76       192
   macro avg       0.72      0.69      0.70       192
weighted avg       0.75      0.76      0.75       192



### Mathematics behind (Logistic Regression)
<br>
***Logistic regression*** is part of regression family as it uses the line equation in the backend.
<br>
<br>
<center><font size=5>$\hat{y}= w_1x_1+w_2x_2+w_3x_3+w_4x_4+w_5x_5+w_6x_6+w_7x_7+w_8x_8 +b$</font></center>
<br>
<br>
Now lets us see how it is different from *Linear Regression*
<br>
<br>
<center><font size=5>$\sigma(\hat{y}) = \frac{1}{1+e^{-\hat{y}}}$</font></center>
<br>
$\sigma(\hat{y})$: sigmoid is sometimes also known as the logistic function. It is a non linear function used not only in machine learning but also for deep learning.

In [16]:
def linear_reg(x):
    """This function will implement Linear Equation which is y_hat = wx+b
    parameters
    x: independent variables in the form of array
    
    output
    y_hat: computed for n features"""
    m,n = x.shape
    w = np.random.randn(1, n)
    b = np.random.randn(1,1)
    line_eq = []
    for i in range(x.shape[1]):
        w_i = w[:,i].reshape(1,1)
        x_i = x[:,i].reshape(m,1)
        line_eq.append(np.dot(w,x.T)+b)
    return np.sum(line_eq,axis=0)

def sigmoid(y_hat):
    """This function will implement sigmoid function
    parameters
    y_hat: y_hat calculated after linear equation
    
    outpur
    sigmoid: it will give us sigmoid of y_hat
    """
    sigmoid = 1/(1+np.exp(-y_hat))
    return sigmoid

In [20]:
train_pred= linear_reg(np.array(train_data))

In [21]:
sigmoid(train_pred)

array([[1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.000000

In [17]:
y_hat = linear_reg(np.array(data))
sigmoid(y_hat)



array([[2.42561201e-139, 4.25315951e-080, 3.47380844e-103,
        0.00000000e+000, 0.00000000e+000, 4.87684003e-069,
        0.00000000e+000, 1.35880516e-116, 0.00000000e+000,
        6.51538975e-072, 5.48507949e-067, 3.51040889e-112,
        2.25657096e-112, 0.00000000e+000, 0.00000000e+000,
        4.82603620e-103, 0.00000000e+000, 9.64368917e-075,
        0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        3.04975065e-092, 3.97049086e-116, 1.88105001e-110,
        0.00000000e+000, 0.00000000e+000, 2.58330060e-106,
        0.00000000e+000, 0.00000000e+000, 2.89021517e-076,
        3.00011716e-125, 0.00000000e+000, 9.38194897e-222,
        1.14228036e-049, 1.69293127e-124, 0.00000000e+000,
        1.65246554e-100, 1.17915673e-126, 1.04179565e-100,
        0.00000000e+000, 0.00000000e+000, 1.12094232e-093,
        2.01737160e-093, 0.00000000e+000, 5.57193199e-103,
        1.22364907e-126, 1.49372233e-079, 2.79511889e-067,
        5.71687643e-113, 1.14010644e-073, 1.69545868e-29

### Cross-entropy loss
<br>
Insted of ***Mean Squared Error*** we will use ***Cross-entropy Loss*** which is also known as log loss.
<br>
<br>
<center><font size=5>$J(y,\hat{y})=-\frac{1}{m}\sum_{i=1}^m[y_ilog(\hat{y_i})+(1-y_i)log(1-\hat{y_i})]$</font></center>

In [22]:
def cost_func(y_hat):
    """We will calculate the cost function for logistic regression
    parameters
    y_hat: input feature after applying sigmoid on linear equation
    output
    cost: computed cost value for data
    """
    cost =  -(np.mean(Y * np.log(s_y_hat) + (1-Y) * np.log(1-s_y_hat)))
    
    return cost

### Optimising Loss

Let us look what we get when we take derivative of sigmoid and cost function
<br>
***Derivative of sigmoid:***
<center><font size = 5>$\sigma^{'}(\hat{y}) = \sigma(\hat{y})(1-\sigma(\hat{y}))$</font></center>
<br>
***Derivative of cost function:***
**For parameter w**
<br>
<center><font size = 5>$\frac{d}{dw}J(w,b) = \frac{1}{m}\sum_{i=1}^m(x(\sigma(\hat{y}) - y))$</font></center>
<br>
**For parameter b**
<br>
<center><font size = 5>$\frac{d}{db}J(w,b) = \frac{1}{m}\sum_{i=1}^m(\sigma(\hat{y}) - y)$</font></center>

In [28]:
def optimisation(x,y_hat,y,w,b):
    """This function will calculate the derivative of sigmoid and cost with respect to w and b
    parameters
    x: independent variables in the form of array
    y_hat: sigmoid of linear equation
    y: dependent variable in the form of array
    w, b: weights and bias
    
    output 
    ds: derivative of sigmoid
    dw: derivative of cost with respect to w
    db: derivative of cost with respect to b"""
    ds = y_hat(1-y_hat)
    dw = np.mean((np.dot(x.T,(y_hat - y))))
    db = (np.mean(y_hat-y))
    return ds, dw, db

### Hypothesis Testing
<br>
We will first create a confusion matrix
<img src="confusionMatrix.JPG">

**Precision:** <font size=3>$\frac{TP}{TP+FP}$</font>
<br>
<br>
**Recall:** <font size=3>$\frac{TP}{TP+FN}$</font>