<div style="text-align:center">
    <img src="../files/monolearn-logo.png" height="150px">
    <h1>ML course</h1>
    <h3>Session 05: Logistic Regression, Confusion Matrix, Data Preprocessing</h3>
    <h4><a href="https://amzenterprise.ir/">Ali Momenzadeh</a></h5>
</div>

### Logistic Regression

Logistic regression estimates the probability of an event occurring, such as voted or didn't vote, based on a given dataset of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1.

<img src ="../files/5/Linear-Regression-vs-Logistic-Regression.webp" width=50%>

It is used to determine the value of a continuous dependent variable. 
Logistic Regression is generally used for classification purposes. Unlike `Linear Regression`, the dependent variable can take a limited number of values only i.e, the dependent variable is categorical. When the number of possible outcomes is only two it is called Binary Logistic Regression.

<h4>Sigmoid function</h4>
<img src = "../files/5/0__5zUFVIAXwzBSgAR.png" style="background-color:white" width=50% >

* In Linear Regression, the output is the weighted sum of inputs. Logistic Regression is a generalized Linear Regression in the sense that we don’t output the weighted sum of inputs directly, but we pass it through a function that can map any real value between 0 and 1.

<img src = "../files/5/cbimd9voqvaf.png" style="background-color:white">

### Logistic Regression

##### About Iris Dataset

We will use the well known Iris data set. It contains 3 classes of 50 instances each, where each class refers to a type of iris plant. To simplify things, we take just the first two feature columns. Also, the two non-linearly separable classes are labeled with the same category, ending up with a binary classification problem.

<img src = "../files/5/iris-machinelearning.png">

#### Import libraries

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

import warnings
warnings.filterwarnings('ignore')

#### Load and prepare data

logistic regression would, in this synthetic dataset, classify values as either 0 or 1, i.e. class one or two, using the logistic curve.

<img src = "../files/5/0_2c7voFri9cIXGrc4.jpg">

#### EDA

In [None]:
iris = pd.read_csv('Iris.csv')

In [None]:
iris.head()

In [None]:
iris['Species'].value_counts()

In [None]:
iris.info()

In [None]:
iris.drop("Id", axis=1, inplace = True)

### Storytelling - Visualization

In [None]:
g = sns.pairplot(iris, hue='Species', markers='+')
plt.show()

In [None]:
iris.shape

In [None]:
corr = iris.corr()
plt.figure(figsize=(10,8)) 
sns.heatmap(corr, cmap='viridis', annot=True)

#### Train and test (Classification)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [None]:
iris

In [None]:
# X = iris.iloc[:,0:4]
X = iris.drop(['Species'], axis=1)
y = iris['Species']

# print(X.head())
print(X.shape)
# print(y.head())
print(y.shape)

In [None]:
X.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

In [None]:
X_test

In [None]:
y_pred

In [None]:
y_test

In [None]:
print("The accuracy of Logistic Regression is: ", (metrics.accuracy_score(y_test, y_pred)))

<img src="../files/5/One.png" style="background-color:white" width=50% />

In [None]:
pd.crosstab(logreg.predict(X),y)

<hr/>

### Confusion Matrix

<img src = "../files/5/1_MmnBnKKENiD1iW_83b0ZeQ.png" width=75%>

#### What is Confusion Matrix?

Confusion matrix represents the accuracy of the model in the tabular format by representing the count of correct/incorrect labels.

<img src = "../files/5/1_n2im9rDJdRQMBNZ3pPMKXw.png" width=80%>

* Positive (P): Observation is positive.
* Negative (N): Observation is not positive.
* True Positive (TP): Outcome where the model correctly predicts the positive class.
* True Negative (TN): Outcome where the model correctly predicts the negative class.
* False Positive (FP): Also called a type 1 error, an outcome where the model incorrectly predicts the positive class when it is actually negative.
* False Negative (FN): Also called a type 2 error, an outcome where the model incorrectly predicts the negative class when it is actually positive.

<img src = "../files/5/1_prg8nKHYwm2NQBgP-sqS8g.jpg" width=50%>

The total number of correct predictions for a class go into the expected row for that class value and the predicted column for that class value.

In the same way, the total number of incorrect predictions for a class go into the expected row for that class value and the predicted column for that class value.

The diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabelled by the classifier. The higher the diagonal values of the confusion matrix the better, indicating many correct predictions.

<img src = "../files/5/1_YV7zy1NGN1-HGQxY56nc_Q.png" width=75%>

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

logistic regression would, in this synthetic dataset, classify values as either 0 or 1, i.e. class one or two, using the logistic curve.

<img src = "../files/5/1_TWXtKH_4trfKz7sexoadiw.png">

In [None]:
data = {
    'y_Actual':    [0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
    'y_Predicted': [0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0]
}

In [None]:
df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
confusion_matrix = pd.crosstab( df['y_Predicted'],df['y_Actual'], rownames=['Predicted'], colnames=['Actual'])

In [None]:
df

In [None]:
confusion_matrix

In [None]:
sns.heatmap(confusion_matrix, annot=True)
plt.show()

In [None]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(df["y_Actual"], df["y_Predicted"])

In [None]:
# Accuracy
from sklearn.metrics import accuracy_score
accuracy_score(df["y_Actual"], df["y_Predicted"])

In [None]:
# Recall
from sklearn.metrics import recall_score
recall_score(df["y_Actual"], df["y_Predicted"])

In [None]:
# Precision
from sklearn.metrics import precision_score
precision_score(df["y_Actual"], df["y_Predicted"])

<img src="../files/5/Accuracy_2.webp">

####  Recall is a useful metric in cases where False Negative is a higher concern than False Positive.

Example : Covid-19

####  Precision is a useful metric in cases where False Positive is a higher concern than False Negatives.

Precision is important in music or video recommendation systems, e-commerce websites, etc. Wrong results could lead to customer churn and be harmful to the business.

<img src="../files/5/1_7J08ekAwupLBegeUI8muHA.png">

<img src = "../files/5/1_5_ZAlFhlCk8llhnYWD5PXw.png" width=40%>

In [None]:
# Method 1: sklearn
from sklearn.metrics import f1_score
f1_score(df["y_Actual"], df["y_Predicted"])

In [None]:
# Method 2: Manual Calculation
recall = recall_score(df["y_Actual"], df["y_Predicted"])
precision = precision_score(df["y_Actual"], df["y_Predicted"])

F1 = 2 * (precision * recall) / (precision + recall)
F1

In [None]:
# Method 3: Classification report
from sklearn.metrics import classification_report
print(classification_report(df["y_Actual"], df["y_Predicted"]))

#### Confustion Matrix in a nutshell

<img src="../files/5/NzSnD.jpg" width=60% />

#### Confusion Matrix for Multi-Class Classification

<img src="../files/5/812455.jpg" width=45% />

1.Let us calculate the TP, TN, FP, FN values for the class Setosa using the Above tricks:

TP: The actual value and predicted value should be the same. So concerning Setosa class, the value of cell 1 is the TP value.
* TP = (cell 1) = 16

FN: The sum of values of corresponding rows except the TP value
* FN = (cell 2 + cell3) = 0 + 0 = 0

FP : The sum of values of corresponding column except the TP value.
* FP = (cell 4 + cell 7) = 0 + 0 = 0

TN: The sum of values of all columns and row except the values of that class that we are calculating the values for.
* TN = (cell 5 + cell 6 + cell 8 + cell 9) = 17 + 1 +0 + 11 = 29

> Accuracy = TP + TN / N = 16 + 29 / 45 = 1

> Precision = TP / TP + FP = 16 / 16 + 0 = 1

> Recall = TP / TP + FN = 16 / 16 + 0 = 1

> F1-score = 2 * ((Precision * Recall) / (Precision + Recall)) = 2 * ((1 * 1) / (1 + 1)) = 2 * (1/2) = 1


2.Similarly, for Versicolor class the values/ metrics are calculated as below:

* TP = 17 (cell 5)
* FN = 0 + 1 = 1 (cell 4 +cell 6)
* FP = 0 + 0 = 0 (cell 2 + cell 8) 
* TN = 16 +0 +0 + 11 =27 (cell 1 + cell 3 + cell 7 + cell 9).

> Accuracy = TP + TN / N = 17 + 27 / 45 = 0.978

> Precision = TP / TP + FP = 17 / 17 + 0 = 1

> Recall = TP / TP + FN = 17 / 17 + 1 = 0.944

> F1-score = 2 * ((Precision * Recall) / (Precision + Recall)) = 2 * ((1 * 0.944) / (1 + 0.944)) = 2 * (0.944/1.944) = 0.971

<hr/>

### Data Preprocessing (dealing with missing values, independent variable encoding and feature scaling)

<img src="../files/5/0_1-i9w0e4kklVQl5B.jpg" width=75% />

<img src="../files/5/text-data-task-framework-preprocessing.png" width=75% />

#### Import libraries

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

#### Import dataset

In [None]:
df = pd.read_csv('Data.csv')

In [None]:
df

In [None]:
df.info()

In [None]:
df.columns

In [None]:
X = df[['Country', 'Age', 'Salary']].values
y = df ['Purchased'].values

In [None]:
X

In [None]:
y

### Missing values

<img src="../files/5/How-to-Handle-Missing-Values-with-Python.jpg" width=50% />

#### Sources of Missing Values

Before we dive into code, it’s important to understand the sources of missing data. Here’s some typical reasons why data is missing:

* User forgot to fill in a field.
* Data was lost while transferring manually from a legacy database.
* There was a programming error.
* Users chose not to fill out a field tied to their beliefs about how the results would be used or interpreted.

In [None]:
df.isnull().sum()

##### Solution 1 - dropna

In [None]:
df_dropna = df.copy()

In [None]:
# summarize the shape of the raw data
print("Before:",df_dropna.shape)

# drop rows with missing values
df_dropna.dropna(inplace=True)

# summarize the shape of the data with missing rows removed
print("After:",df_dropna.shape)

##### Solution 2 - fillna

In [None]:
df_fillna = df.copy()

In [None]:
df_fillna

In [None]:
# fill missing values with mean column values
df_fillna.fillna(df_fillna.mean(), inplace=True)

# count the number of NaN values in each column
print(df_fillna.isnull().sum())

df_fillna

##### Solution 3 - Scikit-learn

The scikit-learn library provides the SimpleImputer pre-processing class that can be used to replace missing values.

It is a flexible class that allows you to specify the value to replace (it can be something other than NaN) and the technique used to replace it (such as mean, median, or mode). The SimpleImputer class operates directly on the NumPy array instead of the DataFrame.

In [None]:
X

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
#imputer.fit_transform(X[:, 1:3])

In [None]:
print(X)

### Encoding the independent variable

##### Pandas - get_dummies

In [None]:
df

In [None]:
pd.get_dummies(df,columns=["Country"])

In [None]:
pd.get_dummies(df)

<img src="../files/5/One-Hot_Encoding_print.png" width=75% />

##### Scikit-learn - OneHotEncoder

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [None]:
df

In [None]:
print(X)

##### Scikit-learn - LabelEncoder

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
print(y)

#### Train and test

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
print(X_train)

In [None]:
X_train.shape

In [None]:
print(X_test)

In [None]:
print(y_train)

In [None]:
print(y_test)

### Feature scaling

Feature Scaling or Standardization: It is a step of Data Pre Processing which is applied to independent variables or features of data. It basically helps to normalise the data within a particular range. Sometimes, it also helps in speeding up the calculations in an algorithm.

<img src="../files/5/1_yR54MSI1jjnf2QeGtt57PA.png" width=50% />

1) Min Max Scaler (Normalization)

2) Standard Scaler

##### What is Normalization?

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

<img src="../files/5/Normalization-Formula.jpg" width=50% />

##### What is Standardization?

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

<img src="../files/5/standardisation.jpg">

<img src="../files/5/Standard-deviation-formula.jpg">

<img src="../files/5/1__783tuRRVcTUwyFWB8VG0g.png" style="background-color:white" width=60% />

Another reason why feature scaling is applied is that few algorithms like gradient descent converge much faster with feature scaling than without it.

<img src="../files/5/1_yi0VULDJmBfb1NaEikEciA.png" width=50% />

Read more here: https://www.geeksforgeeks.org/normalization-vs-standardization/

##### Solution 1 - MinMaxScaler

In [None]:
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
X_train[:, 3:] = mm.fit_transform(X_train[:, 3:])
X_test[:, 3:] = mm.transform(X_test[:, 3:])

In [None]:
print(X_train)

In [None]:
print(X_test)

##### Solution 2 - StandardScaler

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [None]:
print(X_train)

In [None]:
print(X_test)

##### When to use feature Scaling ...

1. Gradient Descent:

Machine learning algorithms like linear regression, logistic regression, neural network, etc. that use gradient descent as an optimization technique require data to be scaled. Take a look at the formula for gradient descent below:

<img src="../files/5/gradient-descent.webp" width=35% />

The presence of feature value X in the formula will affect the step size of the gradient descent. The difference in ranges of features will cause different step sizes for each feature. To ensure that the gradient descent moves smoothly towards the minima and that the steps for gradient descent are updated at the same rate for all the features, we scale the data before feeding it to the model.

> Having features on a similar scale can help the gradient descent converge more quickly towards the minima.

2. Distance-Based Algorithms:

Distance algorithms like KNN, K-means, and SVM are most affected by the range of features. This is because behind the scenes they are using distances between data points to determine their similarity.

Since both the features have different scales, there is a chance that higher weightage is given to features with higher magnitude. This will impact the performance of the machine learning algorithm and obviously, we do not want our algorithm to be biassed towards one feature.

> Therefore, we scale our data before employing a distance based algorithm so that all the features contribute equally to the result.

3. Tree-Based Algorithms: 

Tree-based algorithms, on the other hand, are fairly insensitive to the scale of the features. Think about it, a decision tree is only splitting a node based on a single feature. The decision tree splits a node on a feature that increases the homogeneity of the node. This split on a feature is not influenced by other features.

So, there is virtually no effect of the remaining features on the split. This is what makes them invariant to the scale of the features!

<img src="../files/5/Decision_Tree_2.webp" width=40% />