# Module 03: classification

<details style='font-size:16px'><summary style='font-size:22px'>Learning Objectives:</summary>

- Understanding Different Classification methods.
- Apply Classification algorithms on various data sets to solve real world problems.
- Understnad evaluation methods in Classification
    
</details>

In [1]:
# Style of images within its comment under it
from IPython.core.display import HTML

HTML("""
<style>
.imgHolder {
    position: relative;
}
.imgHolder span {
    position: absolute;
    right: 10px;
    top: 10px;
    color: Black;
    font-weight: bold;
    font-size: 18px;
}
</style>
"""
)

## (A) Introduction to Classification

### What's the classification?

- A *Supervised learning approach.*

- Categorizing some unkown items into a discrete set of categories or "classes".

- The target attribute is a **categorical Variables**

###  How does classification work?


#### ***Simple-class Classification:***

> **Classification** determines the class label for an unlabeled test case.
<img src="Modules_images/classification_idea.png" style='float:right;'>
- Example for simple- class classification:
    + Suppose a brand is concerned about the potential for loans not to be rapaid?

    + If previous loan default data can be used to predict which customers are likely to have problems repaying loans. These bad risk customers can either have their loan application declined or offered alternative products.

    + The goal of a loan defualt predictor is to use existing loan default data which has info. about the customers such as income, age, etc...

#### ***Multi-class Classification***:
<img src="Modules_images/classification_multi.png" style='float:right;'>
- Example, 

    + Imagine having collected data about set of patients.

    + All of them suffered the same illness.

    + During their course of treatment, each patient responded to one of the Three medications. User can use this labeled dataset with a classification algorithm to build a classification model.

### Classification use cases

- Which category customer belongs to?

- Whether a customer switches to another provider/brand?

- Whether a customer responds to a particular advertising campaign?

### Classification algorithms in Machine Learning

- Decision Tree (ID3, C4.5, C5.5)

- Naive Bayes

- Linear Discriminant Analysis

- K-Nearest Neighbor

- Logistic Regression

- Neural Networks

- Support Vector Machines (SVM)

Imagine that a telecommunication provider has segmented his customer base by service usage patterns, categorizing the customers into 4 groups.

---

## (B) Naive Bayes

Naive Bayes is a simple but powerful algorithm used for classification tasks, particularly in machine learning. It's based on Bayes' theorem, which is a probability theory that helps us update our beliefs based on new evidence.

$$
p(B|A) = \frac{\overbrace{p(A|B)}^{\text{likelihood}} \overbrace{p(B)}^{\text{prior}}}{\underbrace{p(A)}_{\text{evidence}}}
$$

- $p(B|A)$ is the posterior probability of event $B$ given the occurrence of event $A$.
- $p(A|B)$ is the likelihood, representing the probability of event $A$ given the occurrence of event $B$.
- $p(B)$ is the prior probability of event $B$, expressing our initial belief about its occurrence before considering the evidence.
- $p(A)$ is the evidence, also known as the marginal likelihood or the probability of event $A$, integrating over all possible occurrences of event $B$.

The Naive Bayes assumption comes into play when we express the likelihood as a product of individual feature probabilities, assuming independence among features given the class label.

Here's a simplified example of how you might implement this in mathematical terms:

Let $ \mathcal X = (x_1, x_2, \ldots, x_n) $ be the features, and $ \mathcal Y $ be the class label.


Starting with the likelihood expression:

$$p(\mathcal{Y}|\mathcal{X}) = \frac{p(\mathcal{Y}) \prod_{i=1}^{n} p(x_i | y)}{p(\mathcal{X})}$$

Due to the conditional independence assumption in Naive Bayes, we often assume that $ p(\mathcal{X}) $ is independent of $ p(\mathcal{Y}) $, the class label. Therefore, we can simplify the expression by canceling out $ p(\mathcal{X}) $:

$$p(\mathcal{Y}|\mathcal{X}) = \frac{p(\mathcal{Y}) \prod_{i=1}^{n} p(x_i | y)}{p(\mathcal{X})} = \frac{p(\mathcal{Y}) \prod_{i=1}^{n} p(x_i | y)}{\text{{constant}}}$$

The constant here is $ p(\mathcal{X}) $, which can be disregarded when calculating posterior probabilities, as it does not depend on the class label $ y $. Hence, the likelihood in the context of Naive Bayes is often simplified to:

$$p(y_j|\mathcal{X}) = \prod_{i=1}^{n} p(x_i | y_j)$$

Now, let $ \mathcal{X} = (x_1, x_2, \ldots, x_n) $ be the features, and $ \mathcal{Y} $ be the class label. To compute $p(y_j|\mathcal{X})$, we can use the simplified likelihood expression:

$$p(y_j|\mathcal{X}) = \prod_{i=1}^{n} p(x_i | y_j)$$


In the context of Naive Bayes, "naive" refers to the assumption that features used to describe an observation are independent of each other. Even though this assumption may not always hold true in real-world scenarios, the simplicity of Naive Bayes makes it computationally efficient and often surprisingly effective.

Let's break it down with an example. Suppose you want to classify emails as either spam or not spam (ham). Naive Bayes would analyze the occurrence of different words in both spam and ham emails. If an email contains words like "free," "discount," and "limited-time offer," the algorithm might lean towards classifying it as spam based on the probability of these words being associated with spam emails

Let's start by taking an example of a dataset of students:
<img src="Modules_images/nb_studentdataset.png" style='width:800px; height:441px'>


And I want to compute the class of a particular phenomenon. How can we calculate it with the Naive Bayes approach?

<img src="Modules_images/nb_studentdataset_solution.png" style='width:800px; height:441px'>


### Implement Naive Bayes in Python

In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

# Generate some example data (you'd typically load real-world data)
# Here, we have two features (X) and a binary target variable (y)
X = [[1, 2], [2, 3], [3, 4], [4, 5], [1, 3], [2, 4], [3, 5], [4, 6]]
y = [0, 0, 0, 0, 1, 1, 1, 1]  # 0 represents one class, and 1 represents another

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Initialize the Naive Bayes classifier
model = GaussianNB()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Evaluate the accuracy of the model
accuracy = metrics.accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

---

## (C) Inroduction to  Decision Tree

### What's a decision tree?

> The basic intuition behind a decision tree is to map out all possible decision paths in the form of a tree. "*Narendra Nath Joshi*"

<img src="Modules_images/decision_tree_2.png" style='float:left; width:800px; height:441px'>
<img src="Modules_images/decision_teee.png"  style='float:right; width:900px; height:441px'>


### Decision Tree Learning algorithm
<img src="Modules_images/decision_tree_3.png" style='float:right;'>

> A decision tree can be constucted by considering the attribute one by one.

1. Choose an attirbute from your dataset.

2. Calculate **the significance** of an attribute in splitting of the data. (*We will explain how to calculate the significance of an attribute "column" to see if it's effective or not!*)

3. Split data based on the value of the best attribute.

4. Go to step 1.

---

### Building Decision Trees

> Decision Tree are built using recursive partitioning to classify the data

In the last section, we talked about how can attriute effect on the data when you're branching the tree depending on especific attribute.

#### Senario 1. 
<img src="Modules_images/attribute_1.png" style='float:right;'>

- As we can see, we've 14 cases.
- With this attribute, we can't decide which Drug of `A` or `B` will be effective for each `Normal` and `high`. Due to **High Impurity** of the labeled class for each feature.

#### Senario 2. 
<img src="Modules_images/attribute_2.png" style='float:right;'>

- With the same 14 cases.

- This time we picked `Sex` attribute of patients. Then split the data into two features `Male` and `female`.

    + As we can see, if the patient is `female`, we can say that `Drug B` might be suitable for her with **high certainty**. (Due to less impurity)

    + But, we don't have sufficient evidence or info. to determine if `Drug A` or `Drug B` is suitable for `male` feature.

    + However, it is a better choice in comparison with `Cholesterol` attribute because the result in the nodes are **more pure**.

    + From we're seeing, we can say that `Sex` attribute is more significant than `cholesterol`. In other word, its more predictive than the other attributes.

    + Indeed, predictiveness is based on decrease in **impurity of nodes**. Since we're looking for the best feature to decrease the impurity of patients in leaves.

    + So, **What's entropy?** 

### Entropy

> Measure of **randomness** or **uncertainty**.
<div class="imgHolder">
    <img src="Modules_images/entropy.png" style='float:right;'>
    <span>Entropy Evaluation</span>
</div>

<div class="imgHolder">
    <img src="Modules_images/entropy_1.png" style='float:right;'>
    <span style='right:525px;'>Entropy Before Splitting</span>
</div>

**The lower the entropy, the less uniform the distribution, the purer the node.**

<span style='font-size: 18px'>

<br>
<br>
<br>
<br>

*Entropy* = -p(A)log(p(A)) - p(B)log(p(B))

</span>

### Entropy Evaluation

<div class="imgHolder">
    <img src="Modules_images/entropy_cal1.png" style='float:right;' width='650px'>
    <span>Entropy Cholesterol Evaluation</span>
</div>

<div class="imgHolder">
    <img src="Modules_images/entropy_cal2.png" style='float:left;'>
    <span style='right:825px'>Entropy Sex Evaluation</span>
</div>

As we can see, we've different types of **Entropy** between two Features. So, how can we decided which one is the best attribute? We can do that using something called **Information Gain**

### What's information gain?

<img src="Modules_images/information_gain.png" style='float:right;'>

> **Information gain** is information that can increase the level of cetainty after splitting.

<br>
<br>
<br>

*Information Gain* = (Entropy before splitting) - (weighted entropy after splitting)

<br>


As you can see, **The less your *Weighted Entropy* after splitting, the more *Information Gain***

### Which attribute is the best?
<center>
<img src="Modules_images/information_gain_1.png">
</center>

Now, after we select the `Sex` attribute, What's the next attribute after branching by the `Sex` attribute?
<img src="Modules_images/build_next.png" style='float:right;'>
<br>
<br>
<br>
<br>
<br>
<br>
<br>


    + As we mentioned before, `Decision Tree` is a recursive partitioning which separating the data after each partition until it reaches out the full tree with the right attributes on.

---

## (D) Introduction to Logistic Regression

<h3 style='color: cyan; font-weight:bold'>What's Logistic Regression?</h3>

> *Is a Classification algorithm for **Categorical variables**.*

The data we have here, is based on a telecommunication dataset that we would like to analyze in order to understand which customer might leave the next month.

- This's *historical customer data*  where each row represents one customer.

- Based on this *historical record*  and use it to predict the future `churn` within the customer group.

<center>
<img src="Modules_images/Logistic_reg.png">
</center>

<h3 style='color: cyan; font-weight:bold'>In which situations do we use logistic regression?<h3>

Using logistic regression to build a model for predicting `customer hurn` using the given features with **one or more independent variables** to predict an outcome as such `churn` we call **dependent variable**, representing whether or not customers will stop using this service.

**Difference between *Linear Regression* and *Logistic Regression***:

- In **Linear Regression**, 

    + we might try to predict a *continuous variables* such as <u>Price of a house, blood pressure of a patient, or fuel consumption of a car.</u>


- In **Logisitic Regression**, 
    
    + we predict a variable which is *binary*  such as <u> Yes/no, True/false. Successful/ not successful, pregnant/nor pregnant.</u> all of which can be coded as zero or one.

    + **Independent variables** should be continuous. If **categorical variables**, they should be *dummy* or *indicator coded*. Which mean you've to *transform* them to some **continuous values**.
    
    + Please note that logistic regression can be used for both *binary classification* and *multi-class classification*.

### Logistic Regression Applications

    - Predicting the probability of a person having a heart attack.

    - Predicting the mortality in injured patients.

    - Predicting a customer's propensity to purchase a product or halt a subscription.

    - Predicting the probability of failure of a given process or product.

    - Predicting the likeihood of a homeowner defualting on a mortgage.

<h3 style='color: cyan; font-weight:bold'> When is the Logistic Regression suitable?</h3>


<img src="Modules_images/logistic_linearly_separate.png" style='float: right'>
<br>
<br>
<br>

- If your data is binary


    + 0/1, YES/NO, True/False, churn/ not churn, positive/negative, and so on.

    + If you want to know what the probability is of a customer buying a product.

    + Logistic regression returns a probaility score between zero and one for sample of data.

    + if your data is *linearly separable*. When you need linear decision boundary.
    
    + If  you understand the impact of a feature. (you can select the feature based on the statistical significance of the logistic. Something like correlation in linear regression)


### Building a model for customer churn

<img src="Modules_images/churn_model.png" style='float: right'>

<br>
<br>
<br>
<br>
<br>

    + We defined the Independent Variables as X and dependent variables as Y.

    + The goal of logistic regression is to build a model to predict the class of each sample.

    - which in this case is a customer, as well as the probability of each sample belonging to a class.

- For that let's formalize this problem;

    + X --> belongs to a real number (%R) of m ( dimension or features) x n ( row records or observation).

    + Y ---> the class we want to predict, which can be either 0 or 1.

    + Yhat --> probability of class of customer, given X


### Logistic Regression Vs. Linear Regression

For better understanding, we're going to display each of them, and implement their model build into the same dataset `Telecommunication` dataset.

### Linear regression `income` predicting:
<center>
<img src="Modules_images/linear_predicting_1.png">
</center>


Let's predict the `income`  which is a continuous value.

If we select an **independent variable** such as `age` and predict the **dependent variable** `income`.

We can plot it and show `age` as **independent variable** and `income` as **predicted variable**.

With **linear regression**, you can *fit* a line or polynomial throuh the data.

Also, we find this line through training our model or calculating it, mathematically based on the sample set.

This line as an equation shown as $\hat{y} = a + b x_1$

Now we can *predict* the continuous value, `y` to predict the `income` of an unknown customer based on his/her `age`.

### Linear regression `Churn` predicting:

> **After all these mathematical equations, What's the probability that this value belongs to `class 0`?**

<center>
<img src="Modules_images/linear_predicting_2.png">
</center>
According this the results, from the step function, <u>No matter how big the value is, as long as its greater than `0.5`, it simply equals one and vice versa.</u>

<center>
<img src="Modules_images/linear_predicting_3.png">
</center>

Regardless of how small the value `y` is, <u>the output would be zero if it is less than `0.5`</u> 

At the end, We need a method that give us the proability of falling in the class as well. **what is the scientific solution here?**

Instead of using `Theta transpose` we use a specific function called **sigmoid**;

- **Sigmoid** is calculating the probability of of $\theta^T$ instead of calculating the value of it, directly.

<center>
<img src="Modules_images/sigmoid.png">
</center>

**So, What's Sigmoid Function?**

### Sigmoid function in logistic regression

- **Logistic Function**
>*It's resembling the step function and is used by the following expression in the logistic regression.*

<center>
<img src="Modules_images/sigmoid_function.png">
</center>

**Notice that in sigmoid function**

-> When the value of $\theta^TX$ gets <u>very big</u>, 
- $e^{-\theta^TX}$ in the denominator of the fraction becomes almost `0`, 
-  and the value of the **sigmoid function** get closer to 1.

-> When the value of $\theta^TX$ gets <u>very small</u>, 
- $e^{-\theta^TX}$ in the denominator of the fraction becomes almost `1`,
- and the value of the **sigmoid function** get closer to 0.

So, it is obvious that when the outcome of the **sigmoid  function** gets closer to `1`, the p(y) equals `1` and <u>`x` goes up</u>.

In contrast, when the **sigmoid function** value is closer to 0, the p(y) equals `1` given <u>`x` is very small</u>.

### Classification of the customer churn model

**What is the output of our model when we use the sigmoid function**?

    -  P(Y=1 |X)
    - P(Y=0 |X) = 1 - P(Y=1 |x)

For example;

If, 

- P(Churn=1| income, age) = 0.8 (for instance)

Then,

- P(Churn=0 | income, age) = 1 - 0.8 = 0.2

<center style='font-size:21px;'>
$\sigma(\theta^TX)\; \to\;P(Y=1 | x)$
</center>

<center style='font-size:21px;'>
$1- \sigma(\theta^TX)\; \to\;P(Y=0 | x)$
</center>

<br>

### The Training Process &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;$\sigma(\theta^TX)\; \to\;
P(Y=1 | x)$

1. <span style='color:yellow'>Initialize $\theta$ with random values as with most machine learning alogrithms. &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; $\theta$ = [-1, 2]</span>

2. Calculate the model output which is $\hat{y} = \sigma(\theta^TX)$ for a customer (2, 5) represents (income, age). &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; $\hat{y} = \sigma([-1, 2] \times [2, 5])  = 0.7$

3. Compare the output of $\hat{y}$ with actual output of customer, `y`, and record it as error.&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&ensp;Error = 1 - 0.7 = 0.3
    + This's the error for only one customer out of all customers in the training set.


4. Calculate the error for all customers. &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Cost = $J(\theta)$

    + <u>The total error is cost of your model and is calculated by the models cost function</u>. **The cost function**, by the way basically represents how to calculate the error of the model which is the difference between **the actual** and the **model predicted values**. 

    + the cost shows how poorly the model is estimating the customer labels, therefore the lower the cost, the better the model is at estimating the customers labels correctly.

    + what we want to do is to <u>try to minimize this cost.</u>
    
5. Change the $\theta$ to reduce the cost. &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;$\theta_{new}$

6. <span style='color:yellow'>Go Back to step 2.</span>

There're two important question;

1. How can we change the values of $\theta$ so that *the cost* is reduced across iterations?

    + There're different way to change the values of $\theta$ but one of the most popular ways is **gradient descent**.
<br>
<br>
2. When should we stop the iterations?

    + There're various ways to stop iterations, but essentially you stop training by calculating **the accuracy of your model** and stop it when it's satisfactory.

---

### Logistic Regression Training

##### 1. General cost function

$\sigma(\theta^TX)\; \to\; P(y=1|x)$

- Change the weight ($\theta$) -> Reduce the cost

- **Cost Function**:

<center style='font-size:21px'>
Derivative of the cost function, we can find how to change the parameters to reduce the cost rather the error, lets dive into it to see how it works.
<br>
$Cost(\hat{y},\; y) = \frac{1}{2} (\sigma(\theta^TX)\; -\; y)^2$ (<strong>very complex to solve</strong>)
    
</center>
<br>
<br>
<center style='font-size:21px'>
$J(\theta) = \frac{1}{m} \sum^m_{i-1}{Cost(\hat{y},\; y)}$
</center>


### Plotting the cost function of the model

<img src="Modules_images/cost_function.png" style='float: right'>
<br>
<br>
<br>
<br>

- Model $\hat{y}$


- Actual Value y=1 or 0 (*we want to find a simple cost function for our model*)

- If Y=1, and $\hat{y}\; \to\; cost=0$ (*If the predicted value is `1` or near to `1`, the cost value goes to `0` or near to `0`*).

- If Y=0, and $\hat{y}\; \to\; cost=large$ (*If the predicted value is `0` or near to `0` and the actual value is `1`, the cost value goes to **HIGH***).

- **Minus log function** such a cost function for us. It means if the actual value `1` and the model also predict `1`, the minus log function returns `zero cost`.
- If **minus log function** returns <u>large value</u> so, we can use the minus log function for calculating <u>the cost of our logistic regression</u>.

### Finally, Logistic regression cost function

So, we will replace the cost function with 

<center>
<img src="Modules_images/cost_function_main.png">
</center>

#### Let's recap logistic regression cost function:

> *Our objective was to find a model that best estimates the actual labels.*

Finding the best model is to find the best parameters' $\theta$ for that model. So, the fittest question was;

- How to find the best parameters for our model?

    + Minimize the cost function. In other words, to minimize the ***J($\theta$)***


- How to minimize the cost function?

    + Using **Gradient Descent**


- What's **Gradient Desent**?

    + A technique to use the derivative of a cost function to change the parameter values, in order to <u>minimize the cost.

### Using gradient descent to minimize the cost 

**Gradient Descent** is to change the parameters values so as to minimize the cost. (*which means decrease the values of $\theta_1$ and $\theta_2$ until reach its minimum*)

First, we need to minimize the cost of function ***J***, which is a function of variable $\theta_1$ and $\theta_2$. Let's add a dimension for the observed cost or error ***J($\theta_1$, $\theta_2$)***

According to this dimensions, we can represent this figure we see in this picture, <u>represents the error value for different values of parameters that is **error**, which is a function of parameters has called your **error curve** or **error bole**.</u>

<img src="Modules_images/GD_minimize_1.png" style='float:right; width:950px'>
<img src="Modules_images/tangent_line_cost_function.png" style='float:left; width:750px;height:547px'>


Now, The question here is, **Which point is the best point for the cost function?** (*Very important Ref. to check [Logistic Regression Cost Error Solvers](https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-defintions)*)

- To minimize the positions on the error curve, you have to find the minimum value of the cost by changing the parameters. 

    + First we need to locate a <span style='color:orange'><b>point</b></span> on the bowl.

    + you change the parameters by **($\Delta\theta_1$, $\Delta\theta_2$)** and take step on the surface.
    
    + As long as we're going upwards or downwards, we can go one more step. The steeper the slope, the further we can step. We can keep taking steps. As we approach the lowest point . Until we reach the flat surface.

    + This's the minimum point of our curve and the **optimum ($\theta_1$, $\theta_2$)**

- What're these steps really? I mean in which direction should we take these steps to make sure we descend? And how big should the steps be?

    + To find the direction and the size of these steps, in other words, to find how to update the parameters, you should calculate the gradient of the cost function at that point.
    
    + **Gradient** is the slope of the surface at everypoint and the direction of the gradient is the direction of the greatest uphill.

        * For example; this <span style='color:orange'><b>point</b></span> if we take the partial derivative of *J*($\theta$) with respect to each parameter at this point, it gives you the slope of the move for each parameter. Now, we've to guarantees that we go down in the <u>error curve</u> So, to decrease *J*.

            <img src="Modules_images/derivative_J.png" style='float: center'>

    + The **Gradient value** also indicates <u> how big of a step to take</u>. 

        * If the slope is <u>large</u> we should take a <u>large step</u> because we're far from the minimum.

           <img src="Modules_images/slope_steps.png" style='float: center'>

        * If the slope is <u>small</u>, we should take <u>smaller step</u>.

+ The **Gradient Descent** takes increasingly smaller steps towards the minimum with each iteration. The **Partial derivative** of the cost function J is calculating using this expression. <span style='font-size:24px;color:red'>$\frac{\partial{J}}{\partial\theta_1}$</span>

+ This equation returns the slope of that point and we should upgrade the parameters.

+ On the opposite direction. A vector of all thses slopes is the gradient vector <span style='font-size:20px;color:red'>$\nabla J$</span> and we can use this vector to change or update all the parameters.
    - Take the previous values of the parameters and substract the error derivative. This results in the new parameter for <span style='font-size:20px;color:cyan'>$New_\theta = Old_\theta - \nabla J$</span>


+ The results in the new parameters for $\theta$ that we know will <u>decrease</u> the cost. Also, multiply the **gradient value** by constant value  <span style='font-size:20px;color:yellow'>$\eta$</span>. (estimated time of arrival)

+ <span style='font-size:20px;color:yellow;font-weight:bold'>$\eta$</span> is a **learning rate** gives us additional control on <u>how fast we move on the surface.

    - **Gradient descent** is like taking steps in the current direction of the slope and the **learning rate** is like th lenght of the step you take. So, these would be our new parameters, Notice that it's an iterative operation and in each iteration we update the parameters and minimize the cost until the algorithm coverge is on **acceptable minimum**.

<center>
<img src="Modules_images/GD_minimize.png">
</center>

+ This's how the **Derivative concept** goes under `Leibniz notation`. (*Ref: [Khan Academy](https://youtu.be/N2PpRnFqnqY)*)
    - Delta $\Delta$ or "Change of value", we're calling it change of the value when we need to calculate the slope into a line.
<center>
<img src="Modules_images/derivative_concept.png" width=1250px>
</center>

### Training algorithm recap
<center>
<img src="Modules_images/logistic_training_recap.png">
</center>

---

## (E) Evaluation Metrics in Classification

### Classification Accuracy
<img src="Modules_images/classification_acc.png" style='float:right'>

#### How are we calculating the accuracy of classification?

- We train the model, and then calculate the accuracy using test set.

- We pass the test set to our model, and we find the predicted labels.

- Now, the question is, "How accurate is the model?".

- Baseically, we **compare** the <u>actual values</u> `y`  in the test set with the <u>predicted values $\hat{y}$ </u>  by the model to calculate the accuracy of the model

There're different model evaluation metrics but we just talk about three of them here;

1. Jaccard index

2. F1-score.

3. Log Loss.

### 1- Jaccard Index (Jaccard similarity coefficient)
<img src="Modules_images/Jaccard_index.png" style='float:right'>

> Its the intersection between **actual value y** and **predicted value $\hat{y}$**.

y: Actual labels

<img src="Modules_images/Venn_diagram.jpg" style='float:right'>

$\hat{y}$: Predicted labels

<span style='font-size:24px;'>
$J(y,\hat{y}) = \frac{| y\; \cap\; \hat{y} |}{| y\; \cup\; \hat{y} |} = \frac{| y\; \cap\; \hat{y} |}{|y| + |\hat{y}|\; -\; | y\; \cap\; \hat{y} |}$
</span>

#### Example:

<span style='font-size:20px;'>

- y: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

- $\hat{y}$: [1, 1, 0, 0, 0, 1, 1, 1, 1, 1]

- $J(y,\hat{y}) = \frac{8}{(10+10) - 8} = 0.66$
</span>

### 2- F1-score
<img src="Modules_images/f1_score.png" style='float:right'>
<br><br><br>

> **Confusion Matirx** is a way of seperating the labels into section which facilitate the process of check the accuracy of the predicted labels with the True label

Assume, our test set has only 40 customers. This matrix shows the correct and wrong prediction $\hat{y}$, in comparison with the actual labels `y`;

- The first row, for customers whose actual churn value in the test set is **1**. You can calculate out of 40, the churn value of 15 of them is 1. 
    
    + What have been predicted according to The **True label** is <u>6 out of 15</u>.
    + What have been predicted wrong is <u>9 out of 15</u>.


- The second row, for the customers whose actual churn value in the test set is **0**. You can calculate out of 40, the churn value of 25 is 0.
    
    + What have been predicted according to The **True label** is <u>24 out of 25</u>.
    + What have been predicted wrong is <u>1 out of 25</u>.
    

<img src="Modules_images/f1_score_2.png" style='float:right'>

<span style='font-size: 18px'>

1. **Precision**:  <u>Is a measure of accuracy, provided that a class label has been predicted.</u>

    - Precision = TP / (TP + FP)


2. **Recall**: <u>Is the True positive rate.</u>

    - Recall = TP / (TP + FN)


3. **F1-score**: 
<span style='font-size: 21px'>

   - $2\; \times\; \frac{(\; Precision\; \times\; Recall\; )} {(\; Precision\; +\; Recall\; )}$

</span>
</span>

### 3- Log Loss (Logarithmic loss)
<img src="Modules_images/Log_loss.png" style='float:right'>

> Sometimes, the output of a classifier is the probability of a class label instead of prediction of the label, itself.


<h4> In our exmple, in Logistic Regression, the output can be the probability of customer churn. The probability value between 0 and 1.</h4>

<b>The predicted output</b> is a probability value between 0 and 1. Predicting a probability of 0.13 when the <b>actual value</b> is 1, would be bad and would result in a <u>high log loss</u>


Then, we calculate <u>the average log loss</u> across all rows of test set. Its is obvious that ideal classifiers have progressively smaller values of log loss.


<u>Classifier with lower **log loss** has better **accuracy**</u>

---