Unlike regression , classification models doesn't fit into the scenario where our target variable is continous in nature , rather it helps in the scenario where our target output is classifying a set of data , it can be either binary or non binary , 

    -   Binary Classification examples
        -   True
        -   False
    -   Non Binary Classification examples
        -   small
        -   medium
        -   large
        -   extra large 

There are various models and algorithms used for different kinds of classification as below

    -   Models for Binary Classification
        -   Naive Bayes
        -   Logistic Regression
        -   K-Nearest Neighbors
        -   Support Vector Machine
        -   Decision Tree
        -   Random Forest

    -   Models for Non Binary/ Multi Label Classification
        -   Naive Bayes
        -   Gradient Boosting
        -   K-Nearest Neighbors
        -   Decision Tree
        -   Random Forest

For these models there are certain evaluation techniques that we use -

    -   Accuracy
    -   Precision
    -   Recall
    -   F1 Score
    -   ROC AUC
    -   Confusion Matrix

#### Accuracy

In Classification models the work around we have is simplfied with few terms , one of them is True Positive. Let me make you understand that with an example. Suppose our model is trying to predict if there will be rain today on different areas of a city . So the probable outcome is 2 over here. "Rain" and "No Rain". Suppose there are 10 records in our data based on each area. 

    -    Model prediction
        -   6 Rain
        -   4 No Rain
        
    -   Actual Status
        -   3 Rain
        -   7 No Rain
    
So out of the 6 predictions , only 3 turned out to be true , so our true positive count is 3 here. Let's understand this by the code here

Let's also understand True negative here , wherever in our records , the prediction and the actual both are having the same value as "No Rain" will be called as True Negative.
In here 4 prediction says no rain and 7 actual value which is higher than 4 says no rain , so our true negative value will be 4 here.

In [202]:
# importing libraries
import pandas as pd
import numpy as np

In [203]:
rain_df = pd.DataFrame(data = [["area 1","yes","Yes"],
                               ["area 2","No","Yes"],
                               ["area 3","No","No"],
                               ["area 4","yes","Yes"],
                               ["area 5","yes","No"],
                               ["area 6","No","Yes"],
                               ["area 7","No","Yes"],
                               ["area 8","No","No"],
                               ["area 9","No","No"],
                               ["area 10","yes","Yes"]],columns = ["Area Code","Prediction","Actual"])

In [204]:
rain_df.head()

Unnamed: 0,Area Code,Prediction,Actual
0,area 1,yes,Yes
1,area 2,No,Yes
2,area 3,No,No
3,area 4,yes,Yes
4,area 5,yes,No


In [205]:
# converting yes no into 1 ,0 with label encoder
from sklearn.preprocessing import LabelEncoder

In [206]:
le = LabelEncoder()
rain_df["Prediction"] = le.fit_transform(rain_df["Prediction"])
rain_df["Actual"] = le.fit_transform(rain_df["Actual"])


In [207]:
# Let's look into our new dataframe
rain_df.head()

Unnamed: 0,Area Code,Prediction,Actual
0,area 1,1,1
1,area 2,0,1
2,area 3,0,0
3,area 4,1,1
4,area 5,1,0


In [208]:
'''Now every instance where both prediction  and actual column stores the value as 1 will be termed as true positive'''
rain_df["TP Status"] = np.where((rain_df["Prediction"] == 1) & (rain_df["Actual"] == 1) ,"true positive","other")
rain_df["TN Status"] = np.where((rain_df["Prediction"] == 0) & (rain_df["Actual"] == 0) ,"true negative","other")

In [209]:
rain_df.head()

Unnamed: 0,Area Code,Prediction,Actual,TP Status,TN Status
0,area 1,1,1,true positive,other
1,area 2,0,1,other,other
2,area 3,0,0,other,true negative
3,area 4,1,1,true positive,other
4,area 5,1,0,other,other


In [210]:
# Now let's find out total number of records present in the data and store it in a variable
total_record = sum(rain_df["Area Code"].value_counts())
print(total_record)

10


In [211]:
# Now let's find the total number of record where the status is True Positive and True Negative
tp_count = len(rain_df[rain_df["TP Status"]=="true positive"])
print(tp_count)


3


In [212]:
tn_count = len(rain_df[rain_df["TN Status"]=="true negative"])
print(tn_count)

3


In [213]:
# Now the accuracy is nothing but the division value of (true positive+true negative) and total records , so let's find out the accuracy as below
accuracy = (tp_count+tn_count)/total_record
print(f"The accuracy is {accuracy}")

The accuracy is 0.6


In [214]:
''' Now that we have found the accuracy based on maths , let's use the sklearn method that we have in place'''
from sklearn.metrics import accuracy_score

In [215]:
accuracy_sklearn = accuracy_score(rain_df['Prediction'],rain_df['Actual'])
print(f"the accuracy based on sklearn metrics is {accuracy_sklearn}")

the accuracy based on sklearn metrics is 0.6


Hence our background math is proven by the sklearn metrics as well

Now let's talk about false positive and false negative as well

    -   Wherever the prediction says "rain" but actual data says "no rain" is False Positive
    -   Wherever the prediction says "No rain" but actual data says "rain" is False Negative

In [216]:
rain_df["FP Status"] = np.where((rain_df["Prediction"] == 1) & (rain_df["Actual"] == 0) ,"false positive","other")
rain_df["FN Status"] = np.where((rain_df["Prediction"] == 0) & (rain_df["Actual"] == 1) ,"false negative","other")

In [217]:
rain_df.head(10)

Unnamed: 0,Area Code,Prediction,Actual,TP Status,TN Status,FP Status,FN Status
0,area 1,1,1,true positive,other,other,other
1,area 2,0,1,other,other,other,false negative
2,area 3,0,0,other,true negative,other,other
3,area 4,1,1,true positive,other,other,other
4,area 5,1,0,other,other,false positive,other
5,area 6,0,1,other,other,other,false negative
6,area 7,0,1,other,other,other,false negative
7,area 8,0,0,other,true negative,other,other
8,area 9,0,0,other,true negative,other,other
9,area 10,1,1,true positive,other,other,other


#### Precision

precision is calulated with the below formula 

    -   TruePositives / (TruePositives + FalsePositives) 

In [218]:
''' Let's find out precision based on our data and applying the above formula'''
fp_count = len(rain_df[rain_df["FP Status"]=="false positive"])
fn_count = len(rain_df[rain_df["FN Status"]=="false negative"])

In [219]:
print(f"the true positive is {tp_count}")
print(f"the true negative is {tn_count}")
print(f"the false positive is {fp_count}")
print(f"the false negative is {fn_count}")

the true positive is 3
the true negative is 3
the false positive is 1
the false negative is 3


In [224]:
''' So as per the formula precision is here'''
precision = tp_count/(tp_count+fp_count)
print(f"the precision is {precision}")

the precision is 0.75


In [225]:
''' Now that we have found the precision based on maths , let's use the sklearn method that we have in place'''
from sklearn.metrics import precision_score

In [226]:
precision_sklearn = precision_score(rain_df['Prediction'],rain_df['Actual'])
print(f"the precision based on sklearn metrics is {precision_sklearn}")

the precision based on sklearn metrics is 0.5
