### Learning Objective:

At the end of the experiment, you will be able to:

*  understand the performance metrics using Decision tree classifier


In [None]:
#@title Experiment Walkthrough Video
from IPython.display import HTML

HTML("""<video width="420" height="340" controls>
  <source src="https://cdn.exec.talentsprint.com/content/performance_metrics.mp4">
</video>
""")

## Dataset

### History

Social network advertising, also social media targeting, is a group of terms that are used to describe forms of online advertising that focus on social networking services. One of the major benefits of this type of advertising is that advertisers can take advantage of the users’ demographic information and target their ads appropriately. Advantages are advertisers can reach users who are interested in their products, allows for detailed analysis and reporting, information gathered is real, not from statistical projections, does not access IP-addresses of the users.

### Description

The dataset chosen for this  experiment is Social Network Ads. The dataset contains 400 records with 5 columns representing the below details.

Data contains 5 columns:


**UserID** - Each person has a unique ID from which we can identify the person uniquely.

**Gender** - Person can be male or female.

**Age** - Age of the person. 

**EstimatedSalary** - This column contains salary of a person.

**Purchased** - Contains two numbers ‘0’ or ‘1’. ‘0’ means not purchased and ‘1’ means purchased. This variable is our target variable.

In [None]:
!wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/social_advertising.csv
    

### Importing required packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from mlxtend.plotting import plot_confusion_matrix

### Data preprocessing

#### Loading the Data

In [None]:
adv = pd.read_csv("social_advertising.csv")
adv.head()

In [None]:
adv = adv.drop(["User ID", "Gender"], axis = 1)
adv.head()

#### Extracting the features and labels

In [None]:
X = adv.iloc[:, [0, 1]].values # Age and estimated salary
y = adv.iloc[:, 2].values # Purchased

### Split the data into train and test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
print(X_train.shape, X_test.shape,y_train.shape, y_test.shape)

### Model Classification

#### Training a Decision Tree Classifier

In [None]:
clf = DecisionTreeClassifier(criterion="entropy", random_state=0)
# Fitting the data to the model
clf.fit(X_train, y_train)
# Get the predictions on the test dataset
y_pred = clf.predict(X_test)

In [None]:
accuracy_score(y_test,y_pred)

### Model Evaluation 

To evaluate the performance of a classification model, the following metrics are used:

* Confusion matrix
  * Accuracy
  * Precision
  * Recall
  * F1-Score


#### Confusion Matrix

* **Confusion matrix:**  is a table that is used to describe the performance of a classification model on a set of test data for which the true values are known. 

  * **true positive** The correct label of the given instance is positive, and the classifier also
predicts it as a positive
  * **false positive** The correct label is negative, but the classifier incorrectly predicts it as
positive
  * **true negative** The correct label is negative, and the classifier also predicts a negative
  * **false negative** The correct label is positive, but the classifier incorrectly predicts it as
negative

* **Accuracy:** it is the ratio of the number of correct predictions to the total number of input samples.


In [None]:
# Creating a confusion matrix
cm = confusion_matrix(y_test, y_pred)
fig, ax = plot_confusion_matrix(conf_mat=cm.T,cmap=plt.cm.RdPu)
plt.show()

#### Classification Report : 

A Classification report is used to measure the quality of predictions from a classification algorithm. More specifically, True Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a classification report as shown below. 

In [None]:
print(classification_report(y_test, y_pred))

#### Precision-Recall Metrics

* **Precision:** The precision is calculated as the ratio between the number of Positive samples correctly classified to the total number of samples classified as Positive (either correctly or incorrectly)

    Precision = $\mathbf{\frac{TruePositive}{TruePositive + FalsePositive}}$

* **Recall:** Recall tells us how many true positives (points labelled as positive) were recalled or found by our model.

   Recall = $\mathbf{\frac{TruePositive}{TruePositive + FalseNegative}}$

* **F1-score:** precision and recall can be combined into a single score that seeks to balance both concerns, called the F-score or the F-measure.
  
   F1-score = $\mathbf{\frac{2*Precision*Recall}{Precision+Recall}}$

#### Precision

In [None]:
from sklearn.metrics import precision_score
precision_score(y_test, y_pred, average="macro") 

#### Recall

In [None]:
from sklearn.metrics import recall_score
recall_score(y_test, y_pred, average="macro") 

####F1-score

In [None]:
from sklearn.metrics import f1_score
f1_score(y_test, y_pred, average="macro")