<a href="https://colab.research.google.com/github/AsmaaMahmoudSaeed/machine_learning_1/blob/master/Day2_Notebook2_Classification_Algorithm_Performance_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective: Algorithm Performance Evaluation

In this notebook, we evaluate various classification algorithms and provide a pleasing result of performance across metrics of each algorithm


- Metrics measured : Accuracy, Precision, Recall, F1 Score.



The goal is to assess each algorithm's effectiveness in predicting outcomes. By comparing their performance across these metrics, we identify strengths and areas for improvement. This guides algorithm selection based on specific evaluation criteria.


## Importing Libraries

In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn import tree


In [11]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [12]:
DF=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Data/student_dropout.csv')
DF.head()

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Nacionality,Mother's qualification,Father's qualification,Mother's occupation,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,8,5,2,1,1,1,13,10,6,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,6,1,11,1,1,1,1,3,4,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,5,1,1,1,22,27,10,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,8,2,15,1,1,1,23,27,6,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,12,1,3,0,1,1,22,28,10,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [13]:
DF.shape

(4424, 35)

In [14]:
DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 35 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   Marital status                                  4424 non-null   int64  
 1   Application mode                                4424 non-null   int64  
 2   Application order                               4424 non-null   int64  
 3   Course                                          4424 non-null   int64  
 4   Daytime/evening attendance                      4424 non-null   int64  
 5   Previous qualification                          4424 non-null   int64  
 6   Nacionality                                     4424 non-null   int64  
 7   Mother's qualification                          4424 non-null   int64  
 8   Father's qualification                          4424 non-null   int64  
 9   Mother's occupation                      

In [15]:
DF_dtype_counts = DF.dtypes.value_counts()

print(DF_dtype_counts)

int64      29
float64     5
object      1
Name: count, dtype: int64


In [16]:
# Checking null values
DF.isnull().sum()

Unnamed: 0,0
Marital status,0
Application mode,0
Application order,0
Course,0
Daytime/evening attendance,0
Previous qualification,0
Nacionality,0
Mother's qualification,0
Father's qualification,0
Mother's occupation,0


In [17]:
# Split the DFset into features and target
X = DF.drop('Target', axis=1)
y = DF['Target']

In [18]:
y.unique()

array(['Dropout', 'Graduate', 'Enrolled'], dtype=object)


### <span style="color:green">Precision</span>  
is about how accurate your model is. It measures how many of the positive predictions your model made were actually correct. For example, if your model predicts that 100 emails are spam, and 90 of those emails are actually spam, then your precision is 0.9.

### <span style="color:green">Recall</span>
is about how complete your model is. It measures how many of the actual positive instances your model predicted as positive. For example, if there are 100 actual spam emails, and your model predicts that 90 of them are spam, then your recall is 0.9.

### <span style="color:green">F1 score</span>  
is a measure of the overall accuracy of your model. It is a combination of precision and recall. A high F1 score indicates that your model is both accurate and complete.

In general, a higher value of precision indicates that the classifier is good at avoiding false positives, while a higher value of recall indicates that the classifier is good at avoiding false negatives. The F1 score is a trade-off between precision and recall.

### Some examples


| Application              | Metric       | Why?                                                                                                      |
|-------------------------|--------------|-----------------------------------------------------------------------------------------------------------|
| Fraud detection         | Precision    | False positives in fraud detection can lead to innocent people being penalized.                         |
| Medical diagnosis       | Recall       | False negatives in medical diagnosis can lead to patients not receiving the treatment they need.         |
| Image classification   | Both precision and recall | Both precision and recall are important in image classification, as it's crucial to identify objects correctly and avoid false positives and false negatives.  |
| Natural language processing | Both precision and recall | Both precision and recall are crucial in natural language processing to accurately understand text and minimize both false positives and false negatives. |
| Speech recognition     | Both precision and recall | Both precision and recall are vital in speech recognition to transcribe speech accurately and minimize false positives and false negatives. |
| Machine translation    | Both precision and recall | Both precision and recall are essential in machine translation to accurately translate text and minimize false positives and false negatives. |
| Recommendation systems | Precision    | Precision is often preferred in recommendation systems to avoid suggesting irrelevant products or services. |
| Targeted advertising   | Precision    | Precision is often favored in targeted advertising to avoid displaying ads to disinterested individuals. |
| Search engines         | Both precision and recall | Both precision and recall are important in search engines to provide relevant results and minimize false positives and false negatives. |
| Social media filtering | Precision    | Precision is often favored in social media filtering to avoid removing content that might not be harmful. |



In [19]:
# Initialize the Logistic Regression model
model = make_pipeline(StandardScaler(), tree.DecisionTreeClassifier())

# Perform k-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
y_pred = cross_val_predict(model, X, y, cv=5)

# Calculate evaluation metrics

# In this code, we use the 'macro' averaging method for precision, recall, and F1-score, as well as 'ovo' (one-vs-one) strategy for the multiclass AUC-ROC calculation.
# This way, we can handle multiclass classification while obtaining meaningful evaluation metrics.

accuracy = np.mean(scores)
precision = precision_score(y, y_pred, average='macro')
recall = recall_score(y, y_pred, average='macro')
f1 = f1_score(y, y_pred, average='macro')

#  Calculate ROC-AUC using One-vs-Rest strategy
y_probabilities = cross_val_predict(model, X, y, cv=5, method='predict_proba')
# 'macro' means that the ROC-AUC scores for each class are calculated independently,  and then their unweighted mean is taken. Each class contributes equally to the final score,
# regardless of the class's frequency or imbalance.
roc_auc = roc_auc_score(y, y_probabilities, average='macro', multi_class='ovr')

# Calculate log loss
logloss = -np.mean(cross_val_score(model, X, y, cv=5, scoring='neg_log_loss'))

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"AUC-ROC: {roc_auc:.4f}")
print(f"Log Loss: {logloss:.4f}")

Accuracy: 0.6829
Precision: 0.6245
Recall: 0.6260
F1-Score: 0.6252
AUC-ROC: 0.7291
Log Loss: 11.1209


In [20]:
print(f"score for each fold is {scores} (total 5 folds)\
\n\n Predictions = {y_pred} ,\t Total predictions ={len(y_pred)}\
\n\nProbabilities of predictions {y_probabilities}\
\n\n Total Prob of preds = {len(y_probabilities)}\t  & thier Range = {np.amin(y_probabilities)} to {np.amax(y_probabilities)}\
\n\n accuracy = {accuracy} \n precision = {precision}\n recall = {recall}\n f1 ={f1}\
\n\n roc score = {roc_auc} \t logloss = {logloss}")

score for each fold is [0.68587571 0.67909605 0.67344633 0.67457627 0.70135747] (total 5 folds)

 Predictions = ['Enrolled' 'Graduate' 'Dropout' ... 'Dropout' 'Graduate' 'Graduate'] ,	 Total predictions =4424

Probabilities of predictions [[1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 ...
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 0. 1.]]

 Total Prob of preds = 4424	  & thier Range = 0.0 to 1.0

 accuracy = 0.6828703632691668 
 precision = 0.6244982655110909
 recall = 0.6259835615396883
 f1 =0.6252041936811518

 roc score = 0.7291106030764688 	 logloss = 11.120881929294054


In [21]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest model
rf_model = RandomForestClassifier()

# Perform k-fold cross-validation
rf_scores = cross_val_score(rf_model, X, y, cv=5, scoring='accuracy')
rf_y_pred = cross_val_predict(rf_model, X, y, cv=5)

# Calculate evaluation metrics
rf_accuracy = np.mean(rf_scores)
rf_precision = precision_score(y, rf_y_pred, average='macro')
rf_recall = recall_score(y, rf_y_pred, average='macro')
rf_f1 = f1_score(y, rf_y_pred, average='macro')

# Calculate ROC-AUC using One-vs-Rest strategy
rf_y_probabilities = cross_val_predict(rf_model, X, y, cv=5, method='predict_proba')
rf_roc_auc = roc_auc_score(y, rf_y_probabilities, average='macro', multi_class='ovr')

# Calculate log loss
rf_logloss = -np.mean(cross_val_score(rf_model, X, y, cv=5, scoring='neg_log_loss'))

print("Random Forest Classifier Metrics:")
print(f"Accuracy: {rf_accuracy:.4f}")
print(f"Precision: {rf_precision:.4f}")
print(f"Recall: {rf_recall:.4f}")
print(f"F1-Score: {rf_f1:.4f}")
print(f"AUC-ROC: {rf_roc_auc:.4f}")
print(f"Log Loss: {rf_logloss:.4f}")


Random Forest Classifier Metrics:
Accuracy: 0.7728
Precision: 0.7330
Recall: 0.6866
F1-Score: 0.6980
AUC-ROC: 0.8867
Log Loss: 0.6189


## Learnings

---
### ROC-AUC Averaging Strategies

- <span style="color:blue">**average='macro':**</span>
  The ROC-AUC scores for each class are calculated independently, and then their unweighted mean (ex=> 1+2+3/3) is taken. Each class contributes equally to the final score, regardless of the class's frequency or imbalance.
- <span style="color:blue">**average='micro':**</span>
    If you use 'micro' averaging, the ROC-AUC scores are calculated globally by considering all the true positives, false positives, true negatives, and false negatives together.
- <span style="color:blue">**average='weighted':**</span>
    If you use 'weighted' averaging, the ROC-AUC scores are weighted by the number of samples in each class.


### Choose what?

  - Choose <span style="color:#3fba02">macro</span> if you want to give equal importance to every class's performance and provide a balanced evaluation across all classes.
  - Choose <span style="color:#3fba02">micro</span> when overall performance across all classes is critical, especially in the presence of class imbalances.
  - Choose <span style="color:#3fba02">weighted</span> when you want to consider class distribution and give more weight to larger or more important classes.

### Multiclass Strategy Options

-<span style="color:blue">**multi_class parameter:OVR**</span>

  - In the OvR strategy, you build a binary classifier for each class against all the other classes combined.
  - For example, if you have 5 classes (A, B, C, D, E), you would train classifiers for:
    - Classifier for Class A vs. Classes (B, C, D, E)
    - Classifier for Class B vs. Classes (A, C, D, E)
    - Classifier for Class C vs. Classes (A, B, D, E)
    - Classifier for Class D vs. Classes (A, B, C, E)
    - Classifier for Class E vs. Classes (A, B, C, D)
  - The class with the highest "confidence" or probability is predicted by each classifier, and then the class with the most "votes" across all classifiers is chosen as the final prediction.

-<span style="color:blue">**multi_class parameter:OVO**</span>
  
  - In the OvO strategy, you build a binary classifier for every pair of classes.
  - For example, if you have 5 classes (A, B, C, D, E), you would train classifiers for AB, AC, AD, AE, BC, BD, BE, CD, CE, and DE.
  - Each binary classifier decides between two classes, and the class with the most "votes" across all classifiers is chosen as the final prediction.

### Choose what?

 - Choose <span style="color:#3fba02">OVR</span> when you have a large number of classes and computational efficiency matters. It simplifies the problem but might result in imbalanced binary classification problems. This is a common and practical strategy.
  

- Choose <span style="color:#3fba02">OVO</span> when you have a moderate number of classes and you want to evaluate each class's pairwise performance. It provides a more fine-grained evaluation but requires more models to be trained.
  
### Log Loss (Logarithmic Loss):

- Log loss, also known as logarithmic loss or cross-entropy loss, is a widely used evaluation metric for classification problems. It quantifies how well a classifier's predicted probabilities match the actual class labels. In simple terms, it measures how confident and accurate the model's predictions are.

- In machine learning, the goal is to minimize log loss. A model with low log loss is making confident and accurate predictions, while a model with high log loss is less reliable.

- Remember that log loss is commonly used for evaluation and not something you directly optimize when training a model. It's a way to assess how well your classifier's probabilities align with the real-world classes it's trying to predict.

---
