# Evaluation metrics

## 1. Introduction

Evaluation metrics are quantitative measures used to assess the performance of a model or algorithm. They provide a standardized way to measure how well the model is performing on a specific task or dataset. The choice of evaluation metrics depends on the problem at hand and the goals of the project.

Here are some commonly used evaluation metrics and the concepts behind them:

1. Accuracy: Accuracy is a widely used metric for classification tasks. It calculates the percentage of correctly predicted instances out of the total instances. While accuracy is a straightforward measure, it may not be suitable for imbalanced datasets where the classes are not equally represented.

2. Precision and Recall: Precision and recall are metrics commonly used in binary classification tasks, particularly when the classes are imbalanced. Precision measures the proportion of correctly predicted positive instances out of the total instances predicted as positive. Recall, also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances out of the total actual positive instances. Precision focuses on minimizing false positives, while recall focuses on minimizing false negatives.

3. F1-score: The F1-score is the harmonic mean of precision and recall. It provides a single metric that balances the trade-off between precision and recall. The F1-score is useful when you want to consider both precision and recall simultaneously.

4. Mean Squared Error (MSE): MSE is a common metric used in regression tasks. It calculates the average squared difference between the predicted and actual values. MSE gives higher weight to larger errors, making it sensitive to outliers.

5. Mean Absolute Error (MAE): MAE is another metric used in regression tasks. It calculates the average absolute difference between the predicted and actual values. MAE is less sensitive to outliers compared to MSE because it doesn't square the errors.

6. Area Under the Curve (AUC): AUC is commonly used in binary classification tasks to evaluate the performance of a model's predicted probabilities. It measures the area under the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate against the false positive rate at various threshold settings. AUC provides an aggregate measure of the model's ability to distinguish between positive and negative instances across different threshold settings.

These are just a few examples of evaluation metrics, and there are many others depending on the specific problem and the nature of the data. The choice of metric should align with the objectives of the project and take into consideration factors such as class imbalance, sensitivity to errors, and the context of the problem. It's important to understand the underlying concepts of these metrics to correctly interpret the performance of your model and make informed decisions.

![Calculation-of-Precision-Recall-and-Accuracy-in-the-confusion-matrix.png](attachment:Calculation-of-Precision-Recall-and-Accuracy-in-the-confusion-matrix.png)

## 2. Example

In [4]:
import warnings
# Ignore all warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# give more information for the dataset
# pinpoint the matrics
# introduce the data and features
# data visualization

# Think about that I'm the student
# Make sure that I can 
# More comprehensive
# Sufficient Examples

# Load the Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Print the evaluation metrics
print("Accuracy: {:.2f}%".format(accuracy * 100))
print("Precision: {:.2f}%".format(precision * 100))
print("Recall: {:.2f}%".format(recall * 100))
print("F1-score: {:.2f}".format(f1))

Accuracy: 97.22%
Precision: 97.41%
Recall: 97.22%
F1-score: 0.97


In this example, we load the Wine dataset using load_wine() from scikit-learn. We then split the data into training and testing sets using an 80-20 split. Next, we train a logistic regression model on the training set and make predictions on the test set. Finally, we calculate the evaluation metrics including accuracy, precision, recall, and F1-score using the accuracy_score(), precision_score(), recall_score(), and f1_score() functions from scikit-learn.

Note that for multi-class classification tasks like the Wine dataset, the precision_score(), recall_score(), and f1_score() functions require specifying the average parameter to calculate the weighted average of the metrics across all classes.