# Part 3: Understanding evaluation metrics for classification
In this section, you will explore different metrics that can be used for classification. For this purpose, we will be studying the [Pima Indians Diabetes Dataset](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database). 

The goal is to train a classifier to diagnose (predict) diabetes given a set of input features.

You will use the evaluation metrics you implemented to assess the quality of your model.

In [None]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import matplotlib.pyplot as plt
import numpy as np

# We will be using the metrics that you implemented in part 1
import metrics

As before, let's load the dataset and then do some exploration of the data. You may add further analyses if you consider it necessary.

In [None]:
# Let's load the dataset
diabetes = pd.read_csv('../data/diabetes.csv', sep=",")
diabetes.head()

In [None]:
diabetes.hist(figsize=(12, 10), bins=30, edgecolor="black")
plt.subplots_adjust(hspace=0.7, wspace=0.4)

**Question**: What can you say about the distribution of the different features?

**Answer**: 

Now, let's move into training a classifier. 

As in part 2, we will train a single model without focusing on the task of model selection. In real-life problems, you cannot do this, as you will need to explore different options that can lead to the best model possible (the one that generalizes best).

We create a function fit_and_test that will receive a training set to train a model using linear discriminant analysis (LDA) and a test set for prediction using the previously trained model. 

*Note:* Any other classifier could have been used. You are free to test other classifier algorithms already covered in the course.

In [None]:
def fit_and_test(X_train, y_train, X_test):
    #Create an LDA model
    model = LinearDiscriminantAnalysis()

    # Train the model
    model.fit(X_train, y_train)  


    # Make predictions on the testing set
    y_pred = model.predict(X_test) 
    
    return y_pred

Now, let's train and test. As we are not going to do model selection, in this lab we will do one split of the data into training and testing. The training data will not be furthe split into train and validation. 

**Remember this should not be done when solving a machine learning problem.**

In [None]:
# Storing inputs and output into X and y
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# We call fit_and_test to first train and then infer new prices for the test set
y_pred = fit_and_test(X_train, y_train, X_test)

### How good is this model?

Use the metrics that you implemented in part one to evaluate the model. Use the cell below for your experiments:

In [None]:
## YOUR CODE HERE


#Print the results

**Question:** Which metrics did you choose? Justify your answer.

**Answer:** 


**Question:** Analyze the specific values of the chosen metrics in the context of the problem to understand the model's accuracy. Is it a good or a bad model? Provide a detailed justification of your answer.

**Answer:**

**Question:** What is precision telling you about the disease? recall? and F1 score?

**Answer:**

### Imbalanced data
Now we will repeat the exercise simulating a scenario of highly imbalanced datasets. In healthcare applications, for instance, it is common that there will be a large number of healthy cases and a few pathological ones.

We will simulate this scenario by removing some rows from diabetic patients. Then, we will see how these affect the evaluation metrics.

In [None]:
# Identify rows with target label 1
target_1_indices = diabetes[diabetes['Outcome'] == 1].index

# Randomly select 80% of the indices
num_to_remove = int(len(target_1_indices) * 0.8)
indices_to_remove = np.random.choice(target_1_indices, num_to_remove, replace=False)

# Remove the selected rows from the DataFrame
diabetes = diabetes.drop(indices_to_remove)


In [None]:
# Split the data into features and target
X = diabetes.drop('Outcome', axis=1)
y = diabetes['Outcome']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# We call fit_and_test to first train and then infer new prices for the test set
y_pred = fit_and_test(X_train, y_train, X_test)

Use the metrics that you implemented in part one to evaluate the model. Use the cell below for your experiments:

In [None]:
## YOUR CODE HERE

#Print the results

**Question:** What can you say about this model?

**Answer:**

**Question:** Which metrics provide the most useful information? Investigate and propose a solution 

**Answer:**