# 2024-06-18 Third call (second part, 60 minutes, 20 points)

During the exam, you can browse **only the following Websites**:

-  [Python](https://docs.python.org/3/)
-  [Numpy](https://numpy.org)
-  [Scipy](https://docs.scipy.org/)
-  [Pandas](https://pandas.pydata.org/)
-  [Scikit-Learn](https://scikit-learn.org/stable/)
-  [Matplotlib](https://matplotlib.org/)
-  [Seaborn](http://seaborn.pydata.org/)

In [1]:
#import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Adding the following line, allows Jupyter Notebook to visualize plots
# produced by matplotlib directly below the code cell which generated those.
#%matplotlib inline
import seaborn as sns

EPSILON = .0000001 # tiny tolerance for managing subtle differences resulting from floating point operations

DATASET_FILE = "dataset_full.tsv"

## Rules

Comment the line in the exercise where you read "raise NotImplementedError()". 
It is a placeholder to check whether the exercise has been attempted or not. 
**It will be considered a mistake if you do not comment it.**

## Exercise 1 (3 point)

Load the data from the file "dataset_full.tsv" (tabular separated value) that describe 721 instances of Pokemons with their characteristics (12 characteristics in total), plus an additional binary variable that indicates whether a Pokemon is "legendary" (1) or not (0).
Use the first column named "index" as the index of the dataframe.
Save the dataframe in the variable "pokemon_full".

In [2]:
# YOUR CODE HERE
pokemon_full = pd.read_csv(DATASET_FILE, delimiter="\t", index_col="index")

# leave this code to check if you loaded the data correctly
print("Loaded `Pokemon` dataset into a dataframe of size ({} x {})".format(pokemon_full.shape[0], pokemon_full.shape[1]))
pokemon_full.head()

Loaded `Pokemon` dataset into a dataframe of size (721 x 13)


Unnamed: 0_level_0,id,name,type_1,type_2,total,hp,attack,defense,special_attack,special_defense,speed,generationPok,is_legendaryPok
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,0
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,0
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,0
3,4,Charmander,Fire,,309,39,52,43,60,50,65,1,0
4,5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,0


## Exercise 2 (3 points)

Delete the column <code><b>id</b></code> of the Pokemon data with the Dataframe function *drop* with the option *inplace* equal to True.

In [3]:
# YOUR CODE HERE
pokemon_full.drop(labels="id", inplace=True, axis="columns")

# check data
pokemon_full.head()

Unnamed: 0_level_0,name,type_1,type_2,total,hp,attack,defense,special_attack,special_defense,speed,generationPok,is_legendaryPok
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,0
1,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,0
2,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,0
3,Charmander,Fire,,309,39,52,43,60,50,65,1,0
4,Charmeleon,Fire,,405,58,64,58,80,65,80,1,0


## Exercise 4 (14 points)

You need to implement a *simple hold-out* approach for training and testing a classifier that will be able to predict whether a Pokemon is legendary or not.

For this experiment, you need to choose the features that describe the Pokemon.
In particular:
1. You must not take into consideration the name of the Pokemon (it is not useful for the prediction!)
2. You need to have at least one numerical feature and one categorical feature in the dataset.
3. You need to split the dataset with an 80%-20% training-test proportion, use a stratified sampling approach, and set a random set equal to 42.
4. You need to compare the results of two classifiers at your choice using both the f1-measure and accuracy metrics.

In [4]:
# YOUR CODE HERE
from sklearn.model_selection import train_test_split

# todo: predict if the pokemon is legendary or not

# i choose to take the primary type of the pokemon and the sum of all its statistics for the prediction
X = pokemon_full[["type_1", "total"]]
y = pokemon_full[["is_legendaryPok"]]
X = pd.get_dummies(X)

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3, weights="uniform")
imputer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

from sklearn.neighbors import KNeighborsClassifier
kn = KNeighborsClassifier(n_neighbors=3)
kn.fit(X_train, y_train)

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

def evaluate(y_pred, y_true):
    print(f"f1 score: {f1_score(y_true, y_pred)}")
    print(f"accuracy: {accuracy_score(y_true, y_pred)}")
    
print(f"Evaluating LogisticRegression")
out = lr.predict(X_test)
evaluate(out, y_test)

print(f"Evaluating KNN")
out = kn.predict(X_test)
evaluate(out, y_test)


Evaluating LogisticRegression
f1 score: 0.7
accuracy: 0.9586206896551724
Evaluating KNN
f1 score: 0.625
accuracy: 0.9586206896551724


  y = column_or_1d(y, warn=True)
  return self._fit(X, y)
