# Task for Today  

***

## Legendary Pokémon Classification  

Use a FeedForward Neural Network to predict if a given Pokémon is **legendary** or not, based on *Pokémon features*.


<img src="https://wallpapers.com/images/hd/legendary-pokemon-pictures-7yo7x0f1l2b2tu0r.jpg" width="800" height="500" alt="legendaries">

Data available at: https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6

Download the zip file, extract it and put the `pokemon.csv` file in the file section of colab.

# Challenge

TAs want to battle!

<img src="https://pokemongohub.net/wp-content/uploads/2023/06/grunts-1.jpg" width="400" height="300" alt="TAs">

Rules of the challenge:

- Gotta catch 'em all! ...But give priority to the legendaries.
- F1-score is usually the measure of choice for imbalanced datasets; however in this case we particularly want to avoid not "catching" legendaries. They're so rare, you might not have any more chances to catch 'em if they flee...
- In ML terms, we give recall more importance than precision for the task (check the whiteboard if you don't know their meaning).
- F2-score (i.e., [F-$\beta$-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html) with $\beta = 2$) is hence used as the main evaluation metric for your model.

- **TAs achieved a F2-score of 0.7692. Can you beat them?!**

# Imports and dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

import torch
from torch import nn
import torch.optim as optim
import platform

ModuleNotFoundError: No module named 'numpy'

In [6]:
_ = torch.manual_seed(42) # for a fair comparison, don't change the seed!

In [7]:
data = pd.read_csv('pokemon.csv')

In [8]:
data

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


In [9]:
data_raw = data.copy() # usually, if memory allows it, it's a good idea to keep a raw version of your data

# Pre-processing / encoding

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   Total       800 non-null    int64 
 5   HP          800 non-null    int64 
 6   Attack      800 non-null    int64 
 7   Defense     800 non-null    int64 
 8   Sp. Atk     800 non-null    int64 
 9   Sp. Def     800 non-null    int64 
 10  Speed       800 non-null    int64 
 11  Generation  800 non-null    int64 
 12  Legendary   800 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB


In [11]:
data.isna().sum()

#               0
Name            0
Type 1          0
Type 2        386
Total           0
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary       0
dtype: int64

In [12]:
data = data.drop(['#', 'Name', 'Type 2'], axis=1)

In [13]:
data['Legendary'] = data['Legendary'].astype(int)

In [14]:
data.dtypes

Type 1        object
Total          int64
HP             int64
Attack         int64
Defense        int64
Sp. Atk        int64
Sp. Def        int64
Speed          int64
Generation     int64
Legendary      int64
dtype: object

Categorical variables are one-hot encoded

In [15]:
def onehot_encode(df, column, prefix):
    df = df.copy()
    dummies = pd.get_dummies(df[column], prefix=prefix)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)
    return df

In [16]:
data = onehot_encode(data, 'Type 1', 't')
data = onehot_encode(data, 'Generation', 'g')

In [17]:
data.shape

(800, 32)

## Splitting and Scaling

In [18]:
data.columns # note that only the first 7 features are continuous now

Index(['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed',
       'Legendary', 't_Bug', 't_Dark', 't_Dragon', 't_Electric', 't_Fairy',
       't_Fighting', 't_Fire', 't_Flying', 't_Ghost', 't_Grass', 't_Ground',
       't_Ice', 't_Normal', 't_Poison', 't_Psychic', 't_Rock', 't_Steel',
       't_Water', 'g_1', 'g_2', 'g_3', 'g_4', 'g_5', 'g_6'],
      dtype='object')

In [19]:
y = data['Legendary']
X = data.drop('Legendary', axis=1)

In [20]:
scaler = StandardScaler()

X_scaled = scaler.fit_transform(X.iloc[:,:7])
X = np.concatenate((X_scaled, np.array(X.iloc[:,7:])), axis=1)

In [23]:
# keep the proportions for the split equal and specify a seed of 42, we want a fair fight!

train_size = 0.6
valid_size = 0.4
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_test, y_test, train_size=valid_size, random_state=42)

# Model definition

In [26]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if platform.system() == "Darwin":
    device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
    
print(f"Using {device} device")

Using cpu device


### Define your model :

Choose yourself in the model:
- number of hidden layers
- number of neurons per layer (careful with input and output, these are not a choice)
- activation functions
- any other possible component among those seen so far in theory.

In [20]:
# TODO

Instantiate your model and print it out

In [None]:
# TODO

### Hyperparameters:

Choose carefully your:
- learning rate (this is usually the most important hyperparameter to get right, but some optimizers are more forgiving than others)
- batch size
- number of epochs
- other hyperparameters that you might need

In [22]:
# TODO

### Loss function and optimizer:

- What's the appropriate loss function for the task?
- Decide which optimizer you want to use ([Documentation](https://pytorch.org/docs/stable/optim.html))

In [32]:
# TODO

### Dataset and loaders:

Define your TensorDatasets and DataLoaders; remember to use the appropriate dtype for your tensors.

In [24]:
# TODO

In [25]:
# Keep track of training and validation losses during training
train_loss_list = []
valid_loss_list = []

train_length = len(trainloader)
valid_length = len(validloader)

### Training

Implement your training and evaluation (for the validation set) loops

In [None]:
# TODO

# Results

### Plotting

Plot out the training and validation losses over the epochs

In [None]:
plt.plot(..., label='train') # TODO
plt.plot(..., label='valid') # TODO
plt.legend(loc="best")
plt.grid("on")
plt.show()

### Metrics

Print out appropriate metrics for the task

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, fbeta_score, classification_report

# TODO

Did you manage to catch them all?