# Gotta classify 'em all!

### Predict the types of Pokémon based on their attributes.

In the following, you have access to the vital statistics for the first 6 generations of Pokémon. Your (very open ended!) task is to see what to extent it is possible to predict the primary type (the `Type_1` field) of a Pokémon given its other vital statistics. This is a multiclass classification problem.

### Hints and tips:

* The package `sklearn` is the industry standard for ML algorithms that can be used out of the box quickly- you should use it. https://scikit-learn.org/stable/supervised_learning.html#supervised-learning
    * Beginning with something simple like Logistic Regression or a Decision Tree is encouraged to establish a performance baseline.
* The dataset features a large number of columns, many of which are likely redundant and will not contain any predictive power for the task at hand. 
    * Consider preselecting a small number of 'obvious' columns which your misspent youth (sorry, 'domain expertise') tells you are likely to contain a lot of predictice power. 
    * Get something up and running with these first (fewer features -> less time feature engineering) and then circle back to incorporate more features into your model once this is done.
    * Doing this iteratively rather than 'all at once' is good from a client facing point of view (having some kind of concrete result to discuss early is always a good thing), and will also help you see what is going on more quickly.
* The dataset consists of a mixture of categorical and continuous variables. 
    * Categorical variables will need to be converted to numeric indicator values before being passed into `sklearn` classifiers- you can use the `pandas` function `pd.get_dummies` for this.
    * Dummying lots of categorical features with lots of possible values can quickly result in a very large, sparse, feature space. Beginning with only a small subset of features will help here.
* Some of the columns in your dataset have null values in them:
    * It's wise to ignore these at a first pass in order to get up and running quickly, but they might contain information that could improve your model.
    * In this situation, does it make more sense to impute these null values with something like the mean or mode value of the non-null entries, to convert them to a dummy variable, or to incorporate them into another column somehow?
* There are `18` different `Type_1` values. Are these values distributed evenly across the dataset? It might make sense to focus your initial efforts on identifying only a subset of these in order to avoid getting bogged down with small data issues at the start.
* In order to evaluate the performance of your model, you will need to perform a train/test split (use `sklearn.model_selection.train_test_split`).
    * Research what it means to stratify your train/test split with respect to the target variable. It is a good idea to do so here in order to guarantee that the performance metrics you quote are relatively stable. You should be aware that doing this in situations where you can't be sure that the class balance breakdown 'in the wild' is the same as in your dataset will result in biased estimates of the performance of your model.
    * Train/test splits feature a frustrating tradeoff: you want as much data as possible in your train set to build the best possible model, but you also want lots of data in your test set to evaluate its performance accurately. Cross validation is a computationally intensive way to have your cake and eat it in this scenario: see `sklearn.model_selection.cross_val_score`.
* You will need to consider how to evaluate the performance of your classifier: the accuracy score is the simplest metric to quote, but plotting a full confusion matrix will give you significantly more insight into how your model is performing and where it could be improved.

In [None]:
# Feel free to import more packages (i.e., numpy, sklearn packages) as required.
import pandas as pd
import matplotlib.pyplot as plt

# Magic command to make plots render inline/underneath cells in Jupyter notebooks.
%matplotlib inline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [None]:
fpath = "https://s3-eu-west-1.amazonaws.com/faculty-client-teaching-materials/non-linear-algorithms/pokemon.csv"

In [None]:
df = pd.read_csv(fpath)
# list(df)

In [None]:
df.head()

In [None]:
df["Type_1"].unique()

In [None]:
mostcommon = df["Type_1"].value_counts()

In [None]:
my_types = mostcommon.iloc[:19].index
my_features = [
    "Total",
    "HP",
    "Attack",
    "Defense",
    "Sp_Atk",
    "Sp_Def",
    "hasMegaEvolution",
    "Egg_Group_1",
]
my_features = [
    "Total",
    "HP",
    "Attack",
    "Defense",
    "Sp_Atk",
    "Sp_Def",
    "Egg_Group_1",
]

In [None]:
df.loc[df["Type_1"].isin(my_types), ["Type_1"] + my_features]

In [None]:
subset = df["Type_1"].isin(my_types)

In [None]:
X = df[subset][my_features]
y = df[subset]["Type_1"]
X = pd.get_dummies(X, dtype=float)
X = X.dropna()

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=0, stratify=y
)
logisticRegr = LogisticRegression()
logisticRegr.fit(x_train, y_train)
y_pred = logisticRegr.predict(x_test)
score = logisticRegr.score(x_test, y_test)
print(score)

In [None]:
print(classification_report(y_test, y_pred))