# Ricky's Final Project - Part 2

### Pokedex Upgrade - Predicting Catch Rates of Unknown Pokemon

**1) Problem and Hypothesis**

**Problem Statement**

Every year since 1999 thousands of young aspiring Pokemon masters leave their homes in an effort to "catch em all" with nothing more than the clothes on their back and a Pokedex. However, as time passes there seems to be an ever growing number of species. It can daunting to encounter an unknown species for the first time. This project explores using predictive models to help a Pokedex classify unknown Pokemon and predict the probability of catching this new creature. As a further step, can this new species be classified as a "Legenday" type. 

**How does this tie into machine learning?**

There are quite a few machine learning methods that can be used on this data. To first predict a Pokemon's catch rate we can implement a multivariate linear regression to get a discrete result. To then predict if a Pokemon is legendary type we can convert a boolean feature to binary and run a logistic regression. Both will require feature selection methods to reduce the number of variables that come into play. 

Additionaly, the Pokemon universe does a great job of classifying species by color, body type, attribute type, etc. It would be interesting to implement clustering techniques to identify any new types of classifications that we can begin to group Pokemon by outside of the existing attributes. 

**What do you think will have the most impact in predicting the value you are interested in solving for?**

I believe the **Total** battle points features will play a large factor in predicting the **Catch_Rate**. I imagine there to be a negative relationship; where as strength increases the catch rate will decrease. 

**2) Data Set Details**

The data set was compiled on kaggle.com -> https://www.kaggle.com/abcsds/pokemon using the below sources from the video games series:

* pokemon.com
* pokemondb
* bulbapedia

The data set includes 21 variables for each of the 721 Pokemon.

**Data Dictionary**

* Number = Pokémon ID in the Pokédex (integer)
* Name = Name of the Pokémon (string)
* Type_1 = Primary type (string)
* Type_2 = Second type, in case the Pokémon has it (string)
* Total = Sum of all the base stats (Health Points, Attack, Defense, Special Attack, Special Defense, and Speed) (integer)
* HP = Base Health Points (integer)
* Attack = Base Attack (integer)
* Defense = Base Defense (integer)
* Sp_Atk = Base Special Attack (integer)
* Sp_Def = Base Special Defense (integer)
* Speed = Base Speed (integer)
* Generation = Number of the generation when the Pokémon was introduced (integer)
* isLegendary = Indicates whether the Pokémon is Legendary or not (boolean)
* Color = Color of the Pokémon according to the Pokédex (string)
* hasGender = Indicates if the Pokémon can be classified as female or male (boolean)
* Pr_male = In case the Pokémon has Gender, the probability of its being male (float)
* Egg_Group_1 = Egg Group of the Pokémon (string)
* Egg_Group_2 = Second Egg Group of the Pokémon, in case it has two (string)
* hasMegaEvolution = Indicates whether the Pokémon is able to Mega-evolve or not (boolean)
* Height_m = Height of the Pokémon, in meters (float)
* Weight_kg = Weight of the Pokémon, in kilograms (float)
* Catch_Rate = Catch Rate (integer)
* Body_Style = Body Style of the Pokémon according to the Pokédex (string)


In [54]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

pd.set_option('display.max_rows', 10)

%matplotlib inline
plt.style.use('ggplot')

In [55]:
poke_data = pd.read_csv(os.path.join('.', 'pokemon_stats_data.csv'))

poke_data.head()

Unnamed: 0,Number,Name,Type_1,Type_2,Total,HP,Attack,Defense,Sp_Atk,Sp_Def,...,Color,hasGender,Pr_Male,Egg_Group_1,Egg_Group_2,hasMegaEvolution,Height_m,Weight_kg,Catch_Rate,Body_Style
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,...,Green,True,0.875,Monster,Grass,False,0.71,6.9,45,quadruped
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,...,Green,True,0.875,Monster,Grass,False,0.99,13.0,45,quadruped
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,...,Green,True,0.875,Monster,Grass,True,2.01,100.0,45,quadruped
3,4,Charmander,Fire,,309,39,52,43,60,50,...,Red,True,0.875,Monster,Dragon,False,0.61,8.5,45,bipedal_tailed
4,5,Charmeleon,Fire,,405,58,64,58,80,65,...,Red,True,0.875,Monster,Dragon,False,1.09,19.0,45,bipedal_tailed


In [56]:
poke_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 721 entries, 0 to 720
Data columns (total 23 columns):
Number              721 non-null int64
Name                721 non-null object
Type_1              721 non-null object
Type_2              350 non-null object
Total               721 non-null int64
HP                  721 non-null int64
Attack              721 non-null int64
Defense             721 non-null int64
Sp_Atk              721 non-null int64
Sp_Def              721 non-null int64
Speed               721 non-null int64
Generation          721 non-null int64
isLegendary         721 non-null bool
Color               721 non-null object
hasGender           721 non-null bool
Pr_Male             644 non-null float64
Egg_Group_1         721 non-null object
Egg_Group_2         191 non-null object
hasMegaEvolution    721 non-null bool
Height_m            721 non-null float64
Weight_kg           721 non-null float64
Catch_Rate          721 non-null int64
Body_Style          721 non-

**3) Domain Knowledge**

**Feature Creation**

My domain knowledge spans back to roughly only the first generation of the Pokemon series. Fortunately, much hasn't changed since then. This will allow me to use the existing information to create some additional features to potentially improve performance on my models. In addition, to making sense of existing features and patterns.

* Sub_type = binary feature that indicates if a Pokemon has a second attribute type different than it's primary
* Offensive_Power = continuous feature (sum of **Attack**, **Sp_Attack**, **Speed**)
* Defensive_Power = continuous feature (sum of **HP**, **SP_Defense**, **Defense**)

**Existing Work**

On Kaggle there is some work previously done exploring the battle metrics and basic analysis around them. Combing through that published kernel I concluded that looking at battle metrics alone wouldn't be the best indicator of a Pokemon's attribute type. This helped me reframe where I wanted to focus my effort in terms of creating a predictive model. 



**4) Project Concerns**

**Risks**

One concern I have is the low number of observations and wide range of features to select from. I fear not having enough observations to train my data set and running into overfitting my model; taking away its ability to generalize future entries.

Additionally, only 5% of the entire Pokemon universe (721 observations) catergorizes itself as a legenday type. Will this impact the models ability to predict the second task of identifying a legendary type. 

**Finding the best way to eliminate redundant features**

Some of the features might share a high correlation. An example would be that the **Total** feature encapsulates all other battle metrics. Intuitively, I can elimate a few features that I don't think will play much role in my model but looking at the data I am not sure the best technique to eliminate redundant features. I plan on exploring lasso, random forest's importance score and PCA techniques to maximize predictive power.

**Data Assumptions**

There are a few missing values under the **Type 2** feature. Instead of removing those missing values I will transform their Sub attribute type to match their Primary attribute; since they are a single type Pokemon. 

It appears that the battle points are already set to have similar scales between them. However, I will need to create a standard scale across some of the other features that will be introduced into model such as height and weight. 



**5) Outcome**

**Expected Output**

After fitting, training and testing the model. I expect the final product to be a single line output stating the catch rate. 

As my target audience I would expect the catch rate to be converted into a probability as that would be more inuitive than just a catch rate scale used by the current Pokedex. 

**Defining Sucess** 

The model itself does not need to be overly complex. I am hoping for a 70% accuracy in the testing stage. 

**What gain do you expect from your most important features?**

I believe for each model there will be some features that have a heavier weight than others. For example, the majority of Legendary Pokemen are considered to be gender-less. While some regular Pokemon may be gender-less that feature will play a larger role in the prediction process. 

**What happens if it is a bust? What are the implications?**

If the model is a flop then the young trainer will have to catch the unknown Pokemon the old fashion way or stay out of the tall grass.