# Investigate - analyse your dataset

## 1. Define the goal of your analysis

*How can you define your goal as a question that you can answer with data analysis?*

Main: What starter pokemon should you pick in each generation? 
- Assumption: Team combination should cover a multitude of types
- Higher stats are better
- Types have weaknesses and strengths that should be considered

*Which sub questions do you need to answer this question?*
- What types are rarer in the generations?
- How many types do the starters have?
- What starter has the most weaknesses and strenghts because of their type(s)?
- Are there differences in the starters stats?

## 2. Data Handling
*Reflect on your data set. How was it collected? Where does it come from? What implications does that have for answering your question?*

The data comes from [Michael Lomuscio (Kaggle)](https://www.kaggle.com/mlomuscio/pokemon). It is an updated dataset from [Alberto Barradas dataset](https://www.kaggle.com/abcsds/pokemon). The data itself comes from official pokemon/pokedex sides. It is not collected through, e.g. surveys of a sample group, where you should consider bias, etc, but represents current pokemon data. 


*Load your data and observe what state your data is in.*

- The dataframe has 12 columns and 800 rows. 
- The columns are: Num, Name, Type1, Type2, HP, Attack, Defense, SpAtk, SpDef, Speed, Generation, Legendary
- Just from looking at the first 5 rows, we can see that there are NaNs in the Type2 column

In [12]:
import pandas as pd
pokemon = pd.read_csv('./PokemonData.csv')
display(pokemon.shape)
pokemon.head()

(800, 12)

Unnamed: 0,Num,Name,Type1,Type2,HP,Attack,Defense,SpAtk,SpDef,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False


*Describe your dataset. What do the rows mean? What information do you have in the columns? Data types...*
- A row is one pokemon
- The columns give you: 
    - Name: object with the pokemon's name
    - Type1: object with the first type of the pokemon
    - Type2: object with the second type of the pokemon
    - HP: int that gives you the average HP of the pokemon
    - Attack: int that gives you the average Attack of the pokemon
    - Defense: int that gives you the average Defense of the pokemon
    - SpAtk: int that gives you the average Special Attack of the pokemon
    - SpDef: int that gives you the average Special Defense of the pokemon
    - Speed: int that gives you the average Speed of the pokemon
    - Generation: int that gives you the Generation of the pokemon
    - Legendary: Boolean that shows whether a pokemon is legendary or not


In [9]:
pokemon.dtypes

(800, 12)

Num            int64
Name          object
Type1         object
Type2         object
HP             int64
Attack         int64
Defense        int64
SpAtk          int64
SpDef          int64
Speed          int64
Generation     int64
Legendary       bool
dtype: object

*Identify things that need to be cleaned and preprocessed and make a plan for which processing steps you need.*

The only column with missing values is the Type2 column. The value is missing if the pokemon is not a dual type pokemon and therefore doesn't have a second type.
For now I do not see a reason to get rid of these missing values nor convert them, but they will be very useful when e.g. comparing how many pokemon are single type vs dual type.

For now all columns seem interesting. 
- The number and name to id the pokemon easily
- The types to compare pokemon based on type(s)
- The stats to compare pokemon based on stats
- The generation to group pokemon based on generation
- Legendary seems to be the least interesting so far, but it might be interesting to look at starters compared to legendary pokemon or to take legendary pokemon out of the dataframe, because they are supposed to be more powerful, rare pokemon that could impact any calculations we make with the stats, as they are likely edge cases.

In [11]:
pokemon.isna().any()

Num           False
Name          False
Type1         False
Type2          True
HP            False
Attack        False
Defense       False
SpAtk         False
SpDef         False
Speed         False
Generation    False
Legendary     False
dtype: bool

## 3. Methods
- *Identify which parts of the dataset are most relevant to answering your questions*
    - *Which features are most relevant?*
    - *What can you say about the quality of these features? How will it influence your analysis?*
- *Select which methods you want to use to answer your questions.*
    - *Which results will you generate?*
    - *How do these results answer your question?*
    - *What are threads to validity?*
- *Example Methods:*
    - *Descriptive statistics: eg. median, quartiles, mean, standard deviation.*
    - *Data visualizations, eg. countplots for categorical data; histograms and distributions for typical values of continuous data; scatterplots for relations between two numerical variables, etc. These visualizations should be relevant to the question you are trying to answer.*


Different parts of the dataset will be important for different parts of the analysis I would like to do.

### Types
The types are important for comparing pokemon by types. Both variety and strengths and weaknesses.

The variety could be visualized by a bar plot.

I have not decided how (or whether) to best visualize the strengths and weaknesses.

### Stats
The stats are important for comparing pokemon by stats. Here we can do a lot of fun calculations. Are starter pokemon generally better or worse then other pokemon (comparing the mean of different stats or sum of all stats)? Are there significant differences between the starter pokemon?
Good visualizations here could be line plots.

### Extra
A few other things to consider would be: 
- Grouping by generation (or focussing on one generation)
- Dropping legendary pokemon to get a better understanding of e.g. the average stats of non-legendary pokemon
- Is there a relation between types and stats, e.g. do water types generally have more hp, etc
- Is there a relation between stats, e.g. do pokemon with higher attack have lower special attack,...
- Do the two points above translate to our starter pokemon?



