# Pokemon Data Analysis
---
Introduction to Data Analysis in python using `pandas`, `numpy`, and `matplotlib`. 

All data drawn from Kaggle's "Complete Pokemon Dataset" for Generations 1 - 7 and a web scraping of Serebii.net for Gen 8.
The sites for the original datasets are listed below: 
- [Generations 1 - 7](https://www.kaggle.com/rounakbanik/pokemon/version/1)
- [Generation 8](https://github.com/yaylinda/serebii-parser)  
  
![Pokeball](https://image.businessinsider.com/5dcee8473afd37158f6c8ab9?width=1100&format=jpeg&auto=webp)

## What Questions Should We Ask?
---
I think are large part of data analysis is simply learning to ask the right questions. Questions like "how many Pokemon are there?" are easy to answer, but do little to actually provide insight on minutia of the dataset. Instead, if focusing on quantity, we might ask "how many new Pokemon were created in each generation?". If the numbers from generation to generation seem largely similiar, we might infer that the Pokemon company sets arbitrary numeric goals for the number of new pokemon that must be introduced with every sequential release. If this does happen to be the case, we could ask further questions like "What is the distribution of typing in each release? Is it similiar? Could this effect the balance of power in the game?" or, if not, could the distribution of typing per generation be based on the in-game environment? Does generation 7, which is set in an archipelago, contain higher number of water and ground type Pokemon than other generations? However, say the generational numbers vary widely, perhaps we could ask questions about the Pokemon company itself, such as "For generations where fewer new pokemon were released, was the company's budget or timeline any different than generations with larger releases?".  
  
These are the kinds of questions that grant us further insight into the nature of the games and the company behind their development. Howewer, before we can run, we must learn to walk, so, in the interest of learning basic navigation and analysis with the libraries involved, let's start with something more readily attainable.  
  
## Basic Questions
--- 
- What is the strongest Pokemon in terms of overall stats?
- What is the most effective move typing?
    - Which is the least effective?
- Which Pokemon has the highest IV for:
    - Attack
    - Special Attack 
    - Defense
    - Special Defense
    - Speed
    - HP
- Which typing has the highest average IV's for:
    - Attack
    - Special Attack
    - Defense
    - Special Defense
    - Speed
    - HP
- Which typing has the most Pokemon over 600 IV?
- Does typing have any bearing on IV's?
    - Do fire types have stronger than average special attack?
    - Do dragon types have stronger than average attack?
    - Do rock, ground, steel types have stronger than average defense?
- What single typing has the least amount of weaknesses?
- What single typing has the highest number of weaknesses?
- What about dual typings? (Not all dual typings are currently in the game)
- Are there correlations between IV's and certain non-IV characteristics?
    - Weight and defense/hp?
    - Weight and overall total stats? (Do legendaries weigh more?)
    - Total stats and experience gain?
    - Total stats and steps-to-hatch?
  
## What Data Do We Need?
--- 
- Basic information:
    - Number 
    - Name 
- Statistically Relevant Information:
    - Typing 
    - Stats:
        - Attack
        - Special Attack 
        - Defense
        - Special Defense
        - Speed
        - HP
        - Total IV's
    - Weight
    - Experience gain
    - Steps to hatch 


## Installing Dependencies
--- 
First, we need to make sure that we have access to the libraries that we need. There are a couple of ways that we could do this, one of which involves conda, but I like `pip`. `Pip`, or the Python Installer Package, solves this pretty easily in a couple lines.

In [1]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pandas # the library used for handling the dataset
!{sys.executable} -m pip install matplotlib # the library used to vizualize our data
!{sys.executable} -m pip install xlrd # dependency we need to read from excel files 

Collecting pandas
[?25l  Downloading https://files.pythonhosted.org/packages/ab/ba/f97030b7e8ec0a981abdca173de4e727b3a7b4ed5dba492f362ba87d59a2/pandas-1.0.1-cp37-cp37m-macosx_10_9_x86_64.whl (9.8MB)
[K    100% |████████████████████████████████| 9.8MB 3.1MB/s 
[?25hCollecting pytz>=2017.2 (from pandas)
[?25l  Downloading https://files.pythonhosted.org/packages/e7/f9/f0b53f88060247251bf481fa6ea62cd0d25bf1b11a87888e53ce5b7c8ad2/pytz-2019.3-py2.py3-none-any.whl (509kB)
[K    100% |████████████████████████████████| 512kB 9.6MB/s 
Collecting numpy>=1.13.3 (from pandas)
[?25l  Downloading https://files.pythonhosted.org/packages/2f/5b/2cc2b9285e8b2ca8d2c1e4a2cbf1b12d70a2488ea78170de1909bca725f2/numpy-1.18.1-cp37-cp37m-macosx_10_9_x86_64.whl (15.1MB)
[K    100% |████████████████████████████████| 15.1MB 1.5MB/s 
Installing collected packages: pytz, numpy, pandas
Successfully installed numpy-1.18.1 pandas-1.0.1 pytz-2019.3
[33mYou are using pip version 19.0.3, however version 20.0.2 is av

## Importing The Proper Libraries
---
For this analysis, I'm going to be using `pandas` to interact with the data. Below, we will import the proper libraries and use the preferred shorthand notation for them. "`pd`" is common for pandas, while "`np`" and "`plt`" are common substitions for numpy and matplotlib's plotting functionality. `Regex` or "`re`" is also imported for some more complex sorting, but more on that later.

In [2]:
# import pandas and numpy for data analysis
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import re 

## Reading Data Into The Notebook
Pandas has a handy set of functions called read_{file extension} that allows you to import your data into a workable format called a dataframe. Both of my files are .CSV, but there are plenty of other supported extensions, including excel and standard .txt files.

In [4]:
pokemon = pd.read_excel("pokedex.xlsx")
pokemon.head()

Unnamed: 0,Pokedex Number,Name,Type 1,Type 2,HP,Attack,Defense,Special Attack,Special Defense,Speed,...,Against Ground,Against Ice,Against Normal,Against Poison,Against Psychic,Against Rock,Against Steel,Against Water,Steps to Hatch,Experience Growth
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,...,1.0,2.0,1.0,1.0,2,1.0,1.0,0.5,5120.0,1059860.0
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,...,1.0,2.0,1.0,1.0,2,1.0,1.0,0.5,5120.0,1059860.0
2,3,Venusaur,Grass,Poison,80,100,123,122,120,80,...,1.0,2.0,1.0,1.0,2,1.0,1.0,0.5,5120.0,1059860.0
3,4,Charmander,Fire,,39,52,43,60,50,65,...,2.0,0.5,1.0,1.0,1,2.0,0.5,2.0,5120.0,1059860.0
4,5,Charmeleon,Fire,,58,64,58,80,65,80,...,2.0,0.5,1.0,1.0,1,2.0,0.5,2.0,5120.0,1059860.0


In [6]:
# Creates a df containing only pokemon that are at least partially grass type 
grass = pokemon.loc[(pokemon["Type 1"] == "Grass") | (pokemon["Type 2"] == "Grass")]
grass

Unnamed: 0,Pokedex Number,Name,Type 1,Type 2,HP,Attack,Defense,Special Attack,Special Defense,Speed,...,Against Ground,Against Ice,Against Normal,Against Poison,Against Psychic,Against Rock,Against Steel,Against Water,Steps to Hatch,Experience Growth
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,...,1.0,2.0,1.0,1.0,2,1.0,1.0,0.50,5120.0,1059860.0
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,...,1.0,2.0,1.0,1.0,2,1.0,1.0,0.50,5120.0,1059860.0
2,3,Venusaur,Grass,Poison,80,100,123,122,120,80,...,1.0,2.0,1.0,1.0,2,1.0,1.0,0.50,5120.0,1059860.0
49,43,Oddish,Grass,Poison,45,50,55,75,65,30,...,1.0,2.0,1.0,1.0,2,1.0,1.0,0.50,5120.0,1059860.0
50,44,Gloom,Grass,Poison,60,65,70,85,75,40,...,1.0,2.0,1.0,1.0,2,1.0,1.0,0.50,5120.0,1059860.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
859,829,Gossifleur,Grass,,40,40,60,40,60,10,...,0.5,2.0,1.0,2.0,1,1.0,1.0,0.50,,
860,830,Eldegoss,Grass,,60,50,90,80,120,60,...,0.5,2.0,1.0,2.0,1,1.0,1.0,0.50,,
870,840,Applin,Grass,Dragon,40,40,80,40,40,20,...,0.5,4.0,1.0,2.0,1,1.0,1.0,0.25,,
871,841,Flapple,Grass,Dragon,70,110,80,95,60,70,...,0.5,4.0,1.0,2.0,1,1.0,1.0,0.25,,


## Basic Navigation  
---
However, notice that these datasets are quite large, especially gens 1 - 7, with over 800 rows and 41 columns. As such, not all of it is shown in the output cell. Fortunately, `pandas` has a couple helpful methods that we can use to view the parts we want to see.

First up, `.head()`. By default, this will display the first 5 rows of the dataframe, however, you can change this by adding a numerical argument for the number of rows you wish to view.  

Now, if you're thinking, "hey, that's useful, but is there another optional argument to reverse it?" The answer would technically be no, but, there is another method. This one is aptly named `.tail()`. `.tail()` functions exactly like `.head()` does, execpt it shows the *last* 5 rows in a dataframe. Again though, you can change this with an optional numeric argument.

We can use `.columns` to display the headers of our dataframe

Now, let's use these methods to reorganize our data. Currently, the dataframe headers are run alphabetically. While this might be useful in some cases, it isn't in ours. So, let's move things around to resemble a more traditional pokedex.