### POKEMON STATS - Using python to analyse Pokemon!!! 

Hello Ed - this is a Jupyter Notebook - the best way to use and learn python.   
It's super cool - because you can just run things one chunk at a time   
And not worry about making mistakes.     


The first thing about notebooks is that they have 2 different types of cell (these box thingies).....    
* The first one is called Markdown - where you can just write whatever you want - like instructions.    
* The second one is Code - where you can run python code     

The code boxes should be run in order - but pressing the execute triangle to the left of the box.    

Ok - first thing you need to do is make sure the Pokemon file is in the same directory as this notebook - like in your 'My Documents' directory. 

Then you need to install the correct libraries to allow you to run this code correctly........ 

In [None]:
### Remember any line with '#' at the start is just a comment - not real code... And the test should turn green 
# Next run this box by clicking the execute cell 'triangle' 
# These commands installs new packages (a bit like DLC for games) onto Python  
! pip install pandas        
! pip install numpy
! pip install matplotlib
! pip install seaborn
! pip install pokemon

### Let's get started with some Basic Analysis

The first thing here is to import the stuff you've just downloaded into your notewbook: 

In [1]:
import pandas as pd   #importing all the important packages
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

Next - you're going to read the Pokemon.csv file into memory.    
You're going to store it in something called a dataframe - a huge grid of stuff - in this case a huge grid of pokemon information........      
Your dataframe is going to be called df .....  because why not        
The command df.head(10) shows you the first 10 rows of the grid that you've just made - and the kind of information that's in there ......

In [None]:
df =  pd.read_csv('../input/Pokemon.csv')  #read the csv file and save it into a variable
df.head(n=10)                    #print the first 10 rows of the table

This next cell changes the column headers - puts them in capitals - and removes balnk spaces. 

In [None]:
df.columns = df.columns.str.upper().str.replace('_', '') #change into upper case
df.head()

Next - you're going to see how to reference a column     
df['LEGENDARY'] is the Legendary column of df     
Here you're showing the first 5 rows where the legendary is true .....    
Change to head(10) if you want to see more   

In [None]:
df[df['LEGENDARY']==True].head(5)  #Showing the legendary pokemons

Next you're going to change the Index - in other words how python looks up your rows   
It's going to change the index to NAME   

In [5]:
df = df.set_index('NAME') #change and set the index to the name attribute

### CLEANING THE DATAFRAME

The thing that's really annoying about datasets you get on the internet is that they are full of stuff you don't want....     
So here we'll tidy it up a little bit    

In [None]:
## The index of Mega Pokemons contained extra and unneeded text. Removed all the text before "Mega"  
df.index = df.index.str.replace(".*(?=Mega)", "")
df.head(10)

Next we'll drop the columns with #

In [7]:
df=df.drop(['#'],axis=1) #drop the columns with axis=1;axis=0 is for rows

The dataframe has lots of things that you can use to analyse it    
All you have to do is add with a dt after the df     
So for example df.columns lists all the columnn names     
And df.shape will give the number of rows and columns of the dataframe       

In [None]:
print('The columns of the dataset are: ',df.columns) #show the dataframe columns
print('The shape of the dataframe is: ',df.shape)    #shape of the dataframe

Another funcky thing is that some of the columns have dodgy values in them - called NaNs    
Which basically stands for 'Not A Number'     
So here - we find the NaNs in Type 2 and use the value in Type 1 

In [9]:
#some values in TYPE2 are empty and thus they have to be filled or deleted
df['TYPE 2'].fillna(df['TYPE 1'], inplace=True) #fill NaN values in Type2 with corresponding values of Type

Next we can see how to look at individual information...   
We can use the pokemon index we just made (using the pokemon names) to view the rows for that Pokemon    
We do that using df.loc    
df.iloc does the same thing - BUT is uses the row number rather than the index name    
 

In [None]:
print(df.loc['Bulbasaur'])      #retrieves complete row data from index with value Bulbasaur
print(df.iloc[0])               #retrieves complete row date from index 0 ; integer version of loc
print(df.ix[0])                 #similar to iloc
print(df.ix['Kakuna'])          #similar to loc
#loc works on labels in the index.
#iloc works on the positions in the index (so it only takes integers).
#ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.
#inoreder to find details about any pokemon, just specify its name

Next we can look at stuff that links together ....    
So here the | is an OR    
Therefore this statement is looking for anything in the df where    
the TYPE 1 is Fire OR Dragon    
AND    
the TYPE 2 is Dragon OR Fire   

In [None]:
#filtering pokemons using logical operators
df[((df['TYPE 1']=='Fire') | (df['TYPE 1']=='Dragon')) & ((df['TYPE 2']=='Dragon') | (df['TYPE 2']=='Fire'))].head(3)

Next we are going to print out the Pokemon with the highest HP     
Using df.argmax    

In [None]:
print("MAx HP:",df['HP'].argmax())                  #returns the pokemon with highest HP
print("Max DEFENCE:",(df['DEFENSE']).idxmax())      #similar to argmax()

Next - we can sort any column in the df using df.sort_values      
By using ascending=False - we put the largest value at the top    
Again, by adding on .head(10) we only show the top 10 of the sort    


In [None]:
df.sort_values('TOTAL',ascending=False).head(10)  #this arranges the pokemons in the descendng order of the Totals.
#sort_values() is used for sorting and ascending=False is making it in descending order

df.unique - does exactly what it says on the tin - it lists all the individual values from a column    
But only one of each      

In [None]:
print('The unique  pokemon types are',df['TYPE 1'].unique()) #shows all the unique types in column
print('The number of unique types are',df['TYPE 1'].nunique()) #shows count of unique values 

df.value_counts will count all the different types of something in a column   
df.sum adds up everything in a column.   
In this case - it's adding up all the 'Bugs'  

In [None]:
print(df['TYPE 1'].value_counts(), '\n' ,df['TYPE 2'].value_counts())       #count different types of pokemons
df.groupby(['TYPE 1']).size()                                               #same as above
(df['TYPE 1']=='Bug').sum()                                                 #counts for a single value

Last of all - df.describe gives a statistical summary     
So next time you need to do any statistics - ask to do it on python, as it takes milliseconds :)     

In [None]:
df_summary = df.describe() #summary of the pokemon dataframe
df_summary

## VISUALISATIONS

##### The attack distribution for the pokemons across all the genarations: 
So this is a distribution plot - basically showing how many pokemon have got what strength of attack .....     
Example - only 5 or 6 pokemon have got an attack between 0-25     


Anything that has plt.  at the start of it is using the matplotlib library    
This is a graphical library that can plot any kind of graph you can think of ......    
In a notebook, all the information you put in one cell, will go into the plot    
So titles, axes, colours can all be defined inside this one cell for the plot      

In [None]:
bins=range(0,200,20)                    #they act as bin containers - for the differen ranges
plt.hist(df["ATTACK"],bins,histtype="bar",rwidth=1.2,color='#0ff0ff') #hist() is used to plot a histogram
plt.xlabel('Attack')                    #set the xlabel name
plt.ylabel('Count')                     #set the ylabel name
plt.plot()
### You can use df.mean to calculate the average down any column 
plt.axvline(df['ATTACK'].mean(),linestyle='dashed',color='red') #draw a vertical line showing the average Attack value
plt.show()

Above is a Histogram showing the distribution of attacks for the Pokemons. The average value is between 75-77

### Fire Vs Water

Next is a scatter plot!!!     
This is actually 2 scatter plots - one for Fire types and one for Water types  
You'll notice that it only plots the first 50 of each - you can easily change that .....     
 


In [None]:
fire=df[(df['TYPE 1']=='Fire') | ((df['TYPE 2'])=="Fire")]              #fire contains all fire pokemons
water=df[(df['TYPE 1']=='Water') | ((df['TYPE 2'])=="Water")]           #all water pokemins into the same variable
plt.scatter(fire.ATTACK.head(50),fire.DEFENSE.head(50),color='R',label='Fire',marker="*",s=50)      #scatter plot for Fire
plt.scatter(water.ATTACK.head(50),water.DEFENSE.head(50),color='B',label="Water",s=25)              #scatter plot for Water
plt.xlabel("Attack")
plt.ylabel("DEFENCE")
plt.legend()
plt.plot()
fig=plt.gcf()                   #get the current figure using .gcf()
fig.set_size_inches(12,6)       #set the size for the figure
plt.show()

This shows that, generally, fire type pokemons have a better attack than water type pokemons but have a lower defence than water type.

### Strongest Pokemons By Types

Here - we use df.sort to sort by the 'TOTAL' column      
This is put into a whole new dataframe called 'strong'    
Then - all the duplicates from the 'TYPE 1' column are dropped (or removed)

In [None]:
strong=df.sort_values(by='TOTAL', ascending=False) #sorting the rows in descending order
strong.drop_duplicates(subset=['TYPE 1'],keep='first') #since the rows are now sorted in descending oredr
#thus we take the first row for every new type of pokemon i.e the table will check TYPE 1 of every pokemon
#The first pokemon of that type is the strongest for that type
#so we just keep the first row

## Distribution of various pokemon types

Time for a pie chart!!!     
The cheat here is that the 'sizes' are manually calculated.         
Explode pushes out a wedge a little (in this case by '0.1')    
You can change the colours if you want - there are soooo many different colours to choose from     

In [None]:
labels = 'Water', 'Normal', 'Grass', 'Bug', 'Psychic', 'Fire', 'Electric', 'Rock', 'Other'
sizes = [112, 98, 70, 69, 57, 52, 44, 44, 175]
colors = ['Y', 'B', '#00ff00', 'C', 'R', 'G', 'silver', 'white', 'M']
explode = (0, 0, 0.1, 0, 0, 0, 0, 0, 0)  # only "explode" the 3rd slice 
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal')
plt.title("Percentage of Different Types of Pokemon")
plt.plot()
fig=plt.gcf()
fig.set_size_inches(7,7)
plt.show()

## All stats analysis of the pokemons

Ok - some charts that you've probably not used before ....    
This one is a boxplot - it shows the average value but also the range of values      

In [None]:
df2=df.drop(['GENERATION','TOTAL'],axis=1)
sns.boxplot(data=df2)
plt.ylim(0,300)  #change the scale of y axix
plt.show()

This is another one showing attack strength for different pokemon types ........ 

In [None]:
plt.subplots(figsize = (15,5))
plt.title('Attack by Type1')
sns.boxplot(x = "TYPE 1", y = "ATTACK",data = df)
plt.ylim(0,200)
plt.show()

Another box plot - this one showing the Type 2 attack ...   


In [None]:
plt.subplots(figsize = (15,5))
plt.title('Attack by Type2')
sns.boxplot(x = "TYPE 2", y = "ATTACK",data=df)
plt.show()

So - what do these boxplots tell you about the Attacks?????? 

Another boxplot - this one for defence    

In [None]:
plt.subplots(figsize = (15,5))
plt.title('Defence by Type')
sns.boxplot(x = "TYPE 1", y = "DEFENSE",data = df)
plt.show()

This shows that steel type pokemons have the highest defence but normal type pokemons have the lowest defence - you see????   

### Now lets see the same stats in violinplot

Violin Plots are very similar to boxplots - but they show you how the data varies from the mean.    
This is called the distribution - and it shows how 'bunched up' or 'spread out' all the values are ....     
So a long thin violin means that all the data is spread out over a big range of values     
Short and fat ones mean that the data is bunched together - usually close to the mean or median value.      

In [None]:
plt.subplots(figsize = (20,10))
plt.title('Attack by Type1')
sns.violinplot(x = "TYPE 1", y = "ATTACK",data = df)
plt.ylim(0,200)
plt.show()

For example - if you make the next plot - the flying Type has a bulge, or fat, bit around a defence of 80.    
This menas that lots of Flying types have a defence around that value.    
While the STEAL types have a huge range of different defences      

In [None]:
plt.subplots(figsize = (20,10))
plt.title('Defence by Type1')
sns.violinplot(x = "TYPE 1", y = "DEFENSE",data = df)
plt.ylim(0,200)
plt.show()

You can plot violin plots for any range of values       
Here you can compare the different Generations         

In [None]:
plt.subplots(figsize = (15,5))
plt.title('Strongest Genaration')
sns.violinplot(x = "GENERATION", y = "TOTAL",data = df)
plt.show()

This shows that generation 3  has the better pokemons

### Strong Pokemons By Type

Aha - a swarm-plot     
This puts dots for every different TOTAL value - grouped by TYPE1     
In a funky style - you can highlight the LEGENDARY in a different colour or 'hue'        
And put the average value for all the TOTAL values accross the plot is a lovely dotted red line!!          

In [None]:
plt.figure(figsize=(12,6))
top_types=df['TYPE 1'].value_counts()[:10] #take the top 10 Types
df1=df[df['TYPE 1'].isin(top_types.index)] #take the pokemons of the type with highest numbers, top 10
sns.swarmplot(x='TYPE 1',y='TOTAL',data=df1,hue='LEGENDARY') # this plot shows the points belonging to individual pokemons
# It is distributed by Type
plt.axhline(df1['TOTAL'].mean(),color='red',linestyle='dashed')
plt.show()

 Legendary Pokemons are mostly taking the top spots in the Strongest Pokemons


### Finding any Correlation between the attributes

Heatmaps are well cool     
They tell you how strongly different variables (such as HP, ATTACK etc ) are related to each other ......     
The higher the number the higher the relationship ...    
So high values of SP. ATK usually mean high values of TOTAL     

In [None]:
plt.figure(figsize=(10,6)) #manage the size of the plot
sns.heatmap(df.corr(),annot=True) #df.corr() makes a correlation matrix and sns.heatmap is used to show the correlations heatmap
plt.show()

### Number of Pokemons by Type And Generation

Next - lets draw some line plots     
This shows which type you get for each Generation .....     
Notice the funky colour names   
This is called HEX CODE  
#FFA500 is actually just ORANGE1 - and if you typed ORANGE1 in the '' then you would get the same colour    
If you put in HEX CODE colour slider into Google - it has a built in slider for telling you what the HEX CODE is for different colours         

### Type 1

In [None]:
a=df.groupby(['GENERATION','TYPE 1']).count().reset_index()
a=a[['GENERATION','TYPE 1','TOTAL']]
a=a.pivot('GENERATION','TYPE 1','TOTAL')
a[['Water','Fire','Grass','Dragon','Normal','Rock','Flying','Electric']].plot(color=['b','r','g','#FFA500','brown','#6666ff','#001012','y'],marker='o')
fig=plt.gcf()
fig.set_size_inches(12,6)
plt.show()

We can see that water pokemons had the highest numbers in the 1st Generation. However the number has decreased with passing generations. Similarly Grass type pokemons showed an increase in their numbers till generation 5.

Let's do the same for TYPE 2 

In [None]:
a=df.groupby(['GENERATION','TYPE 2']).count().reset_index()
a=a[['GENERATION','TYPE 2','TOTAL']]
a=a.pivot('GENERATION','TYPE 2','TOTAL')
a[['Water','Fire','Grass','Dragon','Normal','Rock','Flying','Electric']].plot(color=['b','r','g','#FFA500','brown','#6666ff','#001012','y'],marker='o')
fig=plt.gcf()
fig.set_size_inches(12,6)
plt.show()

This graph shows that the number of Type2 Grass Pokemons has been steadily increasing. The same is the case for the Dragon Type Pokemons. For other Types the trends are somewhat uneven.

### And that's that!!!!! 
There are lots of different plots we've not covered - but this gives you a look at Matplotlib and python..... 