<h1>Pandas and matplotlib</h1>
Pandas is arguably the most used library in data science and most visualisations in python are made using matplotlib. They are also incredibly important to Machine Learning as they allow one to explore the data (and formulate the questions based on it) and also CLEAN/PREPARE the data. It is not an exageration saying that 80-90% of the work done by a Data Scientist is cleaning data.

[pandas](https://pandas.pydata.org/) is a powerful library that allows one to work with data. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

To open our dataset let us use [pd.read csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

In [None]:
pokemon_df = pd.read_csv('Pokemon.csv', sep=',')

Question: df stands for dataframe but what is a dataframe?

Question: how do we print the dataframe? When is that convenient? 

In [None]:
print(pokemon_df)

Being able to see the entirety of the data on the screen is a double edged sword. In one hand the human working on a given dataset can be there visually inspecting the dataset all the time. On the other hand this process of merely displaying the data can quickly become a burden to the processing time. It is then more practical to use a different approach when dealing with large datasets.<br> 

In this sense there are a few tools we can use to inspect the data for example [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) and [tail](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html). 

In [None]:
# Displaying the first 5 rows of the dataframe.
pokemon_df.head()

In [None]:
# Displaying the bottom 5 rows of the dataframe.
pokemon_df.tail()

Another very powerful tool is [describe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)

In [None]:
pokemon_df.describe()

Question: what are we seeing above? How is that useful? How is that NOT useful?

It is also possible to query specific aspects of each individual column. To select only a column it is similar to getting a specific item of a dictionary,

In [None]:
# Getting only the items that are in the HP column.
pokemon_df['HP']

In [None]:
pokemon_df.HP

In [None]:
# A list of columns can also be selected.
pokemon_df[['HP','Attack','Speed']]

To get to know all keys of the dataframe (or for getting something you can quick copy and paste) you can use

In [None]:
# This also works on dictionaries!
pokemon_df.keys()

In [None]:
pokemon_df.columns

For non-numerical columns other tools are more appropriated to understand the value distribution, such as

In [None]:
# This shows all the unique entries in the dataset for that column.
pokemon_df['Type 1'].unique()

In [None]:
# This will count how much incidence of a given unique entry is seen on the dataframe.
pokemon_df['Type 1'].value_counts()

Exercises: how many pokemons does each different generation have? Also, how many legendary pokemons are there in total?

This is how we save our dataframe.

In [None]:
#Exporting data.
pokemon_df.to_csv('filename.csv', sep=',')

Describing the dataset.

In [None]:
# Getting the description.
pokemon_df.describe()

Question: what is all this information above?

In [None]:
# Combining methods.
pokemon_df.describe().to_csv('description.csv')

Now to operate over a single column the operands we've seen so far work,

In [None]:
pokemon_df['HP'] + 5

Question: Did the operation above changed the actual values in the dataframe?

Creating a new column in a dataframe is quite simple,

In [None]:
pokemon_df['Irrelevant'] = np.nan
pokemon_df['HP+100'] = pokemon_df['HP'] + 100

In [None]:
pokemon_df['Atk+Def+HP'] = pokemon_df['Attack'] + pokemon_df['Defense'] + pokemon_df['HP']

In [None]:
pokemon_df.head()

As it is dropping those columns with [drop](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)

In [None]:
# Getting the column names for a copy paste
pokemon_df.keys()

In [None]:
# Listing the columns to drop. 
# axis = 1 means we're dropping it column-wise.
pokemon_df.drop(['Irrelevant','HP+100'], axis=1)

In [None]:
pokemon_df.head()

In [None]:
# Modifying a specific datapoint.
pokemon_df.at[0,'Irrelevant'] = 10

Question: why are the columns still there then? how do we actually get rid of them?

Exercise: create a new column where the Attack of each pokemon is divided by the maximum value on the attack column, and a new column containing the division of the attack and defense stats.

Filtering is also a process where a data scientist can spend countless hours in. Using pandas the process is also quite simple.

In [None]:
# Showcase the pokemons with HP larger than 80.
pokemon_df[pokemon_df['HP'] > 80]

In [None]:
# Notice that the condition doesn't need to be numerical.
pokemon_df[pokemon_df['Type 1'] == 'Fire']

In [None]:
# Notice that the condition doesn't need to be numerical.
pokemon_df[(pokemon_df['Type 1'] == 'Ice')&(pokemon_df['Type 2'] == 'Psychic')]

Exercise: how do we measure how many entries satisfy the stated condition above? I.e., what is the length of the dataframe. [hint](https://www.w3schools.com/python/gloss_python_string_length.asp)

Examples on how to sort the dataframe.

In [None]:
# Use this to show the largest imbalances on attack/defense.
pokemon_df.sort_values(by=['Attack/Defense'])[0:10]

In [None]:
pokemon_df.sort_values(by=['Attack/Defense'], ascending=False)[0:10]

Exercise: how many pokemons don't have a second type? [hint](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html)

What fraction of the dataset do they represent?

Now numbers can be quite confusing and their significance is limited when context or comparissons are not offered. In this sense pandas dataframes also contain many built-in functions that can help you. For example:

In [None]:
pokemon_df['Attack'].describe()

In [None]:
pokemon_df.hist()

Exploring the built in histogram method.

In [None]:
# This plots the histogram of the attack column
pokemon_df.hist('Attack')

Exploring a line plot.

In [None]:
pokemon_df['Attack'].plot()

In [None]:
# Plotting the first 20 points only.
pokemon_df['Attack'][0:20].plot()

Question, is this plot appropriate for this data?

Now look at the following scatter plot.

In [None]:
pokemon_df.plot.scatter(x='#',y='Attack')

What's the difference between a plot and a scatter plot?

In [None]:
pokemon_df.plot.scatter(x='Attack',y='Defense')

What relevant information can we extract from this sort of plot? what is the use of scattering a column agains't the other?

Further modifying figures with matplotlib.

In [None]:
# Create figure object and defining axes.
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 6))

fig.suptitle('This is the title!')

# Above all, notice the ax = ax[0,0].
pokemon_df.plot.scatter(x='Attack',y='Defense',s=1, c='black', ax=axes[0,0])
pokemon_df.plot.scatter(x='Attack',y='Defense', ax=axes[1,0])
pokemon_df.plot.scatter(x='Attack',y='Defense',s=1, c='black', ax=axes[0,1])
pokemon_df.plot.scatter(x='Attack',y='Defense', s=1, ax=axes[1,1])


Exercise: for every pokemon type make a scatter plot of attack vs defense (and do colour them differently).

In [None]:
# My solution.
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))

for key in pokemon_df['Type 1'].unique():
    print(key)

    axes.scatter(x=pokemon_df[pokemon_df['Type 1'] == key]['Attack'],
                 y=pokemon_df[pokemon_df['Type 1'] == key]['Defense'],
                s=5,
                label=key)
    
    plt.legend()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))

axes.scatter(x=pokemon_df[pokemon_df['Type 1'] == 'Fire']['Attack'],
             y=pokemon_df[pokemon_df['Type 1'] == 'Fire']['Defense'],
             s=5, 
             color='red')

axes.scatter(x=pokemon_df[pokemon_df['Type 1'] == 'Grass']['Attack'],
             y=pokemon_df[pokemon_df['Type 1'] == 'Grass']['Defense'],
             s=5, 
             color='green')


#pokemon_df.plot.scatter(x='Attack',y='Defense',s=1, c='black', ax=axes[0])
#pokemon_df.plot.scatter(x='Attack',y='Defense',s=1, ax=axes[0])

A complete guide on the methods that can be acessed in a dataframe object can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

Exercise: now do the histogram of the HP, defense and speed columns.

In [None]:
pokemon_df['HP'].plot.hist()

In [None]:
fig,axes = plt.subplots(nrows=1,ncols=3,figsize=(12,6))
pokemon_df['HP'].plot.hist(ax=axes[0], bins=50, alpha=0.5, color='red')
axes[0].set_title('HP')

pokemon_df['Defense'].plot.hist(ax=axes[1], bins=50, alpha=.5, color='red')
axes[1].set_title('Defense')
pokemon_df['Attack'].plot.hist(ax=axes[1], bins=50, alpha=.5, color='blue')
axes[1].set_title('Attack')

pokemon_df['Speed'].plot.hist(ax=axes[2], bins=50, alpha=0.5, color='green')
axes[2].set_title('Speed')

axes[1].set_ylabel('')
axes[2].set_ylabel('')

In [None]:
fig,axes = plt.subplots(nrows=1,ncols=3,figsize=(12,6))
pokemon_df['HP'].plot.hist(ax=axes[0], bins=50, alpha=0.5, color='red')
axes[0].set_title('HP')

pokemon_df['Defense'].plot.hist(ax=axes[1], bins=50, alpha=0.5, color='grey')
axes[1].set_title('Defense')

pokemon_df['Speed'].plot.hist(ax=axes[2], bins=50, alpha=0.5, color='green')
axes[2].set_title('Speed')

Numpy can also be used in conjunction with pandas to study the dataframes. For example

In [None]:
# Returns the maximum value of the pokemons HP.
np.max(pokemon_df['HP'])

In [None]:
# Returns the mean value of the pokemons attack.
np.mean(pokemon_df['Attack'])

Exercise: Find out how many pokemons have their Attack stat above the average value for the entire set.

Another thing that is very common to find is missing entries and there are different stragies that can be employed (for [example](https://pandas.pydata.org/docs/user_guide/missing_data.html)). Using [dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) we can drop all the rows that have a missing value,

In [None]:
pokemon_df.dropna()

Why all the dataframe was wipped out?

Columns can be individually dropped using [drop](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)

In [None]:
pokemon_df.keys()

In [None]:
# For example:
pokemon_df.drop(['Irrelevant','HP+100'], axis=1)

In [None]:
pokemon_df = pokemon_df.drop(['Irrelevant','HP+100'], axis=1)
# Now the NANs are not in every row anymore! see the result.
pokemon_df.dropna()

Exercise: compare the lengths of the dataframe before and after dropping the missing values. 

Question: When is it appropriated to drop the missing values?

Exercise: make a pie chart showcasing the distribution of pokemon types.

Exercise: which is the strongest pokemon?

Exercise: what is a boxplot?

Exercise: make a box plot showcasing the different pokemon attacks.

If you want to dive into pandas and machine learning I recommend using [Kaggle](https://www.kaggle.com/). This platform contains a multitude of interactive tutorials that can facilitate your learning process.

In [None]:
import seaborn as sns

sns.set()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))

axes.scatter(x=pokemon_df[pokemon_df['Type 1'] == 'Fire']['Attack'],
             y=pokemon_df[pokemon_df['Type 1'] == 'Fire']['Defense'],
             s=1, 
             color='red')
axes.scatter(x=pokemon_df[pokemon_df['Type 1'] == 'Grass']['Attack'],
             y=pokemon_df[pokemon_df['Type 1'] == 'Grass']['Defense'],
             s=1, 
             color='green')

In [None]:
pokemon_df.describe()

In [None]:
sns.boxplot(x=pokemon_df['Attack'])


In [None]:
pokemon_df['Type 1'].apply(pd.value_counts).plot.pie(subplots=True)

In [None]:
pokemon_df.head()

In [None]:
pokemon_df['Type 1'].value_counts()

In [None]:
pokemon_df['Type 1'].value_counts().plot(kind='pie')

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 10))

pokemon_df['Type 1'].value_counts().plot(kind='pie', ax=axes[0])

pokemon_df['Type 2'].value_counts().plot(kind='pie', ax=axes[1])

axes[0].set_ylabel('')
axes[1].set_ylabel('')
fig.suptitle('Test')
plt.tight_layout()

In [None]:
pokemon_df['Type 1'].value_counts().plot(kind='pie')