## Jupyter Notebook Assessment Task

### *Learn to analyse data with Pokémon!*

[Repository Link](https://github.com/TurnipGuy30/Jupyter)

This program will do the following:

- Take data from a CSV file
- Clean and export the data
- Analyse and visualise the data

---

### Setup

The first thing we should do is set up the modules we plan to use in the program. These will be used to access our database and create diagrams from the data.

In [None]:
# import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# update matplotlib configuration for later use
plt.rcParams.update(
	{
		'font.size': 20,
		'figure.figsize': (10, 8)
	}
)

---

Now that the environment is set up, we can import our file, `pokemon.csv`.

In [None]:
# read CSV file and save to variable as DataFrame
pokemon = pd.read_csv('in/pokemon.csv')

In [None]:
# output DataFrame as table
pokemon

---

We can see that the database is now outputting to the screen.

Running `pokemon` shows us just the start and the end of the file, but this already tells us a few things about the formatting of the data that are specific to a Pokémon database:

- Some Pokémon have alternate forms with the same `Number`.

Of course, these entries have different `index` values.

Looking at the original `pokemon.sql` file, the `PRIMARY KEY` property is given to the `Name` column. In SQL, the Primary Key property is given to a column which the programmer knows will never have duplicate values.

Because this property was given to the `Name` column, we should set that column to our `index` column.

- Some Pokémon only have one type.

There are two columns for Pokémon types: `Type_1` and `Type_2`. This is because Pokémon can have either one or two types.

However, this means that some Pokémon will have `NaN` `Type_2` values. This shouldn't make a difference to our program.

It is worth noting that for Pokémon with two types, the first type is more heaviliy weighted when it comes to game calculations.

- Each Pokémon is grouped into a `Generation` category.

This database contains Pokémon up to Gen VI, meaning that each entry will have a `Generation` value of `1` through `6`.

- Each Pokémon has six different statistics.

Hit Points, Physical Attack, Physical Defense, Special Attack, Special Defense, and Speed are used in game calculations. In general, higher stats means a stronger Pokémon.

These stats are combined in the `Total` column.

Note: in the Pokémon community, the 6 stats are abbreviated to `HP`, `ATK`, `DEF`, `SPA`, `SPD`, and `SPE`.

- Some Pokémon are `Legendary` Pokémon

Legendary Pokémon will generally have higher stats than non-Legendary Pokémon.

---

### CLeaning and organisation

Now that we better understand out data, we can start performing the required cleaning.

Let's rename the columns to be more conventional.

In [None]:
# rename columns
pokemon.rename(columns={
	'Number': 'Dex',
	'Attack': 'ATK',
	'Defense': 'DEF',
	'Sp_Atk': 'SPA',
	'Sp_Def': 'SPD',
	'Speed': 'SPE',
	'Generation': 'Gen'
}, inplace=True)

In [None]:
# make column names lowercase
pokemon.columns = [col.lower() for col in pokemon]

In [None]:
# set index column by name
pokemon.set_index('name', inplace=True)

---

Let's see how it's changed.

In [None]:
# output DataFrame as table
pokemon

Looking good! Let's move on.

---

It's always a good idea to check for empty data.

In [None]:
# show sum of NaN data
pokemon.isnull().sum()

We can see that the only column containing empty data is `type_2`. This is the expected result.

Now that the data is organised, we can export the file before moving on to visualisation.

In [None]:
# export DataFrame to file
pokemon.to_csv('out/pokemon.csv')

---

### Visualisation

Let's start visualising pieces of data.

Visualisation can help to point out certain parts of a database. Look at the following diagram.

In [None]:
# correlation between columns
fig = plt.figure(figsize=(10, 7))
sns.regplot(x=pokemon['atk'], y=pokemon['def']);
plt.title('Attack v Defense');

The diagram easily conveys that the average Pokémon's Defense is around double their Attack.

I will now find the Pokémon with Attack greater than 175 and Defense greater than 100.

In [None]:
# search for multiple conditions
pokemon[
	(pokemon['atk'] >= 175) &
	(pokemon['def'] >= 100)
]

This can be useful information to Pokémon players who want to form the best team for Pokémon battles; High-Attack Pokémon are valuable, and the Pokémon with the best (physical) stats are listed above.

A Pokémon player could then use this information on top of their prior knowledge to form a team of strong Pokémon.

I will now demonstrate this by doing it myself, though I'll need some more information.

Note: a Pokémon team is comprised of up to six individual Pokémon.

In [None]:
# correlation data from columns
fig = plt.figure(figsize=(10, 7))
sns.regplot(x=pokemon['spa'], y=pokemon['spd']);
plt.title('Special Attack v Special Defense');

Look at the diagram below.

In [None]:
# boxplot by column
sns.boxplot(x=pokemon['legendary'], y=pokemon['total'], width=0.1);

This diagram shows the total stats of Legendary Pokémon compared to non-Legendary Pokémon. This means that if we want a team with the highest stats, we should generally use as many Legendary Pokémon as possible.

In [None]:
# search for many conditions
pokemon[
	(
		(pokemon['atk'] >= 175) &
		(pokemon['def'] >= 100)
	) |
	(
		(pokemon['spa'] >= 175) &
		(pokemon['spd'] >= 100)
	)
].head()

According to this data, Mewtwo, Heracross, Kyogre, and Groudon are among the best choices. This team covers some of their own weaknesses, but seems vulnerable to Psychic-Type Pokémon. Therefore, it could benefit from having a Dark-Type Pokémon included.

In [None]:
# search by condition and sort
pokemon[
	(pokemon['type_1'] == 'Dark') 
].sort_values('total', ascending=False).head()

Yveltal seems like a good fit, but is vulnerable to Electric-Type Pokémon. However, this is covered by Groudon being a Ground-Type.

I'll add a Steel-Type Pokémon because they don't have many weaknesses.

In [None]:
# search by condition and sort
pokemon[
	(pokemon['type_1'] == 'Steel')
].sort_values('total', ascending=False).head()

I'll choose Metagross. Now our team looks like this.

In [None]:
# locate multiple entries
pokemon.loc[
	['Mewtwo', 'Heracross', 'Kyogre', 'Groudon', 'Yveltal', 'Metagross']
]

This is a great result; I myself would happily use this team. However, there are always improvements that could be made.

In [None]:
pokemon.sort_values('total', ascending=False).head(1)

The Pokémon with the highest stats is actually Mega Rayquaza, so why don't we have a Rayquaza?

The answer is that there are many ways to formulate a Pokémon team. We have looked at the statistics, using mathematics as a basis for decisions.

But when it comes to Pokémon, there are many other details to account for that are irrelevant to this project.

This was just one example of practical data analysis.

---

### Extra charts

The following charts were made for my own entertainment.

Gen / Stat

In [None]:
sns.boxplot(x=pokemon['gen'], y=pokemon['total'], width=0.5);

In [None]:
sns.boxplot(x=pokemon['gen'], y=pokemon['hp'], width=0.5);

In [None]:
sns.boxplot(x=pokemon['gen'], y=pokemon['atk'], width=0.5);

In [None]:
sns.boxplot(x=pokemon['gen'], y=pokemon['def'], width=0.5);

In [None]:
sns.boxplot(x=pokemon['gen'], y=pokemon['spa'], width=0.5);

In [None]:
sns.boxplot(x=pokemon['gen'], y=pokemon['spd'], width=0.5);

In [None]:
sns.boxplot(x=pokemon['gen'], y=pokemon['spe'], width=0.5);

Gen / Type

In [None]:
sns.boxplot(x=pokemon['gen'], y=pokemon['type_1']);

In [None]:
sns.boxplot(x=pokemon['gen'], y=pokemon['type_2']);

Stat / Type 1

In [None]:
sns.boxplot(x=pokemon['total'], y=pokemon['type_1']);

In [None]:
sns.boxplot(x=pokemon['hp'], y=pokemon['type_1']);

In [None]:
sns.boxplot(x=pokemon['atk'], y=pokemon['type_1']);

In [None]:
sns.boxplot(x=pokemon['def'], y=pokemon['type_1']);

In [None]:
sns.boxplot(x=pokemon['spa'], y=pokemon['type_1']);

In [None]:
sns.boxplot(x=pokemon['spd'], y=pokemon['type_1']);

In [None]:
sns.boxplot(x=pokemon['spe'], y=pokemon['type_1']);

Stat / Type 2

In [None]:
sns.boxplot(x=pokemon['total'], y=pokemon['type_2']);

In [None]:
sns.boxplot(x=pokemon['hp'], y=pokemon['type_2']);

In [None]:
sns.boxplot(x=pokemon['atk'], y=pokemon['type_2']);

In [None]:
sns.boxplot(x=pokemon['def'], y=pokemon['type_2']);

In [None]:
sns.boxplot(x=pokemon['spa'], y=pokemon['type_2']);

In [None]:
sns.boxplot(x=pokemon['spd'], y=pokemon['type_2']);

In [None]:
sns.boxplot(x=pokemon['spe'], y=pokemon['type_2']);