# Basic Statistics in Python

Dataset from Kaggle : **"Pokemon with stats"** by *Alberto Barradas*  
Source: https://www.kaggle.com/abcsds/pokemon (requires login)

---

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

---

### Import the Dataset

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [None]:
pkmndata = pd.read_csv('pokemonData.csv')
pkmndata.head()

Description of the dataset, as available on Kaggle, is as follows.
Learn more : https://en.wikipedia.org/wiki/List_of_Pok%C3%A9mon

> **\#** : ID for each Pokemon (runs from 1 to 721)  
> **Name** : Name of each Pokemon  
> **Type 1** : Each Pokemon has a basic Type, this determines weakness/resistance to attacks  
> **Type 2** : Some Pokemons are dual type and have a Type 2 value (set to nan otherwise)  
> **Total** : Sum of all stats of a Pokemon, a general guide to how strong a Pokemon is  
> **HP** : Hit Points, defines how much damage a Pokemon can withstand before fainting  
> **Attack** : The base modifier for normal attacks by the Pokemon (e.g., scratch, punch etc.)  
> **Defense** : The base damage resistance of the Pokemon against normal attacks  
> **SP Atk** : Special Attack, the base modifier for special attacks (e.g. fire blast, bubble beam)  
> **SP Def** : Special Defense, the base damage resistance against special attacks  
> **Speed** : Determines which Pokemon attacks first each round  
> **Generation** : Each Pokemon belongs to a certain Generation  
> **Legendary** : Legendary Pokemons are powerful, rare, and hard to catch

---

Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [None]:
print("Data type : ", type(pkmndata))
print("Data dims : ", pkmndata.shape)

Check the variables (and their types) in the dataset using the `dtypes` attribute.

In [None]:
print(pkmndata.dtypes)

---

### Extract a Single Variable

We will start by analyzing a single variable from the dataset, **HP**.  
This variable tells us defines how much damage a Pokemon can withstand.  
Extract the variable and its associated data as a Pandas `DataFrame`.

In [None]:
hp = pd.DataFrame(pkmndata['HP'])
print("Data type : ", type(hp))
print("Data dims : ", hp.size)
hp.head()

---

### Uni-Variate Statistics

Check the Summary Statistics of Uni-Variate Series using `describe`.

In [None]:
hp.describe()

Check the Summary Statistics visually using a standard `boxplot`.

In [None]:
f, axes = plt.subplots(1, 1, figsize=(24, 4))
sb.boxplot(hp, orient = "h")

Extend the summary to visualize the complete distribution of the Series.  
The first visualization is a simple Histogram with automatic bin sizes.

In [None]:
f, axes = plt.subplots(1, 1, figsize=(24, 12))
sb.distplot(hp, kde = False, color = "red")

The second visualization is a simple Kernel Density Estimate (KDE).

In [None]:
f, axes = plt.subplots(1, 1, figsize=(24, 12))
sb.distplot(hp, hist = False, color = "red")

The generic `distplot` produces both the Histogram and the KDE.

In [None]:
f, axes = plt.subplots(1, 1, figsize=(24, 12))
sb.distplot(hp, color = "red")

Finally, the **Violin Plot** combines boxplot with kernel density estimate.

In [None]:
f, axes = plt.subplots(1, 1, figsize=(24, 12))
sb.violinplot(hp)

---

### Extract Two Variables

Next, we will analyze two variables from the dataset, **HP** vs **Attack**.  
Extract the two variables and their associated data as a Pandas `DataFrame`.

In [None]:
hp = pd.DataFrame(pkmndata['HP'])
attack = pd.DataFrame(pkmndata['Attack'])

---

### Bi-Variate Statistics

We can of course check the uni-variate Summary Statistics for each variable.

In [None]:
# Summary Statisttics for HP
hp.describe()

In [None]:
# Summary Statisttics for Attack
attack.describe()

And visualize the uni-variate Distributions of each variable independently.

In [None]:
# Set up matplotlib figure with three subplots
f, axes = plt.subplots(2, 3, figsize=(24, 12))

# Plot the basic uni-variate figures for HP
sb.boxplot(hp, orient = "h", ax = axes[0,0])
sb.distplot(hp, kde = False, ax = axes[0,1])
sb.violinplot(hp, ax = axes[0,2])

# Plot the basic uni-variate figures for Attack
sb.boxplot(attack, orient = "h", ax = axes[1,0], color = 'g')
sb.distplot(attack, kde = False, ax = axes[1,1], color = 'g')
sb.violinplot(attack, ax = axes[1,2], color = 'g')

However, it will be more interesting to visualize them together in a `jointplot`.

In [None]:
sb.jointplot(x = attack, y = hp, height = 8)

As it tells us something about the **Correlation** between the two variables.

In [None]:
# Create a joint dataframe by concatenating the two variables
jointDF = pd.concat([attack, hp], axis = 1, join_axes = [attack.index])

# Calculate the correlation between the two columns/variables
jointDF.corr()

One may visualize the correlation matrix as a `heatmap` to gain a better insight.

In [None]:
sb.heatmap(jointDF.corr(), vmin = -1, vmax = 1, annot = True, fmt=".2f")

---

### Multi-Variate Statistics

Similarly, we may analyze all numeric values in the original dataset.

In [None]:
# Extract only the numeric data variables
numDF = pd.DataFrame(pkmndata[["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]])

# Summary Statistics for all Variables
numDF.describe()

In [None]:
# Draw the Boxplots of all variables
f, axes = plt.subplots(1, 1, figsize=(24, 12))
sb.boxplot(data = numDF, orient = "h")

In [None]:
# Draw the distributions of all variables
f, axes = plt.subplots(6, 2, figsize=(12, 24))

count = 0
for var in numDF:
    sb.distplot(numDF[var], ax = axes[count,0])
    sb.violinplot(numDF[var], ax = axes[count,1])
    count += 1

In [None]:
# Calculate the complete  correlation matrix
numDF.corr()

In [None]:
# Heatmap of the Correlation Matrix
f, axes = plt.subplots(1, 1, figsize=(12, 8))
sb.heatmap(numDF.corr(), vmin = -1, vmax = 1, annot = True, fmt = ".2f")

In [None]:
# Draw pairs of variables against one another
sb.pairplot(data = numDF)