### Exploratory Data Analysis (EDA)

Visualization is an essential method in any data scientist's toolbox. Visualization is a key first step in the exploration of most datasets. <i>These process of exploring data visually with the simple summary statistics and visualization toolkit is known as</i> <b>Exploratory Data Analysis (EDA)</b>.

### Let's get started.
Fork the repository to run notebook on your personal computer.

#### Import the required packages

In this tutorial we will work with powerful Python packages like Pandas, Matplotlib and Seaborn. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
%matplotlib inline

The IPython magic command <i>%matplotlib inline</i> enables the display of graphics inline with the Python code. If you do not include this command your graphs will not be displayed inside the notebook.

#### Load and examine the dataset

The function shown in the cell below loads the data from the <i>.csv</i> file.
Execute the code in this cell to load the dataset into your notebook. <b>Make sure you have the .csv file in your working directory.</b>

In [None]:
def read_dataset(fileName="pokemon.csv"):
    cwd = os.getcwd()
    dataset_path = os.path.join(cwd, fileName)
    data = pd.read_csv(dataset_path)
    return data
pokemon = read_dataset()   

Using the Pandas <i>head</i> method let's have a first look at the dataset.

In [None]:
pokemon.head()

You can see there are both numerical and string(categorical) variable types are present in the dataset. As a next step examine some summary statistics of the numeric columns using the Pandas <i>describe</i> method.

In [None]:
pokemon.describe()

To know about the dimensions of the dataset use Pandas <i>shape</i> method.

In [None]:
pokemon.shape

The dataset is having <i>801 rows</i> and <i>41 columns</i>.

Use Pandas <i>info</i> method to know more about the dataset.

In [None]:
pokemon.info()

The <i>height_m</i>, <i>percentage_male</i>, <i>type2</i> and <i>weight_kg</i> columns contain the null/missing values.   

Count the number of null/missing values.

In [None]:
print("'height_m' column has {} null values".format(sum(pokemon['height_m'].isnull())))
print("'percentage_male' column has {} null values".format(sum(pokemon['percentage_male'].isnull())))
print("'type2' column has {} null values".format(sum(pokemon['type2'].isnull())))
print("'weight_kg' column has {} null values".format(sum(pokemon['weight_kg'].isnull())))

Drop the rows where atleast one element is missing.

In [None]:
pokemon.dropna(inplace=True)

Now, re-examine the datset using Pandas <i>info</i> method.

In [None]:
pokemon.info()

We removed all the rows with missing values.

Now the <i>shape</i> of the dataset is

In [None]:
pokemon.shape

List down all the features of the dataset.

In [None]:
pokemon.columns

In [None]:
# or you can use list method 
list(pokemon)

To see specific row use ```.iloc(<row-selection>, <column-selection>)``` method.

In [None]:
pokemon.iloc[0:6] # displays top five rows

In [None]:
pokemon.iloc[0:6, 0:5]

In [None]:
pokemon.iloc[[5]] # displays the fifth row

If you are thinking that why we are using <b>[[ ]]</b> in iloc method?

In [None]:
print(type(pokemon.iloc[[5]])) # Returns a dataframe

In [None]:
print(type(pokemon.iloc[5])) # Returns a series

#### Visualization of the dataset

#### 1. Scatter Plot with Matplotlib

Scatter plots show the relationship between two variables in the form of dots on the plot. In simple terms, the value along a horizontal axis are plotted against a vertical axis.

Matplotlib is the base of most python plotting packages. Some basic understanding of Matplotlib will help you achieve better control of your graphics.

Let's start by making a simple scatter plot. Our recipe is simple:
* Import Matplotlib.pyplot
* Use the plot method
* Specify the values to plot on the x and y axes.
* Specify that we want red dots using a type of 'ro'. If you do not specify a type, you will geta line plot which is the default.

In [None]:
pokemon.plot(kind='scatter', x='height_m', y='weight_kg')

You can also draw the same graph using Panads functions.

In [None]:
pokemon.plot.scatter(x = 'height_m', y = 'weight_kg', figsize = (10,5), fontsize = 20, title = "Height Vs Weight")

In the plot above, we can see that most of the pokemons are having short height and less weight. 

#### Exercise: 
In the cell below, create and execute the code required to display an improved version of the plot you created above. 

Draw the scatter plot with having 'speed' feature on x-axis and 'weight_kg' feature on y-axis.

#### 2. Line Plots

Line plots are similar to point plots. In line plots the discrete points are conneced by lines.

In [None]:
# HP is kind of a measure of your Pokemon's stamina
pokemon['hp'].value_counts().sort_index()

In [None]:
ax = (pokemon['hp'].value_counts().sort_index().plot.line(figsize=(10,5), fontsize=20, color='red'))
ax.set_title("Line Chart of Pokemon's Stamina", fontsize=20)
ax.set_xlabel("HP of Pokemons", fontsize=20)
ax.set_ylabel("Number of Pokemons", fontsize=20)

In the above plot we can see the variation in Pokemon's stamina. More than 35 Pokemons are having stamina near to 60.

#### Exercise:
Draw the line chart using 'generation' as one of the feature.

#### 3. Bar Plots

Bar plots are used to display the counts of unique values of a categorical variable. The height of the bar represents the count for each unique category of the variable.

In [None]:
pokemon['type1'].value_counts()

In [None]:
ax = (pokemon['type1'].value_counts().plot.bar(figsize=(10,5), fontsize=20))
ax.set_title("Pokemon Type", fontsize=20) # set the fontsize of title
ax.set_xlabel("Type1",fontsize=20)
ax.set_ylabel("Number of Pokemons",fontsize=20)
sns.despine(bottom=True, left=True) # used to remove the black border

In the above plot we see most of the pokemons are <i>bug</i> and <i>water</i> type pokemons.

#### Exercise:
Draw barplot using 'is_legendary' as one of the feature.

#### 4. Pie Charts

Pie chart is a type of graph in which a circle is divided into sectors that each represent a proportion of the whole.

In [None]:
ax = (pokemon['is_legendary'].value_counts().plot.pie(figsize=(10,10), fontsize=20))
ax.set_title("Legendary Percentage", fontsize=20)

#### Exercise:

Draw the pie chart using 'type2' as feature.

#### 5. Histogram

Histograms are related to bar plots. Histograms are used for numeric variables. Whereas, a bar plot shows the counts of unique categories, a histogram shows the number of data with values within a bin. The bin divide the values of the variable into equal segments. The vertical axis of the histogram shows the count of data values within each bin.

In [None]:
pokemon[pokemon['weight_kg'] <200]['weight_kg']

In [None]:
ax = (pokemon[pokemon['weight_kg'] <200]['weight_kg'].plot.hist(figsize=(10,5), fontsize=20))
ax.set_title("Pokemon's weight less than 200", fontsize=20)
ax.set_xlabel("weight in kg", fontsize=20)
ax.set_ylabel("Frequency", fontsize=20)
sns.despine(bottom=True, left=True)

In the above plot we can see that most of the pokemons are having the weight less than 25kg.

#### Example:

Draw a histogram when 'weight_kg > 200'

#### 6. Box Plots

Box plots, also known as box and wisker plots, were introduced by John Tukey in 1970. Box plots are another way to visualize the distribution of data values. In this respect, box plots are comparable to histograms, but are quite different in presentation. On a box plot the median value is shown with a dark bar. The inner two quartiles of data values are contained within the 'box'. The 'wiskers' enclose the majority of the data(up to +/-2.5 * interquartile range). Outliers are shown by symbols beyond the wiskers. Several box plots can be stacked along an axis for comparison. The data are divided using a 'group by' operation, and the box plots for each group are atacked next to each other. In this way, the box plot allows you to display two dimensions of your dataset.

In [None]:
# Set all figures of below size
sns.set(rc={'figure.figsize':(10,6)})

In [None]:
ax = (sns.boxplot(x = 'is_legendary', y = 'attack', data = pokemon))
ax.set_title("is_legendary vs attack", fontsize=20)
ax.set_xlabel("is_legendary", fontsize=20)
ax.set_ylabel("attack", fontsize=20)

In the plot above there is much variation in non legendary pokemons than legendary pokemons.

### Exercise:

Draw the box plot of 'is_legendary Vs Defence'

#### 7. Kernel Density Plots

Kernel density plots are similar in concept to a histogram. A kernel density plot displays the values of a smoothed density curve of the data values. In other words, the kernel density plot is a smoothed version of a histogram.

In [None]:
ax= (sns.kdeplot(pokemon[['hp', 'attack']]))
ax.set_title("KDE Plot of hp Vs attack", fontsize = 20)
ax.set_xlabel("HP", fontsize = 20)
ax.set_ylabel("attack", fontsize = 20)

### Exercise: 
Draw the KDE Plot of hp and defence 

### 8. Violin Plots

A violin plot combines attributes of boxplots and a kernel density estimation plot. Like a box plot, the violin plots can be stacked, with a 'group by' operation. Additionally, the violin plot provides a kernel density estimate for each group. As with the box plot, violin plots allow you to display two dimensions of your dataset.

In [None]:
ax = (sns.violinplot(x='is_legendary', y='attack', data = pokemon))
ax.set_title("Violin plot of is_legendary Vs attack", fontsize=20)
ax.set_xlabel("is_legendary", fontsize =20)
ax.set_ylabel("attack", fontsize=20)

### Exercise: 
Draw the Violin plot of is_legendary Vs defence

#### 9. Stacked Plot

In [None]:
pokemon_stats_legendary = pokemon.groupby(['is_legendary', 'generation']).mean()[['attack', 'defense']]
pokemon_stats_legendary

In [None]:
ax = (pokemon_stats_legendary.plot.bar(stacked=True, figsize=(10,5), fontsize=20))
ax.set_title("Stacked Plot of means of Attack and Defense", fontsize=20)
ax.set_xlabel("is_legendary, generation", fontsize=20)
ax.set_ylabel("mean", fontsize=20)

In [None]:
pokemon_stats_by_generation = pokemon.groupby('generation').mean()[['hp', 'attack', 'defense', 'sp_attack', 'sp_defense', 'speed']]
pokemon_stats_by_generation

In [None]:
ax = (pokemon_stats_by_generation.plot.line(figsize=(10,5), fontsize=20))
ax.set_title("Stacked Plot of means", fontsize=20)
ax.set_xlabel("generation", fontsize=20)
ax.set_ylabel("mean", fontsize=20)

#### 10. Subplotting

Subplotting is a technique for creating multiple plots that live side-by-side in one overall figure.

#### Countplot

In [None]:
fig, axarr = plt.subplots(2, 2, figsize=(12, 12))

pokemon['generation'].value_counts().plot.bar(
    ax=axarr[0][0], fontsize=12, color='mediumvioletred'
)
axarr[0][0].set_title("Generation", fontsize=18)

pokemon['type1'].value_counts().plot.bar(
    ax=axarr[1][0], fontsize=12, color='mediumvioletred' )
axarr[1][0].set_title("Type1", fontsize=18)

pokemon['type2'].value_counts().plot.bar(
    ax=axarr[1][1], fontsize=12, color='mediumvioletred')
axarr[1][1].set_title("Type2", fontsize=18)

pokemon['is_legendary'].value_counts().plot.bar(
    ax=axarr[0][1], fontsize=12, color='mediumvioletred')
axarr[0][1].set_title("Legendary", fontsize=18)

plt.subplots_adjust(hspace=.3)


sns.despine()
