# Pandas visualisation - bonus materials

***You will only start understanding your data objectively when you start visualising it***

* Pandas' has built-in methods/functions for visualising data that will help you to understand your data better

* Even though it is related to Pandas, we need to import matplotlib
* Pandas' capabilities/flexibilities for plotting can not be compared to Matplotlib, Seaborn or Plotly
* You typically use Pandas plotting capabilities for quick visualisations

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## `.plot()`

You can use `.plot()` to visualise data in Pandas. The documentation is here

A relevant argument is **kind**, where you determine the kind of plot to produce. The options include:

* 'line': line plot (default)

* 'bar': vertical bar plot
* 'barh': horizontal bar plot
* 'hist': histogram
* 'box': boxplot
* 'kde': Kernel Density Estimation plot
* 'area': area plot
* 'pie': pie plot
* 'scatter': scatter plot

In [None]:
df = sns.load_dataset('penguins')

In [None]:
df = sns.load_dataset('penguins')
df = df.head(50)
df.head()

### Histogram (example)

* A histogram is effective in visualizing how numerical data is distributed. It groups values into bins and displays a count of the data points whose values are in a particular bin

* We are interested to see how body mass is distributed

* We set kind='hist', and y='body_mass_g'
* Bins value is more like a "trial and error" exercise; you may start with a number and refine until you have a good visualisation
* We also set figsize and title
* We may state that visually, the majority of the data is within the range of 3300 to 4500. There is a peak at around 4000. The data is not normally distributed (bell shape). A normal distribution is when data points tend to be around a central value with no bias to the left or the right. It's often termed a 'Bell Curve' because it looks like a bell

In [None]:
df.plot(kind='hist', y='body_mass_g', bins=75, figsize=(10,6), title='body mass distribution')
plt.show()

* We can plot more than one distribution in one plot

* Imagine if we are interested to see the distribution of bill_length_mm and bill_depth_mm
* We notice the values don't overlap; they have different distribution shapes.

In [None]:
df.plot(kind='hist', y=['bill_length_mm', 'bill_depth_mm'], bins=50, figsize=(10,6))
plt.show()

### Box Plot (example)

A boxplot graphs data based on their quartiles. The first, second and third quartile numbers divide the data into approximately equal-sized quarters. Boxplot is an approach to display the distribution of data based on five metrics that help to summarise a numerical distribution:
* minimum (min)
* first quartile (Q1)
* median
* third quartile (Q3)
* and maximum (max)


<p align="center">
	<img src="../assets/img/box_plot.jpg" width="500">
</p>

* The range between the first and third quartiles (Q1 - Q3), also known as the interquartile range (IQR), shows where your data is most frequent
* The min and max show your data range (these points may be an outlier or not). In the plot above, the min and max values are outliers
* The outliers are data points that are dramatically different from other data points
* The upper and lower boundaries, where the data is not an outlier, are Q3 + 1.5 x IQR and Q1 - 1.5 x IQR
* You can check how tightly your data is grouped; the "smaller" the box, the more "grouped" the data is (lower data variance).
* We will study in more detail these terms

In [None]:
df = sns.load_dataset('iris')
df = df.head(50)
df.head()

In [None]:
df.plot(kind='box',y=['sepal_width'],figsize=(10,7))
plt.show()

In [None]:
df.plot(kind='box',y=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],figsize=(10,7))
plt.show()

* Let's plot the same information with histograms

In [None]:
df.plot(kind='hist',bins=50 ,y=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
        figsize=(10,7),alpha=0.7)
plt.show()

The chapter is based on the Predictive Analytics module by [Code Institute](https://codeinstitute.net/)

---

### In-class exercise
Select one dataset from the list in **lesson_4.1** and plot hystogram and box plots of selected *numerical* features. Identify and drop N/A data, if necessary