# Pandas Built-in Data Visualization

Author and instructor: ***Dr. Junaid Qazi, PhD***

Welcome to [pandas for Data Visualization](https://pandas.pydata.org/pandas-docs/stable/visualization.html) now. In the Data Analysis section, we have explored the power of pandas for data analysis. Let's talk about pandas built-in data visualization capabilities. We can create decent looking plots using pandas. This is actually built on matplotlib and embedded into pandas for quick use. <br>

Let's take a look!

```Python
# let's do some imports first
import numpy as np
import pandas as pd
%matplotlib inline
```

Let's create some random datasets now! <br>
df1 with date as index and df2 with sequential index.

*Although, we can use seaborn built-in datasets as well. To make this lecture as a separate and independent, I am creating new data frames here to use.* 

### Let's read a provided dataframe.

```Python
df1=pd.read_csv('data_for_pandas_viz.csv', index_col = 'date') 
# notice, I am asking pandas to set date col as index
```

```Python
df1.head()
```

```Python
df1.info()
```

### Generating `df2`

```Python
# rand(20,3) 20 by 3 2D array
data2 = np.random.rand(20,3)
data2
```

```Python
# these are the column names for my dataframe
col = ['X', 'Y', 'Z']
```

```Python
df2 = pd.DataFrame(data = data2, columns=col) # passing data and col names!
```

```Python
df2.info()
```

## So, I have two dataframes, I will be using them for the range of plot now!
**Let's learn by doing!**

Let's start with simple histogram plot, we can get the hist plot in two ways!
* `DataFrame.hist()`
* `DataFrame.plot(kind = 'hist')`

Let's try both one-by-one first, later on, we will talk about style sheet and discuss the range of plotting options in details!

```Python
df1['A'].hist(bins=30)
# Note that it is using matplotlib under the hood
```

```Python
# we can also plot using kind
df1['A'].plot(kind='hist',bins=30)
# For whole dataset
#df1.plot(kind='hist',bins=30)
```

## Style Sheets

Matplotlib has [style sheets](http://matplotlib.org/gallery.html#style_sheets). We can use this to make our plots look a little nicer. These style sheets include `plot_bmh`,`plot_fivethirtyeight`,`plot_ggplot` and more. They basically create a set of style rules that our plots follow. Its good to use them because they make all our plots to have the same look and feel more professional. <br>
Let's call the style first!

```Python
import matplotlib.pyplot as plt
plt.style.use('ggplot')
```

Now re-plot the same histogram plot!

```Python
df1['A'].hist()
```

```Python
# Lets use another style "bhm"
plt.style.use('bmh')
df1['A'].hist()
```

```Python
# we can use dark_background 
plt.style.use('dark_background')
df1['A'].hist()
```

```Python
# another example
plt.style.use('fivethirtyeight')
df1['A'].hist()
```

**Let's stick with the `ggplot` style and explore how to utilize pandas built-in plotting capabilities!**

```Python
plt.style.use('ggplot')
```

# Plot Types

There are several plot types built-in to pandas, most of them are statistical plots by nature:

* `df.plot.area`     
* `df.plot.barh`     
* `df.plot.hist`     
* `df.plot.line`     
* `df.plot.scatter`
* `df.plot.bar`      
* `df.plot.box`      
* `df.plot.hexbin`   
* `df.plot.kde` 
* `df.plot.density` 
* `df.plot.pie`

We can also just call `df.plot(kind='hist')` or replace that kind argument with any of the key terms shown in the list above (e.g. `'box','barh'`, etc..)<br>
**Let's go through these plots one-by-one using our data frames `df1` and `df2`!**

### Area plot

```Python
#df2.plot.area(alpha=0.5)
df2.plot(kind = 'area', alpha=0.5)
#df2['X'].plot(kind = 'area', alpha=0.5)
```

### Bar plots

```Python
df2.plot.bar()
```

```Python
# Horizontal bars
df2.plot.barh()
```

```Python
# we can stack them on top of each other 
df2.plot.bar(stacked=True)
```

### Histograms

```Python
df1['A'].plot.hist(bins=30)
```

***We can get all the columns on the same plot!***

```Python
df1.plot.hist(alpha = 0.5)
```

A histogram can be **stacked** using `stacked=True`. This is not what we use often but it is good to know!

```Python
df1.plot.hist(stacked = True)
```

We can pass other keywords supported by `matplotlib hist`. For example, `horizontal` and `cumulative` histograms can be drawn by `orientation='horizontal`' and `cumulative=True`.<br>
Let's pass `orientation='horizontal`.

```Python
df1.plot.hist(orientation='horizontal',alpha = 0.5)#, cumulative=True)
```

### Line plots

```Python
df1.plot.line(y='B',figsize=(12,3),lw=1)
```

### Scatter plots
Scatter plot can be drawn by using the `DataFrame.plot.scatter()` method. Scatter plot requires numeric columns for the x and y axes. These can be specified by the x and y keywords.<br>
Let's try a scatter plot with df1 here!

```Python
df1.plot.scatter(x='A',y='B')
```

To plot multiple column groups in a single axes, we need to repeat `plot` method specifying target ax. It is recommended to specify color and label keywords to distinguish each groups.

```Python
ax = df1.plot.scatter(x='A',y='B', 
                 color='DarkBlue', label='Group 1')
df1.plot.scatter(x='C',y='D', 
                 color='DarkGreen', label='Group 1', ax = ax)
```

The keyword `c` may be given as the name of a column to provide colors for each point. `cmap` can be used to indicate colormap. <br>
For all the colormaps, [click here](http://matplotlib.org/users/colormaps.html)

```Python
df1.plot.scatter(x='A',y='B',c='C',cmap='coolwarm')
```

```Python
# We can use 's' to indicate size
df1.plot.scatter(x='A',y='B',c='C',s=50,cmap='coolwarm')
```

We can use **`s`** to indicate size of the bubble based on another column. <br>
**`s`** parameter needs to be an array, not just the name of a column:

```Python
df1.plot.scatter(x='A',y='B',s=df1['D']*100)
# -ve value in sqrt could raise error
```

## BoxPlots
To visualize the distribution of values within each column.

```Python
df2.plot.box()
```

```Python
df2.plot.box(vert=False)
```

Boxplot can be colorized by passing color keyword. We can pass a dict whose keys are boxes, whiskers, medians and caps. If some keys are missing in the dict, default colors are used for the corresponding artists.

```Python
color = dict(boxes='DarkGreen', whiskers='DarkOrange',
             medians='DarkBlue', caps='Gray')
```

```Python
df2.plot.box(color=color)
```

We can also pass a **`by`** argument for groupby in the box plots.<br>
Let's add another column with A and B (two choices) in df2

```Python
df2['group'] = np.random.choice(['A', 'B'], size=20, ) # notice random choice
```

```Python
df2.head()
```

`DataFrame.boxplot` to plot boxplot still can also be used.<br>
Let's grab `X` and `Y` columns and pass `by = 'g'`

```Python
df2.boxplot(column =['X','Y'], by = 'group');
```

I am going to drop 'g' columns because I don't need this anymore!

```Python
df2.drop('group', axis=1, inplace=True)
df2.head()
```

## Hexagonal Bin Plot

Hexbin plots can be a useful alternative to scatter plots if your data are too dense to plot each point individually.

```Python
df1.plot.hexbin(x='A',y='B')#,gridsize=25,cmap='Oranges')
```

A useful keyword argument is `gridsize`; it controls the number of hexagons in the x-direction, and defaults to 100. A larger `gridsize` means more, smaller bins. Let's pass the `gridsize = 25` and see how the above plots look like!

```Python
df1.plot.hexbin(x='A',y='B',gridsize=25,cmap='Oranges')
```

## Kernel Density Estimation plot (KDE)

```Python
df2['X'].plot.kde()
#df2['X'].plot.density()
```

We can use `density()` as well!<br>
Let's try the complete dataframe df1 with density!

```Python
df2.plot.density()
```

## Pie Plot
We can create a pie plot with `DataFrame.plot.pie()` or `Series.plot.pie()`. If your data includes any NaN, they will be automatically filled with 0. A ValueError will be raised if there are any negative values in your data.<br>
Let's try pie plot with df2.<br>

We need to specify a target column by the y argument. When y is specified, pie plot of selected column will be drawn.

```Python
# let's take use X column only
df2.plot.pie(y='X')#,figsize=(12, 4),legend=False)
```

The plot above does not look nice, lets remove the legend and set the figure size!

```Python
#df2.plot.pie('X', figsize=(4, 4), legend=False) # for older pandas versions
df2.plot.pie(y = 'X', figsize=(4, 4), legend=False) #<shift tab> for doc string
```

Instead of passing y, we can pass-in `subplots=True`. This will generate pie plots for each column as subplots. A legend will be drawn in each pie plots by default; specify legend=False to hide it.

```Python
df2.plot.pie(subplots=True, figsize=(12, 4),legend=False) # subplot parameter 
```

This was all about the data visualization using pandas. You can see that how convenient it is to use pandas data visualization capabilities while EDA. It balances ease of use with control over the figure. A lot of the plot calls also accept additional arguments of their parent matplotlib library call. <br>

Let's move on to the exercises section now, where we will try to solve some tasks related to pandas data visualization. See you in the next lecture, Good Luck!

# Great Job!