# EXPLORATORY DATA ANALYSIS (EDA)

Basic chart types
Now that we have loaded and had a first look at the data, let's get to work making some charts.

There are enumerable chart types that are used for data exploration. In this tutorial we focus on the most used chart types:

Scatter plot
Line plots
Bar plots
Histograms
Box plots
Kernel Density Estimation Plots
Violin plots
Scatter plots
Scatter plots show the relationship between two variables in the form of dots on the plot. In simple terms, the values along a horizontal axis are plotted against a vertical axis.

Scatter plots with Matplotlib
Matplotlib is at the base of most Python plotting packages. Some basic understanding of Matplotlib will help you achieve better control of your graphics.

Let's start by making a simple scatter plot. Our recipe is simple:

Import Matplotlib.pyplot
Use the plot method.
Specify the values to plot on the x and y axes.
Specify that we want green dots using a type of 'go'. If you do not specify a type, you will get a line plot which is the default.
Exectute the code in the cell below to create a scatter plot of record id and report year.

Note: The IPyhon magic command %matplotlib inline enables the display of graphics inline with the Python code. If you do not include this command your graphs will not be displayed.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 

In [2]:
df = pd.read_csv('fuel_dataset.csv')

In [3]:
# df['fuel_qty_burned'] = df['fuel_qty_burned'].astype(str)

In [4]:
# df.dtypes

In [5]:
# astype(int('fuel_qty_burned'))

In [6]:
df.head()

Unnamed: 0,record_id,utility_id_ferc1,report_year,plant_name_ferc1,fuel_type_code_pudl,fuel_unit,fuel_qty_burned,fuel_mmbtu_per_unit,fuel_cost_per_unit_burned,fuel_cost_per_unit_delivered,fuel_cost_per_mmbtu
0,f1_fuel_1994_12_1_0_7,1,1994,rockport,coal,ton,5377489.0,16.59,18.59,18.53,1.121
1,f1_fuel_1994_12_1_0_10,1,1994,rockport total plant,coal,ton,10486945.0,16.592,18.58,18.53,1.12
2,f1_fuel_1994_12_2_0_1,2,1994,gorgas,coal,ton,2978683.0,24.13,39.72,38.12,1.65
3,f1_fuel_1994_12_2_0_7,2,1994,barry,coal,ton,3739484.0,23.95,47.21,45.99,1.97
4,f1_fuel_1994_12_2_0_10,2,1994,chickasaw,gas,mcf,40533.0,1.0,2.77,2.77,2.57


In [None]:
%matplotlib inline
plt.plot(df['record_id'], df['report_year'], 'go')

[<matplotlib.lines.Line2D at 0x28bc0914788>]

#### Scatter plots with Pandas
While you can create most any visualization using Matplotlib, with enough code, you may want a simpler approach if your data are in a Pandas data frame. The Pandas package contains a number of useful plot methods which operate on data frames. The simple recipe for plotting from Pandas data frames is:

Use the plot method, specifing the kind argument, or use a chart-specific plot method.
Specify the columns with the values for the x and y axes.

In [None]:
df.plot(kind = 'scatter', x = 'fuel_qty_burned', y = 'fuel_cost_per_unit_burned')

In [None]:
fig = plt.figure(figsize=(6, 6)) # define plot area
ax = fig.gca() # define axis                   
df.plot(kind = 'scatter', x = 'report_year', y = 'fuel_qty_burned', ax = ax)
ax.set_title('Scatter plot of fuel_unit vs fuel_qty_burned') # Give the plot a main title
ax.set_xlabel('report_year') # Set text for the x axis
ax.set_ylabel('fuel_qty_burned')# Set text for y axis

In [None]:
# x = list(range(100))
# y = [z * z for z in range(100)]
# t = pd.DataFrame({'x':x, 'y':y})

In [None]:
# fig = plt.figure(figsize=(10, 10)) # define plot area
# ax = fig.gca() # define axis                   
# t.plot(x = 'x', y = 'y', ax = ax) ## line is the default plot type
# ax.set_title('Line plot of x^2 vs. x') # Give the plot a main title
# ax.set_xlabel('x') # Set text for the x axis
# ax.set_ylabel('x^2')# Set text for y axis

#### Bar plots
Bar plots are used to display the counts of unique values of a categorical variable. The height of the bar represents the count for each unique category of the variable.

It is unlikely that your pandas data frame includes counts by category of a variable. Thus, the first step in making a bar plot is to compute the counts. Fortunately, pandas has a value_counts method. The code below uses this method to create a new data frame containing the counts of the fuel plant.

In [None]:
counts = df['plant_name_ferc1'].value_counts()
counts

In [None]:
df[['plant_name_ferc1']]

In [None]:
fig = plt.figure(figsize=(10,10)) # define plot area
ax = fig.gca() # define axis    
counts.plot.bar(ax = ax) # Use the plot.bar method on the counts data frame
ax.set_title('Number of plants fuel') # Give the plot a main title
ax.set_xlabel('plant name') # Set text for the x axis
ax.set_ylabel('Number of plant')# Set text for y axis

### Histograms
Histograms are related to bar plots. Histograms are used for numeric variables. Whereas, a bar plot shows the counts of unique categories, a histogram shows the number of data with values within a bin. The bins divide the values of the variable into equal segments. The vertical axis of the histogram shows the count of data values within each bin.

The code below follows our same basic recipe to create a histogram of engine-size. Notice however, that the column of the data frame we wish to plot is specified by name as auto_prices['engine-size'].

In [None]:
fig = plt.figure(figsize=(10,10)) # define plot area
ax = fig.gca() # define axis    
df['fuel_type_code_pudl'].plot.hist(ax = ax) # Use the plot.hist method on subset of the data frame
ax.set_title('Histogram of fuel type') # Give the plot a main title
ax.set_xlabel('Fuel type') # Set text for the x axis
ax.set_ylabel('Number of fuel type')# Set text for y axis

### Box plots
Box plots, also known as box and wisker plots, were introduced by John Tukey in 1970. Box plots are another way to visualize the distribution of data values. In this respect, box plots are comparable to histograms, but are quite different in presentation.

On a box plot the median value is shown with a dark bar. The inner two quartiles of data values are contained within the 'box'. The 'wiskers' enclose the majority of the data (up to +/- 2.5 * interquartile range). Outliers are shown by symbols byond the wiskers.

Several box plots can be stacked along an axis for comparison. The data are divided using a 'group by' operation, and the box plots for each group are stacked next to each other. In this way, the box plot allows you to display two dimensions of your data set.

The code in the cell below generally follows the recipe we have been using. The data frame is subsetted to two columns. One column contains the numeric values to plot and the other column is the group by variable. In this case, the group by variable is specified with the by = 'fuel-type' argument.

In [None]:
fig = plt.figure(figsize=(10,10)) # define plot area
ax = fig.gca() # define axis    
auto_prices[['fuel_type_code_pudl','fuel_qty_burned']].boxplot(by = 'fuel_qty_burned', ax = ax) # Use the plot.bar method on the new data frame
ax.set_title('Box plots of fuel quantity burned by fuel type') # Give the plot a main title
ax.set_xlabel('Fuel type') # Set text for the x axis
ax.set_ylabel('fuel quantity')# Set text for y axis

### Kernel density plots and introduction to Seaborn

In [None]:
import seaborn as sns
sns.set_style("whitegrid")
sns.kdeplot(df['fuel_unit'])

The KDE plot results are similar to what we observed with the histogram of engine size. Engine size is skewed toward the small size, and there are outliers of a few cars with large engines clearly visible.

As we did with pandas plots, we control the properties of a Seaborn plot by specifing axes. The code in the cell below extends our simple recipe for Seaborn plots:

Define a figure.
Define one or more axes on the figure.
Set a style for the plot grid.
Define the plot type and columns to be plotted.
Use methods on the axes to control attributes like titles and axis labels.

In [None]:
fig = plt.figure(figsize=(10,10)) # define plot area
ax = fig.gca() # define axis 
sns.set_style("whitegrid")
sns.kdeplot(df['fuel_unit'], ax = ax)
ax.set_title('KDE plot of fuel_unit') # Give the plot a main title
ax.set_xlabel('fuel unit') # Set text for the x axis
ax.set_ylabel('Density')# Set text for y axis

In [None]:
fig = plt.figure(figsize=(10,10)) # define plot area
ax = fig.gca() # define axis 
sns.set_style("whitegrid")
sns.kdeplot(df[['fuel_unit', 'fuel_cost_per_unit_burned']], ax = ax, cmap="Blues_d")
ax.set_title('KDE plot of fuel unit and price') # Give the plot a main title
ax.set_xlabel('fuel unit') # Set text for the x axis
ax.set_ylabel('Price')# Set text for y axis

#### Violin plots
Now, we will use Seaborn to create a violin plot. A violin plot combines attributes of boxplots and a kernel density estimation plot. Like a box plot, the violin plots can be stacked, with a group by operation. Additionally, the violin plot provides a kernel density estimate for each group. As with the box plot, violin plots allow you to display two dimensions of your data set.

The code in the cell below follows the recipe we have laid out for Seaborn plotting. The sns.violinplot method has more arguments than the kdeplot method.

The columns used for the plot are set with the x and y arguments.
The x column is the group by variable.
The data argument specifies a pandas data frame or a numpy array

In [None]:
fig = plt.figure(figsize=(8,8)) # define plot area
ax = fig.gca() # define axis 
sns.set_style("whitegrid")
sns.violinplot(x = 'fuel_unit', y = 'fuel_cost_per_unit_burned', data = df, ax = ax)
ax.set_title('Violine plots of fuel cost by fuel type') # Give the plot a main title
ax.set_xlabel('Fuel type') # Set text for the elaxis
ax.set_ylabel('fuel cost')# Set text for y axis