# Tutorial 2: Data Visualization
Hi everyone, thank you for taking the time to learn some Python! We will be introducing you to more advanced content. This tutorial will be covering how to create vizualizations: scatter plot, line plots, bar graphs, and histograms. Additionally, we will learn how to add titles, axes labels, and color-coding to vizualizations.


# Lesson 1: Scatter Plot
Now on to our first lesson we will be learning how to take the pandas dataframe and visualize its content as a scatter plot. <br>
Let us first discuss how to create a `scatter plot` and introduce you to `matplotlib.pyplot`.

First we must import the required packages

In [None]:
import numpy as np                    # this is a package that adds additional tools to your Jupyter Notebook
from dplython import diamonds 
import matplotlib.pyplot as plt

## Create your own Scatter Plot from random data

In [None]:
# This is an example of a Scatter Plot
x = np.random.rand(5)
y = np.random.rand(5)

# The below funtion 'scatter()' creates the Scatter Plot!
examplePlot = plt.scatter(x, y)    

# The show() function outputs the new plot
plt.show(examplePlot)

The above plot is a good example of a simple scatter plot. You must make sure to add `plt.` before calling the function `scatter()` because it is a part of `matplotlib.pyplot` package!

In [None]:
# Write your data frame recreation in this chunk
myScatterPlot = 

Now to check if your input was correct. If you were right, "Correct" will be printed! Hints will print if you need help getting to the correct answer.

In [None]:
# Run this chuck to check your answer
check.answer_3(myScatterPlot)

## Scatterplot using data from a Data frame `diamonds`

In [None]:
plt.scatter(x= diamonds['carat'], y= diamonds['price'])

# Lesson 2: Line Plots
Now on to our third lesson we will be learning how to take the pandas dataframe and visualize its content as a line plot. These plots are useful linear graphs that show data frequencies along a number line. They can be used to analyze data that has a single defined value.

Here is an example where  we start by creating a figure and an axes. In their simplest form, a figure and axes can be created as follows:

In [None]:
fig = plt.figure()
ax = plt.axes()

In Matplotlib, the figure (an instance of the class plt.Figure) can be thought of as a single container that contains all the objects representing axes, graphics, text, and labels. The axes (an instance of the class plt.Axes) is what we see above: a bounding box with ticks and labels, which will eventually contain the plot elements that make up our visualization. Throughout this book, we'll commonly use the variable name fig to refer to a figure instance, and ax to refer to an axes instance or group of axes instances.

Once we have created an axes, we can use the ax.plot function to plot data. Let's start with a simple sinusoid:

In [None]:
fig = plt.figure()
ax = plt.axes()

x = np.linspace(0, 10, 1000)
ax.plot(x, np.sin(x));

Alternatively, we can use the pylab interface and let the figure and axes be created in the background:

In [None]:
plt.plot(x, np.sin(x));

If we want to create the same figure with multiple lines, we can call the plot function multiple times:

In [None]:
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x));

The first adjustment you might wish to make to a plot is to control the line colors and styles. The plt.plot() function takes arguments that can be used to specify these. To adjust the color, you can use the color keyword, which accepts a string argument representing virtually any imaginable color. The color can be specified in different ways:

In [None]:
plt.plot(x, np.sin(x - 0), color='blue')        # specify color by name
plt.plot(x, np.sin(x - 1), color='g')           # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75')        # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44')     # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 to 1
plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names support

If no color is chosen for the graph, Matplotlib will cycle through a set of default colors for multiple lines.

Similarly, the line style can be adjusted using the linestyle keyword:

In [None]:
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x, x + 2, linestyle='dashdot')
plt.plot(x, x + 3, linestyle='dotted');

# For short, you can use the following codes:
plt.plot(x, x + 4, linestyle='-')  # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':');  # dotted

Matplotlib is great at choosing default axes limits for your plot, but sometimes it's nice to have finer control. The most basic way to adjust axis limits is to use the plt.xlim() and plt.ylim() methods:

In [None]:
plt.plot(x, np.sin(x))

plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5);

The following is for when you want to present the axis in reverse:

In [None]:
plt.plot(x, np.sin(x))

plt.xlim(10, 0)
plt.ylim(1.2, -1.2);

Titles and axis labels are the simplest such labels—there are methods that can be used to quickly set them:

In [None]:
plt.plot(x, np.sin(x))
plt.title("A Sine Curve")
plt.xlabel("x")
plt.ylabel("sin(x)");

When multiple lines are being shown within a single axes, it is useful to create a plot legend that labels each line. Matplotlib has a built-in way of creating a legend. It is done via the plt.legend() method. Though there are several valid ways of using this, I find it easiest to specify the label of each line using the label keyword of the plot function:

In [None]:
plt.plot(x, np.sin(x), '-g', label='sin(x)')
plt.plot(x, np.cos(x), ':b', label='cos(x)')
plt.axis('equal')

plt.legend();

## Lesson 3: Bar Graphs
A bar chart/graph is a chart/graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. These charts/graphs can be plotted in a vertical or horizontal manner.

Matplotlib API provides the bar() function that is used in the MATLAB style and the object oriented API. The signature of bar() function to be used with axes object includes the following example:

`ax.bar(x, height, width, bottom, align)`

The parameters in this example are the following

x: sequence of scalars representing the x coordinates of the bars.

height: scalars that represent the height of the bars

width: scalars that represent the width of the bars

bottom: the y-coordinates of the graph

align: center/edge of the graph; optional

Following is a simple example of the Matplotlib bar plot. It shows the number of students enrolled for various courses offered at an institute.

In [None]:
#import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
langs = ['C', 'C++', 'Java', 'Python', 'PHP']
students = [23,17,35,29,12]
ax.bar(langs,students)
plt.show()

Multiple bar charts can be displayed by playing with the thickness and the positions of each bar. The data variable contains three series of four values. The following script will show three bar charts of four bars. The bars will have a thickness of 0.25 units. Each bar chart will be shifted 0.25 units from the previous one. The data object is a multidict containing number of students passed in three branches of an engineering college over the last four years.

In [None]:
data = [[30, 25, 50, 20],
[40, 23, 51, 17],
[35, 22, 45, 19]]
X = np.arange(4)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(X + 0.00, data[0], color = 'b', width = 0.25)
ax.bar(X + 0.25, data[1], color = 'g', width = 0.25)
ax.bar(X + 0.50, data[2], color = 'r', width = 0.25)

The optional bottom parameter of the pyplot.bar() function allows you to specify a starting value for a bar. Instead of running from zero to a value, it will go from the bottom to the value. The first call to pyplot.bar() plots the blue bars. The second call to pyplot.bar() plots the red bars, with the bottom of the blue bars being at the top of the red bars.

In [None]:
N = 5
menMeans = (20, 35, 30, 35, 27)
womenMeans = (25, 32, 34, 20, 25)
ind = np.arange(N) # the x locations for the groups
width = 0.35
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(ind, menMeans, width, color='r')
ax.bar(ind, womenMeans, width,bottom=menMeans, color='b')
ax.set_ylabel('Scores')
ax.set_title('Scores by group and gender')
ax.set_xticks(ind, ('G1', 'G2', 'G3', 'G4', 'G5'))
ax.set_yticks(np.arange(0, 81, 10))
ax.legend(labels=['Men', 'Women'])
plt.show()

## Lesson 4: Histograms
A histogram is a chart that showcases rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval.

The following example shows how a normal histogram is built:

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')

data = np.random.randn(1000)

In [None]:
plt.hist(data);

The hist() function has various options to include the calculation and the display at the same time; here's an example of a more customized histogram:

In [None]:
plt.hist(data, bins=30, normed=True, alpha=0.5,
         histtype='stepfilled', color='steelblue',
         edgecolor='none');

The plt.hist docstring has more information on other customization options that are available to the user. The combination of histtype='stepfilled' along with some transparency alpha is useful when comparing histograms of several distributions:

In [None]:
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)

kwargs = dict(histtype='stepfilled', alpha=0.3, normed=True, bins=40)

plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);

Compute the histogram with the np.histogram() function:

In [None]:
counts, bin_edges = np.histogram(data, bins=5)
print(counts)

Much like how there's such thing as one-dimensional graphs, there are also two-dimensional graphs as well. These graphs are created by dividing points into two-dimensional bins. The first method for creating these types of graphs is through the use of the Gaussian distribution:



In [None]:
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 10000).T

This next example showcases a more linear way to plot a two-dimensional histogram. This is done through the use of the plt.hist2d in this example:

In [None]:
plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in bin')

The two-dimensional histogram creates a tesselation of squares across the axes. Another natural shape for such a tesselation is the regular hexagon. For this purpose, Matplotlib provides the plt.hexbin routine, which will represents a two-dimensional dataset binned within a grid of hexagons:

In [None]:
plt.hexbin(x, y, gridsize=30, cmap='Blues')
cb = plt.colorbar(label='count in bin')

Another method for evaluating densities in multiple dimensions is known as the kernel density estimation (KDE). KDE "smears out" the points in space and add up the result to obtain a smooth function. One extremely quick and simple KDE implementation exists in the scipy.stats package. Here is a quick example of using the KDE on this data:

In [None]:
from scipy.stats import gaussian_kde

# fit an array of size [Ndim, Nsamples]
data = np.vstack([x, y])
kde = gaussian_kde(data)

# evaluate on a regular grid
xgrid = np.linspace(-3.5, 3.5, 40)
ygrid = np.linspace(-6, 6, 40)
Xgrid, Ygrid = np.meshgrid(xgrid, ygrid)
Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()]))

# Plot the result as an image
plt.imshow(Z.reshape(Xgrid.shape),
           origin='lower', aspect='auto',
           extent=[-3.5, 3.5, -6, 6],
           cmap='Blues')
cb = plt.colorbar()
cb.set_label("density")