# Project 2 : Visualization

## Instructions

### Description

In this project, you will look at three bad visualizations then:

1. Identify what makes them bad
1. Use the same data to make a better chart
1. Explain an interesting pattern you noticed.

Some helpful questions to determine if a visualization is bad:

1. What is the visualization trying to show? Ex. Comparison? Relationship? Change over time?
2. Is this the right visualization to use?
3. Does the visualization have the correct labels and axes limits?
4. Is there too much being shown in one visualization? Should it be split?

Some helpful questions to find patterns in a visualization:

1. How do different data points compare? Are there significant differences? Are there any outliers?
2. If comparing data/series, how do they rank? Is there a significant difference between rankings?
3. If looking at data over time, is there any seasonality? How do the values compare to the mean and/or median? How do the values change over time? Ex. Ups and downs? Always up? Always down?

### Getting Started

The lecture on data visualization (available in the usual places) has a lot of code examples.  Also don't forget the matplotlib documentation available from the Help menu in the notebook.

Also, this is the first assignment we've given where we ask you to provide text answers and not just code. You don't have to get fancy, but you'll want to use Markdown to write up your answers.  There is Markdown help available from the Help menu as well.

### Grading

For grading purposes, we will clear all outputs from all your cells and then run them all from the top.  Please test your notebook in the same fashion before turning it in.

### Submitting Your Solution

To submit your notebook, first clear all the cells (this won't matter too much this time, but for larger data sets in the future, it will make the file smaller).  Then use the File->Download As->Notebook to obtain the notebook file.  Finally, submit the notebook file on Canvas.

### Credits

Many thanks to Saad Elbeleidy for this assignment!

### Setup Code

In [None]:
## Imports
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import numpy as np
import pandas as pd
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### Problem 1: Bad line chart (15 points)

To get you started, we'll walk through 1 bad visualization.

![Bad Line Chart](http://imgur.com/kB6uNZC.png)

In [None]:
# Bad line chart data & names
badLineNames = ["2016 Q1", "2016 Q2", "2016 Q3", "2016 Q4"]
badLineProduct1 = [240, 300, 280, 400]
badLineProduct2 = [300, 320, 150, 160]
badLineProduct3 = [120, 140, 180, 160]
badLineProduct4 = [380, 400, 450, 500]

**What makes this visualization bad?**

**1. What is the visualization trying to show? Ex. Comparison? Relationship? Change over time?**

This visualization trys to show data over time.

**2. Is this the right visualization to use?**

Yes, we should be using a line chart to show data over time.

**3. Does the visualization have the correct labels and axes limits?**

There are no labels or a title. We could also use some more space between the min and maximum data points and the axes limits.

**4. Is there too much being shown in one visualization? Should it be split?**

Yes, it's quite difficult to follow each series, it should be split.


Since the chart type is the correct one, it seems all we need to do is add labels and split the lines into panels. Before we do that, we can probably also improve the design. We covered how to improve a `matplotlib` plot in class using different styles. Select a `style` and apply it below.

In [None]:
## Apply your chosen style 
plt.style.use('bmh')


example of how to set up subplots can be found in lecture 05, slide 13. 

Now we need to plot the data over different panels. We can use [`plt.subplots`](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplots) to create multiple panels. Since we have 4 products, we're going to need 4 panels on top of each other. `subplots` returns two variables, the figure object and an array of axes. What we can do is loop through each Axes object and create a plot for that product. The first Axes object should plot the first product, the second should plot the second product and so on.
<br><br>
**Step 1: Buffers** 
<br>Each subplot will have its own y axis, but to make the scale is the same for all subplots, create a buffer between the lowest value among all products and the start of the y axis as well as between the highest value and the top of the y axis. These buffers are simply integers that extend the y axis above the highest value in the products, and below the lowest.
<br><br>
**Step 2: Mean**
<br>Calculate the mean of the entire data set -- the mean of the individual product means. To calculate this easily, first put the products in a list, create a list of the individual means using a comprehension over the product list, then find the mean of the list of individual means.
<br><br>
**Step 3: Colors**
<br>Choose a color for the plot of each product. This can be done by filling a list with each color's matplotlib name. Available colors can be found [here](https://matplotlib.org/2.0.2/api/colors_api.html)
<br><br>
**Step 4: Subplots** 
<br>Now that those numbers are calculated, create 4 line charts on top of each other, each plotting one of the products.
<br>For each subplot:
<br>
1) plot the product with `plot(data, color)`
<br>
2) Set the y scale using `set_ylim(bottomBuffer, topBuffer)`
<br>
3) Add an x label if this panel is not the bottom one using `set_xticklabels(list of labels)`
<br>
4) Add a title to the subplot with the product number using `set_title(title)`
<br>
5) Add a dashed line with the value of the mean using `plot(mean, args)` or `axhline(mean, args)`. More info [here](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.axhline.html)
<br><br>
**Step 5: Beautify**
<br>
1) Add a title to the whole diagram(the superplot) with `fig.suptitle(title)`
<br>
2) Rearrange subplots, if necessary, with `fig.subplots_adjust(left, bottom, right, top, wspace, hspace)`
<br><br>
**Notes:**
<br>
- Many of the arguments to many of these functions are 'default', meaning they're optional. If there's an argument you don't need or want, try leaving it out. For example, if you only want to change the hspace of the subplots, you can call `fig.subplot_adjust(hspace=0.6)` without specifying the other arguments. Just be sure to label the argument you want to specify. Don't just write `fig.subplot_adjust(0.6)`

In [None]:
# Step 1
minimumBuffer = min(badLineProduct1 + badLineProduct2 + badLineProduct3 + badLineProduct4) - 100
maximumBuffer = max(badLineProduct1 + badLineProduct2 + badLineProduct3 + badLineProduct4) + 100
# print("minbuffer = ", minimumBuffer, "  maxBuffer = ", maximumBuffer)
# Step 2
products = [badLineProduct1, badLineProduct2, badLineProduct3, badLineProduct4 ]
meanTotal = np.mean(products)
means = [np.mean(badLineProduct1), np.mean(badLineProduct2), np.mean(badLineProduct3), np.mean(badLineProduct4)]
print("The total mean for all the data is ", meanTotal)
for index,i in enumerate(products):
    print("The mean of Product {}'s y values is {}.".format(index+1, np.mean(i)))
# Step 3
colors = ['b', 'g', 'm', 'k']
fig, pltAxes = plt.subplots(ncols=1, nrows=4, figsize=(10,10))
#print(pltAxes)
axes = pltAxes.ravel()
# print(axes)

# Step 4
for i, graph in enumerate(axes):
    graph.plot(badLineNames, products[i], color = colors[i])
    graph.set_ylim(minimumBuffer, maximumBuffer)
    graph.set_xticklabels(badLineNames)
    graph.set_title("Product {}".format(i+1), fontsize=8)   # Add the title for the dataset
    graph.axhline(y = means[i] , color = 'r')
# Step 5
fig.suptitle("2016 Product Sales")
fig.subplots_adjust(hspace=0.2)
#plt.legend([products[0], products[1], products[2], products[3], means[0]], ['Product 1 Sales', 'Product 2 Sales', 'Product 3 Sales', 'Product 4 Sales', 'Mean of Sales'])
plt.show()

Now that you've created a better plot, try to describe a pattern in the dataset. Use the following questions as a reference:

1. How do different data points compare? Are there significant differences? Are there any outliers?
2. If comparing data/series, how do they rank? Is there a significant difference between rankings?
3. If looking at data over time, is there any seasonality? How do the values compare to the mean and/or median? How do the values change over time? Ex. Ups and downs? Always up? Always down?

    1. The data is fairly linear with few outliers. The topic of outliers is debatable, but Product 2 seems to have the worse
       outliers. Products 1 is the closest to the mean of all the products. Otherwise, Product 4 had the highest sale, Product 3 had the lowest, and Products 1 and 2 seem to be in the middle. For all the products except Product 2, the general trend for the long run looks like it's increasing.
       
    2.  Product 4 had the highest sale, Product 3 had the lowest, and Products 1 and 2 seem to be in the middle, but Product one seems to have sold more than Product 2 due to its decrease in sales at the end. For all the products except Product 2, the general trend for the long run looks like it's increasing.
    
    3. The general trend for the data is increasing (besides Product 2). This generalizes that as the year and season progress, that sales do better. On the individual level, all saw increases in sale during the first quarter to the second quarter. In the second quater to 3 quarter, Products 1 and 2 decreased while products 3 and 4 increased. In third to fourth quarter, sales then increased for all except Product 3. Products 1 and 2 seem to deviate more from their means than Products 1 and 2.
       

Next, look through the following bad visualizations and apply the above workflow to:

1. Determine what makes them bad
1. Create a better visualization
1. Describe a pattern in the data

### Problem 2: Bad pie chart (20 points)

Explain why this visualization is a bad one:

![Bad Pie Chart](http://imgur.com/Wg9DOZd.png)

This is a bad graph because it has no percentages or titles which makes this graph impossible to tell what it is trying to display. Without that knowledge, it is difficult to know whether this is a good model for the type of data. Bar graphs in gerenal are also bad to easily show comparison data.

In [None]:
badPieNames = ["Golden", "Boulder", "Denver", "Colo Springs"]
badPieValues = [0.37, 0.4, 0.5, 0.35]

In [None]:
# Plot a better chart using
x = np.arange(len(badPieNames))

plt.bar(badPieNames, badPieValues, align='center', alpha=0.5)
plt.ylim(0, 0.6)
plt.xticks(x, badPieNames)
plt.ylabel('Value')
plt.xlabel('City')
plt.title('City VS Value')
plt.show()




Tell a story or describe a pattern using your new visualization.

It seems like the capital has a higher percentage and the surrounding cities in the metro have roughly the same percentage This could be a reference to location or population density, but without more information, that isn't known.

### Problem 3: Bad bar chart 1 (20 points)

Explain why this visualization is a bad one:

![Bad Bar Chart](http://imgur.com/AkLyM9I.png)

This is a bad bar char because the axises aren't labelled and there is no title on the graph. This makes the data useless since we don't know what we are comparing The start point for the y axis is also misleading.

In [None]:
badBarNames = ["A", "B", "C"]
badBarValues = [240, 232, 251]

In [None]:
#badBarNames.clear()
#badBarNames = ['Python', 'C++', 'Java']
x = np.arange(len(badBarNames))

 
plt.bar(badBarNames, badBarValues, align='center', alpha=0.5)
plt.ylim(0, 350)
plt.xticks(x, badBarNames)
plt.ylabel('Value')
plt.xlabel('Letter')
plt.title('Letter VS Value')
 
plt.show()


Tell a story or describe a pattern using your new visualization.

The data is realtively similar. The range of the data is only 21. The original graph made the data seem so diverse except when look at the data more as a whole (new generated graph), the data is actually closer together.

### Problem 4: Bad bar chart 2 (20 points)

Explain why this visualization is a bad one:

![Bad Bar Chart](http://imgur.com/Ns3lgyp.png)

The graph is bad because the axsis are not labeled and there is not title which makes the data useless since you don't know what you are comparing.

In [None]:
badBar2Names = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
badBar2Values = [240, 320, 360, 280, 290, 300, 500, 410, 390, 200, 220, 240]

In [None]:
x = np.arange(len(badBar2Names))

plt.plot(badBar2Names, badBar2Values)
plt.ylim(0, 600)
plt.xticks(x, badBar2Names)
plt.title("Months VS Value", fontsize=11)   # Add the title for the dataset
 
plt.show()

Tell a story or describe a pattern using your new visualization.

The data seems to be seasonal. The sales increase as spring and fall come around, but as soon as those monthes (April and October) start, the values drop and then start increasing again until the next 'seasonal' change.

### Questionnaire
1) How long did you spend on this assignment?
<br><br>
This assignment took me 2.5 hours.

2) What did you like about it? What did you not like about it?
<br><br>
I liked that there were a bunch of different types of graphs and now I have a quick reference if I ever want to know how to plot a certain type of graph. The whole "Tell a sotry" thing is confusing as to what you are looking for though. Like if we are supposed to created a story and names for the values or if we are just generalizing to all data for just labels and values.

3) Did you find any errors or is there anything you would like changed?
<br><br>
There isn't anything I would like to change except for maybe some better clarification on what "Tell a story' means for the description of the new visualization.