# INFO 98: Data Science Skills, Spring 2019
## Lecture 05: Data Visualization

---

## Table of Contents
* [Setup](#setup)
* [Demo](#demo)
    * Histogram
    * Bar Plot
    * Box Plot
    * Line Graph
    * Scatter Plot
    * Contout Plot
    * Heat Map
* [Customization](#customization)

<a id='setup'></a>
# Setup
____

In [None]:
# Comment out !pip install statements if you have those packages installed.

!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

<a id='Demo'></a>
# Demo
___

**NOTE**: Place the .csv file at https://drive.google.com/file/d/18uiFsYakr2-GytJiWXMlsz3QxmGFA2CB/view?usp=sharing in the same folder as the IPython notebook

In [None]:
dataset = pd.read_csv('gun-violence-data.csv')

In [None]:
dataset.columns

In [None]:
dataset

In [None]:
dataset = dataset.dropna()

In [None]:
# Drop irrelevant columns
dataset = dataset.drop(columns = ['incident_url', 'source_url', 'sources'])

In [None]:
dataset

### Histrogram

A 50 bar bar plot would seem to be non-optimal. Instead, let's create a plot that visualizes this data effectively.

In [None]:
state_grouping = dataset.groupby('state').agg({'n_killed': sum})

#state_grouping.head()
states = dataset['state'].unique()
killed = state_grouping['n_killed']
plt.hist(killed)
#sns.distplot(killed)

#seaborn implementation of the same histogram
plt.ylabel('Number of states (frequency)')
plt.xlabel('Number killed in a state from 2013-2018')
plt.title('A histogram of the state-level distribution of number of people killed by guns')

What does this histogram show? How did it transform the data from the bar plot above?

### Bar Plot

Let's say that we wanted to plot the number of gun deaths per year in California. What plot do you think will best accomplish this?

In [None]:
california = dataset[dataset['state'] == 'California']
california

In [None]:
california['date'] = pd.to_datetime(california['date'])
#what I am doing is creating datetime objects in pandas. This allows you to take dates that could appear in your data
#and turn them into parsable objects which you can mine for the year, month, time, and more.
#documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

In [None]:
# The dataset also has year, month, and day attributes.
california['date'].iloc[0].year

In [None]:
# To simplify the syntax, you could print ca_split_by_year!
ca_split_by_year = [california['date'].iloc[i].year for i in range(len(california))]

In [None]:
# Splitting/aggregating the data to get the number of gun deaths in each year.
deaths_agg_by_year = {}
for i in ca_split_by_year:
    if i in deaths_agg_by_year.keys():
        deaths_agg_by_year[i] += 1
    else:
        deaths_agg_by_year[i] = 1

In [None]:
deaths_agg_by_year

In [None]:
plt.bar(deaths_agg_by_year.keys(), deaths_agg_by_year.values())
plt.ylabel('Number of gun deaths in that year')
plt.xlabel('Year')
plt.title('Number of gun deaths by year in California')

### Box Plot

But what if we wanted to know something more interesting about the data, like the gender distribution of the participants in gun crime?

To do that, we need to manipulate the data some more, and then think of a nice visualization to use. Any ideas?

In [None]:
# We clean and split the data from participant_gender column using regular expression.
gender = [re.findall(r"Male|Female", s) for s in dataset['participant_gender']]
gender

In [None]:
# Number of males in each gun crime
num_male = [sum(1 for x in i if x == 'Male') for i in gender]

In [None]:
# Number of females in each gun crime
num_female = [sum(1 for x in i if x == 'Female') for i in gender]

In [None]:
plt.figure(figsize = (3, 8))
plt.boxplot(num_male)
#sns.boxplot(data = num_male, notch = True)
plt.ylabel('Number of males involved')
plt.xlabel('Male')
plt.title('A boxplot of the number of males involved in gun crimes in the dataset')

In [None]:
plt.figure(figsize = (3, 8))
plt.boxplot(num_female)
#sns.boxplot(data = num_female, notch = True)
plt.ylabel('Number of females involved')
plt.xlabel('Female')
plt.title('A boxplot of the number of females involved in gun crimes in the dataset')

What are some things that you can conclude with respect to gender from these boxplots?

### Line Graph

Let's say that we wanted to create a time series analysis of the data to see if there are any big spikes in crime on certain days. We suspect that some phenomena cause more deaths on certain days than others.
<br>
<br>
What are some tools that we can use to achieve this?

In [None]:
time_grouping = dataset.groupby('date').agg({'n_killed': sum, 'n_injured': sum})

In [None]:
plt.plot(time_grouping.index, time_grouping['n_killed'])

What is wrong with this plot? How do we fix this?

In [None]:
plt.figure(figsize = (30, 8))
ax = plt.plot(time_grouping.index, time_grouping['n_killed'])
plt.xticks(time_grouping.index[::20], time_grouping.index[::20], rotation = 'vertical')
#sns.lineplot(time_grouping.index, time_grouping['n_killed'])
plt.xlabel('Date')
plt.ylabel('Number of people killed by guns that day')
plt.title('The number of people killed by guns in a day over time')

### Scatter Plot

In [None]:
# We create points where the number killed is on the x axis and the number injured is on the y axis
points = [[dataset['n_killed'].iloc[i], dataset['n_injured'].iloc[i]] for i in range(len(dataset))]

In [None]:
points_x = np.array([points[i][0] for i in range(len(points))])
points_y = np.array([points[x][1] for x in range(len(points))])
plt.scatter(points_x, points_y)
#sns.scatterplot(points_x, points_y)
plt.xlabel('Number of people killed')
plt.ylabel('Number of people injured')
plt.title('Scatter plot comparing the number of people killed and injured in each gun incident')

Does this scatter plot seem too good to be true? How is it possible that with so many points that we get a clean looking plot like this?

### Violin Plot

In [None]:
sns.violinplot(points_x, points_y, scale = "width")
plt.xlabel('Number of people killed')
plt.ylabel('Number of people injured')
plt.title('A violin plot of the number of people killed vs the number of people injured')

Compare this violin plot against the scatter plot of the same data above.

### Contour Plot

In [None]:
plt.figure(figsize = (10, 8))
sns.kdeplot(points_x, points_y, cbar = True, cmap="OrRd")
plt.xlim(-1, 2.5)
plt.ylim(-1, 2.5)
plt.ylabel('injured')
plt.xlabel('killed')
plt.title('Bivariate Kernel Density Estimate of the number of people killed vs number of people injured')

### Heat Map

The dataset is not "rectangularizable", i.e. each feature cannot be coherently represented as a matrix, so to demonstrate the properties of the heatmap and how to create one, we will create random data.

In [None]:
ten_by_twelve = np.random.rand(10, 12)
# Documentation for np.random.rand: https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.rand.html
# Creating a 10x12 matrix and populating it with data randomly sampled from a uniform(0,1) distribution.
ax = sns.heatmap(ten_by_twelve)

As we can see, this is not a meaningful example

In [None]:
# sns has default datasets available for data analysis: https://seaborn.pydata.org/generated/seaborn.load_dataset.html
# Using these random datasets, we can create an example with meaning
flights = sns.load_dataset("flights")
flights = flights.pivot("month", "year", "passengers")
ax = sns.heatmap(flights)

<a id='customization'></a>
# Customization
____

As you saw above, there was a decent amount of customization in the plots that we've created. 
So how do you generally think about customization?

<font color='red'> [ Your-Response-Here ] </font>

**Note:** We can use matplotlib's customization properties to wrap around seaborn plots. So, essentially matplotlib allows you to customize matplotlib <b>and</b> seaborn plots!

**Setting x and y ticks manually:** 
Use plt.xticks and plt.yticks to set the location and labels on the x and y axes.
* yticks Documentation: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.xticks.html
* xticks Documentation: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.yticks.html

That's how we were able to reduce the number of dates being plotted above in the time series analysis.

**Color:**
Every plotting function will accept a color function: 
* As a color map (for multiple plots on the same plot you need to specify different colors)
* As an argument for a single plot on a plot (usually the c or color parameter which you pass a string to)

Here are some common colors that you can use: https://matplotlib.org/2.0.2/api/colors_api.html
There are a large variety of different color mappings that you can specify: https://matplotlib.org/tutorials/colors/colormaps.html

**Legend:**
Usually it's good practice that if you have multiple plots on the same plot, you should provide a legend to denote which plot is being referred to.

Documentation: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html

This function takes usually two arguments: a list of handles, and a list of labels
 * Handles: these are the plot objects themselves
 * Labels: what do we want to label these plots as?

There's a lot more that you can do to customize your plots apart from what is listed here. The best way to figure out customization schemes is to experiment on given dataset to determine the best possible way to describe the data.