<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/022__Line_Charts_with_Matplotlib.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COURSE 2/6: EXPLORATORY DATA VISUALIZATION

# MISSION 1: Line Charts

*Learn the basics of data visualization.*

## 1. Representation Of Data


Data that are represented as tables (CSV, Excel, pandas) makes it difficult to explore a dataset to uncover patterns. Data visualization is a discipline that focuses on the visual representation of data to transform data from table representations visual ones, and enable us to find patterns quicker.

In this course, named Exploratory Data Visualization, we will learn data visualization techniques to explore datasets and help us uncover patterns. In this mission, we'll use a specific type of data visualization to understand U.S. unemployment data.

## 2. Introduction To The Data

![BLS logo](https://www.nccaom.org/wp-content/uploads/2016/12/BLS-Timeline-Main.jpg)

The United States [Bureau of Labor Statistics (BLS)](https://www.bls.gov/) surveys and calculates the monthly unemployment rate. The unemployment rate is the percentage of individuals in the labor force without a job. You can read more about how the BLS calculates the unemployment rate [here](http://www.bls.gov/cps/cps_htgm.htm).

The BLS releases monthly unemployment data available for download as an Excel file, with the `.xlsx` file extension. While the pandas library can read in XLSX files, it relies on an external library for actually parsing the format. Let's instead download the same dataset as a CSV file [here](https://drive.google.com/file/d/1ccblpyB_BGKKtkAL8XbwOpJWtEBqdn8p/view?usp=sharing) or [here](https://github.com/Rossel/DataQuest_Courses/blob/master/datasets/unrate.csv). 

The dataset contains the monthly unemployment rate as a CSV from January 1948 to August 2016 and is saved as `unrate.csv`. Before we get into visual representations of data, let's first read this CSV file into pandas to explore the table representation of this data. The dataset we'll be working with is a [time series](https://en.wikipedia.org/wiki/Time_series) dataset, which means the data points (monthly unemployment rates) are ordered by time.







When we read the dataset into a DataFrame, pandas will set the data type of the `DATE` column as a text column. Because of how pandas reads in strings internally, this column is given a data type of `object`. We need to convert this column to the `datetime` type using the `pandas.to_datetime()` [function](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html), which returns a Series object with the `datetime` data type that we can assign back to the DataFrame:
```
import pandas as pd
df['col'] = pd.to_datetime(df['col'])
```



Let's start by importing the libraries we need and reading the dataset into pandas using Google Colab.

In [10]:
# Run code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [11]:
# Insert file id from Google Drive shareable link:
# https://drive.google.com/file/d/1ccblpyB_BGKKtkAL8XbwOpJWtEBqdn8p/view?usp=sharing
id = "1ccblpyB_BGKKtkAL8XbwOpJWtEBqdn8p"

In [12]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('unrate.csv')

In [13]:
# Import pandas library and read csv
import pandas as pd
unrate = pd.read_csv('unrate.csv')

In [14]:
# Convert the DATE column into a series of datetime values
unrate['DATE'] = pd.to_datetime(unrate['DATE'])

In [15]:
# Print dataframe dimensions
unrate.head(12)

Unnamed: 0,DATE,VALUE
0,1948-01-01,3.4
1,1948-02-01,3.8
2,1948-03-01,4.0
3,1948-04-01,3.9
4,1948-05-01,3.5
5,1948-06-01,3.6
6,1948-07-01,3.6
7,1948-08-01,3.9
8,1948-09-01,3.8
9,1948-10-01,3.7


In [16]:
# Print dataframe dimensions
print(unrate.shape)

(824, 2)


## 3. Table Representation

The dataset contains 2 columns:

- `DATE`: date, always the first of the month. Here are some examples:
  - `1948-01-01`: January 1, 1948.
  - `1948-02-01`: February 1, 1948.
  - `1948-03-01`: March 1, 1948.
  - `1948-12-01`: December 1, 1948.
- `VALUE`: the corresponding unemployment rate, in percent.

## 4. Observations From The Table Representation


We can make the following observations from the table:

- In 1948:
  - monthly unemployment rate ranged between `3.4` and `4.0`.
  - highest unemployment rate was reached in both March and December.
  - lowest unemployment rate was reached in January.
- From January to March, unemployment rate trended up.
- From March to May, unemployment rate trended down.
- From May to August, unemployment rate trended up.
- From August to October, unemployment rate trended down.
- From October to December, unemployment rate trended up.

Because the table only contained the data from 1948, it didn't take too much time to identify these observations. If we scale up the table to include all 824 rows, it would be very time-consuming and painful to understand. Tables shine at presenting information precisely at the intersection of rows and columns and allow us to perform quick lookups when we know the row and column we're interested in. In addition, problems that involve comparing values between adjacent rows or columns are well suited for tables. Unfortunately, many problems you'll encounter in data science require comparisons that aren't possible with just tables.

For example, one thing we learned from looking at the monthly unemployment rates for 1948 is that every few months, the unemployment rate switches between trending up and trending down. It's not switching direction every month, however, and this could mean that there's a seasonal effect. Seasonality is when a pattern is observed on a regular, predictable basis for a specific reason. A simple example of seasonality would be a large increase textbook purchases every August every year. Many schools start their terms in August and this spike in textbook sales is directly linked.

We need to first understand if there's any seasonality by comparing the unemployment trends across many years so we can decide if we should investigate it further. The faster we're able to assess our data, the faster we can perform high-level analysis quickly. If we're reliant on just the table to help us figure this out, then we won't be able to perform a high level test quickly. Let's see how a visual representation of the same information can be more helpful than the table representation.

## 5. Visual Representation

## 6. Introduction to Matplotlib

## 7. Adding Data

## 8. Fixing Axis Ticks

## 9. Adding Axis Labels And A Title