# CME538 - Introduction to Data Science
## Assignment 4 - Exploratory Data Analysis

**Learning Objectives**
After completing this assignment, you should be comfortable:

- Using `matplotlib` and `seaborn` for data visualization.
- Using more advanced `Pandas` grouping and aggregating methods.
- Resampling DateTime indices.
- Working with datetime columns in `Pandas`.
- Removing outliers.
- Investigating missing data.

You are free to add new cells to use as a scratch pad, but make sure to clean you code up and present your answer in the cell indicated with `# Write your code here`.

**Marking Breakdown**

Question | Points
--- | ---
Question 1a | 1
Question 1b | 1
Question 1c | 1
Question 1d | 1
Question 1e | 1
Question 2a | 1
Question 2b | 1
Question 2c | 1
Question 3a | 1
Question 3b | 1
Question 4a | 1
Question 4b | 1
Question 4c | 1
Question 5 | 1
Question 6 | 1
Question 7a | 1
Question 7b | 1
Question 7c | 1
Question 7d | 1
Question 7e | 1
Question 7f | 1
Question 7g | 1
Question 8a | 1
Question 8b | 1
Question 8c | 1
Question 8d | 1
Question 8e | 1
Total | 27

One of the following marks below will be added to the **Total** above.

### Code Quality

| Rank | Points | Description |
| :-- | :-- | :-- |
| Youngling | 1 | Code is unorganized, variables names are not descriptive, redundant, memory-intensive, computationally-intensive, uncommented, error-prone, difficult to understand. |
| Padawan | 2 | Code is organized, variables names are descriptive, satisfactory utilization of memory and computational resources, satisfactory commenting, readable. |
| Jedi | 3 | Code is organized, easy to understand, efficient, clean, a pleasure to read. #cleancode |

## Setup Notebook

In [None]:
# Import 3rd party libraries
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
import warnings
warnings.filterwarnings('ignore')

# Overview
You've just been hired by the City of Toronto. Congratulation! Toronto has been collecting data on its bike share program since 2017 and the 2019 data has just become available. The city has implemented some new initiatives to try to increase ridership numbers such as **Free Ride Wednesdays** in the month of September and the addition of new bike lanes. You manager has asked you to:
1. Merge the bike share data with local weather data from the TORONTO CITY CENTRE weather station.
2. Investigate the effect of temperature on ridership numbers.   
3. Explore different consumer behaviours between Annual Members and Casual Members.


# 1. Prepare Weather Data
## Question 1a
First, let's check to see what weather files are available in the assignment directory. Weather file names have the following structure `en_climate_hourly_ON_6158355_01-2017_P1H.csv`. All weather file names contain the number `6158359`, which is the `TORONTO CITY CENTRE` weather station ID. Create a variable called `weather_filenames` and assign a list containing all weather file names to it. 

In [None]:
# Write your code here.
weather_filenames = ...

# Print file names
print(weather_filenames[0:5])

## Question 1b
`weather_filenames` contains 12 files containing monthly weather data for 2019. Create a variable `weather_data` and assign a DataFrame to it that contains the data from all 12 `.csv` files. Hint: `pd.concat()` might be helpful.

Check out this [glossary](https://climate.weather.gc.ca/glossary_e.html#windChill) to get a better understanding of what the column names refer to.

In [None]:
# Write your code here.
weather_data = ...

# View DataFrame
weather_data.head()

## Question 1c
A column called `'Date/Time'` contains hourly datetime stamps in the format `YYYY-MM-DD HH:MM`. Use `pd.DatetimeIndex()` to set the `'Date/Time'` column as the index of `weather_data`. Now the index of `weather_data` is composed of `Timestamps`. Hint: 'weather_data.columns' should no longer contain `'Date/Time'`. 

In [None]:
# Write your code here.
...

# View DataFrame
weather_data.head()

## Question 1d
The index of `weather_data` (`weather_data.index`) should be a series of Timestamps (e.g. `Timestamp('2019-01-01 00:00:00')`).

Are these Timestamps localized to a time zone? If so, which one?

*Type your answer here, replacing this text.*

If the Timestamps are not localized, localize them to Toronto's time zone (Eastern Standard Time - `EST`).

In [None]:
# Write your code here.
...

# View DataFrame
weather_data.head()

## Question 1e
Next, plot temperature as a function of the datetime index. Your plot should look something like this.

<br>
<img src="images/temp_2019.png" alt="drawing" width="600"/>
<br>

In [None]:
# Write your code here.
...

# 2. Import Bike Share Data
The assignment folder contains data about bike share trips in the city of Toronto for 2019 where there is one `.csv` file for each month. File names have the structure `bike_share_YYYY-MM.csv`. 

In [None]:
# Get bike share file names
trips_filenames = [filename for filename in os.listdir() if 'bike_share' in filename]

# Print file names
print(trips_filenames[0:5])

## Question 2a
Create a variable `trips_data` and assign a DataFrame to it that contains the bike share data from all 12 `.csv` files.

In [None]:
# Write your code here.
trips_data = ...

# Let's remove double spaces from the column names
trips_data.columns = [' '.join(col.split()) for col in trips_data.columns]              

# View DataFrame
trips_data.head()

## Question 2b
Next, convert columns `'Start Time'` and `'End Time'` to datetimes. Then, localize `'Start Time'` and `'End Time'` to Eastern Standard Time (EST). This might take a minute or two.    

In [None]:
# Write your code here.
...

# View DataFrame
trips_data.head()

## Question 2c
To check that these datetime conversions were done correctly, generate a plot of daily ride counts. Your plot should look something like this. Hint: Check out `.resample()` and consider making a new variable.

<br>
<img src="images/trips_2019.png" alt="drawing" width="600"/>
<br> 

In [None]:
# Write your code here.
...

# 3. Clean Bike Share Data
## Question 3a - Missing Data
Large datasets are rarely completely full (no missing values) and its always a good idea to evaluate if there is missing data and for what fields. 

First, check for missing values in `weather_data`. Create a DataFrame named `weather_data_missing` where the index in the column names of `weather_data` and there is one column named `'count'` which contains the number of missing values for a particular column.

In [None]:
# Write your code here.
weather_data_missing = ...

# View DataFrame
weather_data_missing

Next, check for missing values in `trips_data`. Create a DataFrame named `trips_data_missing` where the index in the column names of `trips_data` and there is one column named `'count'` which contains the number of missing values for a particular column.

In [None]:
# Write your code here.
trips_data_missing = ...

# View DataFrame
trips_data_missing

We can see that some columns have missing values. However, having missing data does not necessarily mean that something is wrong with an entry. For example, the `Weather` column contains the following unique values:

In [None]:
weather_data['Weather'].unique().tolist()

You can see that only non-normal/clear weather events are listed. So, when `weather_data['Weather'] == NaN`, the conditions are clear. Therefore, we would never want to remove rows where `weather_data['Weather'] == NaN`.

We can see that the first 8 columns of `weather_data_missing` have no missing data, so we can leave `weather_data` and address the missingness on a case-by-case basis depending on which columns we're analyzing.

For `trips_data`, we can see that `'End Station Id'` and `'End Station Name'` have 454 missing values, which is only 0.01% of the dataset. This might suggest corruption and given the small number of missing values, we can safely drop these rows. 

## Question 3b - Missing Data
Drop any rows of `trips_data` with missing values.

In [None]:
# Write your code here.
trips_data = ...

# View DataFrame
trips_data.head()

## Question 4a - Outliers
Outliers in your datasets can be both good and bad. One the one hand, they may contain important information while on the other hand, they skew your visualizations and may bias your models. 

As a simple first pass, let's look at the summary statistics for `trips_data` using `.describe()` (remember, it only works for numeric data).

In [None]:
trips_data.describe()

Right away we notice something a bit funny with `'Trip Duration'`. The min and max values seem implausible. A trip cannot last `0 seconds` (you'd have to be biking at the speed of light!) and its unlikely that a trip lasted for `1.240378e+07 seconds`. `1.240378e+07 seconds` is roughly 4.78 months, which would be quite the ride and cost tens of thousands of dollars. We can see that the average `'Trip Duration'` is roughly 17 minutes.

We've been told by Bike Share Toronto that trips lasting less than 1 minute can be considered false trips. Remove all trips from `trips_data` with a duration less than 60 seconds.

In [None]:
# Write your code here
trips_data = ...

# View DataFrame
trips_data.head()

## Question 4b - Outliers
Next, remove any `'Trip Duration'` values less than `Q1 - 1.5 * IQR` and greater than `Q3 + 1.5 * IQR`. 

- Q1: The first quartile (`.quantile(0.25)`)
- Q3: The third quartile (`.quantile(0.75)`)
- IQR: The first quartil (`Q3 - Q1`)
<br>
<img src="images/probability_density.png" alt="drawing" width="450"/>
<br> 

In [None]:
# Write your code here
trips_data = ...

# View DataFrame
trips_data.head()

## Question 4c - Outliers
Plot a histogram + density plot using `sns.distplot()` of the `'Trip Duration'`. Ensure that `'Trip Duration'` is displayed in minutes. Your plot should look something like this.
<br>
<img src="images/trip_durations.png" alt="drawing" width="450"/>
<br> 

In [None]:
# Write your code here.
...

## Question 5 - Duplicates
Remove any entries from `trips_data` which have the same `'Trip Id'`.

In [None]:
# Write your code here
trips_data = ...

# View DataFrame
trips_data.head()

# 4. Merge Datasets
To facilitate an analysis of the effect of weather on ridership, we must merge two DataFrames (`weather_data` and `trips_data`).

## Question 6
Use the `.merge()` function to combine `weather_data` and `trips_data` using datetime information and set the output to a new variable called `data_merged`. In `trips_data` there are two time stamps corresponding to the start and end of the ride. Use the `'Start Time'` of the rides to merge. 

`trips_data` datetimes contain information down to the minute, while `weather_data` is reported every hour. Thus, we must merge based on a common year, month, day, hour. Hint: create a new column in `trips_data` called `'merge_time'` and set it equal to `trips_data['Start Time']` rounded to the nearest hour.

In [None]:
# Write your code here
data_merged = ...

# View DataFrame
data_merged.head()

# 5. Analysis of 'User Type'
## Question 7a
First, we'll explore the daily number for Annual Members and Casual Members. Casual Members pay on a per ride basis while Annual Members pay a monthly subcription fee. The DataFrame `data_merged` has a temporal resolution of a minute. Therefore, in order to look at daily numbers, we'll need to convert `data_merged` so that every row corresponds to a day. Create a new DataFrame called `data_days` with three columns:
- ride: The total number of rides for a particular day.
- annual_members: Number of rides by Annual Members.
- casual_members: Number of rides by Casual Members.
- workday: Was this day a workday (True) or a weekend day (False).

Your DataFrame should looks something like this.

<br>
<img src="images/data_days.png" alt="drawing" width="500"/>
<br>

As a quick sanity check you can check that the number of rows in `data_merged` is equal to the sum of `data_days['rides']`.

Hint: You can use the `.groupby()` method and the `agg()` method to compute this transformation in a single line of code.

In [None]:
# Write your code here
data_days = ...

# View DataFrame
data_days.head()

## Question 7b
Use `sns.distplot()` to create a plot showing the distributions of daily ride counts from `data_days` for Casual Members and Annual Members. Your plot should look something like this. 

<br>
<img src="images/ride_count_histogram.png" alt="drawing" width="600"/>
<br>

In [None]:
# Write your code here.
...

## Question 7c
Use `sns.scatterplot()` to create a scatter plot showing the relationship between daily ride counts from `data_days` for Casual Members and Annual Members. Your plot should look something like this. 

<br>
<img src="images/ride_count_scatter.png" alt="drawing" width="600"/>
<br>

In [None]:
# Write your code here.
...

## Question 7d
Looking at the figure you've generated for **Question 7c**, some interesting outliers have appeared. In particular, there are some `workday` data points that appear to follow the `non-workday` trend. What could explain these outliers and what additional information could be collected to address them? 

*Type your answer here, replacing this text.*

## Question 7e
Let's examine the hourly ride counts for `Annual Members` and `Casual Members`. First thing we have to do is create a new DataFrame called `data_hours`. `data_hours` should have its index set to hours (0 to 23) using the `'Start Time'` column and three columns `'rides', 'annual_members', 'casual_members'`. These should be average hourly values. 

In [None]:
# Write your code here.
data_hours = ...

# View DataFrame
data_hours.head()

## Question 7f
Use `data_hours` to create a plot showing the average number of hourly rides for `Annual Members` and `Casual Members`. Your plot should look something like this.

<br>
<img src="images/hourly_rides.png" alt="drawing" width="600"/>
<br>

In [None]:
# Write your code here.
...

## Question 7g
What can you observe from the plot? Hypothesize about the meaning of the peaks for casual and annual membership riders.

*Type your answer here, replacing this text.*

# 5. Analysis of 'Weather'
In this section, we'll be looking at the influence of weather conditions, such as temperature and precipitation, on ridership activity.

First, let's take a look at the missingness for `data_merged`. 

In [None]:
data_merged.isnull().sum(axis=0).to_frame('count')

We can see that the `'Weather'` column has 2,141,573 missing values. Now let's take a look at the unique labels in the `'Weather'` column and how many entries contain each one.

In [None]:
data_merged.groupby('Weather')['Trip Id'].count().sort_values(ascending=False)

We can see that the most common `'Weather'` labels are `'Rain'`, `'Fog'`, and `'Rain,Fog'`. There is no label for **clear** condition, which suggests that the 2,141,573 NaN values correspond to **clear** conditions.

## Question 8a
The first thing we have to do is transform `data_merged` to contain aggregated values for each hour. Remember, `data_merged`'s granularity is at the ride level. Each row, corresponds to one ride with a temporal resolution of one minute. Therefore, there can be multiple entries for the same minute.

Create a new variable called `hourly_rides_and_weather` and assign a DataFrame to it containing the following information:
- Index: DatetimeIndex with a resolution of 1 hour (2019-01-01 10:00:00, 2019-01-01 11:00:00, 2019-01-01 12:00:00, 2019-01-01 13:00:00, etc.). Use `'Start Time'` to generate this index.
- Column 1 `'rides'`: How many rides were recorded during a particular hour.
- Column 2 `'annual_members'`: How many `'Annual Member'` rides were recorded during a particular hour.
- Column 3 `'casual_members'`: How many `'Casual Member'` rides were recorded during a particular hour.
- Column 4 `'workday'`: Does this hour correspond to a workday or a weekend day (True, False). 
- Column 5 `'temp'`: Reported temperature from the `'Temp (°C)'` column. 
- Column 6 `'weather'`: Reported weather conditions from the `'Weather'` column. 

<br>
<img src="images/hourly_rides_and_weather_1.png" alt="drawing" width="600"/>
<br>

Hints:
1. Use `.groupby()` and `.agg()`.
2. This is an example of how you can use `.agg()` to compute Column 1 `'rides'`: `.agg(rides=('rides', 'sum'))`.
3. Use `data_merged['Start Time'].dt.floor('H')` to groupby hour.

In [None]:
# Write your code here
hourly_rides_and_weather = ...

# View DataFrame
hourly_rides_and_weather.head(10)

## Question 8b
Next, let's transform `hourly_rides_and_weather` from hourly to daily sampling. As we saw for **Question 7g**, there are strong trends within each day, which could complicate our initial analysis. Therefore, by aggregating by day-of-the-week, we'll remove some of this trend.

Modify `hourly_rides_and_weather` to include the following information:
- Index: DatetimeIndex with a resolution of 1 day (2019-01-01 00:00:00, 2019-01-02 00:00:00, 2019-01-03 00:00:00, 2019-01-04 00:00:00, etc.). Use `'Start Time'` to generate this index.
- Column 1 `'rides'`: How many rides were recorded during a particular day.
- Column 2 `'annual_members'`: How many `'Annual Member'` rides were recorded during a particular day.
- Column 3 `'casual_members'`: How many `'Casual Member'` rides were recorded during a particular day.
- Column 4 `'workday'`: Is this a workday or a weekend day (True, False). 
- Column 5 `'temp'`: The maximum temperature recorded for a particular day. 
- Column 6 `'weather'`: This column should contain one of two values (`'clear'` or `'Precipitation'`). `'Clear'` should be assigned to days where 50% or more of the hours of that day had no precipitation events (Rain, Fog, Snow, Rain, Fog, etc.). Remember, `hourly_rides_and_weather['weather']` contains an `NaN` value when there was no precipitation event. When more than 50% of the hours of a day had a precipitation event, assign `'Precipitation'`.  

<br>
<img src="images/hourly_rides_and_weather_2.png" alt="drawing" width="600"/>
<br>

In [None]:
# Write your code here
hourly_rides_and_weather = ...

# View DataFrame
hourly_rides_and_weather.head(10)

## Question 8c
Let's investigate the relationship between weather conditions and ridership numbers. Create a violin plot using `sns.violinplot()` that looks something like the figure below.

<br>
<img src="images/weather_daily_rides_1.png" alt="drawing" width="600"/>
<br>

In [None]:
# Write your code here
...

## Question 8d
Let's investigate the relationship between the maximum daily temperature and ridership numbers. Create a scatter plot using `sns.scatterplot()` that looks something like the figure below.

<br>
<img src="images/temp_daily_rides.png" alt="drawing" width="600"/>
<br>

In [None]:
# Write your code here
...

## Question 8e
Reflect on the figures you've generated for **Question 8c** and **Question 8d**. What trends can you identify from these plots and can you suggest any potential issues with them or modifications you'd suggest to improve them?

*Type your answer here, replacing this text.*

**Congratulations, you're done Assignment 4. Review your answers and clean up that code before submitting on Quercus. `#cleancode`**