# Table of Contents
### - Correlation heatmap (matplotlib)
### - Correlation heatmap (seaborn)
### - Scatterplots
### - Pair plots
### - Correlation analysis
### - Current hypotheses

# Notebook Setup

In [None]:
# Import related libraries
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib

In [None]:
# Import cleaned dataset
path = r'C:\Users\mmreg\OneDrive\Desktop\Data Analytics Course Work\Data Immersion\Tasks\08-2022 Exploratory Analytics Project\02 Data'

In [None]:
df = pd.read_csv(os.path.join(path, 'Prepared', 'citibike_clean.csv'), index_col = False)

In [None]:
df.head()

In [None]:
matplotlib.__version__

In [None]:
%matplotlib inline

# Question 2

## Use the questions you defined in the previous task to pick out variables from your data set suitable for your exploratory visual analysis.

In [None]:
# Create dataframe with columns removed for analysis
# We will not need the trip_id, bike_id, start_station_name, end_station_name, start_station_id, end_station_id, end_station_latitude, end_station_longitude, weekday, subscriber, customer_volume or end_time
df_vis = df.drop(columns = ['trip_id', 'bike_id', 'start_station_name', 'end_station_name', 'start_station_id', 'end_station_id', 'end_station_latitude', 'end_station_longitude', 'weekday', 'subscriber', 'end_time', 'customer_volume'])

In [None]:
# Ensure successful column drop
df_vis.head()

In [None]:
df_vis = df_vis.drop(columns = ['Unnamed: 0'])

In [None]:
df_vis.shape
# With 11 columns dropped, the shape should be (42856,8). We are good to go.

In [None]:
df_vis.rename(columns = {'start_hour' : 'Start Hour', 'start_time' : 'Start Time', 
                     'start_station_latitude': 'Latitude', 'start_station_longitude': 'Longitude',
                     'trip_duration': 'Trip Length', 'birth_year' : 'Birth Year', 'age' : 'Age', 'gender' : 'Gender' },
                      inplace = True)

In [None]:
df_vis.head()

# Question 3

## Create a correlation matrix heatmap (colored).

In [None]:
# View correlation table to ensure data is ready for heatmap
df_vis.corr()

In [None]:
# Create heatmap using matplotlib
f = plt.figure(figsize=(10, 10))
plt.matshow(df_vis.corr(), fignum=f.number)
plt.xticks(range(df_vis.shape[1]), df_vis.columns, fontsize=12, rotation=45)
plt.yticks(range(df_vis.shape[1]), df_vis.columns, fontsize=12)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=14)

In [None]:
# Create a subplot with matplotlib
f,ax = plt.subplots(figsize=(10,10))


corr = sns.heatmap(df_vis.corr(), annot = True, ax = ax) # The `annot` argument allows the plot to 
#place the correlation coefficients onto the heatmap.

In [None]:
# Save visual for presentation
corr.figure.savefig(os.path.join(path, 'corr_chart.png'))

### We can see that most variables correlations are weak at best. The top correlations would be longitude/latitude, trip_duration/gender, birth_year/gender, and birth_year/start_hour. All other correlations are too weak to be considered relevant. For the purpose of this task, I will consider anything above 0.05 a weak relationship to provide me with variables to complete the rest of the task. The relationships are defined below:
#### - This correlation would be defined as weak and positive. This simply tells us that the locations occur within a very small portion of the globe: New York City. Longitude and latitude will clearly have a commonality and thus a correlation: they are locations within a city, a small area in the scope of the world and would likely be close to one another.
#### - This is a weak, negative correlation. When we define the gender scale of the data (0: undefined, 1: male, 2: female) it tells us an interesting story. From what the heatmap shows us, the longer the trip duration, the more likely the gender of the traveler is male or unknown.
#### - This is a weak, positive correlation. Again we use the definition of the numerical values of the gender column to draw analysis. According to the heat map, the younger customers that use this service are more likely to be female.
#### - We can hypothesize that the later the trip starts, the younger the customer on the trip is. We see this from the weak positive correlation of the variables. 

# Question 4
## Create a scatterplot (or plots) for the variables with the strongest correlations and examine the nature of their relationships.

In [None]:
# Create scatterplot for the longitude/latitude relationship
sns.lmplot(x = 'start_station_longitude', y = 'start_station_latitude', data = df_vis)

### While it may not seem like it tells us much, it actually provides us with some quality information. We can see three distinct clusters within the scatterplot, and knowing what longitude and latitude tell us can deduce that there are three distinct locations where the citibikes are most commonly used. This could be useful information for deployment of bikes - particularly when weekday and hour are incorporated - but the linear regression line sadly tells us little.

In [None]:
# Create scatterplot for the trip_duration/gender relationship
sns.lmplot(x = 'trip_duration', y = 'gender', data = df_vis)

In [None]:
# Create variable for trip duration in minutes to try and clear up visual
df_vis['trip_duration_min'] = df_vis['trip_duration']/60

In [None]:
df_vis.head()

In [None]:
# Try visualization with new variable
sns.lmplot(x = 'gender', y = 'trip_duration_min', data = df_vis)

### The scatter plot is less scatter and more tightly compacted points in three distinct rows, but it does tell us that the longer the trip, the more likely the customer gender is undefined or male. However looking at the descriptive statistics of the gender variable we see that more men use the service than both female and undefined combined. So this leads me to assume that the large gender disparity is causing this slant.

# Question 5
## Create a pair plot of the entire data set.

In [None]:
# Create pair plot of entire data set
g = sns.pairplot(df)

### This seems rather rediculous, though asked for so I included it. I will create another one with only the subset that is used for the previous heatmap visualization

In [None]:
g_vis = sns.pairplot(df_vis)

### Overall some compelling things to observe here. The gender disparity is good information to go on, and will be immediately useful for the company. One thing we will need to look into are the birth years; I see a lot of plots of those born in the 1900s. I also see that those that put the 1900s year in their profile also are listed as undefined gender, which tells me they simply did not fill out the customer profile. Logic tells me that this is a default number given in the app, but I would need to contact stakeholders to make sure this is accurate. Later on, I would like to look at the trip duration and start hour of the trips in relation to gender and whether they are a subscriber or not, as this could give us valuable insight on the customer base and how to better serve them.

# Question 6
## Create a categorical plot and interpret the results.

### I will make a categorical plot of the trip duration (in minutes), and interpret the results.

In [None]:
# Create categorical plot of start_hour
g_4 = sns.histplot(df_vis['start_hour'], bins = 24, kde = True)
# 16 - 19, 8 high / 7 - 20 normal / else low

### From the visualization, we can start to add a flag for busy hours of operation for the service. We can say that 8a as well as 4p - 7p are high volume, 7a - 8a, 9a - 4p, 7p-9p are normal volume, and all other hours are low volume. I am going to create a variable that categorizes these times for future analysis.

In [None]:
# Insert new 'customer_volume' variable
df.loc[df['start_hour'] > 23, 'customer_volume'] = 'Low Volume'

In [None]:
df.loc[df['start_hour'] < 7, 'customer_volume'] = 'Low Volume'

In [None]:
df.loc[df['start_hour'] == 7 , 'customer_volume'] = 'Normal Volume'

In [None]:
df.loc[(df['start_hour'] >= 9) & (df['start_hour'] < 16), 'customer_volume'] = 'Normal Volume'

In [None]:
df.loc[(df['start_hour'] >= 20) & (df['start_hour'] <= 21), 'customer_volume'] = 'Normal Volume'

In [None]:
df.loc[(df['start_hour'] >= 16) & (df['start_hour'] <= 19), 'customer_volume'] = 'High Volume'

In [None]:
df.loc[df['start_hour'] == 8 , 'customer_volume'] = 'High Volume'

In [None]:
df['customer_volume'].value_counts(dropna = False)

In [None]:
# Confirm addition of derived variable
df.head()

In [None]:
# Create categorical plot of trip_duration_min and start_hour, with color categorization of gender
sns.set(style="ticks")
g_3 = sns.catplot(x="start_hour", y="trip_duration_min", hue="gender", data=df_vis)

### This is an interesting visual. We can see a small glimps into the popular times for the service through the density of the plots. However, what's more interesting is that even though men use the service much more often than women, their trips are on average much shorter. "Undefined" gender shows no patterns nor correlations.

# Question 7
## Revisit the questions you generated in the previous task and write answers to those you can based on the exploration you’ve conducted so far. Add any new questions that may have arisen based on the early findings in your visual exploration.

### Some of the question from the previous task that have been answered:
#### "Are there certain areas within New York that see the most usage of the service?" Yes, as seen in the correlation plot, we can determine that there are three distinct 'zones' within which the customers start their rides. Further analysis will need to see where they finish and if this zone plot changes, but we can determine almost precisely where the bikes are used most.
#### "What is the age demographic of the most common customers?" We can see that the most common customers are 70 and younger, with very few older than this.
#### "Is there a correlation between age and how long the trip is?" No strong correlation can be found; however, we do see that older customers tend to use the service less, especially after the age of 60. This could be useful information.
### Some new questions that have arisen:
#### - Why do women tend to use bikes for longer amount of times?
#### - Do the customer volume ranges change depending on day of the week?
#### - There seem to be some pockets within the longitude and latitude charts when correlated with start hour. Why is this?


# Question 8
## Define any hypotheses that you can at this point.

### There are a couple hypotheses that I can think of from the data provided:
### - If you are a subscriber, then your trip lengths will be shorter
### - If you are male, you will have shorter trips
### - If it is a weekday, you will have shorter trip duration

In [None]:
#Export datasets for future use
df.to_csv(os.path.join(path, 'Prepared', 'citibike_clean.csv'))