# Case study
### Google Data Analytics Capstone - Coursera

How Does a Bike-Share Navigate Speedy Success?

### Scenario
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of
marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your
team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team
will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve
your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

### Business task
Design marketing strategies aimed at converting casual riders into annual members.
### My assignment
How do annual members and casual riders use Cyclistic bikes differently?

### Set up the environment

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

### Discover datasets
There is 2 type of datasets:
<ul>
    <li>Per year -> one dataset is one quarter of one year -> 13 attributes for 426 887 entries</li>
    <li>Per month -> one dataset is one month -> 13 attributes for 84 776 entries</li>
</ul>
Let's compare the 2 types to spot differences and decide which one to use.
<br>
Both dataset have the same number of attributes and the same attribute names, but a different entry value.

In [None]:
df_2020_Q1 = pd.read_csv('../input/cyclistic-dataset-google-certificate-capstone/Divvy_Trips_2020_Q1.csv')
df_202004 = pd.read_csv('../input/cyclistic-dataset-google-certificate-capstone/202004-divvy-tripdata.csv')

In [None]:
df_2020_Q1.head()

In [None]:
df_202004.head()

In [None]:
df_2020_Q1.info()

In [None]:
df_202004.info()

Our assignment is to compare annual members with casual members. Let's study the column <i>member_casual</i>.
<br>
There are 2 types of member_casual:
<ul>
    <li>member -> annual members</li>
    <li>casual -> casual members</li>
</ul>
That's perfect. From there we know that we could group by member type to have insight.

In [None]:
df_2020_Q1['member_casual'].unique()

### Prepare

We know what the datasets are like. Now we can start preparing the data. The important question is: how many months or quarters do we take for analysis.
<br>
Taking the every single entry of every single month will make the dataset huge to analyze. We could select a sample size.
<br>
<br>
Let's start by filtering the whole year of 2020 and choose a sample size from it.

The new data frame has 13 attributes and 3 541 683 entries. Which is waaaaaay too much entries.
<br>
Let's take a sample size (https://www.surveymonkey.com/mp/sample-size-calculator/)

I recall the datasets from there no to destroy the analysis of the 2 datasets made previously.
<br>
<br>
According to our assignment the only attributes which are going to give use usefull insight are: meber_casual, started_at, and ended_at. Why? Because according the the company's pricing plan, price are defined in 3 ways:
<ul>
    <li>single_ride passe</li>
    <li>full_day passe</li>
    <li>annual membership</li>
</ul>
So, what's differ between casual and annual members will be the time spend on the bike, or, trip time.

In [None]:
df_2020_Q1 = pd.read_csv('../input/cyclistic-dataset-google-certificate-capstone/Divvy_Trips_2020_Q1.csv', usecols = ['ride_id', 'started_at', 'ended_at', 'member_casual'])
df_202004 = pd.read_csv('../input/cyclistic-dataset-google-certificate-capstone/202004-divvy-tripdata.csv', usecols = ['ride_id', 'started_at', 'ended_at', 'member_casual'])
df_202005 = pd.read_csv('../input/cyclistic-dataset-google-certificate-capstone/202005-divvy-tripdata.csv', usecols = ['ride_id', 'started_at', 'ended_at', 'member_casual'])
df_202006 = pd.read_csv('../input/cyclistic-dataset-google-certificate-capstone/202006-divvy-tripdata.csv', usecols = ['ride_id', 'started_at', 'ended_at', 'member_casual'])
df_202007 = pd.read_csv('../input/cyclistic-dataset-google-certificate-capstone/202007-divvy-tripdata.csv', usecols = ['ride_id', 'started_at', 'ended_at', 'member_casual'])
df_202008 = pd.read_csv('../input/cyclistic-dataset-google-certificate-capstone/202008-divvy-tripdata.csv', usecols = ['ride_id', 'started_at', 'ended_at', 'member_casual'])
df_202009 = pd.read_csv('../input/cyclistic-dataset-google-certificate-capstone/202009-divvy-tripdata.csv', usecols = ['ride_id', 'started_at', 'ended_at', 'member_casual'])
df_202010 = pd.read_csv('../input/cyclistic-dataset-google-certificate-capstone/202010-divvy-tripdata.csv', usecols = ['ride_id', 'started_at', 'ended_at', 'member_casual'])
df_202011 = pd.read_csv('../input/cyclistic-dataset-google-certificate-capstone/202011-divvy-tripdata.csv', usecols = ['ride_id', 'started_at', 'ended_at', 'member_casual'])
df_202012 = pd.read_csv('../input/cyclistic-dataset-google-certificate-capstone/202012-divvy-tripdata.csv', usecols = ['ride_id', 'started_at', 'ended_at', 'member_casual'])

In [None]:
#create a new dataframe for the year 2020
frame_2020 = [df_2020_Q1, df_202004, df_202005, df_202006, df_202007, df_202008, df_202009, df_202010, df_202011, df_202012]
df_2020 = pd.concat(frame_2020)

Create a new column that will display the lenght of each trip.
<br>
ended_at - started_at = length_trip

In [None]:
#the 2 columns started_at and ended_at were object type 
#let's turn them into a date and time format
df_2020['started_at'] = pd.to_datetime(df_2020['started_at'])
df_2020['ended_at'] = pd.to_datetime(df_2020['ended_at'])

We create a new column which will give the length for each trip (in date and time type)

In [None]:
df_2020['length_trip'] = df_2020['ended_at'] - df_2020['started_at']

We create a column that will give the day of the week when the ride started.
<br>
<br>
/!\
<ul>
    <li>0 = Monday</li>
    <li>1 = Tuesday</li>
    <li>2 = Wednesday</li>
    <li>3 = Thursday</li>
    <li>4 = Friday</li>
    <li>5 = Saturday</li>
    <li>6 = Sunday</li>
</ul>

In [None]:
df_2020['day_started'] = df_2020['started_at'].dt.dayofweek

Create a new column to extract the month of the started_at column. So that a seasonal analysis could be perform.

In [None]:
df_2020['month_started'] = pd.DatetimeIndex(df_2020['started_at']).month

Now, our dataset is a giant dataframe of 6 attributes for 3.5 million entries. This is a big chunk to study. That's why it would be better to select a sample size of it. However, in the case study, stakeholders don't specify the confidence level and the margin error wanted. 
<br>
<br>
After trying to plot some distibution, my laptop transformed itslef into a rocket and took ages to plot one histogram (truly I don't know how long, I stopped the kernel because it was too long). 
<br>
<br>
I can play Destiny 2 on my laptop but can't plot a little graph with 3.5 entries. LOL
<br>
<br>
I have no choice but to choose a sample size. (https://www.surveymonkey.com/mp/sample-size-calculator/)
<br>
<br>
I'll choose the following parameters:
<ul>
    <li>Confidence level: 99%</li>
    <li>Margin error: 5%</li>
</ul>
Sample size : 666
<br>
Let's round that to 700. So, if there's rows to remove, the analysis will be still good.

<br>
I'll choose the sample size randomly and quickly check it to avoid unfair conclusion.
<br>
<br>
The proportion of member type is almost the same between the entire dataset (0.63) and the sampled one (0.63). It seems that the sample size is fair.

In [None]:
df_2020_sampled = df_2020.sample(n = 700)

In [None]:
df_2020_sampled.shape

In [None]:
#compare proportion member type in the sample size
prop = df_2020_sampled.groupby('member_casual')['member_casual'].count()
prop

In [None]:
#compare proportion member type in the entire dataset
prop2 = df_2020.groupby('member_casual')['member_casual'].count()
prop2

In [None]:
#proportion sample, between casual and annual members
prop.iloc[0]/prop.iloc[1]

In [None]:
#proportion entire, between casual and annual members
prop2.iloc[0]/prop2.iloc[1]

### Process
Check it there are mistakes in the data.

In [None]:
df_2020_sampled.info()

Check for nan of null values

In [None]:
df_2020_sampled.isna().sum()

In [None]:
df_2020_sampled.isnull().sum()

Start with the day_started column. Days have to be in the range 0 to 6. Seems to be good.

In [None]:
df_2020_sampled['day_started'].unique()

Then with length of trip. They should be all positive values.
<br>
The min value is negative. So we have a problem in the length_trip column.
<br>
<br>
There are only 3 negative values. This type of error might be a human error. The value of started_at and ended_at might have been switch.
<br>
3 possibilities to fix it:
<ol>
    <li>Remove these 3 rows. But the sample size won't be good anymore. </li>
    <li>Switch the started_at values and ended_at values for these 3 errors.</li>
    <li>Or, simply turn the 3 values in length_trip_sec positive.</li>
</ol>
Let's go with the 3rd possibility

In [None]:
#to make it easy to detect them, turn the length_trip in seconds in a new column
df_2020_sampled['length_trip_sec'] = df_2020_sampled['length_trip'].astype('timedelta64[s]')

In [None]:
df_2020_sampled[df_2020_sampled['length_trip_sec'] < 0].count()

In [None]:
df_2020_sampled.loc[df_2020_sampled['length_trip_sec'] < 0]

In [None]:
#replace all values by its absolute value
df_2020_sampled['length_trip_sec'] = df_2020_sampled['length_trip_sec'].abs()

Another problem detected is the maximum length_trip. Some lenght_trip are huge. We can directly put them in the case of outlier. Let's see more about it.
<br>
The box plot and histogram shows us that some length_trip_sec takes gigantic values. 
<br>
Most of the outliers are with the casual members.

In [None]:
sns.boxplot(df_2020_sampled['length_trip_sec'])

In [None]:
sns.histplot(df_2020_sampled['length_trip_sec'])

Let's detect the outliers

In [None]:
#calculate the interquartile range
q25, q50, q75 = np.percentile(df_2020_sampled['length_trip_sec'], [25, 50, 75])
iqr = q75 - q25
iqr

In [None]:
#define the min and max limites to be considered an outlier
mini = q25 - 1.5*iqr
maxi = q25 + 1.5*iqr

maxi

In [None]:
#identify the points to remove
points = [x for x in df_2020_sampled['length_trip_sec'] if x > maxi]
print('the max points is max point is', max(points))
print('the max points is min point is', min(points))

According to the outlier detection, all values of length_trip_sec superior to 1049 seconds are considered outliers.
<br>
<b>/!\ The outliers here are not something to erase and forget. It tells us a lot about how a member uses a bike. In that case it appears that casual members are causing these huge outliers.</b>
<br>
<br>
<i>Here again, it seems to be a human error. The day seems to be the mistake but I can't be 100% sure.</i>
<br>
<br>
Let's delete the huge values to see the boxplot better.

In [None]:
df_2020_sampled[df_2020_sampled['length_trip_sec'] > 10000].count()

In [None]:
df_2020_sampled = df_2020_sampled[df_2020_sampled['length_trip_sec'] <= 10000]

We can see the outliers better. And there's a lot of outliers. In this case study, outliers are non negligible so I'll keep them for the sake of the business task.

In [None]:
sns.boxplot(df_2020_sampled['length_trip_sec'])

### Analysis with sampled dataset

##### Display the distribution of length_trip

In [None]:
fig, ax = plt.subplots(figsize = (10,7))
sns.histplot(df_2020_sampled, x = 'length_trip_sec', hue = 'member_casual', ax=ax, kde = True) #plot distribution
plt.title('Distribution of the length of a trip in 2020, annual vs casual member')
plt.show()

In [None]:
pivot = df_2020_sampled.groupby('member_casual')['length_trip_sec'].agg(['mean','max', 'min'])
pivot = pivot.reset_index()
pivot

##### Create a summary table grouped by member types and days

In [None]:
summary = df_2020_sampled.groupby(['member_casual', 'day_started'])['length_trip_sec'].agg(['mean','max', 'min'])
summary = summary.reset_index()
summary

##### Visualize the average length of a trip per day and per member type.

In [None]:
summary['day_started'] = summary['day_started'].apply(str) #turn date into string for the plot bar

In [None]:
sns.catplot(data = summary, kind = 'bar', x = 'day_started', y = 'mean', hue = 'member_casual', height = 7, aspect = 1.2)
plt.title('Average length of a trip per day and per member type')
plt.show()

##### Verify which day is the most solicitated.

In [None]:
day = df_2020_sampled.groupby(['member_casual','day_started'])['day_started'].agg(['count'])
day = day.reset_index()
day

In [None]:
fig, ax = plt.subplots(figsize = (10,7))
sns.histplot(df_2020_sampled, x = 'day_started', hue = 'member_casual', ax=ax, kde = True) #plot distribution
plt.title('Distribution of number of use per day of the week')
plt.show()

##### Check the average trip length per month

In [None]:
summary2 = df_2020_sampled.groupby(['member_casual', 'month_started'])['length_trip_sec'].agg(['mean','max', 'min'])
summary2 = summary2.reset_index()
summary2

In [None]:
summary2['month_started'] = summary2['month_started'].apply(str) #turn date into string for the plot bar

In [None]:
sns.catplot(data = summary2, kind = 'bar', x = 'month_started', y = 'mean', hue = 'member_casual', height = 7, aspect = 1.2)
plt.title('Average length of a trip per month and per member type')
plt.show()

##### Which is the most solicited month

In [None]:
df_2020_sampled.groupby(['member_casual', 'month_started'])['month_started'].agg(['count'])

In [None]:
fig, ax = plt.subplots(figsize = (10,7))
sns.histplot(df_2020_sampled, x = 'month_started', hue = 'member_casual', ax=ax, kde = True) #plot distribution
plt.title('Distribution of number of use per month of the week')
plt.show()

### Conclusion

In the conclusion no numbers will appear since the sample size is taken randomly. If the entire kernel is ran again, different values and different plots will be displayed.
<br>
<br>
But, in general, the analysis tell us the following:
<br>
<br>
<b>The average time spent on a bike per ride is clear:</b>
<ul>
    <li>While Casual members spend on average more time on their bike than annual members, the distribution remain inconsistent during week days.</li>
    <li>Annual members seem to be more predictible is a more consitant average time spent on their bike.</li>
    <li>In both case, casual and annual members seem to spend more time on a bike during the weekends.</li>
</ul>
<b>While the number of rides is clearly on annual members side:</b>
<ul>
    <li>Annual members have more rides in a day than casual members</li>
    <li>On the weekends (Saturday and Sunday), is the period where casual member have different rides more often</li>
    <li>Same remark as before, the number of rides stay consistent throuhout the week with annual members</li>
</ul>
<b>During seasonal period, behaviors are distinct:</b>
<ul>
    <li>The period of spring and summer are the best season so far both in term of number of unique ride and average time per ride</li>
    <li>However, casual members are still unpredictable. With the average time spent on a bike varying from month to month unconsistantly.</li>
</ul>
<br>
In clear, casual members spend averagely more time on a bike than their peers the annual members. It appears that casual members create the most and biggest outliers for the ride time. However, annual members use their benifit well since they are using more bikes in a day than casual members.