# Cyclistic Case Study 

### Deliverables
* A clear summary of the business task
* A description of all data sources used
* Documentation of any cleaning or manipulation of data
* A summary of the analysis
* Supporting visualizations and key findings
* Top high-level content recommendations based on the analysis

### Business Task

Identify trends and practices of Cyclistic´s Users in order to improve the total revenue of the company. 

### Data Sources Used

**Divvy Tripp Data**

Data can be found in the following link: https://divvy-tripdata.s3.amazonaws.com/index.html

**Acknowlegement:**

Lyft Bikes and Scooters, LLC (“Bikeshare”)
https://ride.divvybikes.com/data-license-agreement


**More about this Data Set:**

* It is a Secord Party Data Set that was collected by “Bikeshare”
* It is a Public Data Set
* The Data was collected with authorization of the Chicago State

#### Data Preparation

In [1]:
# Importing libraries that would help to clean and manipulate the data. 

import pandas as pd 
import numpy as np
import os 

In [2]:
# Getting the Current Work Directory so it is easy to find and manipulate files and data in general. 

cwd = os.getcwd()
cwd

'c:\\Users\\Roger\\Desktop\\Cyclist Case Study'

In [3]:
# List of files inside an specific folder with the historical data 

files = os.listdir(cwd + "\\Cyclist Data")
files

['202101-divvy-tripdata.csv',
 '202102-divvy-tripdata.csv',
 '202103-divvy-tripdata.csv',
 '202104-divvy-tripdata.csv',
 '202105-divvy-tripdata.csv',
 '202106-divvy-tripdata.csv',
 '202107-divvy-tripdata.csv',
 '202108-divvy-tripdata.csv',
 '202109-divvy-tripdata.csv',
 '202110-divvy-tripdata.csv',
 '202111-divvy-tripdata.csv',
 '202112-divvy-tripdata.csv']

In [4]:
# Loop that concatenates all the data from the historical files and saves it into a new file. 

""" all_data = pd.DataFrame()

for file in files: 
    df_m = pd.read_csv(cwd + "\\Cyclist Data\\" + file)
    all_data = pd.concat([all_data, df_m])

all_data.to_csv("Cyclistic_Trip_Data_2021.csv")
all_data """

# Note: Due I was working in a Personal Computer, the directories and files are from that Computer. This can be changed to any computer. 

' all_data = pd.DataFrame()\n\nfor file in files: \n    df_m = pd.read_csv(cwd + "\\Cyclist Data\\" + file)\n    all_data = pd.concat([all_data, df_m])\n\nall_data.to_csv("Cyclistic_Trip_Data_2021.csv")\nall_data '

In [5]:
# Reading the new file with all the data. 

all_data = pd.read_csv("C:\\Users\\Roger\\Desktop\\Cyclist Case Study\\Cyclistic_Trip_Data_2021.csv")

#### Data Cleaning

In [6]:
# Veryfing the data types
all_data.dtypes

Unnamed: 0              int64
ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id       object
end_station_name       object
end_station_id         object
start_lat             float64
start_lng             float64
end_lat               float64
end_lng               float64
member_casual          object
dtype: object

In [7]:
# Changing data format for certain columns into Dates 

all_data[["started_at", "ended_at"]] = all_data[["started_at", "ended_at"]].apply(pd.to_datetime, format= '%Y-%m-%d %H:%M:%S')

In [8]:
# Creating a new column for the month, so it is easy to query data based on this condition.

all_data["Month"] = all_data["started_at"].dt.month

In [9]:
# Creating a column with the time elapsed between the start of the trip and the end of it.

all_data["Time_Elapsed"] = (all_data["ended_at"] - all_data["started_at"]).dt.total_seconds() / 60
all_data

Unnamed: 0.1,Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,Month,Time_Elapsed
0,0,E19E6F1B8D4C42ED,electric_bike,2021-01-23 16:14:19,2021-01-23 16:24:44,California Ave & Cortez St,17660,,,41.900341,-87.696743,41.890000,-87.720000,member,1,10.416667
1,1,DC88F20C2C55F27F,electric_bike,2021-01-27 18:43:08,2021-01-27 18:47:12,California Ave & Cortez St,17660,,,41.900333,-87.696707,41.900000,-87.690000,member,1,4.066667
2,2,EC45C94683FE3F27,electric_bike,2021-01-21 22:35:54,2021-01-21 22:37:14,California Ave & Cortez St,17660,,,41.900313,-87.696643,41.900000,-87.700000,member,1,1.333333
3,3,4FA453A75AE377DB,electric_bike,2021-01-07 13:31:13,2021-01-07 13:42:55,California Ave & Cortez St,17660,,,41.900399,-87.696662,41.920000,-87.690000,member,1,11.700000
4,4,BE5E8EB4E7263A0B,electric_bike,2021-01-23 02:24:02,2021-01-23 02:24:45,California Ave & Cortez St,17660,,,41.900326,-87.696697,41.900000,-87.700000,casual,1,0.716667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5595058,247535,847431F3D5353AB7,electric_bike,2021-12-12 13:36:55,2021-12-12 13:56:08,Canal St & Madison St,13341,,,41.882289,-87.639752,41.890000,-87.610000,casual,12,19.216667
5595059,247536,CF407BBC3B9FAD63,electric_bike,2021-12-06 19:37:50,2021-12-06 19:44:51,Canal St & Madison St,13341,Kingsbury St & Kinzie St,KA1503000043,41.882123,-87.640053,41.889106,-87.638862,member,12,7.016667
5595060,247537,60BB69EBF5440E92,electric_bike,2021-12-02 08:57:04,2021-12-02 09:05:21,Canal St & Madison St,13341,Dearborn St & Monroe St,TA1305000006,41.881956,-87.639955,41.880254,-87.629603,member,12,8.283333
5595061,247538,C414F654A28635B8,electric_bike,2021-12-13 09:00:26,2021-12-13 09:14:39,Lawndale Ave & 16th St,362.0,,,41.860000,-87.720000,41.850000,-87.710000,member,12,14.216667


In [10]:
# After exploring more through the data, I realized that some records had negative values. 
# That could defenetively affect the analysis, so droping those records from the Data Set is a good idea. 

all_data = all_data.loc[~(all_data["Time_Elapsed"] <= 0.9)]
all_data.reset_index(drop=True, inplace= True)
all_data.head()

Unnamed: 0.1,Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,Month,Time_Elapsed
0,0,E19E6F1B8D4C42ED,electric_bike,2021-01-23 16:14:19,2021-01-23 16:24:44,California Ave & Cortez St,17660,,,41.900341,-87.696743,41.89,-87.72,member,1,10.416667
1,1,DC88F20C2C55F27F,electric_bike,2021-01-27 18:43:08,2021-01-27 18:47:12,California Ave & Cortez St,17660,,,41.900333,-87.696707,41.9,-87.69,member,1,4.066667
2,2,EC45C94683FE3F27,electric_bike,2021-01-21 22:35:54,2021-01-21 22:37:14,California Ave & Cortez St,17660,,,41.900313,-87.696643,41.9,-87.7,member,1,1.333333
3,3,4FA453A75AE377DB,electric_bike,2021-01-07 13:31:13,2021-01-07 13:42:55,California Ave & Cortez St,17660,,,41.900399,-87.696662,41.92,-87.69,member,1,11.7
4,5,5D8969F88C773979,electric_bike,2021-01-09 14:24:07,2021-01-09 15:17:54,California Ave & Cortez St,17660,,,41.900409,-87.696763,41.94,-87.71,casual,1,53.783333


In [11]:
# Veryfing if the data frame has NaN values that could affect the calculations and analysis.   

all_data["end_lng"].isnull().values.any() 

True

In [12]:
# Dropping those values. 

all_data = all_data.dropna(subset=["end_lat"])
all_data.reset_index(drop=True, inplace= True)

In [13]:
# Changing the column names so it is easy to identify during the analysis 

all_data.rename(columns={"member_casual":"member_type", "rideable_type":"type_of_bike"}, inplace= True)

#### Analysis and Visualizations

***How do annual members and casual riders use Cyclistic bikes differently?***

First, I wanted to to Summarize the amount of users that are "Anual Members" and the ones that are "Casual".

In [14]:
# Creating another Data Frame with the grouped data and the number of users

count_of_members = all_data.groupby(["Month", "member_type"]).count()["ride_id"].reset_index(name="Count")
count_of_members

Unnamed: 0,Month,member_type,Count
0,1,casual,17834
1,1,member,77541
2,2,casual,9915
3,2,member,38573
4,3,casual,83085
5,3,member,142439
6,4,casual,134836
7,4,member,197518
8,5,casual,253197
9,5,member,270020


In [15]:
import plotly.express as px

# Agregar total a cada año
# Cambiar la label pop up
# Quizá cambiar los colores 
# Hacer que se vean todos los meses en el eje x 

fig1 = px.bar(count_of_members, x="Month", y="Count", color="member_type", title="Type of Members by Month", labels= {"Count": "Count of Members", "member_type": "Member Type"}, barmode= "group")
fig1.show()

As shown in the chart, there is an increase in the number of both type of users from May to August.

I think it would be worth it to analize further in this tendency, my Hypotesis is that: 

**During those months people tend to use more bikes as a result of the year season, which is Summer.**

Temperature would be another influence on this tendency, after the summer season, the amount of users starts to decrease as well as the temperature, that is why Winter has the lowest numbers on this chart, including the Anual Members.

As a result of this Hypothesis, users might just use bikes during that period of time and the rest of the year, use a different type of transportation. 




Now, let´s take a look at the average time of the tripps for each type of user in order to find out more about their behaviors.

In [16]:
# Creating a new Data Frame with the Average Duration of Tripps. 

# Note: Due the amount of data that this Data Frame has, I thought it would be benefitial that instead of using the Mean for the calculations I used the Median so the results would be more accurate. 

avg_duration = all_data.groupby("member_type").median()["Time_Elapsed"].reset_index(name="Average Duration Of Trip")
avg_duration     

Unnamed: 0,member_type,Average Duration Of Trip
0,casual,16.166667
1,member,9.75


In [17]:
# Corregir Titulos
# Cambiar anchura de la gráfica. 
# Cambiar Colores
# Agregar titulo


fig2 = px.bar(avg_duration,x="member_type",y="Average Duration Of Trip", title="Duration of Trips")
fig2.show()

We can see that Casual Users take more time on their Tripps than the Anual Users. 

Although it is the result of the annual data, we can identify this tendency in every month, as shown above: 

In [18]:
# Creating another Data Frame with the Average Duration of the Tripps grouped by month.

avg_month_duration = all_data.groupby(["member_type","Month"]).median()["Time_Elapsed"].reset_index(name="Average Duration Of Trip")
avg_month_duration

Unnamed: 0,member_type,Month,Average Duration Of Trip
0,casual,1,12.433333
1,casual,2,16.266667
2,casual,3,18.916667
3,casual,4,18.266667
4,casual,5,19.016667
5,casual,6,17.6
6,casual,7,16.916667
7,casual,8,16.283333
8,casual,9,15.533333
9,casual,10,13.883333


In [19]:
fig3 = px.bar(avg_month_duration, x="Month", y="Average Duration Of Trip", color="member_type", title="Average Duration of Trips", barmode= "group")
fig3.show()

Something interesting about this visualization is that we can also identify a regular duration for the Annual Members. 

It seems like 10 minutes is the standard duration for this type of members. I assume that Annual Members use bikes like if they were part of their daily rutine. 

As said before, Casual Users seem to take longer Tripps than the Annual Members. 
Casual Members might use bikes in a different way, maybe for an specific activity.

Now, I would like to calculate the distance traveled of every tripp and get more insights in order to complement the previous results.

In [20]:
from geopy.distance import great_circle

# I used another Library for Geolocations. 
# GEOPY will help us to calculate the distance between 2 given points, in this case, the coordinates that the Data Frame has. 
# Something that needs to be considered is that in order to optimize processing and time, I used the "Great Circle" method instead of "Geodesic", which is less accurate but faster. 


all_data['Distance_KM'] = all_data.apply(lambda row: great_circle((row[9], row[10]), (row[11], row[12])).kilometers, axis=1)
all_data.head()

Unnamed: 0.1,Unnamed: 0,ride_id,type_of_bike,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_type,Month,Time_Elapsed,Distance_KM
0,0,E19E6F1B8D4C42ED,electric_bike,2021-01-23 16:14:19,2021-01-23 16:24:44,California Ave & Cortez St,17660,,,41.900341,-87.696743,41.89,-87.72,member,1,10.416667,2.242247
1,1,DC88F20C2C55F27F,electric_bike,2021-01-27 18:43:08,2021-01-27 18:47:12,California Ave & Cortez St,17660,,,41.900333,-87.696707,41.9,-87.69,member,1,4.066667,0.556328
2,2,EC45C94683FE3F27,electric_bike,2021-01-21 22:35:54,2021-01-21 22:37:14,California Ave & Cortez St,17660,,,41.900313,-87.696643,41.9,-87.7,member,1,1.333333,0.280032
3,3,4FA453A75AE377DB,electric_bike,2021-01-07 13:31:13,2021-01-07 13:42:55,California Ave & Cortez St,17660,,,41.900399,-87.696662,41.92,-87.69,member,1,11.7,2.248213
4,5,5D8969F88C773979,electric_bike,2021-01-09 14:24:07,2021-01-09 15:17:54,California Ave & Cortez St,17660,,,41.900409,-87.696763,41.94,-87.71,casual,1,53.783333,4.536549


Now that we know the distance between trips, we can perform some calculations in order to find more about the Cyclistic´s Users behaviour.

In [21]:
all_data.groupby("member_type").median()["Distance_KM"]

# Note: Due the amount of data that this Data Frame has, I thought it would be benefitial that instead of using the mean for the calculations I used the Median so the results would be more accurate. 

member_type
casual    1.754479
member    1.583348
Name: Distance_KM, dtype: float64

As we can see, the distance between the two member types are quite similar, but the tendency of Casual Members having the highest results persists, meaning that not only Casual Users take longer time Tripps, but also ride longer Distances. 

If we combine those insights, we can support another Hypothesis: 

**Casual Members use bikes with a different purpose that involves more time and longer distances compared to the annual users.** 

There could be plenty of reasons for this type of preferences, but I think it can be related to: 

* Spend less money and time on transportation
* Quick rides around an area
* Tourism 

We can also take into consideration the highest Distance records of the Data Frame.

In [22]:
all_data.groupby("member_type")["Distance_KM"].nlargest(5)

member_type         
casual       5209815    114.383732
             887444      33.800227
             2064834     32.210766
             2949419     31.906523
             4188359     31.008518
member       1377420     32.022889
             1907117     31.559257
             815931      28.735410
             1331181     27.690401
             2921496     26.344355
Name: Distance_KM, dtype: float64

Although this results are really high compared with others, they do not represent an usual behaviour from users. 

In order to prove that those results are not usual, we can use Percentiles and determine the percentage they represent. 

In [23]:
ordered_data = all_data.sort_values(by ="Distance_KM")
np.percentile(ordered_data["Distance_KM"],99)

9.483551768984926

The result shows us that 99% of the data in the "Distance" column is less than or equal to 9.4 KM 

This means that records that are higher than 9.4 KM only represent less than 1% of the data. 

Also, we can analyze the lowest records.

In [24]:
all_data.groupby("member_type")["Distance_KM"].nsmallest(5)

member_type      
casual       230     0.0
             1508    0.0
             4304    0.0
             4369    0.0
             4370    0.0
member       140     0.0
             231     0.0
             1507    0.0
             2737    0.0
             3590    0.0
Name: Distance_KM, dtype: float64

As we can see, 0 KM is the lowest value on the Distance parameter. But that doesn´t mean the data is wrong.

Let´s take a look at those values in order to understand them. 

In [25]:
all_data.loc[all_data["Distance_KM"] == 0].head()

Unnamed: 0.1,Unnamed: 0,ride_id,type_of_bike,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_type,Month,Time_Elapsed,Distance_KM
140,143,614F73576809533D,classic_bike,2021-01-14 15:32:16,2021-01-14 15:34:08,Halsted St & North Branch St,KA1504000117,Halsted St & North Branch St,KA1504000117,41.899368,-87.64848,41.899368,-87.64848,member,1,1.866667,0.0
230,233,8527740375523C06,docked_bike,2021-01-25 17:04:53,2021-01-25 17:41:25,California Ave & Cortez St,17660,California Ave & Cortez St,17660,41.900363,-87.696704,41.900363,-87.696704,casual,1,36.533333,0.0
231,234,DFCD4D55C73B88AB,classic_bike,2021-01-07 07:51:06,2021-01-07 08:27:53,California Ave & Cortez St,17660,California Ave & Cortez St,17660,41.900363,-87.696704,41.900363,-87.696704,member,1,36.783333,0.0
1507,1511,B4BBE7DD0E518C6A,classic_bike,2021-01-17 19:50:29,2021-01-17 20:02:36,Rush St & Hubbard St,KA1503000044,Rush St & Hubbard St,KA1503000044,41.890173,-87.626185,41.890173,-87.626185,member,1,12.116667,0.0
1508,1512,86764570E4F76AC0,docked_bike,2021-01-14 12:44:31,2021-01-14 13:01:08,Michigan Ave & 8th St,623,Michigan Ave & 8th St,623,41.872773,-87.623981,41.872773,-87.623981,casual,1,16.616667,0.0


Although the distance is 0 KM, the time elapsed between the start and end of the tripp is something we must take into consideration.

Apparently, users take tripps and then return the bikes to the same station they started, that is why the Distance Parameter is 0.

We can analize further and get the hours where Users tend to use more bikes. 

In [26]:
# Creating a new Column with the hour of the records and then grouping them into a new Data Frame

all_data["Hour"] = all_data["started_at"].dt.hour
hours = all_data.groupby(["Hour","member_type"]).count()["ride_id"].reset_index(name="Count")
hours.head()


Unnamed: 0,Hour,member_type,Count
0,0,casual,53010
1,0,member,32495
2,1,casual,38579
3,1,member,21449
4,2,casual,25099


In [27]:

fig4 = px.line(hours, x='Hour', y='Count', color="member_type")
fig4.show()

As can be seen, there are some hours that have higher demand of bikes. 

This is more evident with the **Annual Members**. There are 2 increments during the day that are relevant. 
One of them is between 7:00 to 8:00 and the last one, which is the highest, from 15:00 to 17:00. 
After that, the following hours have a noticeable decrease on the bikes demand. 

Given that range of hours, it seems like those increments happen only when people go to work and when they get back to their houses. 
**If that is correct, it could support the Hypothesis of Annual Members only using the bikes as part of their rutine.**

Now, **Casual Users** seem to have a constant increment during the day. As said before, this could mean that they use bikes for sporadic activities. 


Finally, we can also get some insights about the type of bikes users prefer during the day. 

In [28]:
# Creating a new Data Frame with the count of Type of Member, Type of Bike and Hour of the day. 

bikes = all_data.groupby(["Hour","type_of_bike","member_type"]).count()["ride_id"].reset_index(name="Count")
bikes

Unnamed: 0,Hour,type_of_bike,member_type,Count
0,0,classic_bike,casual,25294
1,0,classic_bike,member,18450
2,0,docked_bike,casual,6730
3,0,electric_bike,casual,20986
4,0,electric_bike,member,14045
...,...,...,...,...
116,23,classic_bike,casual,35861
117,23,classic_bike,member,29162
118,23,docked_bike,casual,8929
119,23,electric_bike,casual,28169


In [29]:
fig5 = px.line(bikes, x='Hour', y='Count', color="type_of_bike", facet_col="member_type")
fig5.show()

Thanks to this visualization, we can confirm that the pattern of use that was shown above is correct. 
Also, now we can watch users' preferences regarding the type of bike they like to use. 

Classic Bikes are the most popular among both type of users. They are practical, easy to use and are a better option for those users that just need a quick ride. 

Electric Bikes are in second place for both users. I would say that this type of bike is used when people have longer tripps. 

Finally, Docked Bikes do not seem to be used by a lot of people. 

We can see that only Casual Members use this type of bike. Perhaps, the demand of Annual and Casual Users taking Classic and Electric Bikes for their rides causes a lack of them. Thus, part of the Casual User should use this type of Bike. 

#### Summary of the Analysis

After analyzing the data, there are a few insights that can tell us more about the users´ preferences and benefit the company decisions: 

1. There are specific months *(May to August)* where the amount of users has an increment. This can be caused by different components, but I think the main reason is temperature, the colder it is, the less users take rides. 
This can effect directly on the Casual Users decision. If they are just going to use bikes during that period of time, then getting the Annual Membership might not be the best idea for them. 

2. Casual Users take rides that requiere more time and a longer distances compared to Annual Users. 
    * As shown above, it seems like Annual Members use bikes in a routinary way. My Hypothesis is that Annual Members use bikes most of the times for work. According to the visualizations, the hours that have more demand of bikes are also the ones that people might use to go to work and then go back to their home. 
    * Casual Users have more of a sporadic behaviour. There is an increment of users throughout the day, but it doesn´t look like they use bikes in a routinary way or for something specific. Instead, they might have another reasons to use this service, such as:
    
        * Spend less money and time on transportation
        * Quick rides around an area
        * Tourism 

#### Recomendations: 

Finnaly, I would also like to add a Recomendation part. This can be used as a way of improving the data and get better and more complete analyses: 

1. It could be interesting to see how users pay their membership or full-day passes. There might be a payment preference between the Annual Members and Casual Members. If annual membership only accepts a certain method, then it might give us another insight about why Casual Users don´t use the annual plan.

2. I think having an identifier for every users can be benefitial. With an ID, we could have more information about users´ behaviour, if they live in the city, if they use bikes certain days of the week, age, gender, etc. That could be used to create different segments and analyze them with more precision. 

3. Finally, aditional information related to the service itself can be helpful. Prices of the membership and full day passes would help the analyst to undestand more about the users, aswell as especifying the characteristics of every bike. 