# **Exploratory Data Analysis on Chicago Bicycle Rent Usage**
_An analysis aimed to find key insights that will help increase the usage of rental cycles._

![img](https://i.imgur.com/55wXYIq.jpg)
## **Introduction**
Cycles have been a part of this world for a very long time, which also inspired the development of other advanced vechicles such as motor bikes. This simple machine was first invented in 1817 by German Inventor Karl von Drais since then it has evolved a lot and used for variety of activities even in recent years. 

Some of the reasons why cycles are so special are as follows:
- Eco Friendly
- Easy to use
- Cheaper to buy
- Less maintenance
- Easy to place (compactness)

And with developed in technologies new features like **geotracking** and **electric bicycles** has been developed. All these reasons have created renting cycles to the public a potential buisness opportunity. One such company is **Cyclistic** , a Chicago based company which provides rental services for cycles. **Cyclistic** provides both casual daily passes and annual membership to the customers. The main motivation of this EDA is to find various insights on the behaviour of the cyclists and come up with strategies to increase the cycle usage using the kaggle cyclistic dataset.

**The following steps will be followed in the analysis :**

- Downloading the [Cyclistic](https://www.kaggle.com/datasets/gunnarn/chicago-bicycle-rent-usage?select=202207-divvy-tripdata.csv) Dataset from Kaggle.
- Installing and Importing essential libraries for analysis
- Preprocessing and data cleaning
- Asking and Answering Questions 
- Exploratory Data Analysis (EDA)
- Writing a conclusion

## **Downloding the cyclistic Dataset from Kaggle**

In [None]:
# Installing the opendatsets python library
!pip install opendatasets --quiet

[opendatsets](https://pypi.org/project/opendatasets/#:~:text=opendatasets%20is%20a%20Python%20library,using%20a%20simple%20Python%20command.) is a python library used for downloading datasets from kaggle and google drive effortlessly

In [None]:
import opendatasets as od


In [None]:
url='https://www.kaggle.com/datasets/gunnarn/chicago-bicycle-rent-usage?select=202207-divvy-tripdata.csv'
od.download(url)

All csv files has been downloaded from kaggle, for this analysis we will be using the latest data that is of July 2022.

In [None]:
file_path='./chicago-bicycle-rent-usage/202207-divvy-tripdata.csv'

### **About the Data**

The complete dataset consist of 28 .csv files ranging from April 2020 to July 2022. For this analysis we will be using the latest(July 2022) dataset.

The dataset consist of 13 columns.
- `ride_id : ID of the ride`
- `rideable_type : Type of bike used`
- `started_at : Starting time`
- `ended_at : Ending time`
- `start_station_name : Name of the starting station`
- `start_station_id : Id of the starting station`
- `end_station_name : Name of the ending station`
- `end_station_id : Id of the ending station`
- `start_lat : Latitiude of the starting station`
- `start_lng : Longitude of the starting station`
- `end_lat : Latitiude of the ending station` 
- `end_lng : Longitude of the ending station`
- `member_casual : Type of the user(member/casual user)`

Classification of data:

- `'started_at'` and `'ended_at'` are **time series data**
- `'start_lat'`, `'start_lng'`, `'end_lat'`, `'end_lng'` are **geographical data**
- `'rideable_type'` and `'member_casual'` are **categorical data**

## **Installing and Importing Essential Libraries for Analysis**

In [None]:
!pip install pandas numpy matplotlib plotly.express seaborn pyarrow wordcloud --quiet

In [None]:
from collections import defaultdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

**Uses of the installed library**
- For Data Manipulation and Querying
    - Pandas
    - Numpy
- For Visualisation
    - Seaborn
    - Matplotlib
    - Plotly.express

## **Preprocessing and Data Cleaning**

Reading the .csv file using pandas. Since we know that `started_at` and `ended_at` are time based data, we can parse it accordingly while loading the dataset.

In [None]:
raw_bike_df=pd.read_csv(file_path, parse_dates=['started_at', 'ended_at'])

In [None]:
raw_bike_df

In [None]:
raw_bike_df.head()

In [None]:
raw_bike_df.info()

As we can can see that the pandas library has loaded the `started_at` and `ended_at` as a datetime datatype. And `rideable_type` , `member_casual` as an object. 

In [None]:
raw_bike_df['rideable_type'].unique()

As its visible that there are 3 types of bikes that are used by the users.

In [None]:
raw_bike_df['member_casual'].unique()

Similarly for `member_casual`.
So lets convert `rideable_type` and `member_casual` as categorical data type.

In [None]:
raw_bike_df['rideable_type']=raw_bike_df['rideable_type'].astype('category')
raw_bike_df['member_casual']=raw_bike_df['member_casual'].astype('category')

In [None]:
raw_bike_df.info()

As it is visible now `rideable_type` and `member_casual` has been converted into categorical data.

### **Analysing null and duplicate values**

#### **Handling Duplicate Data**

In [None]:
raw_bike_df.duplicated().sum()

The dataset does not have any duplicate data.

#### **Handling Missing Values**

Lets use the isna() function to check the missing values.

In [None]:
raw_bike_df.isna().sum()

As it is visible that the missing values are `start_station_name`, `start_station_id`, `end_station_name`, `end_station_id` out of which few rows dont have `end_lat` and `end_lng`.

To avoid confusion in analysis lets go ahead and drop the rows that have missing data.

In [None]:
raw_bike_df=raw_bike_df.dropna()
raw_bike_df.isna().sum()

**Lets create a copy of the raw dataframe before continuing with the further analysis.**

In [None]:
bike_df=raw_bike_df.copy()

### **Normalizing the Geographical Data based on the Start Location and End Location**

Lets get all the distinct start and end locations of the dataset and store it in a list.

In [None]:
start_station_list=bike_df['start_station_name'].unique().tolist()
end_station_list=bike_df['end_station_name'].unique().tolist()

In [None]:
start_station_list[0]

For analysing the problem, a sample start station is taken and store in the `sample_start_station` variable.

In [None]:
sample_start_station=start_station_list[0]

In [None]:
bike_df[bike_df['start_station_name']==sample_start_station]['start_lat'].nunique()

As it is visible that there are 438 latitudes for `"Ashland Ave & Blackhawk St`. 


Thus to solve this problem we have to make sure that all **occurrence of particular location has the same latitude and longitude.** 

In [None]:
for start_station in start_station_list:
  df=bike_df[bike_df['start_station_name']==start_station][['start_lat','start_lng']]
  start_lat=df.start_lat.mean()
  start_lng=df.start_lng.mean()
  bike_df.loc[bike_df['start_station_name']==start_station, ['start_lat','start_lng']]=[start_lat,start_lng]

In [None]:
for end_station in start_station_list:
  df=bike_df[bike_df['end_station_name']==end_station][['end_lat','end_lng']]
  end_lat=df.end_lat.mean()
  end_lng=df.end_lng.mean()
  bike_df.loc[bike_df['end_station_name']==end_station, ['end_lat','end_lng']]=[end_lat,end_lng]

The in above blocks of code the dataset has been aggregated based on the start and end stations by iterating through the _start_station_list_ and _end_station_list_. And the mean of latitude and longitude has been assigned to them.

Lets verify this by checking the number of occurrence of the _sample_start_station_(`"Ashland Ave & Blackhawk St"`).

In [None]:
bike_df[bike_df['start_station_name']==sample_start_station]['start_lat'].nunique()

As it is visible that unlike before there is only 1 unique latitude for the _sample_start_station_(`"Ashland Ave & Blackhawk St"`)

### **Saving the Intermediate Results**

In [None]:
bike_df.to_csv('bike_df.csv')

In [None]:
bike_df.head()

## **Exploratory Data Analysis (EDA)**

Before we get started with asking and answering question, lets explore various columns in the dataset and find the relationship of them with other columns.

### `rideabe_types`
The rideable types column is about the bikes used by the riders for commuting.

In [None]:
bike_df['rideable_type'].unique()

The `rideable_type` column has three unique values that is classic bike, docked bike and electric bike. Lets visualize the contribution of each of these bikes to the dataset using a pie chart.

In [None]:
ride_df=bike_df['rideable_type'].value_counts().reset_index().rename(columns={'index':'bike_type','rideable_type':'freq'})

In [None]:
px.pie(ride_df,values='freq',names='bike_type')

Out of the total bikes used, classic bikes covers the maximum of 58 % and electric and docker bike with 37 % and 4.76 %.

### `member_casual`
The member_casual column tells us whether the rider is a membered customer or a casual one.

In [None]:
mem_ca=bike_df['member_casual'].value_counts()
mem_ca

There are total of 3,31002 membered riders and 3,11678 casual riders. Lets visualize it using a histogram.

In [None]:
px.histogram(bike_df['member_casual'], histnorm='percent')

Membered riders are around 51 % of total riders and casual riders are around 49 % of total riders.

### `start_station_name`

The name of the station where the riders have started their ride. 

In [None]:
top_start=bike_df['start_station_name'].value_counts().head(10)
top_start

Lets visualize top 10 popular starting stations.

In [None]:
fig=px.bar(x=top_start.index, y=top_start)
fig.update_layout(title="Top 10 Starting Stations")

The top 10 starting stations used by the riders is shown in the above histogram.

### `end_station_name`
Name of the station where the riders have ended their journey.

In [None]:
top_end=bike_df['end_station_name'].value_counts().head(10)
top_end

Lets visualize top 10 popular ending stations.

In [None]:
fig=px.bar(x=top_end.index, y=top_end)
fig.update_layout(title="Top 10 Ending Stations")

The top 10 ending station is almost similar to that of the top 10 starting stations except for Clark St & Armitage Ave, Clark St & Lincoln Ave  at 9th and 10th place.

### `started_at`
Date and Time when the rider started to ride.

In [None]:
bike_df['started_at'].head()

Lets visualize and analyze the different hours of the day when the riders started their journey.

In [None]:
time_df=pd.DataFrame()
time_df['starting_hr']=bike_df.started_at.dt.hour
time_df['ending_hr']=bike_df.ended_at.dt.hour
time_df['member_casual']=bike_df['member_casual']
time_df[['starting_hr','member_casual']].head()

In [None]:
sns.histplot(time_df['starting_hr'], bins=24, kde=True, kde_kws=dict(bw_method=0.09)).set(title="Distribution of Starting Time of Riders");
plt.xlabel('Starting Hour')
plt.ylabel('No. of Traffic')

17:00 hour of the day has the maximum traffic in the starting station. And the least is at 4:00.

### `ended_at`
Date and Time when the rider have ended their ride.

In [None]:
bike_df['ended_at'].head()

Lets visualize and analyze the different hours of the day when the riders have completed their journey.

In [None]:
time_df[['ending_hr','member_casual']].head()

In [None]:
sns.histplot(time_df['ending_hr'], bins=24, kde=True, kde_kws=dict(bw_method=0.09)).set(title='Distribution of Ending Time of Riders');
plt.xlabel('Starting Hour')
plt.ylabel('No. of Traffic')

The above distribution is similar to that of the starting time where the highest traffic is at 17:00 hours and the min traffic is at 4:00.

### **Introducting new column**

- `distance_km`


### `distance_km`
Lets go ahead and create a new column `distance_km` to track distance between the starting and ending stations for further analysis.

The `haversine` function calculates the distance between two destinations based on stating and ending geolocations.


In [None]:
from math import radians, cos, sin, asin, sqrt
def haversine(row):

    lat1, lon1, lat2, lon2 = map(radians, [row['start_lat'], row['start_lng'], row['end_lat'], row['end_lng']])

    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 
    return c * r
    
bike_df['distance_km']=bike_df.apply(haversine, axis=1)

Lets now visualize the distance covered by various riders using a distribution plot.

In [None]:
sns.histplot(bike_df["distance_km"], bins=30, kde=True,kde_kws=dict(bw_method=0.2));
plt.xlabel('Distance(km)')
plt.ylabel('No. of riders')

From the distribution plot the distance covered by most of the riders lie between 1 to 2 kms.

## **Asking and Answering Questions**

1. What is the realative distibution of types of bikes used by customers of **Cyclistic** ?
2. What are top 50 start stations used ?
    - Show the popular stations with a word cloud.
3. What is total number of routes taken by the riders where start 
and end stations are not the same ? 
    - And plot top 50 routes of the same in a map.
4.  How is the membership of the customer related to the destination ?
5.  Which is the most preferable bike used, analyse it based on various scenarios ?
6.  How is distance covered related to the membership of the customers ?
7.  Which route with different destination has more casual riders, plot it in a map ?
8.  Find the relationship between type of riders vs. types of bikes used ?
9.  Find top 5 routes with max distance ?
10. At what time of the day has more traffic in the starting stations, compare it with the type of riders(member/casual)?

### **1. What is the relative distribution of types of bikes used by customers of Cyclistic ?**

In [None]:
rideable_types=bike_df['rideable_type'].unique().tolist()

In [None]:
classic_relative_count= (bike_df[bike_df['rideable_type']=='classic_bike']['rideable_type'].count()/bike_df['rideable_type'].count())*100
electric_relative_count=(bike_df[bike_df['rideable_type']=='electric_bike']['rideable_type'].count()/bike_df['rideable_type'].count())*100
docked_relative_count=(bike_df[bike_df['rideable_type']=='docked_bike']['rideable_type'].count()/bike_df['rideable_type'].count())*100
relative_frequency=[classic_relative_count,electric_relative_count,docked_relative_count]

In [None]:
fig=px.bar(x=rideable_types, y=relative_frequency)
fig.update_layout(title='Relative Distribution of Bikes Used')
fig.update_xaxes(title_text='Type of Bike')
fig.update_yaxes(title_text='Percentage of Usage')

##### **Insights**:
- Classic bikes are most popular with 57 percentage of usage by the customers.
- Followed by electric bikes with 37 percentage and docked bikes with only 4 percentage
- Docked bikes are very least used by riders when compared to classic and electric bikes.

### **2. What are top 50 start stations used ?**

In [None]:
top_50_start_df=bike_df['start_station_name'].value_counts().head(50).reset_index()
top_50_start_df.rename(columns={'index':'start_station_name','start_station_name':'frequency'}, inplace=True)

In [None]:
start_50=top_50_start_df['start_station_name'].tolist()

In [None]:
mask=bike_df['start_station_name'].isin(start_50)
start_50_full_df=bike_df[mask]


In [None]:
data=start_50_full_df['start_station_name']
fig=px.histogram(data, histfunc='count')
fig = fig.update_layout(title='Top 50 Start Stations',barmode='overlay', yaxis=defaultdict(title='Frequency'), xaxis=defaultdict(categoryorder='total descending'))
fig.update_xaxes(title_text='Start Stations')

fig

- #### **Show the popular stations with a word cloud.**

In [None]:
from PIL import Image
from wordcloud import WordCloud
text = bike_df['start_station_name'].values
wordcloud = WordCloud(regexp=r'\b\w+\b',width=800, height=400, background_color='white',).generate(str(text))

plt.figure( figsize=(20,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show();

#### **Insights From Histogram and Word Cloud**
- The plotted histogram shows the Top 50 start stations used by the riders with `Streeter Dr & Grand Ave` as the most popularly used start station with 13877 counts.
- `DuSable Lake Shore Dr & North Blvd` as second with  8177 counts.
- `Michigan Ave & Oak St` as third with  7429 counts.

### **3. What is total number of routes taken by the riders where start and end stations are not the same ?**


In [None]:
bike_df['start_end_station']=bike_df['start_station_name']+' --> '+bike_df['end_station_name']
bike_df[['start_station_name','end_station_name','start_end_station']].head(3)

In [None]:
distinct_routes=bike_df[bike_df['start_station_name']!=bike_df['end_station_name']]
answer=distinct_routes['start_end_station'].nunique()
print(f"There are a total of {answer} routes taken by the customers where start and end stations are not the same")

In [None]:
distinct_routes_temp=distinct_routes['start_end_station'].value_counts().head(50).reset_index()
distinct_routes_temp.rename(columns={'index':'start_end_station','start_end_station':'frequency'},inplace=True)
distinct_routes_list=distinct_routes_temp['start_end_station'].tolist()

- #### **Plot 50 Distinct Routes in a Map**

In [None]:
!pip install folium --quiet
dict_se={}
dict_l={}

for route in distinct_routes_list:
  index=distinct_routes['start_end_station'].eq(route).idxmax()
  start_lat=distinct_routes['start_lat'].loc[index]
  start_lng=distinct_routes['start_lng'].loc[index]
  end_lat=distinct_routes['end_lat'].loc[index]
  end_lng=distinct_routes['end_lng'].loc[index]
  dict_se[route]=((start_lat,start_lng),(end_lat,end_lng))

In [None]:
dict_l={}
for loc in dict_se:
    locs=loc.split(' --> ')
    if locs[0] not in dict_l:
        dict_l[locs[0]]=dict_se[loc][0]
    if locs[1] not in dict_l:
        dict_l[locs[1]]=dict_se[loc][1]

In [None]:
import folium

m = folium.Map(location=[41.892288,-87.612082], zoom_start=13.3)

for loc in dict_se.values():
  route = folium.PolyLine(
      locations=loc,
      color='red',
      weight=5,
      opacity=0.7,
      dash_array=(7,7)
  )
  route.add_to(m)


for loc, coor in dict_l.items():
  marker = folium.Marker(location=coor, popup=loc)
  m.add_child(marker)


m

##### **Insights from Querying and Folium Map**

- There are almost 89,702 routes taken by the customers where start and end stations are not same.
- From the map its visible that `Streeter Dr & Grand Ave ` station has the maximum connection with other stations making it the most popular station.

### **4. How is the membership of the customer related to the destination?**

In [None]:
# Function to add a new column destination
# Inserts same if the destination is same and not_same if the destination is different

def new_col(row):
  if row['start_station_name']==row['end_station_name']:
    return 'same'
  else:
    return 'not_same'

In [None]:
bike_df['destination']=bike_df.apply(new_col, axis=1)

In [None]:
data=bike_df
fig=px.histogram(data['member_casual'], color=data['destination'],histnorm='percent', facet_col=data['destination'])
fig.update_layout(title='Relative Frequency Distribution of Customer Membership based on Routes Taken')


##### **Insights from Histograms**
- Casual riders are more when the starting and ending locations are same
- Whereas when destination is different membership riders are slighly higher by 7 % when compared to casual riders


### **5. Which is the most preferable bike used, analyse it based on various scenarios ?**

In [None]:
data=bike_df['rideable_type']
fig=px.histogram(data)
fig.update_layout(title='Distribution of Bikes used by Riders')

Lets analyse this based on various scenarious

#### **Analysing the bike usage based on distance covered**

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(14, 10))


data1=bike_df[(bike_df['distance_km']>1) & (bike_df['distance_km']<5)]
fig1=sns.histplot(data=data1, x='rideable_type', ax=axes[0, 0])
axes[0, 0].set_title("1 km > Distance < 5 km")

data2=bike_df[(bike_df['distance_km']>5) & (bike_df['distance_km']<10)]
fig2=sns.histplot(data=data2, x='rideable_type', ax=axes[0, 1])
axes[0, 1].set_title("5 km > Distance < 10 km")

data3=bike_df[(bike_df['distance_km']>10) & (bike_df['distance_km']<15)]
fig3=sns.histplot(data=data3,x='rideable_type', ax=axes[1, 0])
axes[1, 0].set_title("10 km > Distance < 15 km")

data4=bike_df[(bike_df['distance_km']>15) & (bike_df['distance_km']<20)]
fig4=sns.histplot(data=data4, x='rideable_type', ax=axes[1, 1])
axes[1, 1].set_title("15 km > Distance < 20 km")
fig.subplots_adjust(wspace=0.3, hspace=0.3)

for rect in fig1.patches:
    x = rect.get_x() + rect.get_width() / 2
    y = rect.get_height()
    fig1.text(x, y, f"{y:.0f}", ha='center', va='bottom')

for rect in fig2.patches:
    x = rect.get_x() + rect.get_width() / 2
    y = rect.get_height()
    fig2.text(x, y, f"{y:.0f}", ha='center', va='bottom')

for rect in fig3.patches:
    x = rect.get_x() + rect.get_width() / 2
    y = rect.get_height()
    fig3.text(x, y, f"{y:.0f}", ha='center', va='bottom')

for rect in fig4.patches:
    x = rect.get_x() + rect.get_width() / 2
    y = rect.get_height()
    fig4.text(x, y, f"{y:.0f}", ha='center', va='bottom')

fig.suptitle("Bike used based on Distance Covered")
plt.show()

##### **Insights from Histograms**

- The usage of classic bikes are higher when the distance is between 1-10 kms.
- Whereas the usage of electric bikes, increase with the increase in distance.
- Between 10-15 km the usage of both classic and electric bikes are almost same.


Creating a new column `duration_min` which contains the travel time in minutes.

In [None]:
bike_df['duration_mins']=((bike_df['ended_at']-bike_df['started_at']).dt.seconds)/60

#### **Analysing the bike usage based on travel duration**

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(14, 10))


data1=bike_df[(bike_df['distance_km']>1) & (bike_df['distance_km']<10) &(bike_df['duration_mins']>1) &(bike_df['duration_mins']<5)]
fig1=sns.histplot(data=data1, x='rideable_type', ax=axes[0, 0])
axes[0, 0].set_title("1 to 5 Minutes")

data1=bike_df[(bike_df['distance_km']>1) & (bike_df['distance_km']<10)&(bike_df['duration_mins']>5) &(bike_df['duration_mins']<10)]
fig2=sns.histplot(data=data1, x='rideable_type', ax=axes[0, 1])
axes[0, 1].set_title("5 to 10 Minutes")

data1=bike_df[(bike_df['distance_km']>1) & (bike_df['distance_km']<10)&(bike_df['duration_mins']>10) &(bike_df['duration_mins']<15)]
fig3=sns.histplot(data=data1,x='rideable_type', ax=axes[1, 0])
axes[1, 0].set_title("10 to 15 Minutes")

data1=bike_df[(bike_df['distance_km']>1) & (bike_df['distance_km']<10)&(bike_df['duration_mins']>15) &(bike_df['duration_mins']<20)]
fig4=sns.histplot(data=data1, x='rideable_type', ax=axes[1, 1])
axes[1, 1].set_title("15 to 20 Minutes")
fig.subplots_adjust(wspace=0.3, hspace=0.3)



for rect in fig1.patches:
    x = rect.get_x() + rect.get_width() / 2
    y = rect.get_height()
    fig1.text(x, y, f"{y:.0f}", ha='center', va='bottom')

for rect in fig2.patches:
    x = rect.get_x() + rect.get_width() / 2
    y = rect.get_height()
    fig2.text(x, y, f"{y:.0f}", ha='center', va='bottom')

for rect in fig3.patches:
    x = rect.get_x() + rect.get_width() / 2
    y = rect.get_height()
    fig3.text(x, y, f"{y:.0f}", ha='center', va='bottom')

for rect in fig4.patches:
    x = rect.get_x() + rect.get_width() / 2
    y = rect.get_height()
    fig4.text(x, y, f"{y:.0f}", ha='center', va='bottom')


fig.suptitle("Bike used based on Distance Covered between 1-10 km")
plt.show()

##### **Insights from Histograms with Fixed distance of 1-10 km**

- Electric bikes are used more when the travel duration is between 1 to 5 mins.
- As the travel time increases usage of electric bikes decline.
- When the travel time riders are tend to use more of classic bikes.

### **6. How is distance covered related to the membership of the customers ?**

In [None]:
sns.kdeplot(data=bike_df, x="distance_km", hue="member_casual", palette=["#F63366", "#6FC8CE"])

sns.set_style("darkgrid")
sns.set(rc={'figure.figsize':(10,8)})
sns.despine()
plt.title('Frequency Polygon by Customer Membership')
plt.xlabel('Distance(km)')
plt.ylabel('Frequency')

plt.show()


In [None]:

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(14, 10))

sns.histplot(bike_df[(bike_df['distance_km']>1) & (bike_df['distance_km']<5)]['member_casual'], ax=axes[0, 0])
axes[0, 0].set_title("1 km > Distance < 5 km")

sns.histplot(bike_df[(bike_df['distance_km']>5) & (bike_df['distance_km']<10)]['member_casual'], ax=axes[0, 1])
axes[0, 1].set_title("5 km > Distance < 10 km")

sns.histplot(bike_df[(bike_df['distance_km']>10) & (bike_df['distance_km']<15)]['member_casual'], ax=axes[1, 0])
axes[1, 0].set_title("10 km > Distance < 15 km")

sns.histplot(bike_df[(bike_df['distance_km']>15) & (bike_df['distance_km']<20)]['member_casual'], ax=axes[1, 1])
axes[1, 1].set_title("15 km > Distance < 20 km")
fig.subplots_adjust(wspace=0.3, hspace=0.3)

fig.suptitle("Membership based on Distance Covered")
plt.show()


##### **Insights from the Frequency Polygon and Histogram**

Assuming that the membered riders take fixed route whenever they travel.
- From the frequency polygon it is visible that membered riders are more when the distance covered is less.
- And membered riders tend to decrease with the increase in distance covered



### **7. Which route with different destination has more casual riders, plot it in a map ?**

In [None]:
max_casual_riders=distinct_routes[distinct_routes['member_casual']=='casual'].groupby(['start_end_station'])['member_casual'].size().idxmax()
print(f"The route {max_casual_riders} has more casual riders  when destination is different.")

In [None]:
bike_df[bike_df['start_station_name']=='DuSable Lake Shore Dr & Monroe St'][['start_lat','start_lng']]

In [None]:
import folium

# Create map object
m = folium.Map(location=[np.array([41.892288,41.881005]).mean(),np.array([-87.612082,-87.61679]).mean()], zoom_start=14)
# Create polyline object


route = folium.PolyLine(
    locations=((41.892288,-87.612082),(41.881005,-87.61679)),
    color='red',
    weight=5,
    opacity=0.7
)
marker=folium.Marker(location=(41.881005,-87.61679),popup='DuSable Lake Shore Dr & Monroe St')
m.add_child(marker)

marker=folium.Marker(location=(41.892288,-87.612082),popup='Streeter Dr & Grand Ave')
m.add_child(marker)

route.add_to(m)
m

#### **Insights from the Folium Map**
- The route DuSable Lake Shore Dr & Monroe St --> Streeter Dr & Grand Ave has more casual riders  when destination is different.

### **8. Find the relationship between Type of Riders vs. Types of Bikes used ?**

In [None]:
fig=px.histogram(bike_df['member_casual'],histnorm='percent', color=bike_df['rideable_type'])
fig.update_layout(title="Type of Riders vs. Types of Bikes")
fig

#### **Insights from the Histogram**
- Casual riders use more electric bikes when compared to membered riders.
- Whereas membered riders use more classic bikes.
- Docked bikes are avilable only of casual riders.

### **9. Find top 3 routes with max distance ?**


In [None]:
indexes=bike_df['distance_km'].nlargest(3).index
data=bike_df.loc[indexes][['start_end_station','distance_km', 'rideable_type']]
fig=px.bar(data, x='start_end_station', y='distance_km').update_layout(title="Top 3 routes with Long Distance")
fig


In [None]:
data

#### **Insights from the bar chart:**
- From the bar chart
`Clark St & Elmdale Ave to Walden Pkwy & 100th St` with 30.93 km, `Museum of Science and Industry to Benson Ave` with 29.7 km and `Benson Ave & Church St to Lake Park Ave & 56t` with 26.44 km are the top 3 routes with long distance.

### **10. What time of the day has more traffic in the starting stations, compare it with the type of riders(member/casual)?**

In [None]:
sns.histplot(data=time_df, x='starting_hr', bins=24, kde=True, kde_kws=dict(bw_method=0.09), hue='member_casual').set(title="Distribution of Starting Time of Riders");
plt.xlabel('Starting Hour');
plt.ylabel('No. of Traffic');

#### **Insights from the KDE Plot**
- Around 16:00 to 17:00 in the evening has the overall maximum traffic, with membered riders more than casual riders.
- 5:00 to 10:00 in the morning has more membered riders using the bikes when compared to the casual riders.
- From 12:00 to 15:00 in the afternoon and 1:00 to 3:00 in the early morning the traffic of casual riders are more compared to the membered riders.


## **Conclusion**

### Various Conclusions that can be gathered from above visualizations are:

- From the distribution of bikes used by customers, classic bikes are the most used with 57% making it the most popular. Thus the standards of classic bikes should be increased for better customer satisfaction and **more usage**.
- The most popular start stations are `Streeter Dr & Grand Ave`, `DuSable Lake Shore Dr & North Blvd`, `Michigan Ave & Oak St`. These station should always have extra bikes and other facilities such as cafeteria, park etc, to attract more customers. 
- There are almost 89702 unique routes where the start and end locations are not same. `Streeter Dr & Grand Ave` station has the maximum connection when seen in the map. Thus this station should be well maintained and should provide good service to customers.
- Customers tend to use more of electric bikes when the distance is more than 15 km. Thus the electric bikes should be well tuned and maintained for handling long distances.
- Similarly customers also tend to use electric bikes to travel faster. Thus the mechanics and the motor should be well designed and should be capable to handle higher speed without wear and tear.
- Most of the membered customer travel only for short distances. With the increase in distance the number of membered customers is less. Thus some kind of offers and discounts should be introduced for the casual riders who travel long distances if they choose to take annual membership.
- The route `DuSable Lake Shore Dr & Monroe St --> Streeter Dr & Grand Ave` has more casual riders  when destination is different. Thus this route has to be well maintained.
- `Clark St & Elmdale Ave to Walden Pkwy & 100th St`, `Museum of Science and Industry to Benson Ave`, `Benson Ave & Church St to Lake Park Ave & 56t` are the top 3 longest routes used by the riders. Riders using these routes should be given discounts on annual membership.



## **Future Work**

- All the start and end station should be categorized based on the streets and area they are in.
- Analysing  the usage of bikes by the riders with the previous year's data.
- Introducing new dataset like popular locations in `Chicago` for analysing why there are more traffic in certain area.
- Building a ML model to predict the usage of bikes in the future and which type of bike has more scope. 

## **References**

- [Seaborn Documentation](https://seaborn.pydata.org/)
- [Doubts and clarifications](https://stackoverflow.com/)
- [Jovian EDA](https://www.youtube.com/watch?v=kLDTbavcmd0)
- [Data Analysis with Python](https://www.youtube.com/playlist?list=PLyMom0n-MBrpzC91Uo560S4VbsiLYtCwo)
