# CYCLISTIC-BIKE SHARE : CASE STUDY
## SCENARIO
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your
team wants to understand how casual riders and annual members use Cyclistic bikes dierently. From these insights, your team
will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve
your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

## STATEMENT OF BUSINESS TASK
> How do annual members and casual riders use Cyclistic bikes differently?

## PROGRAMMING LANGUAGES USED 
* Python - Modules used: NUMPY,PANDAS,MATPLOTLIB AND SEABORN,
* SQL

## TOOLS USED
* Pgadmin 4( PostgreSQL tool)- For processing the data for exploration.
* Jupyter Notebook - For Exploratory Data Analysis and visualzation.

Note: As kaggle does not support SQL, I will not be able to provide SQL queries but detailed documentation of process of creating "FINAl_DATA.CSV" is given below.

## PROCESSING DATA FOR EXPLORATION
    
   - Copied original data into new folder and imported copied data to Pgadmin 4(PostgreSQL Tool). As I can come back to original data if in case I messed the data or data got deleted.
   - Merged all datasets contaning data from 1 Apr 2020 to 30 Apr 2021 using so that all the data is at one place making it easier and making it less time consuming to perform furthur manipulations on data.
       
      ## Cleaning and Transforming the data
   
   - Missing values: There are many missing values in the columns start_station_name,end_station_name,start_station_id and end_station_id but they can be neglected as there is no significant role of those columns in analyzing our answer.There were no missing values found in other columns.
   - Format check:Checked the formats of started_at and ended_at as they contain date and time.
   - Duplicate Entries: Ensured that the column "ride_id" has unique values and there were no repeating entries.
   - Errors: Ensured that there are no errors in data by querying the data in columns started_at and ended_at to check whether any dates are out of range or any errors such as 30 feb 2020 and in column "member_casual" to check whether there are an entries other than member and casual.
   - Data Manipulation: 
      - Changed the name of column "member_casual" to "rider_type" to make it more readable.
      - Added a new column "ride_length-min" which is difference between column "started_at" and "ended_at". The new column tells us about the number of minutes that a rider rode his bike.
      - Added columns "start_day" and "end_day" which represents name of the weekDay on which Rider started his ride and ended his ride respectively. 
      - Created New table("Final_data.csv" which is used for data exploration in below python code) from all tripdata which contains columns "ride_id,rideable_type,started_at, ended_at, start_day, end_day, ride_length_min, and rider_type".
      - Error check for new columns: Queried the data to find out if column "ride_length_min" has any negative values and deleted such rows as length of ride cannot be negative.
               
        ## SQL Queries:  
           * SQL query for combining data:
                    create table all_tripdata as (
                    select * from cyclistic_trip_data_2020_04
                    union
                    select * from cyclistic_trip_data_2020_05
                    union
                    select * from cyclistic_trip_data_2020_06
                    union
                    select * from cyclistic_trip_data_2020_07
                    union
                    select * from cyclistic_trip_data_2020_08
                    union
                    select * from cyclistic_trip_data_2020_09
                    union
                    select * from cyclistic_trip_data_2020_10
                    union
                    select * from cyclistic_trip_data_2020_11
                    union
                    select * from cyclistic_trip_data_2020_12
                    union
                    select * from cyclistic_trip_data_2021_01
                    union
                    select * from cyclistic_trip_data_2021_02
                    union
                    select * from cyclistic_trip_data_2021_03
                    union
                    select * from cyclistic_trip_data_2021_04)
                
           * SQL query to create new table:
              
                    create table FINAL_DATA as(
                    select ride_id,rideable_type,started_at, ended_at,
                    to_char(started_at,'Day') as start_day,to_char(ended_at,'Day') as end_day,
                    round(cast(extract(epoch from ended_at-started_at)/60 as numeric),2) as ride_length_min,
                    member_casual as rider_type
                    from all_tripdata)
                    
               
               Note: Furthur Analysis is done in Jupyter Notebook(given below) on the new table("FINAL_DATA.CSV") that was created earlier.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sqlalchemy as sql
import plotly.express as px

In [None]:
data=pd.read_csv('..//input//processed-final-data//FINAL_DATA.CSV',error_bad_lines=False)

### DESCRIPTION OF FINAL DATASET
    ride_id : Unique ID given for every ride
    rideable_type : Types of bikes
    started_at : Date and Time when Rider started the ride
    ended_at : Date and Time when Rider ended the ride
    start_day : Name of the WeekDay when rider started the ride
    end_day : Name of the WeekDay when rider ended the ride
    ride_length_min : Length of ride taken by the rider in minutes
    rider_type : Type of Rider either casual or member

In [None]:
data.head()

In [None]:
data.describe()

As we can see in the above description of ride length data, 75% of riders ride for less than 26.43 minutes. But the maximum ride length is 58720 minutes. So, if we want plot a distribution we need to remove the outliers. But currently there is no use of plotting distribution. So, I'm skiiping that part.

In [None]:
data.started_at=pd.to_datetime(data.started_at)
data.info()

### ANALYSING RIDE LENGTH ON DAILY BASIS 

In [None]:
avg_ridelength_day=data.groupby([pd.Grouper(key='started_at',freq='D'),'rider_type'])['ride_length_min'].mean().reset_index(name='Average ride_length')

In [None]:
avg_ridelength_day.head()

In [None]:
avg_ridelength_day.pivot('started_at','rider_type','Average ride_length').sort_values(by='started_at').plot()
plt.ylabel('Average Ride Length(in minutes)')
plt.xlabel('Date')
plt.title('Average Ride Length per day')
plt.show()

If we look at the above figure, length of ride per day has little variations for both type of riders expect for few exception days in apr 2020 and feb 2021 for casual riders. To make the plot more smoother, we can plot it on monthly or weekly basis.

### ANALYSING RIDE LENGTH ON MONTHLY BASIS

In [None]:
avg_ridelength_month=data.groupby([pd.Grouper(key='started_at',freq='M'),'rider_type'])['ride_length_min'].mean().reset_index(name='Average ride_length')

In [None]:
avg_ridelength_month.head()

In [None]:
avg_ridelength_month.pivot('started_at','rider_type','Average ride_length').sort_values(by='started_at').plot()
plt.ylabel('Average Ride Length(in minutes)')
plt.xlabel('Month')
plt.title("Average Ride Length per month")
plt.show()

As we can see the plot of monthly basis is more smoother and easy to gain understanding of length of rides. If we observe the graph, we can say that the ride length has decreased over the months for both type of riders and casual riders used the ride for more time when compared to members.

### ANALYSING RIDE LENGTH ON WEEKLY BASIS(OPTIONAL)

In [None]:
avg_ridelength_week=data.groupby([pd.Grouper(key='started_at',freq='W'),'rider_type'])['ride_length_min'].mean().reset_index(name='Average ride_length')

In [None]:
avg_ridelength_week.head()

In [None]:
avg_ridelength_week.pivot('started_at','rider_type','Average ride_length').sort_values(by='started_at').plot()
plt.ylabel('Average Ride Length(in minutes)')
plt.xlabel('Month')
plt.title('Average Ride Length per week')
plt.show()

### ANALYSING RIDE LENGTH ON WEEKDAYS

In [None]:
avg_ridelength_weekday= data.groupby([data['started_at'].dt.day_name(),'rider_type'])['ride_length_min'].mean().reset_index(name='Average ride_length')

In [None]:
avg_ridelength_weekday.pivot('started_at','rider_type','Average ride_length').plot(kind='bar')
plt.ylabel('Average Ride Length(in minutes)')
plt.xlabel('Weekdays')
plt.title("Average Ride length on weekdays")
plt.legend(bbox_to_anchor=(1.05,1), loc='upper left')
plt.show()

The above plot gives us an understanding about length of rides on different weekdays. As expected, weekends got higher ride length than other days.

### ANALYSING USAGE OF RIDEABLE TYPES OF CASUAL RIDERS AND MEMBERS BASED ON RIDE LENGTH

In [None]:
bikeride_ridelength=data.groupby([pd.Grouper(key='started_at',freq='M'),'rideable_type','rider_type'])['ride_length_min'].mean().reset_index(name='Average ride_length')

In [None]:
riders_type=['casual','member']
for rider in riders_type:
    df=bikeride_ridelength[bikeride_ridelength['rider_type']==rider]
    df.pivot('started_at','rideable_type','Average ride_length').plot()
    plt.ylabel('Average ride_length')
    plt.xlabel('Month')
    plt.title(rider+' Riders(on Ride Length Basis)')
    plt.show()

After looking at above two graphs, we can certainly say that casual riders preferred docked bikes for long distance rides wheres members choose docked bikes initially but as the months passed they preferred using electric and claassic bike as they came into existence.

### ANALYSING NUMBER OF RIDES ON DAILY BASIS OF CASUAL RIDERS AND MEMBERS

In [None]:
Num_rides_day=data.groupby([pd.Grouper(key='started_at',freq='D'),'rider_type'])['ride_id'].count().reset_index(name='Number of rides')

In [None]:
Num_rides_day.pivot('started_at','rider_type','Number of rides').plot()
plt.ylabel('Number of rides')
plt.xlabel('Date')
plt.title("Number of rides per day")
plt.show()

The graph is highly voltile which is not preferable if we want to analyse it. So we can plot the graph of number of rides per month or week as we did with ride length.

### ANALYSING NUMBER OF RIDES ON MONTHLY BASIS OF CASUAL RIDERS AND MEMBERS

In [None]:
Num_rides_month=data.groupby([pd.Grouper(key='started_at',freq='M'),'rider_type'])['ride_id'].count().reset_index(name='Number of rides')

In [None]:
Num_rides_month.pivot('started_at','rider_type','Number of rides').plot()
plt.ylabel('Number of rides')
plt.xlabel('Month')
plt.title("Number of rides per month")
plt.show()

As expected, we got a smoother and more understandable curve. When we look into details we can certainly say that members used the bikes more often than the casual riders. Another observation which can be made is, riders gradually increased from the month of april to august 2020 and from feb 2021 to Apr 2021 but there was constant decrease in the number of riders between august 2020 and february 2021.

### ANALYSING NUMBER OF RIDES ON WEEKLY BASIS OF CASUAL RIDERS AND MEMBERS(OPTIONAL)

In [None]:
#Number of rides per week
Num_rides_week=data.groupby([pd.Grouper(key='started_at',freq='W'),'rider_type'])['ride_id'].count().reset_index(name='Number of rides')

In [None]:
Num_rides_week.pivot('started_at','rider_type','Number of rides').plot()
plt.ylabel('Number of rides')
plt.xlabel('Week')
plt.title('Number of Rides per week')
plt.show()

### ANALYSING NUMBER OF RIDES ON WEEKDAYS OF CASUAL RIDERS AND MEMBERS

In [None]:
Num_rides_weekday= data.groupby([data['started_at'].dt.day_name(),'rider_type'])['ride_id'].count().reset_index(name='Number of rides')

In [None]:
Num_rides_weekday.pivot('started_at','rider_type','Number of rides').plot(kind='bar')
plt.ylabel('Number of rides')
plt.xlabel('Weekdays')
plt.title("Number of rides on weekdays")
plt.legend(bbox_to_anchor=(1.05,1), loc='upper left')
plt.show()

If we inspect the bar graph of number of riders on different weekdays, we can clearly say that the casual riders are high on saturday which is apparently a weekend whereas members used the bike regularly at almost constant rate.

### ANALYSING USAGE OF TYPE OF BIKE OF CASUAL RDERS AND MEMBERS BASED ON NUMBER OF RIDES

In [None]:
bike_type_month=data.groupby([pd.Grouper(key='started_at',freq='M'),'rideable_type','rider_type'])['ride_id'].count().reset_index(name='Number of rides')

In [None]:
riders_type=['casual','member']
for rider in riders_type:
    df=bike_type_month[bike_type_month['rider_type']==rider]
    df.pivot('started_at','rideable_type','Number of rides').plot()
    plt.ylabel('Number of rides')
    plt.xlabel('Month')
    plt.title(rider+' Riders(on Number of rides basis)')
    plt.figure(figsize=(20,20))
    plt.show()

Usage of docked bikes is drastically decreased since new bikes entered the market and currently both the riders preferring classic bikes.