# 1. Business Task
Cyclistic is a company looking for how to keep growing in the future. Their business is based on different bike rental plans:

-Single-ride pass.

-Full-day pass.

-Annual membership.

We are going to analyse how annual members (Annual membership) differ from casual riders (Single pass and Full-day pass) using Cyclistic, in order to help the organization design marketing strategies aimed at converting casual riders into annual members. 

Annual members are more profitable than casual riders, so stakeholders believe that increasing the amount of annual riders is key for Cyclistic's growth in the future.

# 2. Data Preparation

The data used is internally collected monthly data from August 2021 to July 2022 by Cyclistic from their users. The data is stored by the company on AWS' S3 cloud object storaged solution in dot zip files.

https://divvy-tripdata.s3.amazonaws.com/index.html

In general the data contains anonymous user information regarding ride identification, starting and ending time of the ride, the starting and ending station of the ride, and the user type of the ride.

It's stored in the structured data format. However, there is some data missing in some columns.


#### Storing the data
1. The zip files were uncompressed, in an individual carpet for each time period, in the 'csv files' folder.

2. The csv files containg information about the Byke Trips were copied to the 'raw data' folder.



## Cleaning the data



#### Understanding the data
The "Trips" table contained the following columns:

- ride_id: Number identification of the trip.

- rideable_type: Bike category used during the trip.

- started_at: datetime of the starting time of the bike trip.

- ended_at: datetime of the ending time of the bike trip.

- start_station_name: Name of the starting point station of the trip.

- start_station_id: Number identification of the starting point station of the trip.

- end_station_name: Name of the ending point station of the trip.

- end_station_id: Number identification of the ending point station of the trip.

- start_lat: Latitude point of the starting station of the trip.

- start_lng: Longitude point of the starting station of the trip.

- end_lat: Latitude point of the ending station of the trip.

- end_lng: Longitude point of the ending station of the trip.

- member_casual: Refers to the User Type of the trip. Category 'casual' refers to single-ride pass and full-day pass users. The 'member' category refers to users with annual memberships.

#### Data cleaning and formatting with Python and Pandas
The following changes were performed to the data

In [1]:
#Imported the necessary libraries
import pandas as pd
import numpy as np

In [4]:
#Load the data with pandas for each month (From August 2021 to July 2022)
trips_2021_08 = pd.read_csv('data/raw data/202108-divvy-tripdata.csv')
trips_2021_09 = pd.read_csv('data/raw data/202109-divvy-tripdata.csv')
trips_2021_10 = pd.read_csv('data/raw data/202110-divvy-tripdata.csv')
trips_2021_11 = pd.read_csv('data/raw data/202111-divvy-tripdata.csv')
trips_2021_12 = pd.read_csv('data/raw data/202112-divvy-tripdata.csv')
trips_2022_01 = pd.read_csv('data/raw data/202201-divvy-tripdata.csv')
trips_2022_02 = pd.read_csv('data/raw data/202202-divvy-tripdata.csv')
trips_2022_03 = pd.read_csv('data/raw data/202203-divvy-tripdata.csv')
trips_2022_04 = pd.read_csv('data/raw data/202204-divvy-tripdata.csv')
trips_2022_05 = pd.read_csv('data/raw data/202205-divvy-tripdata.csv')
trips_2022_06 = pd.read_csv('data/raw data/202206-divvy-tripdata.csv')
trips_2022_07 = pd.read_csv('data/raw data/202207-divvy-tripdata.csv')
trips_2022_08 = pd.read_csv('data/raw data/202201-divvy-tripdata.csv')

In [6]:
#Looking inside the data from August 2021 we can see there are rows with missing values in several columns
trips_2021_08.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 804352 entries, 0 to 804351
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   ride_id             804352 non-null  object 
 1   rideable_type       804352 non-null  object 
 2   started_at          804352 non-null  object 
 3   ended_at            804352 non-null  object 
 4   start_station_name  715894 non-null  object 
 5   start_station_id    715894 non-null  object 
 6   end_station_name    710237 non-null  object 
 7   end_station_id      710237 non-null  object 
 8   start_lat           804352 non-null  float64
 9   start_lng           804352 non-null  float64
 10  end_lat             803646 non-null  float64
 11  end_lng             803646 non-null  float64
 12  member_casual       804352 non-null  object 
dtypes: float64(4), object(9)
memory usage: 79.8+ MB


By using the info() function we know the data doesn't have any empty values for the datetime columns, but there's missing data regarding the station name's and geographical points.

In [8]:
#Adding the month is useful to latter concatenate them into a dataframe
trips_by_month = [
    trips_2021_08,
    trips_2021_09,
    trips_2021_10,
    trips_2021_11,
    trips_2021_12,
    trips_2022_01,
    trips_2022_02,
    trips_2022_03,
    trips_2022_04,
    trips_2022_05,
    trips_2022_06,
    trips_2022_07,
    trips_2022_08
    ]

In [13]:
#We can once more check that all the dataframes have the same number of columns
for month in trips_by_month:
    print(month.shape[1])

13
13
13
13
13
13
13
13
13
13
13
13
13


In [14]:
#Merged all the data into a single dataframe

trips_2021_2022 = pd.concat(trips_by_month)

In [15]:
trips_2021_2022.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6005233 entries, 0 to 103769
Data columns (total 13 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
dtypes: float64(4), object(9)
memory usage: 641.4+ MB


In [17]:
trips_2021_2022.head(10)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,99103BB87CC6C1BB,electric_bike,2021-08-10 17:15:49,2021-08-10 17:22:44,,,,,41.77,-87.68,41.77,-87.68,member
1,EAFCCCFB0A3FC5A1,electric_bike,2021-08-10 17:23:14,2021-08-10 17:39:24,,,,,41.77,-87.68,41.77,-87.63,member
2,9EF4F46C57AD234D,electric_bike,2021-08-21 02:34:23,2021-08-21 02:50:36,,,,,41.95,-87.65,41.97,-87.66,member
3,5834D3208BFAF1DA,electric_bike,2021-08-21 06:52:55,2021-08-21 07:08:13,,,,,41.97,-87.67,41.95,-87.65,member
4,CD825CB87ED1D096,electric_bike,2021-08-19 11:55:29,2021-08-19 12:04:11,,,,,41.79,-87.6,41.77,-87.62,member
5,612F12C94A964F3E,electric_bike,2021-08-19 12:41:12,2021-08-19 12:47:47,,,,,41.81,-87.61,41.8,-87.6,member
6,C7435946FDFFA9B7,electric_bike,2021-08-19 12:21:50,2021-08-19 12:37:31,,,,,41.77,-87.62,41.81,-87.61,member
7,C67017767EED2251,electric_bike,2021-08-13 14:52:35,2021-08-13 14:58:16,,,Clark St & Grace St,TA1307000127,41.94,-87.64,41.950874,-87.659146,member
8,ABC4532F2B4983AB,electric_bike,2021-08-17 18:23:55,2021-08-17 18:24:13,,,,,41.92,-87.66,41.92,-87.66,member
9,82437E52DC3B9A8A,electric_bike,2021-08-04 12:50:53,2021-08-04 13:08:20,,,,,41.74,-87.53,41.74,-87.53,member


In [33]:
#We know there's a lot of missing data regarding station names and geographical points.
#Since we will not use geographical data to map each trip,
#  we will drop the start_lat, start_lng, end_lat, end_lng columns.

trips_2021_2022.drop(columns=['start_lat', 'start_lng', 'end_lat', 'end_lng'])


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,member_casual,trip_duration
0,99103BB87CC6C1BB,electric_bike,2021-08-10 17:15:49,2021-08-10 17:22:44,,,,,member,0 days 00:06:55
1,EAFCCCFB0A3FC5A1,electric_bike,2021-08-10 17:23:14,2021-08-10 17:39:24,,,,,member,0 days 00:16:10
2,9EF4F46C57AD234D,electric_bike,2021-08-21 02:34:23,2021-08-21 02:50:36,,,,,member,0 days 00:16:13
3,5834D3208BFAF1DA,electric_bike,2021-08-21 06:52:55,2021-08-21 07:08:13,,,,,member,0 days 00:15:18
4,CD825CB87ED1D096,electric_bike,2021-08-19 11:55:29,2021-08-19 12:04:11,,,,,member,0 days 00:08:42
...,...,...,...,...,...,...,...,...,...,...
103765,8788DA3EDE8FD8AB,electric_bike,2022-01-18 12:36:48,2022-01-18 12:46:19,Clinton St & Washington Blvd,WL-012,,,casual,0 days 00:09:31
103766,C6C3B64FDC827D8C,electric_bike,2022-01-27 11:00:06,2022-01-27 11:02:40,Racine Ave & Randolph St,13155,,,casual,0 days 00:02:34
103767,CA281AE7D8B06F5A,electric_bike,2022-01-10 16:14:51,2022-01-10 16:20:58,Broadway & Waveland Ave,13325,Clark St & Grace St,TA1307000127,casual,0 days 00:06:07
103768,44E348991862319B,electric_bike,2022-01-19 13:22:11,2022-01-19 13:24:27,Racine Ave & Randolph St,13155,,,casual,0 days 00:02:16


In [37]:
#Since we're instered to know the behaviour of the users, 
# we will drop the rows with empty data in the 'started_at' and 'ended_at' columns
trips_2021_2022 = trips_2021_2022[trips_2021_2022['started_at'].notna()]
trips_2021_2022 = trips_2021_2022[trips_2021_2022['ended_at'].notna()]

In [24]:
#Converting the 'started_at' and 'ended_at' columns to datatime format
trips_2021_2022['started_at'] = pd.to_datetime(trips_2021_2022['started_at'])
trips_2021_2022['ended_at'] = pd.to_datetime(trips_2021_2022['ended_at'])

In [31]:
#Now we've changed the data type of the time columns
trips_2021_2022.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6005233 entries, 0 to 103769
Data columns (total 14 columns):
 #   Column              Dtype          
---  ------              -----          
 0   ride_id             object         
 1   rideable_type       object         
 2   started_at          datetime64[ns] 
 3   ended_at            datetime64[ns] 
 4   start_station_name  object         
 5   start_station_id    object         
 6   end_station_name    object         
 7   end_station_id      object         
 8   start_lat           float64        
 9   start_lng           float64        
 10  end_lat             float64        
 11  end_lng             float64        
 12  member_casual       object         
 13  trip_duration       timedelta64[ns]
dtypes: datetime64[ns](2), float64(4), object(7), timedelta64[ns](1)
memory usage: 687.2+ MB


In [28]:
#We add a new column called 'trip_duration'
trips_2021_2022['trip_duration'] = trips_2021_2022['ended_at'] - trips_2021_2022['started_at']

In [29]:
#Checking the new column
trips_2021_2022.head(10)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,trip_duration
0,99103BB87CC6C1BB,electric_bike,2021-08-10 17:15:49,2021-08-10 17:22:44,,,,,41.77,-87.68,41.77,-87.68,member,0 days 00:06:55
1,EAFCCCFB0A3FC5A1,electric_bike,2021-08-10 17:23:14,2021-08-10 17:39:24,,,,,41.77,-87.68,41.77,-87.63,member,0 days 00:16:10
2,9EF4F46C57AD234D,electric_bike,2021-08-21 02:34:23,2021-08-21 02:50:36,,,,,41.95,-87.65,41.97,-87.66,member,0 days 00:16:13
3,5834D3208BFAF1DA,electric_bike,2021-08-21 06:52:55,2021-08-21 07:08:13,,,,,41.97,-87.67,41.95,-87.65,member,0 days 00:15:18
4,CD825CB87ED1D096,electric_bike,2021-08-19 11:55:29,2021-08-19 12:04:11,,,,,41.79,-87.6,41.77,-87.62,member,0 days 00:08:42
5,612F12C94A964F3E,electric_bike,2021-08-19 12:41:12,2021-08-19 12:47:47,,,,,41.81,-87.61,41.8,-87.6,member,0 days 00:06:35
6,C7435946FDFFA9B7,electric_bike,2021-08-19 12:21:50,2021-08-19 12:37:31,,,,,41.77,-87.62,41.81,-87.61,member,0 days 00:15:41
7,C67017767EED2251,electric_bike,2021-08-13 14:52:35,2021-08-13 14:58:16,,,Clark St & Grace St,TA1307000127,41.94,-87.64,41.950874,-87.659146,member,0 days 00:05:41
8,ABC4532F2B4983AB,electric_bike,2021-08-17 18:23:55,2021-08-17 18:24:13,,,,,41.92,-87.66,41.92,-87.66,member,0 days 00:00:18
9,82437E52DC3B9A8A,electric_bike,2021-08-04 12:50:53,2021-08-04 13:08:20,,,,,41.74,-87.53,41.74,-87.53,member,0 days 00:17:27


In [None]:
#We can see all columns
trips_2021_2022.info(verbose=True, show_counts=True)