


# Part I - (Ford go bike data exploration)
## by (Iniobong Nwa)

## Introduction
> This dataset contains information about individual rides made in a bike sharing company in San Framcisco.



## Preliminary Wrangling


In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import os
import datetime as dt


%matplotlib inline

> Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.


In [None]:
#read the csv file
fordbikes = pd.read_csv('fordgobike-tripdata.csv')


In [None]:
fordbikes.head(20)

In [None]:
fordbikes.describe()

In [None]:
fordbikes.shape

In [None]:
fordbikes.info()

In [None]:
df_bike = fordbikes.copy()

## Data Cleaning

## Issue 1: Missing values in start_station_id, start_station_name,  end_station_id, end_station_name, member_birth_year, member_gender


### Define: Drop rows with missinhn values

In [None]:
# view the null values
df_bike.isnull().sum()

#### Code

In [None]:
#drop null values
df_bike.dropna(inplace= True)

#### Test

In [None]:
df_bike.info()

### Issue 2: Wrong data type for start_time and end_time.

### Define: Convert  start time and end time to datetime

#### Code

In [None]:
#convert to datetime data type
df_bike['start_time'] = pd.to_datetime(df_bike['start_time'])
df_bike['end_time'] = pd.to_datetime(df_bike['end_time'])

#### Test

In [None]:
df_bike.info()

### Issue 3: Wrong data type for bike_id, start_station_id and end_station_id

#### Define: Convert bike id, start_station_id and end_station_id column to string data type

#### Code

In [None]:
#convert to object data type
df_bike['bike_id'] = df_bike['bike_id'].astype(str)
df_bike['bike_id'] = df_bike['start_station_id'].astype(str)
df_bike['bike_id'] = df_bike['end_station_id'].astype(str)

#### Test

In [None]:
df_bike.info()

## Issue: Wrong data type for user_type, bike_share_for_all_trip and member_gender

### Define: Convert user_type, bike_share_for_all_trip and member_gender to categorical data

#### Code

In [None]:
#convert to categorical data
df_bike['user_type'] = df_bike['user_type'].astype('category')
df_bike['bike_share_for_all_trip'] = df_bike['bike_share_for_all_trip'].astype('category')
df_bike['member_gender'] = df_bike['member_gender'].astype('category')

#### Test

In [None]:
df_bike.info()

In [None]:
df_bike['duration_min'] =df_bike['duration_sec']/60
df_bike['duration_min'] =df_bike['duration_min'].astype(int)

### What is the structure of your dataset?

> After cleaning the dataset, I have 174952 bike rides and 16 features.

### What is/are the main feature(s) of interest in your dataset?

> I would like to find out what type of user is more frequent and the duration of most trips.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> To do this, I would be working with the duration_sec, user_type and bike_share_for_all_trip.

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.


> **Rubric Tip**: The project (Parts I alone) should have at least 15 visualizations distributed over univariate, bivariate, and multivariate plots to explore many relationships in the data set.  Use reasoning to justify the flow of the exploration.



>**Rubric Tip**: Use the "Question-Visualization-Observations" framework  throughout the exploration. This framework involves **asking a question from the data, creating a visualization to find answers, and then recording observations after each visualisation.** 


In [None]:
#split the columns into hour, day and month datetime columns
df_bike['start_hour'] = df_bike['start_time'].dt.hour
df_bike['start_day'] = df_bike['start_time'].dt.day_name()
df_bike['start_month'] = df_bike['start_time'].dt.month_name()

df_bike['end_hour'] = df_bike['end_time'].dt.hour
df_bike['end_day'] = df_bike['end_time'].dt.day_name()
df_bike['end_month'] = df_bike['end_time'].dt.month_name()

In [None]:
print(df_bike.shape)
df_bike.head()

In [None]:
#save as csv file
df_bike.to_csv('df_bike.csv')
df = pd.read_csv('df_bike.csv')

In [None]:
df_bike['start_day'].value_counts()

In [None]:
df_bike['start_month'].value_counts()

In [None]:
#bike rides for each day in the week
base_color = sb.color_palette()[0]
days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday','Friday','Saturday', 'Sunday']
plt.figure(figsize =[15,6])
sb.countplot(data= df_bike , x='start_day' , color= base_color, order= days_order)
plt.title('Bike ride most popular days', fontsize = 16, fontweight='bold')
plt.ylabel('Number of Bike trips', fontsize = 15)
plt.xlabel('Bike Ride start days', fontsize= 15)
plt.title('Daily Bike Rides', fontsize=14, fontweight='bold')
plt.xlabel('Days');
days_count = df_bike['start_day'].value_counts()
total_trips= days_count.sum()
plt.yticks(size=13); 
locs, labels = plt.xticks( size=14); 
for loc, label in zip(locs, labels):
    count = days_count[label.get_text()]
    string = '{:0.1f}%'.format(100*count/total_trips)
    plt.text(loc, count-10000, string, ha = 'center', color = 'white',fontsize = 13)

From the plot, its obvious that the most rides are recorded on Thursday. In general, there are more daily rides on week days and fewer rides on the weekends.

In [None]:
#bike rides for each gender group
print(df_bike['member_gender'].value_counts())

sb.countplot(data= df_bike , x='member_gender' , color= base_color)

gender_count = df_bike['member_gender'].value_counts()
total_gender= gender_count.sum()
plt.yticks(size=13);
plt.ylabel('Number of Bike trips', fontsize = 15)
plt.xlabel('Bike Ride start days', fontsize= 15)
plt.title('Bike ride by gender', fontsize=14, fontweight='bold')
locs, labels = plt.xticks( size=14)
for loc, label in zip(locs, labels):
    count = gender_count[label.get_text()]
    if count < 40000:
        pct_string = '{:0.1f}%'.format(100*count/total_gender)
        plt.text(loc, count+1000, pct_string, ha = 'center', color = 'black',fontsize = 13)

    else:
        pct_string = '{:0.1f}%'.format(100*count/total_gender)
        plt.text(loc, count-25000, pct_string, ha = 'center', color = 'white',fontsize = 13)

We have a higher percentage of male riders than female riders.

In [None]:
#bike sharing for users type
print(df_bike['user_type'].value_counts())
sb.countplot(data= df_bike , x='user_type' , color= base_color)

usertype_count = df_bike['user_type'].value_counts()
users= usertype_count.sum()
plt.yticks(size=13)
plt.title('Bike Rides By Different Users', fontsize=14, fontweight='bold')
plt.ylabel('Number of rides', fontsize=14)
plt.xlabel('User Type', fontsize=14)
locs, labels = plt.xticks( size=14); 
for loc, label in zip(locs, labels):
    count = usertype_count[label.get_text()]
    string = '{:0.1f}%'.format(100*count/users)
    plt.text(loc, count-15000, string, ha = 'center', color = 'white',fontsize = 13)
plt.show()

Most of the users are subscribers and they make up over 90% of the total users.

In [None]:
# no of rides for bike sharing
print(df_bike['bike_share_for_all_trip'].value_counts())
sb.countplot(data= df_bike, x='bike_share_for_all_trip', color= base_color)

bike_sharing_count = df_bike['bike_share_for_all_trip'].value_counts()
bike_sharing= bike_sharing_count.sum()
plt.yticks(size=13)
plt.title('Bike sharing for all Rides', fontsize=14, fontweight='bold')
plt.xlabel('Bike sharing for all rides', fontsize=14)
plt.ylabel('Number of bike rides', fontsize=14)
locs, labels = plt.xticks( size=14); 
for loc, label in zip(locs, labels):
    count = bike_sharing_count[label.get_text()]
    string = '{:0.1f}%'.format(100*count/bike_sharing)
    plt.text(loc, count-15000, string, ha = 'center', color = 'white',fontsize = 13)
    


90% of the trips are not bike sharing.

In [None]:
#hourly rides for all trips
fig, ax = plt.subplots(2,figsize = [15,15])

base_color = sb.color_palette()[0]
sb.countplot(data= df_bike, x = 'start_hour',color =base_color, ax = ax[0])
sb.countplot(data= df_bike, x = 'end_hour', color= base_color, ax =ax[1])
plt.title('Hourly Bike Rides', fontsize=14, fontweight='bold')
plt.xlabel('Time in hours', fontsize=14)
plt.ylabel('Number of Rides')

Only about 9% of our riders are into bike sharing while over 90% do not.



>**Rubric Tip**: Visualizations should depict the data appropriately so that the plots are easily interpretable. You should choose an appropriate plot type, data encodings, and formatting as needed. The formatting may include setting/adding the title, labels, legend, and comments. Also, do not overplot or incorrectly plot ordinal data.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> I split the start_time and end_time columns into, hour, day and month.

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> I had to add some more columns to enhance our time based analysis.
> I had to change the data types of some columns.

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

In [None]:
fig, ax = plt.subplots(3,figsize = [14,14])
days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday','Friday','Saturday', 'Sunday']
sb.countplot(data= df_bike, x = 'start_day', hue = 'user_type',order= days_order,ax=ax[0])
sb.countplot(data= df_bike, x = 'start_day', hue = 'bike_share_for_all_trip', order= days_order,ax=ax[1])
sb.countplot(data= df_bike, x = 'start_day', hue = 'member_gender', order= days_order, ax=ax[2])

From the plot above we have more subscribers than customers. We also have very few people bike sharing and we are more males involved than females.

In [None]:
sb.regplot(data = df_bike, x='start_hour', y= 'duration_sec')

In [None]:
days_count = df_bike.groupby(['start_day', 'user_type']).size()
memgen_count = df_bike.groupby(['start_day', 'member_gender']).size()
share_count = df_bike.groupby(['start_day', 'bike_share_for_all_trip']).size()

In [None]:
days_count = days_count.reset_index(name= 'count')
memgen_count = memgen_count.reset_index(name= 'count')
share_count = share_count.reset_index(name= 'count')

In [None]:
days_count

In [None]:
binsize = 100
bin_edges = np.arange(0, df_bike['duration_sec'].max()+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = df_bike, x = 'duration_sec', bins = bin_edges)
plt.xlabel('Duration in seconds')
plt.ylabel('Number of trips')
plt.xlim(0,5000) # set up a limit due to the outliers
plt.show()

The above distribution is right skewed.

In [None]:
hour_order = np.arange(0,24)

plt.figure(figsize=(10,6))
plt.title('Bike Service Popular Times', fontsize=15)
ax = sb.countplot(data=df_bike, x='duration_min',order=hour_order, color=base_color)



This plot is equally right skewed

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> I observe we have more subscribers taking longer trips than customers and there are more males taking longer trips than females and other gender groups.

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Most users who go long distances do not bike share.

## Multivariate Exploration


In [None]:
# plot for trip duration across gender and customer type
fig = plt.figure(figsize = [8,6])
ax = sb.pointplot(data = df_bike, x ='member_gender', y = 'duration_sec', hue = 'user_type',
           palette = 'Blues', linestyles = '', dodge = 0.4)
plt.title('Trip Duration across gender and customer type')
plt.ylabel('Average Trip Duration (Secs)')

From the plot above,I observe that customers take longer trips than subscribers and the other gender groups take longer trips than the frmale and the male gender.

In [None]:
# plot for trip duration across gender and bike sharing for all trips
fig = plt.figure(figsize = [8,6])
base_color = sb.color_palette()[0]
ax = sb.pointplot(data = df_bike, x ='member_gender', y = 'duration_sec', hue = 'bike_share_for_all_trip', 
            color= base_color, linestyles = '', dodge = 0.4)
plt.title('Trip Duration across gender and bike_share_for_all_trip')
plt.ylabel('Average Trip Duration (Secs)')
ax.set_yticklabels([],minor = True)

The other gender group have the longest trip on the average for bike sharing.

In [None]:
# plot for trip duration across customer type and week days
fig = plt.figure(figsize = [8,6])
ax = sb.pointplot(data = df_bike, x ='start_day', y = 'duration_sec', hue = 'user_type',
           palette = 'Blues', linestyles = '', order= days_order, dodge = 0.4)
plt.title('Trip Duration across week days and customer type')
plt.ylabel('Average Trip Duration (Secs)')
ax.set_yticklabels([],minor = True)
plt.show();

Customers have relatively higher trips than subscribers. Customers recorded higher trip duration on weekends and Subsribers have higher trip duration on weekends too. 

In [None]:
# plot for duration of trips for bike sharing by users and 
sb.countplot(data = df_bike, x = 'user_type', hue = 'bike_share_for_all_trip', order=df_bike.user_type.value_counts().index)

Only subscribers are bike sharing for all trips, customers don't bike share.

In [None]:
sb.countplot(data = df_bike, x = 'user_type', hue = 'member_gender',order=df_bike.user_type.value_counts().index)

Subscribers have other gender groups but customer only contained female and male.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> The longest trips are taken on the weekends and by other gender group.

### Were there any interesting or surprising interactions between features?

> I discovered that there are only males and females as Customers and other gender groups are only subscribers. The Customers do not bike share. The longest trips are recorded on weekends.

## Conclusions
> Customers travel the same duration regardles of the day of the week.

> Only subscribers share bike on the trip.

> Over 90% of the trips were by subscribers.

> 90% of the trips were not bike sharing for only trips.

> We have the most trips on weekdays and fewer number of trips on weekend.

> During the weekdays, we have the most number of rides on Thurdays and the the least on Saturdays and Sundays.

> We have the most number of trips between 8am-9am and 4pm -5pm.