# Ford GoBike System Data Exploration
> **BY ABDULRAHEEM BASHIR**

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#pre_wra">Preliminary Wrangling</a></li>
<li><a href="#data_clean">Cleaning Data</a></li>
<li><a href="#CD"></a></li>
<li><a href="#SD"></a></li>
<li><a href="#AVD"></a></li>
</ul>

<a id='intro'></a>
## Introduction
> This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.

In [90]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

%matplotlib inline

In [91]:
# Reading the Ford GoBike System Data
# saving it as a dataframe with the name bike_df

bike_df = pd.read_csv('bike.csv')

In [92]:
# displaying few top rows from the dog_rate dataframe

bike_df.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,52185,2019-02-28 17:32:10.1450,2019-03-01 08:01:55.9750,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,13.0,Commercial St at Montgomery St,37.794231,-122.402923,4902,Customer,1984.0,Male,No
1,42521,2019-02-28 18:53:21.7890,2019-03-01 06:42:03.0560,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,81.0,Berry St at 4th St,37.77588,-122.39317,2535,Customer,,,No
2,61854,2019-02-28 12:13:13.2180,2019-03-01 05:24:08.1460,86.0,Market St at Dolores St,37.769305,-122.426826,3.0,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,5905,Customer,1972.0,Male,No
3,36490,2019-02-28 17:54:26.0100,2019-03-01 04:02:36.8420,375.0,Grove St at Masonic Ave,37.774836,-122.446546,70.0,Central Ave at Fell St,37.773311,-122.444293,6638,Subscriber,1989.0,Other,No
4,1585,2019-02-28 23:54:18.5490,2019-03-01 00:20:44.0740,7.0,Frank H Ogawa Plaza,37.804562,-122.271738,222.0,10th Ave at E 15th St,37.792714,-122.24878,4898,Subscriber,1974.0,Male,Yes


<a id='pre_wra'></a>
## Preliminary Wrangling

In [93]:
# displaying some information about dataframe

bike_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             183412 non-null  int64  
 1   start_time               183412 non-null  object 
 2   end_time                 183412 non-null  object 
 3   start_station_id         183215 non-null  float64
 4   start_station_name       183215 non-null  object 
 5   start_station_latitude   183412 non-null  float64
 6   start_station_longitude  183412 non-null  float64
 7   end_station_id           183215 non-null  float64
 8   end_station_name         183215 non-null  object 
 9   end_station_latitude     183412 non-null  float64
 10  end_station_longitude    183412 non-null  float64
 11  bike_id                  183412 non-null  int64  
 12  user_type                183412 non-null  object 
 13  member_birth_year        175147 non-null  float64
 14  memb

### What is the structure of your dataset?

> The dataset contains 183,412 ebtries including information about when and where the trip began and finished, the duration of each trip in seconds, and some user information. The dataset has 16 columns though their some misrepresentation in the Data type for start_time and end_timecolumns.

### What is/are the main feature(s) of interest in your dataset?

> My main feature(s) of interest in this dataset dataset are: duration_sec, start_station_name, end_station_name, and bike_share for all_trip

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> The features that will help in this investigation are user_type, gender, member_birt_year, start_time and end_time

<a id='data_clean'></a>
## Cleaning Data
> In this section, The data will be cleaned up so that it is suitable for analysis.

In [94]:
# Checking for duplicates in the dataframe

bike_df.duplicated().sum()

0

The output above demonstrates that there are no duplicates in the dataframe.

In [95]:
# Checking the dataframe's columns for the sum of missing data.

bike_df.isnull().sum()

duration_sec                  0
start_time                    0
end_time                      0
start_station_id            197
start_station_name          197
start_station_latitude        0
start_station_longitude       0
end_station_id              197
end_station_name            197
end_station_latitude          0
end_station_longitude         0
bike_id                       0
user_type                     0
member_birth_year          8265
member_gender              8265
bike_share_for_all_trip       0
dtype: int64

In [96]:
# Checking for the total amount of missing data

bike_df.isnull().sum().sum()

17318

In [97]:
# Calculating the percentage of missing data

(bike_df.isnull().sum().sum() / bike_df.shape[0]) * 100

9.44213028591368

The output above reveals that the total quantity of missing data in the dataframe is 17,318 (which accounts for 9.4% of the data), making it appropriate to drop the missing data.

In [98]:
# dropping the missing data

bike_df.dropna(inplace=True)

In [99]:
# Checking for the total amount of missing data
# to determine whether the required change has been made

bike_df.isnull().sum().sum()

0

In [100]:
# Changing the data type of the start time and end time fields to datatime

bike_df['start_time'] = pd.to_datetime(bike_df['start_time'])
bike_df['end_time'] = pd.to_datetime(bike_df['end_time'])

In [103]:
# displaying information about dataframe
# to determine whether the required change has been made

bike_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             174952 non-null  int64         
 1   start_time               174952 non-null  datetime64[ns]
 2   end_time                 174952 non-null  datetime64[ns]
 3   start_station_id         174952 non-null  float64       
 4   start_station_name       174952 non-null  object        
 5   start_station_latitude   174952 non-null  float64       
 6   start_station_longitude  174952 non-null  float64       
 7   end_station_id           174952 non-null  float64       
 8   end_station_name         174952 non-null  object        
 9   end_station_latitude     174952 non-null  float64       
 10  end_station_longitude    174952 non-null  float64       
 11  bike_id                  174952 non-null  int64         
 12  user_type       

In [104]:
# Checking for the unique years
# in the start_time column

bike_df['start_time'].dt.year.unique()

array([2019], dtype=int64)

In [105]:
# Checking for the unique years
# in the end_time column

bike_df['end_time'].dt.year.unique()

array([2019], dtype=int64)

The above output indicates that the activities in this dataset take place in 2019. This information, along with the member's birth year, can assist us in determining the member's age. This will necessitate the creation of a new column called age.

In [106]:
#Using the aforementioned data, add a new column with the name age
# and change the type of the data to integer.

bike_df['age'] = (2019 - bike_df['member_birth_year']).astype(int)

In [107]:
# dropping columns that are unnecessary for this analysis

bike_df.drop(['start_station_id','start_station_latitude','start_station_longitude','end_station_id','end_station_latitude','end_station_longitude','bike_id','member_birth_year'],axis=1,inplace=True)

In [112]:
# displaying the column names
# to determine whether the required change has been made

bike_df.columns

Index(['duration_sec', 'start_time', 'end_time', 'start_station_name',
       'end_station_name', 'user_type', 'member_gender',
       'bike_share_for_all_trip', 'age'],
      dtype='object')

In [113]:
# Checking for the unique gender type

bike_df['member_gender'].unique()

array(['Male', 'Other', 'Female'], dtype=object)

In [114]:
# displaying some descriptive statistic abput the data

bike_df.describe()

Unnamed: 0,duration_sec,age
count,174952.0,174952.0
mean,704.002744,34.196865
std,1642.204905,10.118731
min,61.0,18.0
25%,323.0,27.0
50%,510.0,32.0
75%,789.0,39.0
max,84548.0,141.0


The statistics above shows that there are some outliers in the age column, which will be the first thing to look at in the next section and how to deal with them.

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.


> **Rubric Tip**: The project (Parts I alone) should have at least 15 visualizations distributed over univariate, bivariate, and multivariate plots to explore many relationships in the data set.  Use reasoning to justify the flow of the exploration.



>**Rubric Tip**: Use the "Question-Visualization-Observations" framework  throughout the exploration. This framework involves **asking a question from the data, creating a visualization to find answers, and then recording observations after each visualisation.** 




>**Rubric Tip**: Visualizations should depict the data appropriately so that the plots are easily interpretable. You should choose an appropriate plot type, data encodings, and formatting as needed. The formatting may include setting/adding the title, labels, legend, and comments. Also, do not overplot or incorrectly plot ordinal data.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

## Conclusions
>You can write a summary of the main findings and reflect on the steps taken during the data exploration.



> Remove all Tips mentioned above, before you convert this notebook to PDF/HTML


> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML or PDF` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!

