# UDACITY PROJECT 5 - FINAL PROJECT
## Data Exploration - Ford GoBike System Data Analysis
### *Jhonatan Nagasako*
#### *28-FEB-2021*

<hr size="5"/>

<a id='contents'></a>
# Table of Contents (click link to section)

<a href="#intro">A. INTRODUCTION</a><br><br>
<a href="#gather">1. GATHERING DATA</a><br><br>
<a href="#assess">2. ASSESSING DATA</a><br><br>
<a href="#clean">3. CLEANING DATA</a><br><br>
<a href="#store">4. STORING DATA</a><br>

<a href="#explore">5. DATA EXPLORATION</a>    
* 5.1 <a href="#preliminary1">[Exporation - Preliminary Review]</a><br>
* 5.2 <a href="#univariate1">[Exporation - Univariate Exploration]</a><br>
* 5.3 <a href="#bivariate1">[Exporation - Bivariate Exploration]</a><br>
* 5.4 <a href="#multivariate1">[Exporation - Multivariate Exploration]</a><br>

<a href="#discussion">6. DISCUSSION</a><br> 
* 6.1 <a href="#preliminary2">[Discussion - Preliminary Review]</a><br>
* 6.2 <a href="#univariate2">[Discussion - Univariate Exploration]</a><br>
* 6.3 <a href="#bivariate2">[Discussion - Bivariate Exploration]</a><br>
* 6.4 <a href="#multivariate2">[Discussion - Multivariate Exploration]</a><br>  

<a href="#conclusion">7. CONCLUSION</a>



<hr size="5"/>

<a id='intro'></a>
# A. INTRODUCTION

This notebook will focus on the **data EXPLORATION** of Ford GoBike System. Data source can be found via [Udacity provided link to Google Docs/Drive](https://docs.google.com/document/d/e/2PACX-1vQmkX4iOT6Rcrin42vslquX2_wQCjIa_hbwD0xmxrERPSOJYDtpNc_3wwK_p9_KpOsfA6QVyEHdxxq7/pub?embedded=True). 

This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.
* Note that this dataset will require some data wrangling in order to make it tidy for analysis. There are multiple cities covered by the linked system, and multiple data files will need to be joined together if a full year’s coverage is desired.
* Depending on scope of questions explored, additional data sources from other cities may be explored. Data can be accessed via [this page](https://www.google.com/url?q=https://www.bikeshare.com/data/&sa=D&source=editors&ust=1614518054096000&usg=AOvVaw38y_cueV0lTerb59CY7YsD) or [this page](https://www.google.com/url?q=https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems&sa=D&source=editors&ust=1614518054097000&usg=AOvVaw2OskG9ApXPoPZlezrpwmXp).

<a href="#contents">[Table of Contents]</a>

<hr size="5"/>
<h5><center>📚 Gathering START -- Project START 📚</center></h5>        
<hr size="5"/>

<a id='gather'></a>
# 1. GATHERING DATA

<font color=blue>

<a href="#contents">[Table of Contents]</a>

In [1]:
# import statements for all of the packages used for analysis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
import statsmodels.api as sm;

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html

%matplotlib inline

In [2]:
# gather .csv file
df = pd.read_csv('201902-fordgobike-tripdata.csv')
df.head(3)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,52185,2019-02-28 17:32:10.1450,2019-03-01 08:01:55.9750,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,13.0,Commercial St at Montgomery St,37.794231,-122.402923,4902,Customer,1984.0,Male,No
1,42521,2019-02-28 18:53:21.7890,2019-03-01 06:42:03.0560,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,81.0,Berry St at 4th St,37.77588,-122.39317,2535,Customer,,,No
2,61854,2019-02-28 12:13:13.2180,2019-03-01 05:24:08.1460,86.0,Market St at Dolores St,37.769305,-122.426826,3.0,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,5905,Customer,1972.0,Male,No


In [3]:
# high-level overview of data shape and composition
print('\n----Shape of File (row, column)---\n')
print(df.shape)
print('\n----Data Types---\n')
print(df.dtypes)
print('\n----Unique Values---\n')
print(df.nunique())
print('\n----Number of missing values---\n')
print(df.isnull().sum())


----Shape of File (row, column)---

(183412, 16)

----Data Types---

duration_sec                 int64
start_time                  object
end_time                    object
start_station_id           float64
start_station_name          object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             float64
end_station_name            object
end_station_latitude       float64
end_station_longitude      float64
bike_id                      int64
user_type                   object
member_birth_year          float64
member_gender               object
bike_share_for_all_trip     object
dtype: object

----Unique Values---

duration_sec                 4752
start_time                 183401
end_time                   183397
start_station_id              329
start_station_name            329
start_station_latitude        334
start_station_longitude       335
end_station_id                329
end_station_name              329
end_station_latitude       

<hr size="5"/>
<h5><center>🔎 Gathering END ➜ Assessing START 🔎</center></h5>        
<hr size="5"/>

<a id='assess'></a>
# 2. ASSESSING DATA

<font color=blue>

<a href="#contents">[Table of Contents]</a>

<font color='red'>

<a id='todo'></a>
## QUALITY (click hyperlink question to go to section in notebook!)

<a href="#quality-section">[Go to QUALITY section]</a>
    
1. <a href="#Q1">[Remove rows with ```NULL``` in ```end_station``` column (197 items)]</a>
    
    
2. <a href="#Q2">[Item above may fix this, but remove rows with ```NULL``` in thee following columns (197 items each)]</a>

    * start_station_id
    * start_station_name
    * end_station_name
    
    
3. <a href="#Q3">[Remove rows in ```start_station_latitude``` in ```start_station_longitude``` columns]</a>
    
    * should change from 334-latitude and 335-longitude to 329 items
    
    
4. <a href="#Q4">[Review ```member_birth_year``` and determine good cut-off date--remove rows respectively]</a>
    
    * E.g., There is someone born in 1900 that rented a bike in 2019... that makes that person over 100 years old!
    * consider average age and oldest person... this could be errors that can be removed

    
5. <a href="#Q5">[Convert ```duration_sec``` to minutes, hours, or/and days]</a>
    
<br>
    
>**Tips for Common Data Quality Issues**
>1. Missing data
2. Invalide data (e.g., state a negative height, or other datatype validation errors--str vs int vs float, think there can only be 2 people in a room... not 2.54 people in a room... *unless there's ghosts lol*)
3. Inaccurate data (e.g., specifying a foot = 5 inches, which is WRONG. A foot = 12 inches)
4. Inconsistent data (e.g., mixing up units, some data captured as cm instead of inches)



## TIDINESS
    
<a href="#tidy-section">[Go to TIDY section]</a>
    
1. <a href="#T1">[Create columns to break apart ```start_time``` to ```date``` and ```time```]</a>

    
2. <a href="#T2">[Create columns to break apart ```end_time``` to ```date``` and ```time```]</a>

<br>
    
>**Tips for Tidying**
>1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table
*Reference for [tidy data here](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)*

<br>
    
## FEATURE ENGINEEERING
1. Test
    
<a href="#assess">[Assessing Data Requirements]</a> <a href="#contents">[Table of Contents]</a>

<hr size="5"/>
<h5><center>🧹 Assessing END ➜ Cleaning START 🧹</center></h5>        
<hr size="5"/>

<a id='clean'></a>
# 3. CLEANING DATA

<font color=blue>
    
   
<a href="#todo">[Cleaning and Tidying To-do List]</a> 
<a href="#contents">[Table of Contents]</a>

<a id='quality-section'></a>

<font color='red'>
    
## QUALITY ISSUES ADDRESSED -- note that ```df``` will change to ```dfc``` to indicate cleaned data
    
<a href="#todo">[Cleaning and Tidying To-do List]</a>

<a id='Q1'></a>

<font color='red'>

### ✔️ 1. all dataframes --> Remove missing data column ```Unnamed: 0```

<a href="#todo">[Cleaning and Tidying To-do List]</a>

<a id='tidy-section'></a>

<font color='red'>

## TIDINESS ISSUES ADDRESSED -- note that ```df``` will change to ```dfc``` to indicate cleaned data

<a href="#clean">[Cleaning Data Requirements]</a> 
<a href="#todo">[Cleaning and Tidying To-do List]</a> 
<a href="#contents">[Table of Contents]</a>

<hr size="5"/>
<h5><center>💾 Cleaning END ➜ Storing START 💾</center></h5>        
<hr size="5"/>

<a id='store'></a>
# 4. STORING

<font color=blue>
    

<a href="#store">[Table of Contents]</a>

<hr size="5"/>
<h5><center>🗺️ Storing END ➜ Data Exploration START 🗺️</center></h5>        
<hr size="5"/>

<a id='explore'></a>
# 5. DATA EXPLORATION

<font color=blue>
    
<a href="#contents">[Table of Contents]</a>

<a href="#explore">[#5 Data Exporation]</a><br>
--- <a href="#preliminary1">[Preliminary Review]</a>
--- <a href="#univariate1">[Univariate Exploration]</a>
--- <a href="#bivariate1">[Bivariate Exploration]</a>
--- <a href="#multivariate1">[Multivariate Exploration]</a> 

<a href="#discussion">[#6 Discussion]</a><br>
--- <a href="#preliminary2">[Preliminary Review]</a>
--- <a href="#univariate2">[Univariate Exploration]</a>
--- <a href="#bivariate2">[Bivariate Exploration]</a>
--- <a href="#multivariate2">[Multivariate Exploration]</a> 

<a id='preliminary1'></a>
## 5.1 Exploration - Preliminary Review

<a href="#contents">[Table of Contents]</a>

<a href="#explore">[#5 Data Exporation]</a><br> 

<a href="#discussion">[#6 Discussion]</a><br>
--- <a href="#preliminary2">[6.1 Discussion - Preliminary Review]</a>

<a id='univariate1'></a>
## 5.2 Exploration - Univariate Exploration
<a href="#contents">[Table of Contents]</a>

<a href="#explore">[#5 Data Exporation]</a><br> 

<a href="#discussion">[#6 Discussion]</a><br>
--- <a href="#univariate2">[6.2 Discussion - Univariate Exploration]</a>

<a id='bivariate1'></a>
## 5.3 Exploration - Bivariate Exploration
<a href="#contents">[Table of Contents]</a>

<a href="#explore">[#5 Data Exporation]</a><br>

<a href="#discussion">[#6 Discussion]</a><br>
--- <a href="#bivariate2">[6.3 Discussion - Bivariate Exploration]</a>

<a id='multivariate1'></a>
## 5.4 Exploration - Multivariate Exploration
<a href="#contents">[Table of Contents]</a>

<a href="#explore">[#5 Data Exporation]</a><br> 

<a href="#discussion">[#6 Discussion]</a><br>
--- <a href="#multivariate2">[6.4 Discussion - Multivariate Exploration]</a> 

<hr size="5"/>
<h5><center>📊 Data Exploration END ➜ Discussion START 📊</center></h5>        
<hr size="5"/>

<a id='discussion'></a>
# 6. DISCUSSION

<font color=blue>
    
<a href="#contents">[Table of Contents]</a>

<a href="#explore">[#5 Data Exporation]</a><br>
--- <a href="#preliminary1">[Preliminary Review]</a>
--- <a href="#univariate1">[Univariate Exploration]</a>
--- <a href="#bivariate1">[Bivariate Exploration]</a>
--- <a href="#multivariate1">[Multivariate Exploration]</a> 

<a href="#discussion">[#6 Discussion]</a><br>
--- <a href="#preliminary2">[Preliminary Review]</a>
--- <a href="#univariate2">[Univariate Exploration]</a>
--- <a href="#bivariate2">[Bivariate Exploration]</a>
--- <a href="#multivariate2">[Multivariate Exploration]</a> 

<a id='preliminary2'></a>
## 6.1 Discussion - Preliminary Wrangling
<a href="#contents">[Table of Contents]</a>

<a href="#explore">[#5 Data Exporation]</a><br>
--- <a href="#preliminary1">[5.1 Exploration - Preliminary Review]</a>

<a href="#discussion">[#6 Discussion]</a><br>
    
### What is the structure of your dataset?

There are 53,940 diamonds in the dataset with 10 features (carat, cut, color, clarity, depth, table, price, x, y, and z). Most variables are numeric in nature, but the variables cut, color, and clarity are ordered factor variables with the following levels.

(worst) ——> (best) <br>
cut: Fair, Good, Very Good, Premium, Ideal <br>
color: J, I, H, G, F, E, D <br>
clarity: I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF

### What is/are the main feature(s) of interest in your dataset?

I'm most interested in figuring out what features are best for predicting the price of the diamonds in the dataset.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I expect that carat will have the strongest effect on each diamond's price: the larger the diamond, the higher the price. I also think that the other big "C"s of diamonds: cut, color, and clarity, will have effects on the price, though to a much smaller degree than the main effect of carat.

<a id='univariate2'></a>
##  6.2 Discussion - Univariate Exploration

<a href="#contents">[Table of Contents]</a>

<a href="#explore">[#5 Data Exporation]</a><br>
--- <a href="#univariate1">[5.2 Exploration - Univariate Exploration]</a>

<a href="#discussion">[#6 Discussion]</a><br>

I'll start by looking at the distribution of the main variable of interest: price.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The price variable took on a large range of values, so I looked at the data using a log transform. Under the transformation, the data looked bimodal, with one peak between \$500 and \$1000, and another just below \$5000.

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

When investigating the x, y, and z size variables, a number of outlier points were identified. Overall, these points can be characterized by an inconsistency between the recorded value of depth, and the value that would be derived from using x, y, and z. For safety, all of these points were removed from the dataset to move forwards.

<a id='bivariate2'></a>
## 6.3 Discussion - Bivariate Exploration

<a href="#contents">[Table of Contents]</a>

<a href="#explore">[#5 Data Exporation]</a><br>
--- <a href="#bivariate1">[5.3 Exploration - Bivariate Exploration]</a>

<a href="#discussion">[#6 Discussion]</a><br>

To start off with, I want to look at the pairwise correlations present between features in the data.

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Price had a surprisingly high amount of correlation with the diamond size, even before transforming the features. An approximately linear relationship was observed when price was plotted on a log scale and carat was plotted with a cube-root transform. The scatterplot that came out of this also suggested that there was an upper bound on the diamond prices available in the dataset, since the range of prices for the largest diamonds was much narrower than would have been expected, based on the price ranges of smaller diamonds.

There was also an interesting relationship observed between price and the categorical features. For all of cut, color, and clarity, lower prices were associated with increasing quality. One of the potentially major interacting factors is the fact that improved quality levels were also associated with smaller diamonds. This will have to be explored further in the next section.

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Expected relationships were found in the association between the 'x', 'y', and 'z' measurements of diamonds to the other linear dimensions as well as to the 'carat' variable. A small negative correlation was observed between table size and depth, but neither of these variables show a strong correlation with price, so they won't be explored further. There was also a small interaction in the categorical quality features. Diamonds of lower clarity appear to have slightly better cut and color grades.

<a id='multivariate2'></a>
## 6.4 Discussion - Multivariate Exploration

<a href="#contents">[Table of Contents]</a>

<a href="#explore">[#5 Data Exporation]</a><br>
--- <a href="#multivariate1">[5.4 Exploration - Multivariate Exploration]</a> 

<a href="#discussion">[#6 Discussion]</a><br>

The main thing I want to explore in this part of the analysis is how the three categorical measures of quality play into the relationship between price and carat.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I extended my investigation of price against diamond size in this section by looking at the impact of the three categorical quality features. The multivariate exploration here showed that there indeed is a positive effect of increased quality grade on diamond price, but in the dataset, this is initially hidden by the fact that higher grades were more prevalent in smaller diamonds, which fetch lower prices overall. Controlling for the carat weight of a diamond shows the effect of the other C's of diamonds more clearly. This effect was clearest for the color and clarity variables, with less systematic trends for cut.

### Were there any interesting or surprising interactions between features?

Looking back on the point plots, it doesn't seem like there's a systematic interaction effect between the three categorical features. However, the features also aren't fully independent. But it is interesting in something like the 1-carat plot for prices against cut and clarity, the shape of the 'cut' dots is fairly similar for the SI2 through VVS2 clarity levels.

<hr size="5"/>
<h5><center>🍻 Discussion END ➜ Conclusion START 🍻</center></h5>        
<hr size="5"/>

<a id='conclusion'></a>
# 7. CONCLUSION


<font color=blue>
    
<a href="#contents">[Table of Contents]</a>

<a href="#explore">[#5 Data Exporation]</a><br>
--- <a href="#preliminary1">[Preliminary Review]</a>
--- <a href="#univariate1">[Univariate Exploration]</a>
--- <a href="#bivariate1">[Bivariate Exploration]</a>
--- <a href="#multivariate1">[Multivariate Exploration]</a> 

<a href="#discussion">[#6 Discussion]</a><br>
--- <a href="#preliminary2">[Preliminary Review]</a>
--- <a href="#univariate2">[Univariate Exploration]</a>
--- <a href="#bivariate2">[Bivariate Exploration]</a>
--- <a href="#multivariate2">[Multivariate Exploration]</a> 

<hr size="5"/>
<h5><center>🏁 Conclusion END -- Project FINALE 🏁</center></h5>        
<hr size="5"/>

# End of Data Project!

Made with ❤️ by Jhon!

Further impovements include...
1. Validation statements using ```assert``` clauses to confirm that data manipuation is correct during cleaning stage
2. More plots exploring data

<a href="#contents">[Table of Contents]</a>