<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lab 3.02: Statistical Modeling and Model Validation

> Authors: Tim Book, Matt Brems

---

## Objective
The goal of this lab is to guide you through the modeling workflow to produce the best model you can. In this lesson, you will follow all best practices when slicing your data and validating your model. 

## Imports

In [1]:
# Import everything you need here.
# You may want to return to this cell to import more things later in the lab.
# DO NOT COPY AND PASTE FROM OUR CLASS SLIDES!
# Muscle memory is important!

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score


from sklearn.linear_model import LinearRegression

## Read Data
The `citibike` dataset consists of Citi Bike ridership data for over 224,000 rides in February 2014.

In [25]:
# Read in the citibike data in the data folder in this repository.
df = pd.read_csv('data/citibike_feb2014.csv')

In [3]:
df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,294,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,-73.991475,21101,Subscriber,1991,1
1,372,2014-02-01 00:00:03,2014-02-01 00:06:15,285,Broadway & E 14 St,40.734546,-73.990741,439,E 4 St & 2 Ave,40.726281,-73.98978,15456,Subscriber,1979,2
2,591,2014-02-01 00:00:09,2014-02-01 00:10:00,247,Perry St & Bleecker St,40.735354,-74.004831,251,Mott St & Prince St,40.72318,-73.9948,16281,Subscriber,1948,2
3,583,2014-02-01 00:00:32,2014-02-01 00:10:15,357,E 11 St & Broadway,40.732618,-73.99158,284,Greenwich Ave & 8 Ave,40.739017,-74.002638,17400,Subscriber,1981,1
4,223,2014-02-01 00:00:41,2014-02-01 00:04:24,401,Allen St & Rivington St,40.720196,-73.989978,439,E 4 St & 2 Ave,40.726281,-73.98978,19341,Subscriber,1990,1


## Explore the data
Use this space to familiarize yourself with the data.

Convince yourself there are no issues with the data. If you find any issues, clean them here.

In [4]:
df.isnull().sum()

tripduration               0
starttime                  0
stoptime                   0
start station id           0
start station name         0
start station latitude     0
start station longitude    0
end station id             0
end station name           0
end station latitude       0
end station longitude      0
bikeid                     0
usertype                   0
birth year                 0
gender                     0
dtype: int64

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224736 entries, 0 to 224735
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             224736 non-null  int64  
 1   starttime                224736 non-null  object 
 2   stoptime                 224736 non-null  object 
 3   start station id         224736 non-null  int64  
 4   start station name       224736 non-null  object 
 5   start station latitude   224736 non-null  float64
 6   start station longitude  224736 non-null  float64
 7   end station id           224736 non-null  int64  
 8   end station name         224736 non-null  object 
 9   end station latitude     224736 non-null  float64
 10  end station longitude    224736 non-null  float64
 11  bikeid                   224736 non-null  int64  
 12  usertype                 224736 non-null  object 
 13  birth year               224736 non-null  object 
 14  gend

## Is average trip duration different by gender?

Conduct a hypothesis test that checks whether or not the average trip duration is different for `gender=1` and `gender=2`. Be sure to specify your null and alternative hypotheses, and to state your conclusion carefully and correctly!

My null hypothesis is that there is no average trip difference between the genders. 
My alternative hypothesis is there is an average trip difference between the genders.


In [6]:
df['gender'].value_counts()

1    176526
2     41479
0      6731
Name: gender, dtype: int64

In [7]:
df.groupby('gender')['tripduration'].mean()

gender
0    1740.830932
1     814.032409
2     991.361074
Name: tripduration, dtype: float64

In [8]:
df.groupby('gender')['tripduration'].std()

gender
0    5566.110472
1    5020.576128
2    7114.753227
Name: tripduration, dtype: float64

Because n is greater than 30 we can use a z-test.  

In [9]:
s = (((5020.576128**2)/176526) + ((7114.753227**2)/41479)) ** 0.5

In [10]:
z_score = (814.032409-991.361074)/s

In [11]:
z_score

-4.802922146172092

That is greater then -1.96 which is the z-score for the 0.05 confidence level. Therefore we reject the null hypothesis that the two genders have an equal average.

## What numeric columns shouldn't be treated as numeric?

**Answer:** gender, bikeid, start station id, end station id

## Dummify the `start station id` Variable

In [26]:
df['start station id'].describe()

count    224736.000000
mean        439.203479
std         335.723861
min          72.000000
25%         305.000000
50%         403.000000
75%         490.000000
max        3002.000000
Name: start station id, dtype: float64

In [44]:
df['start station id'] = df['start station id'].astype(str)

In [45]:
df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,294,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,-73.991475,21101,Subscriber,1991,1
1,372,2014-02-01 00:00:03,2014-02-01 00:06:15,285,Broadway & E 14 St,40.734546,-73.990741,439,E 4 St & 2 Ave,40.726281,-73.98978,15456,Subscriber,1979,2
2,591,2014-02-01 00:00:09,2014-02-01 00:10:00,247,Perry St & Bleecker St,40.735354,-74.004831,251,Mott St & Prince St,40.72318,-73.9948,16281,Subscriber,1948,2
3,583,2014-02-01 00:00:32,2014-02-01 00:10:15,357,E 11 St & Broadway,40.732618,-73.99158,284,Greenwich Ave & 8 Ave,40.739017,-74.002638,17400,Subscriber,1981,1
4,223,2014-02-01 00:00:41,2014-02-01 00:04:24,401,Allen St & Rivington St,40.720196,-73.989978,439,E 4 St & 2 Ave,40.726281,-73.98978,19341,Subscriber,1990,1


In [46]:
dums =pd.get_dummies('start station id')

In [None]:
#df = df.join(dums)

this last line was working earlier but it is no longer working

## Engineer a feature called `age` that shares how old the person would have been in 2014 (at the time the data was collected).

- Note: you will need to clean the data a bit.

In [17]:
df['birth year'].value_counts()

1985    9305
1984    9139
1983    8779
1981    8208
1986    8109
        ... 
1910       4
1917       3
1927       2
1921       1
1913       1
Name: birth year, Length: 78, dtype: int64

In [51]:
df['birth year'] = df['birth year'].apply(lambda x: x if x== '\\N' else int(x)) # got help in office hours


In [19]:
df['birth year'].unique()

array([1991, 1979, 1948, 1981, 1990, 1978, 1944, 1983, 1969, 1986, 1962,
       1965, 1942, 1989, 1980, 1957, 1951, 1992, 1971, 1982, 1968, 1984,
       '\\N', 1956, 1987, 1985, 1996, 1975, 1988, 1974, 1972, 1959, 1973,
       1977, 1976, 1953, 1993, 1970, 1963, 1967, 1966, 1960, 1961, 1994,
       1958, 1955, 1946, 1964, 1900, 1995, 1954, 1952, 1949, 1947, 1941,
       1938, 1950, 1945, 1997, 1934, 1940, 1939, 1936, 1943, 1935, 1937,
       1922, 1932, 1907, 1926, 1899, 1901, 1917, 1910, 1933, 1921, 1927,
       1913], dtype=object)

In [52]:
df['birth year'] = df['birth year'].replace('\\N', 1985) #the mode of the data

In [53]:
df['age'] = 2014-df['birth year']

In [49]:
dum2 = pd.get_dummies(df['usertype'])

In [50]:
df = df.join(dum2)

## Split your data into train/test data

Look at the size of your data. What is a good proportion for your split? **Justify your answer.**

Use the `tripduration` column as your `y` variable.

For your `X` variables, use `age`, `usertype`, `gender`, and the dummy variables you created from `start station id`. (Hint: You may find the Pandas `.drop()` method helpful here.)

**NOTE:** When doing your train/test split, please use random seed 123.

In [54]:
features = ['age', 'gender', 'Subscriber']
X = df[features]
y = df['tripduration']

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123) 

## Fit a Linear Regression model in `sklearn` predicting `tripduration`.

In [56]:
lr =LinearRegression()
lr.fit(X_train, y_train);

## Evaluate your model
Look at some evaluation metrics for **both** the training and test data. 
- How did your model do? Is it overfit, underfit, or neither?
- Does this model outperform the baseline? (e.g. setting $\hat{y}$ to be the mean of our training `y` values.)

In [57]:
lr.score(X_train, y_train)

0.0011228686054517434

In [58]:
lr.score(X_test, y_test)

0.0005368640691153503

The model is very bad. It does not explain much of the variation in the trip duration. The model is underfiting. It is only slightly better than the baseline.

## Fit a Linear Regression model in `statsmodels` predicting `tripduration`.

In [None]:
import statsmodels.api as sm # https://www.statsmodels.org/stable/regression.html

In [None]:
model = sm.OLS(y,X)

In [None]:
res = model.fit()

In [None]:
print(res.summary())

The model remains bad.

## Using the `statsmodels` summary, test whether or not `age` has a significant effect when predicting `tripduration`.
- Be sure to specify your null and alternative hypotheses, and to state your conclusion carefully and correctly **in the context of your model**!

The null hypothesis is that age is not significant. The alternative hypothesis is that it is. The p value for age is far less than 0.05. Therefore we reject the null hypothesis. Age is significant at the 0.05 level.

## Citi Bike is attempting to market to people who they think will ride their bike for a long time. Based on your modeling, what types of individuals should Citi Bike market toward?

They should market towards subscribers who are older since both have a postivive coefficient in the model.