# Regression

The Regression is a Machine Learning supervised method which predicts the mean of a quantitative dependent variable. For this dataset we are going to use Linear Regression, it is a supervised statistical model. This model solves the equation: $$Y = \beta_{1}X_{1}+...+\beta_{p}X_{p}$$ for $\beta = (\beta_{1},...,\beta_{p})$ using the least squares method, where $Y$ is the dependent variable and $X_{i}$ are the independent variables (or features) for $i = 1,...,p$. Sckit Learn has the function "linear_model.LinearRegression" that performes this task.


Before using the function, we need to prepare the data. LinearRegression only accepts numbers and does not accept missed values (NaN), that's why we have to clean the dataset first. First of all, we load the dataset:

In [5]:
import pandas as pd
import numpy as np

df2 = pd.read_csv("data/201801-fordgobike-tripdata.csv",parse_dates=True)
df3 = pd.read_csv("data/201802-fordgobike-tripdata.csv",parse_dates=True)
df4 = pd.read_csv("data/201803-fordgobike-tripdata.csv",parse_dates=True)
df5 = pd.read_csv("data/201804-fordgobike-tripdata.csv",parse_dates=True)
df6 = pd.read_csv("data/201805-fordgobike-tripdata.csv",parse_dates=True)
df7 = pd.read_csv("data/201806-fordgobike-tripdata.csv",parse_dates=True)
df8 = pd.read_csv("data/201807-fordgobike-tripdata.csv",parse_dates=True)

df = pd.concat([df2, df3, df4, df5, df6, df7, df8], axis=0)

Then we remove the missing values:

In [7]:
mediana = df['member_birth_year'].median()
df['member_birth_year'] = df['member_birth_year'].replace(np.NaN, mediana)
df['member_gender'] = df['member_gender'].fillna('Other')
df = df[np.isfinite(df['end_station_id'])]

Now we code the qualitative variables so they have numeric values:

In [9]:
dummies_user_type = pd.get_dummies(df.user_type)
merged_user_type = pd.concat([df, dummies_user_type],axis='columns')
df_temporal = merged_user_type.drop(['user_type'], axis='columns')
dummies_member_gender = pd.get_dummies(df.member_gender)
dummies_bike_share = pd.get_dummies(df.bike_share_for_all_trip)
merged_gender = pd.concat([df_temporal, dummies_member_gender,dummies_bike_share],axis='columns')
df_temporal2 = merged_gender.drop(['member_gender',"end_time","start_time","bike_share_for_all_trip","end_station_name","start_station_name"], axis='columns')
df_temporal2.head()

Unnamed: 0,duration_sec,start_station_id,start_station_latitude,start_station_longitude,end_station_id,end_station_latitude,end_station_longitude,bike_id,member_birth_year,Customer,Subscriber,Female,Male,Other,No,Yes
0,75284,120.0,37.76142,-122.426435,285.0,37.783521,-122.431158,2765,1986.0,0,1,0,1,0,1,0
1,85422,15.0,37.795392,-122.394203,15.0,37.795392,-122.394203,2815,1985.0,1,0,0,0,1,1,0
2,71576,304.0,37.348759,-121.894798,296.0,37.325998,-121.87712,3039,1996.0,1,0,0,1,0,1,0
3,61076,75.0,37.773793,-122.421239,47.0,37.780955,-122.399749,321,1985.0,1,0,0,0,1,1,0
4,39966,74.0,37.776435,-122.426244,19.0,37.788975,-122.403452,617,1991.0,0,1,0,1,0,1,0


Now that the dataset is clean, we are ready to use the LinearRegression function. First we choose the feature duration_sec as the dependent variable and the rest of the features as independet variables. Notice that in the code below we drop other features like end_time and start_time because if we have those variables in real life the regression would no be necessary. Also the station names were drop because we alredy have their longitude and latitude.

In [10]:
y = df_temporal2.pop('duration_sec')
X = df_temporal2

Then we divide the data in a train set that consists of 70% of the observations and a test set that has the other 30%:

In [12]:
from sklearn import linear_model,cross_validation
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70)




Now we are ready to fit the model:

In [13]:
clf = linear_model.LinearRegression()
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

To evaluate the model prediction accuracy we use the Residual Sum of Squares RSS given by: $$\sum (y_{i}-\hat{y_{i}})^{2}$$ where the $y_{i}$ are the dependent variables in y_train and $\hat{y_{i}}$ are the predictions obtained with the model:

In [40]:
y_pred = clf.predict(X_test)
print(y_pred)
y_test = np.array(y_test)
y_test.flatten()
print(y_test)
#y_pred-y_test
RSS = (np.square(y_pred-y_test)).sum()
RSS

[840.69481602 732.78684658 633.82434353 ... 553.23563219 643.98843201
 687.67178229]
[ 508  424 1273 ...  317 1327 1286]


1879170662645.216

A low RSS means the model is well-fitted, since it is the sum of the square of the diferences between the actual values and the predicted values. The RSS obtained for this model means that the it is poorly fitted, so it's predictions are not good. To improve the predictive power of this model you can do feature selection or cross validation for a better train set selection.