# Multiple Linear Regression
Multiple independent variables in Linear Regression

In [2]:
from sklearn.linear_model import LinearRegression 
import pandas as pd

In [7]:
# Read csv file
car = pd.read_csv('./files/auto-mpg.csv', header=None, names=['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'name'])
car.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [10]:
# Set x and y data
x = car[['weight', 'cylinders']]
y = car[['mpg']]

In [11]:
# Split train & test data
from sklearn.model_selection import train_test_split

In [12]:
# Confirm split data
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.85, random_state=0)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((338, 2), (60, 2), (338, 1), (60, 1))

In [13]:
# Set model object
lm = LinearRegression()

In [14]:
# Fit data
lm.fit(x_train, y_train)

LinearRegression()

In [21]:
# Predict y_train_hat with linear model already fitted
y_train_hat = lm.predict(x_train)
y_train_hat[:5]

array([[19.31457821],
       [27.04576281],
       [24.18176499],
       [23.90895452],
       [12.1042136 ]])

In [26]:
# Compare with actual y_train value
y_train.iloc[:5]

Unnamed: 0,mpg
200,18.0
206,26.5
107,18.0
15,22.0
106,12.0


In [15]:
# Get coefficient & intercept
lm.coef_, lm.intercept_

(array([[-0.00620024, -0.73757226]]), array([45.89966215]))

Above outcome means as follows:
* **y (mpg) = - 0.0620024 * (weight) - 0.73757226 * (cylinder) + 45.89966215** 

In [17]:
# R-squared for train data
lm.score(x_train, y_train)

0.6897269031891438

In [16]:
# R-squared for whole data with already fitted linear model
# R-squared of self.predict(x) and y
lm.score(x, y)

0.6960790024886645

In [27]:
# R-squared for test data with already fitted linear model
lm.score(x_test, y_test)

0.7259419804282132

### Dump Linear Model into Pickle
This is the answer for the question "How can I save the model outcome and use it later without importing and run all things?"

In [28]:
import pickle

In [29]:
pickle.dump(lm, open('./storage/car_lm.pkl', 'wb'))

### Load Pickle of Linear Model

In [30]:
pickle.load(open('./storage/car_lm.pkl', 'rb'))

LinearRegression()

It seems just a LinearRegression object(), but is exactly same with the fitted model above.

In [32]:
lm.predict(x_train)[:5]     # The outcome is exactly same with the previous one

array([[19.31457821],
       [27.04576281],
       [24.18176499],
       [23.90895452],
       [12.1042136 ]])

---
# Linear Regression Excersize with Kaggle Data
Dataset is from Kaggle competition "New York City Taxi Trip Duration" : https://www.kaggle.com/c/nyc-taxi-trip-duration/data?select=train.zip
#### Data fields
* id - a unique identifier for each trip
* vendor_id - a code indicating the provider associated with the trip record
* pickup_datetime - date and time when the meter was engaged
* dropoff_datetime - date and time when the meter was disengaged
* passenger_count - the number of passengers in the vehicle (driver entered value)
* pickup_longitude - the longitude where the meter was engaged
* pickup_latitude - the latitude where the meter was engaged
* dropoff_longitude - the longitude where the meter was disengaged
* dropoff_latitude - the latitude where the meter was disengaged
* store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* trip_duration - duration of the trip in seconds
**Objectives**: Based on individual trip attributes, participants should predict the duration of each trip in the test set.

In [54]:
# Read csv of train data
nyc = pd.read_csv('./files/train.csv')

In [55]:
nyc.tail()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
1458639,id2376096,2,2016-04-08 13:31:04,2016-04-08 13:44:02,4,-73.982201,40.745522,-73.994911,40.74017,N,778
1458640,id1049543,1,2016-01-10 07:35:15,2016-01-10 07:46:10,1,-74.000946,40.747379,-73.970184,40.796547,N,655
1458641,id2304944,2,2016-04-22 06:57:41,2016-04-22 07:10:25,1,-73.959129,40.768799,-74.004433,40.707371,N,764
1458642,id2714485,1,2016-01-05 15:56:26,2016-01-05 16:02:39,1,-73.982079,40.749062,-73.974632,40.757107,N,373
1458643,id1209952,1,2016-04-05 14:44:25,2016-04-05 14:47:43,1,-73.979538,40.78175,-73.972809,40.790585,N,198


Column **id, vender_id** is less relevant and predictive for **trip_duration**. **pickup_datetime & dropoff_datetime** information is already extracted into trip_duration (might relate to month, if possible?). Cannot understand what **store_and_fwd_flag** means here. Will extract only the rest of the columns to analyze.

In [56]:
# Extract only relevant columns into new dataframe
taxi = nyc[['id', 'passenger_count', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'trip_duration']]
taxi.head()

Unnamed: 0,id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration
0,id2875421,1,-73.982155,40.767937,-73.96463,40.765602,455
1,id2377394,1,-73.980415,40.738564,-73.999481,40.731152,663
2,id3858529,1,-73.979027,40.763939,-74.005333,40.710087,2124
3,id3504673,1,-74.01004,40.719971,-74.012268,40.706718,429
4,id2181028,1,-73.973053,40.793209,-73.972923,40.78252,435


In [57]:
# Check dtypes & null values
taxi.info()               # no need to convert dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 7 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   id                 1458644 non-null  object 
 1   passenger_count    1458644 non-null  int64  
 2   pickup_longitude   1458644 non-null  float64
 3   pickup_latitude    1458644 non-null  float64
 4   dropoff_longitude  1458644 non-null  float64
 5   dropoff_latitude   1458644 non-null  float64
 6   trip_duration      1458644 non-null  int64  
dtypes: float64(4), int64(2), object(1)
memory usage: 77.9+ MB


In [63]:
# Set x and y: it should be resonable independent & dependent variables
# Distance between pickup point and dropoff point should be calculated
import math
taxi['distance'] = (taxi['dropoff_longitude'] - taxi['pickup_longitude'])**2 + (taxi['dropoff_latitude'] - taxi['pickup_latitude'])**2
taxi


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration,distance
0,id2875421,1,-73.982155,40.767937,-73.964630,40.765602,455,0.000313
1,id2377394,1,-73.980415,40.738564,-73.999481,40.731152,663,0.000418
2,id3858529,1,-73.979027,40.763939,-74.005333,40.710087,2124,0.003592
3,id3504673,1,-74.010040,40.719971,-74.012268,40.706718,429,0.000181
4,id2181028,1,-73.973053,40.793209,-73.972923,40.782520,435,0.000114
...,...,...,...,...,...,...,...,...
1458639,id2376096,4,-73.982201,40.745522,-73.994911,40.740170,778,0.000190
1458640,id1049543,1,-74.000946,40.747379,-73.970184,40.796547,655,0.003364
1458641,id2304944,1,-73.959129,40.768799,-74.004433,40.707371,764,0.005826
1458642,id2714485,1,-73.982079,40.749062,-73.974632,40.757107,373,0.000120


In [None]:
x = taxi[['passenger_count', 'pickup']]

In [51]:
import numpy as np
np.array([[1, 2]]).shape

(1, 2)

In [45]:
dir(np.array([[1, 2]])).size

AttributeError: 'list' object has no attribute 'size'