#Let's Build our own Linear Regression Model!
In this notebook, we will build and train a linear regression model which takes in the parameters of the day and predicts the number of bikers according to those parameters.

**We will use numpy to build the model**

##About the Dataset

The dataset's about Seattle's Fremont Bridge in the form of a csv file.
The data contains different details about a given day, like weather, temperature and other factors (see the dataframe preview below) for more details. The data also contains how many bikers were observed crossing the bridge that day.

#Import Important Libraries

In [33]:
from IPython.display import clear_output
import pandas as pd
import numpy as np
clear_output()

In [34]:
# Download the CSV file.
!gdown 1_eJU8Y-31_l0oq1sSJT6pROJyo-ufuvD

Downloading...
From: https://drive.google.com/uc?id=1_eJU8Y-31_l0oq1sSJT6pROJyo-ufuvD
To: /content/bikers_data.csv
  0% 0.00/213k [00:00<?, ?B/s]100% 213k/213k [00:00<00:00, 70.6MB/s]


#Data Exploration

In [35]:
data_df = pd.read_csv('bikers_data.csv')

In [36]:
data_df.head()

Unnamed: 0,Date,Number of bikers,Mon,Tue,Wed,Thu,Fri,Sat,Sun,holiday,daylight_hrs,Rainfall (in),Temp (F),dry day
0,2012-10-03,14084.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,11.277359,0.0,56.0,1
1,2012-10-04,13900.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.219142,0.0,56.5,1
2,2012-10-05,12592.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,11.161038,0.0,59.5,1
3,2012-10-06,8024.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,11.103056,0.0,60.5,1
4,2012-10-07,8568.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,11.045208,0.0,60.5,1


In [37]:
data_df.describe()

Unnamed: 0,Number of bikers,Mon,Tue,Wed,Thu,Fri,Sat,Sun,holiday,daylight_hrs,Rainfall (in),Temp (F),dry day
count,2646.0,2646.0,2646.0,2646.0,2646.0,2646.0,2646.0,2646.0,2646.0,2646.0,2646.0,2646.0,2646.0
mean,10972.597128,0.142857,0.142857,0.142857,0.142857,0.142857,0.142857,0.142857,0.027967,11.907412,0.117305,54.285714,0.568405
std,5479.641291,0.349993,0.349993,0.349993,0.349993,0.349993,0.349993,0.349993,0.164909,2.615865,0.264038,10.875798,0.495392
min,152.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.218894,0.0,25.0,0.0
25%,7105.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.360658,0.0,46.0,0.0
50%,10308.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.812303,0.0,53.5,1.0
75%,15274.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.463207,0.11,63.0,1.0
max,25712.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,15.781095,3.25,82.0,1.0


In [38]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2646 entries, 0 to 2645
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              2646 non-null   object 
 1   Number of bikers  2646 non-null   float64
 2   Mon               2646 non-null   float64
 3   Tue               2646 non-null   float64
 4   Wed               2646 non-null   float64
 5   Thu               2646 non-null   float64
 6   Fri               2646 non-null   float64
 7   Sat               2646 non-null   float64
 8   Sun               2646 non-null   float64
 9   holiday           2646 non-null   float64
 10  daylight_hrs      2646 non-null   float64
 11  Rainfall (in)     2646 non-null   float64
 12  Temp (F)          2646 non-null   float64
 13  dry day           2646 non-null   int64  
dtypes: float64(12), int64(1), object(1)
memory usage: 289.5+ KB


In [39]:
data_df.isnull().sum()

Date                0
Number of bikers    0
Mon                 0
Tue                 0
Wed                 0
Thu                 0
Fri                 0
Sat                 0
Sun                 0
holiday             0
daylight_hrs        0
Rainfall (in)       0
Temp (F)            0
dry day             0
dtype: int64

In [40]:
data_df.nunique()

Date                2646
Number of bikers    2048
Mon                    2
Tue                    2
Wed                    2
Thu                    2
Fri                    2
Sat                    2
Sun                    2
holiday                2
daylight_hrs        2362
Rainfall (in)        133
Temp (F)             110
dry day                2
dtype: int64

##Feature Engineering

In [41]:
data_df = data_df.drop("Date",axis=1)

##Split The dependent variable from the independent variables

In [42]:
data_y = data_df['Number of bikers'] # target
data_x = data_df.drop(['Number of bikers'], axis=1) # input features

In [43]:
data_x.head()

Unnamed: 0,Mon,Tue,Wed,Thu,Fri,Sat,Sun,holiday,daylight_hrs,Rainfall (in),Temp (F),dry day
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,11.277359,0.0,56.0,1
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.219142,0.0,56.5,1
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,11.161038,0.0,59.5,1
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,11.103056,0.0,60.5,1
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,11.045208,0.0,60.5,1


In [44]:
data_y.head()

0    14084.0
1    13900.0
2    12592.0
3     8024.0
4     8568.0
Name: Number of bikers, dtype: float64

#split the data into test and train

In [45]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [46]:
#Manully splitting
# data_y=data_y.values.reshape(1,-1)
# train_size= int(0.8 * (len(data_x)))
# X_train, X_test= data_x[:train_size], data_x[train_size:]
# y_train, y_test= data_y[:train_size], data_y[train_size:]


In [47]:
#Split using Sklearn
X_train, X_test, y_train, y_test = train_test_split(
    data_x,
    data_y,
    test_size=0.20,
    random_state=4
)


#Feature scaling standarlization
scaler = StandardScaler()
X_train= scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train= scaler.fit_transform(y_train.values.reshape(-1,1))
y_test = scaler.transform(y_test.values.reshape(-1,1))




#add a column of ones to the data so we can multiply it with the bias B0
X_train = np.c_[np.ones(X_train.shape[0]), X_train]
X_test = np.c_[np.ones(X_test.shape[0]), X_test]

In [48]:
X_train

array([[ 1.        , -0.40881085, -0.40249087, ...,  4.73768486,
        -0.06275111, -1.14347039],
       [ 1.        , -0.40881085, -0.40249087, ...,  0.16669565,
        -0.15505653, -1.14347039],
       [ 1.        , -0.40881085,  2.4845284 , ..., -0.44276958,
         1.46028835,  0.87453073],
       ...,
       [ 1.        , -0.40881085, -0.40249087, ..., -0.44276958,
         1.82951004,  0.87453073],
       [ 1.        , -0.40881085,  2.4845284 , ..., -0.44276958,
        -0.15505653,  0.87453073],
       [ 1.        ,  2.44611904, -0.40249087, ...,  0.01432935,
        -1.49348515, -1.14347039]])

##Calculate theta (w,b)

In [49]:
theta = np.linalg.inv(X_train.T @ X_train) @ (X_train.T @ y_train)

#Find predictions
 using the formula y_hat=xw+b for the train and test sets

In [50]:
y_train_pred=X_train @ theta

In [51]:
y_test_pred=X_test @ theta

##Root Mean Square Error Function

In [52]:
def rmse(y,yhat):
  return np.sqrt(np.mean((y - yhat) ** 2))

In [53]:
rmse_train =rmse(y_train, y_train_pred)

In [54]:
rmse_test = rmse(y_test, y_test_pred)

##Print train and validation losses.

In [55]:
print(rmse_train)

0.40737238103548434


In [56]:
print(rmse_test)

0.4214967658562719
