## Task: Predict number of bikers on a given day using linear regression

You are provided with a dataset about Seattle's Fremont Bridge in the form of a csv file.
The data contains different details about a given day, like weather, temperature and other factors (see the dataframe preview below) for more details. The data also contains how many bikers were observed crossing the brudge that day.

You are provided with the code to download and load the csv file.

Your task is to train a linear regression model which takes in the parameters of the day (you can drop the columns that you think you don't need) and predicts the number of bikers according to those parameters.

In [None]:
from IPython.display import clear_output

In [None]:
# Don't modify this code


%pip install gdown==4.5


clear_output()

In [None]:
# Download the CSV file.
!gdown 1_eJU8Y-31_l0oq1sSJT6pROJyo-ufuvD

Downloading...
From: https://drive.google.com/uc?id=1_eJU8Y-31_l0oq1sSJT6pROJyo-ufuvD
To: /content/bikers_data.csv
  0% 0.00/213k [00:00<?, ?B/s]100% 213k/213k [00:00<00:00, 80.5MB/s]


In [None]:
import pandas as pd
import numpy as np

In [None]:
data_df = pd.read_csv('bikers_data.csv')

In [None]:
data_df.head()

Unnamed: 0,Date,Number of bikers,Mon,Tue,Wed,Thu,Fri,Sat,Sun,holiday,daylight_hrs,Rainfall (in),Temp (F),dry day
0,2012-10-03,14084.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,11.277359,0.0,56.0,1
1,2012-10-04,13900.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.219142,0.0,56.5,1
2,2012-10-05,12592.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,11.161038,0.0,59.5,1
3,2012-10-06,8024.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,11.103056,0.0,60.5,1
4,2012-10-07,8568.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,11.045208,0.0,60.5,1


In [None]:
data_y = data_df['Number of bikers'] # target
data_x = data_df.drop(['Number of bikers'], axis=1) # input features

In [None]:
data_x.head()

Unnamed: 0,Date,Mon,Tue,Wed,Thu,Fri,Sat,Sun,holiday,daylight_hrs,Rainfall (in),Temp (F),dry day
0,2012-10-03,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,11.277359,0.0,56.0,1
1,2012-10-04,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.219142,0.0,56.5,1
2,2012-10-05,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,11.161038,0.0,59.5,1
3,2012-10-06,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,11.103056,0.0,60.5,1
4,2012-10-07,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,11.045208,0.0,60.5,1


In [None]:
data_y

0       14084.0
1       13900.0
2       12592.0
3        8024.0
4        8568.0
         ...   
2641     4552.0
2642     3352.0
2643     3692.0
2644     7212.0
2645     4568.0
Name: Number of bikers, Length: 2646, dtype: float64

In [None]:
#Dropping nulls
data_df.dropna

#instead of col for each day, this lines will make it on one col with diffrent numbers for each day
day_mapping = {
    'Mon': 1,
    'Tue': 2,
    'Wed': 3,
    'Thu': 4,
    'Fri': 5,
    'Sat': 6,
    'Sun': 7
}
# Create a new column "Day" and map the values using the dictionary
data_df['Day'] = data_df[['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']].idxmax(axis=1).map(day_mapping)

In [None]:
target = data_df['Number of bikers'] # target
features = data_df.drop(['Number of bikers','Date','Mon','Tue','Wed','Thu','Fri','Sat','Sun','dry day'], axis=1)

#Since we want bikers in general, not in a single day we can drop them, also date for same reason
#dry day would be known if the Rainfall is = 0, so no need for it

In [None]:
#Seeing the outliers to normlize

features.describe()

Unnamed: 0,holiday,daylight_hrs,Rainfall (in),Temp (F),Day
count,2646.0,2646.0,2646.0,2646.0,2646.0
mean,0.027967,11.907412,0.117305,54.285714,4.0
std,0.164909,2.615865,0.264038,10.875798,2.000378
min,0.0,8.218894,0.0,25.0,1.0
25%,0.0,9.360658,0.0,46.0,2.0
50%,0.0,11.812303,0.0,53.5,4.0
75%,0.0,14.463207,0.11,63.0,6.0
max,1.0,15.781095,3.25,82.0,7.0


In [None]:
features.head()

Unnamed: 0,holiday,daylight_hrs,Rainfall (in),Temp (F),Day
0,0.0,11.277359,0.0,56.0,3
1,0.0,11.219142,0.0,56.5,4
2,0.0,11.161038,0.0,59.5,5
3,0.0,11.103056,0.0,60.5,6
4,0.0,11.045208,0.0,60.5,7


In [None]:
#normlization

from sklearn import preprocessing
x = features.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
features = pd.DataFrame(x_scaled)

In [None]:
features = features.values
target = target.values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features,target,test_size = 0.2)

X_train = np.c_[np.ones(X_train.shape[0]), X_train]
X_test = np.c_[np.ones(X_test.shape[0]), X_test]

theta = np.linalg.inv(X_train.T @ X_train) @ (X_train.T @ y_train)

print(theta)

[  8428.16140452  -5588.45712024   3369.11439303 -14809.59283173
  11540.60326459  -8625.04772693]


In [None]:
predictions = X_test @ theta

In [None]:
# mean square error
def mean_square_error(y_true,y_pred):
  return ((y_pred - y_true)**2).mean()

trainLoss = mean_square_error(X_train @ theta, y_train)
print(f"Loss in Training: {trainLoss}" )
testLoss = mean_square_error(y_test,predictions)
print(f"Loss in testing: {testLoss}")

Loss in Training: 8688575.37105325
Loss in testing: 7067335.018854274
