## Task: Predict number of bikers on a given day using linear regression

You are provided with a dataset about Seattle's Fremont Bridge in the form of a csv file.
The data contains different details about a given day, like weather, temperature and other factors (see the dataframe preview below) for more details. The data also contains how many bikers were observed crossing the bridge that day.

You are provided with the code to download and load the csv file.

Your task is to **train a linear regression model** which takes in the parameters of the day (you can drop the columns that you think you don't need) and predicts the number of bikers according to those parameters.

**Divide the data** into a train (80%) and a validation data set (20%).

**Print train and validation losses**.

You are **not allowed** to use LinearRegression from sklearn.linear_model.

In [None]:
from IPython.display import clear_output

In [None]:
# Download the CSV file.
!gdown 1_eJU8Y-31_l0oq1sSJT6pROJyo-ufuvD

Downloading...
From: https://drive.google.com/uc?id=1_eJU8Y-31_l0oq1sSJT6pROJyo-ufuvD
To: /content/bikers_data.csv
  0% 0.00/213k [00:00<?, ?B/s]100% 213k/213k [00:00<00:00, 3.79MB/s]


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from numpy.linalg import inv



In [None]:
data_df = pd.read_csv('bikers_data.csv')

In [None]:
data_df.head()

Unnamed: 0,Date,Number of bikers,Mon,Tue,Wed,Thu,Fri,Sat,Sun,holiday,daylight_hrs,Rainfall (in),Temp (F),dry day
0,2012-10-03,14084.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,11.277359,0.0,56.0,1
1,2012-10-04,13900.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.219142,0.0,56.5,1
2,2012-10-05,12592.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,11.161038,0.0,59.5,1
3,2012-10-06,8024.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,11.103056,0.0,60.5,1
4,2012-10-07,8568.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,11.045208,0.0,60.5,1


In [None]:
data_y = data_df['Number of bikers'] # target
data_x = data_df.drop(['Number of bikers'], axis=1) # input features
data_x = data_df.drop(['Date'],axis=1)

data_x = data_x.values
data_y = data_y.values

In [None]:
# data_x.head()


In [None]:
data_y

array([14084., 13900., 12592., ...,  3692.,  7212.,  4568.])

In [None]:
data_x #here is the features (head)

array([[1.4084e+04, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 5.6000e+01,
        1.0000e+00],
       [1.3900e+04, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 5.6500e+01,
        1.0000e+00],
       [1.2592e+04, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 5.9500e+01,
        1.0000e+00],
       ...,
       [3.6920e+03, 0.0000e+00, 0.0000e+00, ..., 1.0000e-02, 4.5500e+01,
        0.0000e+00],
       [7.2120e+03, 1.0000e+00, 0.0000e+00, ..., 4.0000e-02, 4.5500e+01,
        0.0000e+00],
       [4.5680e+03, 0.0000e+00, 1.0000e+00, ..., 2.6000e-01, 4.9500e+01,
        0.0000e+00]])

In [None]:
print(data_df.columns)

Index(['Date', 'Number of bikers', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat',
       'Sun', 'holiday', 'daylight_hrs', 'Rainfall (in)', 'Temp (F)',
       'dry day'],
      dtype='object')


In [None]:

data_x.shape

(2646, 13)

In [None]:
#----Train-----
X_train, X_test, y_train, y_test = train_test_split(data_x,data_y,test_size = 0.2)
# X_train = np.c_[np.ones(X_train.shape[0]), X_train]
# X_test = np.c_[np.ones(X_test.shape[0]), X_test]
#if i add a column then an error will accured in the training becaues the matrix won't multiply by a diffrente matrix that's why i put it but i comment it

In [None]:
theta = np.linalg.solve(data_x.T@data_x, data_x.T@data_y)#equation to help us predict


In [None]:
y_train_pred = X_train @ theta


In [None]:
y_test_pred = X_test @ theta


In [None]:
def mean_squared_error(y, yhat):#calculate error (known equation)
  return np.mean((y - yhat) ** 2)

In [None]:
mse_train = mean_squared_error(y_train, y_train_pred)

In [None]:
mse_test = mean_squared_error(y_test, y_test_pred)

In [None]:
print(mse_train)

2.9276020759572632e-21


In [None]:
print(mse_test)

2.771909892736239e-21
