## Task: Predict number of bikers on a given day using linear regression

You are provided with a dataset about Seattle's Fremont Bridge in the form of a csv file.
The data contains different details about a given day, like weather, temperature and other factors (see the dataframe preview below) for more details. The data also contains how many bikers were observed crossing the brudge that day.

You are provided with the code to download and load the csv file.

Your task is to train a linear regression model which takes in the parameters of the day (you can drop the columns that you think you don't need) and predicts the number of bikers according to those parameters.

You can find more details about the data: https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset

Fanaee-T, H. (2013). Bike Sharing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5W894.


- For better enhancements of the skills, try changing some integers to strings and you may make some cells null.
- Then, Solve by starting doing the preprocessing (fixing labels and missing cells).

In [1]:
from IPython.display import clear_output

In [None]:
# Incase you run this notebook outside colab (where the libraries aren't already pre-installed)

%pip install gdown
%pip install pandas
%pip install numpy

clear_output()

In [2]:
# Download the CSV file.
!wget https://archive.ics.uci.edu/static/public/275/bike+sharing+dataset.zip
!unzip bike+sharing+dataset.zip
!rm hour.csv

--2025-01-06 06:16:36--  https://archive.ics.uci.edu/static/public/275/bike+sharing+dataset.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘bike+sharing+dataset.zip’

bike+sharing+datase     [   <=>              ] 273.43K   529KB/s    in 0.5s    

2025-01-06 06:16:37 (529 KB/s) - ‘bike+sharing+dataset.zip’ saved [279992]

Archive:  bike+sharing+dataset.zip
  inflating: Readme.txt              
  inflating: day.csv                 
  inflating: hour.csv                


In [3]:
import pandas as pd
import numpy as np

In [58]:
df = pd.read_csv("/content/day.csv")

# To shuffle
# df = df.sample(frac = 1)
df.head(10)

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600
5,6,2011-01-06,1,0,1,0,4,1,1,0.204348,0.233209,0.518261,0.089565,88,1518,1606
6,7,2011-01-07,1,0,1,0,5,1,2,0.196522,0.208839,0.498696,0.168726,148,1362,1510
7,8,2011-01-08,1,0,1,0,6,0,2,0.165,0.162254,0.535833,0.266804,68,891,959
8,9,2011-01-09,1,0,1,0,0,0,1,0.138333,0.116175,0.434167,0.36195,54,768,822
9,10,2011-01-10,1,0,1,0,1,1,1,0.150833,0.150888,0.482917,0.223267,41,1280,1321


In [59]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    int64  
 3   yr          731 non-null    int64  
 4   mnth        731 non-null    int64  
 5   holiday     731 non-null    int64  
 6   weekday     731 non-null    int64  
 7   workingday  731 non-null    int64  
 8   weathersit  731 non-null    int64  
 9   temp        731 non-null    float64
 10  atemp       731 non-null    float64
 11  hum         731 non-null    float64
 12  windspeed   731 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(4), int64(11), object(1)
memory usage: 91.5+ KB


In [60]:
y = df["cnt"]
display(y)

Unnamed: 0,cnt
0,985
1,801
2,1349
3,1562
4,1600
...,...
726,2114
727,3095
728,1341
729,1796


In [61]:
x = df.drop(["instant", "cnt"], axis=1)

x.head(5)

Unnamed: 0,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered
0,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654
1,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670
2,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229
3,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454
4,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518


In [62]:
df.tail(5)

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
726,727,2012-12-27,1,1,12,0,4,1,2,0.254167,0.226642,0.652917,0.350133,247,1867,2114
727,728,2012-12-28,1,1,12,0,5,1,2,0.253333,0.255046,0.59,0.155471,644,2451,3095
728,729,2012-12-29,1,1,12,0,6,0,2,0.253333,0.2424,0.752917,0.124383,159,1182,1341
729,730,2012-12-30,1,1,12,0,0,0,1,0.255833,0.2317,0.483333,0.350754,364,1432,1796
730,731,2012-12-31,1,1,12,0,1,1,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729


In [63]:
x["dteday"] = x["dteday"].apply(lambda date: int(date.split("-")[2]))

x.head(5)

Unnamed: 0,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered
0,1,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654
1,2,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670
2,3,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229
3,4,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454
4,5,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518


In [64]:
x.tail(5)

Unnamed: 0,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered
726,27,1,1,12,0,4,1,2,0.254167,0.226642,0.652917,0.350133,247,1867
727,28,1,1,12,0,5,1,2,0.253333,0.255046,0.59,0.155471,644,2451
728,29,1,1,12,0,6,0,2,0.253333,0.2424,0.752917,0.124383,159,1182
729,30,1,1,12,0,0,0,1,0.255833,0.2317,0.483333,0.350754,364,1432
730,31,1,1,12,0,1,1,2,0.215833,0.223487,0.5775,0.154846,439,2290


In [65]:
x["bias"] = np.ones(len(x))
x.head(5)

Unnamed: 0,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,bias
0,1,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,1.0
1,2,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,1.0
2,3,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1.0
3,4,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1.0
4,5,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1.0


In [56]:
# Splitting data to train and test data. train 80%
# n = len(x)

# split = int(n * 0.8)

# x_train = x.iloc[:split,:]
# x_test = x.iloc[split:,:]
# y_train = y.iloc[:split]
# y_test = y.iloc[split:]
# print(len(x_train), len(x_test), len(y_train), len(y_test))

In [66]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)


In [67]:
display(X_train)

Unnamed: 0,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,bias
682,13,4,1,11,0,2,1,2,0.343333,0.323225,0.662917,0.342046,327,3767,1.0
250,8,3,0,9,0,4,1,3,0.633913,0.555361,0.939565,0.192748,153,1689,1.0
336,3,4,0,12,0,6,0,1,0.299167,0.310604,0.612917,0.095783,706,2908,1.0
260,18,3,0,9,0,0,0,1,0.507500,0.490537,0.695000,0.178483,1353,2921,1.0
543,27,3,1,6,0,3,1,1,0.697500,0.640792,0.360000,0.271775,1077,6258,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,13,1,0,3,0,0,0,1,0.384348,0.380091,0.527391,0.270604,982,1435,1.0
106,17,2,0,4,0,0,0,1,0.456667,0.445696,0.479583,0.303496,1558,2186,1.0
270,28,4,0,9,0,3,1,2,0.635000,0.575158,0.848750,0.148629,480,3427,1.0
435,11,1,1,3,0,0,0,1,0.361739,0.359670,0.476957,0.222587,1658,3253,1.0


In [68]:
x_train = X_train.to_numpy()
x_test = X_test.to_numpy()
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

In [73]:
theta = np.linalg.inv(x_train.T @ x_train) @ (x_train.T @ y_train)

In [74]:
y_pred_train = x_train @ theta

In [75]:
loss_train = np.mean((y_pred_train - y_train)**2)

print(loss_train)

1.6932223411929971e-18


In [76]:
y_pred_test = x_test @ theta

In [77]:
loss_test = np.mean((y_pred_test - y_test)**2)

print(loss_test)

1.4117894480068751e-18
