# Bike Sharing Dataset
## Data Cleaning & Vector Space Representation

In [34]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

## Load Dataset

See first rows of the dataset:

In [35]:

df = pd.read_csv("bike_sharing_dataset/hour.csv")
df.head()


Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


## Initial Inspection

### Explore data

In [36]:
df.columns

Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       'casual', 'registered', 'cnt'],
      dtype='object')

Data represents bike rental (sharing) system.

There are 17 columns:
- **instant:** record index
- **dteday:** date
- **season:** 1:winter, 2:spring, 3:summer, 4:fall
- **yr:** year (0: 2011, 1: 2012)
- **mnth:** month (1 to 12)
- **hr:** hour (0 to 23)
- **holiday:** weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- **weekday:** day of the week
- **workingday:** if day is neither weekend nor holiday is 1, otherwise is 0
- **weathersit:** 
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
	- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
	- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
	- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- **temp:** Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- **atemp:** Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- **hum:** Normalized humidity. The values are divided to 100 (max)
- **windspeed:** Normalized wind speed. The values are divided to 67 (max)
- **casual:** count of casual users
- **registered:** count of registered users
- **cnt:** count of total rental bikes including both casual and registered

------------------------------------------------------------------------------------
See more info about the dataset... There is a rule that the columns should not have a null value.

In [37]:

df.info()
df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB


Unnamed: 0,instant,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0
mean,8690.0,2.50164,0.502561,6.537775,11.546752,0.02877,3.003683,0.682721,1.425283,0.496987,0.475775,0.627229,0.190098,35.676218,153.786869,189.463088
std,5017.0295,1.106918,0.500008,3.438776,6.914405,0.167165,2.005771,0.465431,0.639357,0.192556,0.17185,0.19293,0.12234,49.30503,151.357286,181.387599
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
25%,4345.5,2.0,0.0,4.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0
50%,8690.0,3.0,1.0,7.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0
75%,13034.5,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0
max,17379.0,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0


- **count** - the final number of the non-null values (the result: the dataset didn't have NaN values so matching to 17 329 rows)
- **mean** - the average of the values in the each column
- **std** - the standard deviation (how spread out the data are -> low value: close to the mean, high value: data points are spread out wider)
- **min** - the smallest value in the each column
- **25%** - the value closest to the 25% metric of data
- **50%** - the value closest to the 50% metric of data
- **75%** - the value closest to the 75% metric of data
- **max** - the highest value in the each column

------------------------------------------------------------------
Number of rows and columns:

In [38]:
df.shape

(17379, 17)

**Rows:** 17 379

**Columns:** 17

------------------------------------------------------------------
### Analyze Data Structure
Count elements (distinct):

In [39]:
df.nunique()

instant       17379
dteday          731
season            4
yr                2
mnth             12
hr               24
holiday           2
weekday           7
workingday        2
weathersit        4
temp             50
atemp            65
hum              89
windspeed        30
casual          322
registered      776
cnt             869
dtype: int64

-------------------------------------
***.isna()***: creates a table of booleans, where **True** is for **NaN**, the we sum them (***.sum()***)

In [40]:
# Missing values
sum_nan_values = df.isna().sum()
print(sum_nan_values)
print("\n### Missing values (%): ###")
print((sum_nan_values / len(df) * 100).round(2))

# Duplicates
print(f"\n### Total duplicate rows: {df.duplicated().sum()}")
print(f"### Duplicates (keep first): {df.duplicated(keep='first').sum()}")

# Remove duplicates if needed
df = df.drop_duplicates(keep='first')

instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

### Missing values (%): ###
instant       0.0
dteday        0.0
season        0.0
yr            0.0
mnth          0.0
hr            0.0
holiday       0.0
weekday       0.0
workingday    0.0
weathersit    0.0
temp          0.0
atemp         0.0
hum           0.0
windspeed     0.0
casual        0.0
registered    0.0
cnt           0.0
dtype: float64

### Total duplicate rows: 0
### Duplicates (keep first): 0


From the output is visible that there ae not missing values, also no duplicate rows. Overall, data quality looks clean.

## Data Cleaning

Drop non-informative columns
- **instant:** just a sequence index, no predictive value
- **dteday:** date string, hour already captures temporal patterns

In [41]:
df_clean = df.drop(columns=["instant", "dteday"])

print(f"Columns after cleaning: {df_clean.columns.tolist()}")
print(f"Shape: {df_clean.shape}")

Columns after cleaning: ['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']
Shape: (17379, 15)


Validate data integrity for data consistency issues: casual users + registered users should equal total count

In [42]:
validation_error = (df_clean["casual"] + df_clean["registered"] - df_clean["cnt"]).abs().sum()

print(f"Data integrity check (should be 0): {validation_error}")

if validation_error == 0:
    print("Data is consistent: casual + registered = cnt")
else:
    print("Data is inconsistent!")

Data integrity check (should be 0): 0
Data is consistent: casual + registered = cnt


## Encode Categorical Variables

***pd.get_dummies():*** converts **categorical** type columns into multiple columns (also called one-hot encoding) with boolean values
- **season:** has values like 1, 2, 3, 4, it becomes columns like spring, summer, fall, winter 
- **year:** from 0 and 1 to specific years,
- **weekday:** from numbers to days
- **weathersit:** from numbers to weather status
- other categorical data (month and hour we kept as values in one column)

In [43]:
season_labels = {1: "spring", 2: "summer", 3: "fall", 4: "winter"}
year_labels = {0: "2011", 1: "2012"}
weekday_labels = {0: "Sunday", 1: "Monday", 2: "Tuesday", 3: "Wednesday", 4: "Thursday", 5: "Friday", 6: "Saturday"}
weathersit_labels = {1: "Clear/Few clouds", 2: "Mist/Cloudy", 3: "Light Snow/Rain", 4: "Heavy Rain/Snow"}

season_dummies = pd.get_dummies(df_clean["season"].map(season_labels), prefix="season")
year_dummies = pd.get_dummies(df_clean["yr"].map(year_labels), prefix="year")
weekday_dummies = pd.get_dummies(df_clean["weekday"].map(weekday_labels), prefix="weekday")
weathersit_dummies = pd.get_dummies(df_clean["weathersit"].map(weathersit_labels), prefix="weathersit")

df_encoded = pd.concat([
    df_clean.drop(columns=["season", "weathersit", "weekday", "yr"]),
    season_dummies,
    year_dummies,
    weekday_dummies,
    weathersit_dummies
], axis=1)

print(f"Columns after encoding: {df_encoded.columns.tolist()}")
print(f"Shape after encoding: {df_encoded.shape}")
print(f"\nData types:\n{df_encoded.dtypes}")

Columns after encoding: ['mnth', 'hr', 'holiday', 'workingday', 'temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt', 'season_fall', 'season_spring', 'season_summer', 'season_winter', 'year_2011', 'year_2012', 'weekday_Friday', 'weekday_Monday', 'weekday_Saturday', 'weekday_Sunday', 'weekday_Thursday', 'weekday_Tuesday', 'weekday_Wednesday', 'weathersit_Clear/Few clouds', 'weathersit_Heavy Rain/Snow', 'weathersit_Light Snow/Rain', 'weathersit_Mist/Cloudy']
Shape after encoding: (17379, 28)

Data types:
mnth                             int64
hr                               int64
holiday                          int64
workingday                       int64
temp                           float64
atemp                          float64
hum                            float64
windspeed                      float64
casual                           int64
registered                       int64
cnt                              int64
season_fall                       bool
season_spr

## Feature Scaling

Standardize numerical features to zero mean and unit variance - this could be an optional step
> Note: 'hr' is kept unscaled as it's cyclical (others can be as well if decided that it is unnecessary)

In [44]:
scaler = StandardScaler()

numerical_cols = [
    "temp", "atemp", "hum", "windspeed",
    "cnt", "casual", "registered"
]

df_encoded[numerical_cols] = scaler.fit_transform(
    df_encoded[numerical_cols]
)

print("Scaling verification (should be ~0 mean, ~1 std):")
print(f"\nMeans:\n{df_encoded[numerical_cols].mean().round(4)}")
print(f"\nStd Devs:\n{df_encoded[numerical_cols].std().round(4)}")
print(f"\nNote: 'hr' column kept unscaled (range 0-23)")

Scaling verification (should be ~0 mean, ~1 std):

Means:
temp          0.0
atemp        -0.0
hum           0.0
windspeed     0.0
cnt          -0.0
casual        0.0
registered   -0.0
dtype: float64

Std Devs:
temp          1.0
atemp         1.0
hum           1.0
windspeed     1.0
cnt           1.0
casual        1.0
registered    1.0
dtype: float64

Note: 'hr' column kept unscaled (range 0-23)


## Final Dataset Check

In [45]:
# Final cleaned and processed dataset
print("Dataset Overview:")
print(f"Shape: {df_encoded.shape}")
print(f"\nFirst 5 rows:")
print(df_encoded.head())
print(f"\nData Info:")
df_encoded.info()
print(f"\nBasic Statistics:")
print(df_encoded.describe().round(3))

Dataset Overview:
Shape: (17379, 28)

First 5 rows:
   mnth  hr  holiday  workingday      temp     atemp       hum  windspeed  \
0     1   0        0           0 -1.334648 -1.093281  0.947372  -1.553889   
1     1   1        0           0 -1.438516 -1.181732  0.895539  -1.553889   
2     1   2        0           0 -1.438516 -1.181732  0.895539  -1.553889   
3     1   3        0           0 -1.334648 -1.093281  0.636370  -1.553889   
4     1   4        0           0 -1.334648 -1.093281  0.636370  -1.553889   

     casual  registered  ...  weekday_Monday  weekday_Saturday  \
0 -0.662755   -0.930189  ...           False              True   
1 -0.561343   -0.804655  ...           False              True   
2 -0.622190   -0.837690  ...           False              True   
3 -0.662755   -0.950010  ...           False              True   
4 -0.723603   -1.009474  ...           False              True   

   weekday_Sunday  weekday_Thursday  weekday_Tuesday  weekday_Wednesday  \
0           F

## Vector Space Representations

Create feature vectors - each vector focuses on different aspect of behavior


In [46]:
# Weather Impact Vector:
X_weather = df_encoded[["temp", "atemp", "hum", "windspeed", "cnt"]].values

# User Segmentation Vector
X_users = df_encoded[["hr", "casual", "registered"]].values

print(f"X_weather shape: {X_weather.shape} - Weather impact on bike usage")
print(f"X_users shape: {X_users.shape} - User type segmentation by hour")

X_weather shape: (17379, 5) - Weather impact on bike usage
X_users shape: (17379, 3) - User type segmentation by hour


## Repeat for day.csv

In [47]:

df2 = pd.read_csv("bike_sharing_dataset/day.csv")
# Missing values
sum_nan_values = df2.isna().sum()
print(sum_nan_values)
print("\nMissing values (%):")
print((sum_nan_values / len(df2) * 100).round(2))

# Duplicates
print(f"\nTotal duplicate rows: {df2.duplicated().sum()}")
print(f"Duplicates (keep first): {df2.duplicated(keep='first').sum()}")

# Remove duplicates if needed
df2 = df2.drop_duplicates(keep='first')

# Drop non-informative columns
# instant: just a sequence index, no predictive value
# dteday: date string, hour already captures temporal patterns

df2_clean = df2.drop(columns=["instant", "dteday"])

print(f"Columns after cleaning: {df2_clean.columns.tolist()}")
print(f"Shape: {df2_clean.shape}")

# Validate data integrity for data consistency issues: casual users + registered users should equal total count
validation_error = (df2_clean["casual"] + df2_clean["registered"] - df2_clean["cnt"]).abs().sum()

print(f"Data integrity check (should be 0): {validation_error}")

if validation_error == 0:
    print("Data is consistent: casual + registered = cnt")
else:
    print("Data is inconsistent!")

# Encode categorical variables to numerical format
season_labels = {1: "spring", 2: "summer", 3: "fall", 4: "winter"}
year_labels = {0: "2011", 1: "2012"}
weekday_labels = {0: "Sunday", 1: "Monday", 2: "Tuesday", 3: "Wednesday", 4: "Thursday", 5: "Friday", 6: "Saturday"}
weathersit_labels = {1: "Clear/Few clouds", 2: "Mist/Cloudy", 3: "Light Snow/Rain", 4: "Heavy Rain/Snow"}

season_dummies = pd.get_dummies(df2_clean["season"].map(season_labels), prefix="season")
year_dummies = pd.get_dummies(df2_clean["yr"].map(year_labels), prefix="year")
weekday_dummies = pd.get_dummies(df2_clean["weekday"].map(weekday_labels), prefix="weekday")
weathersit_dummies = pd.get_dummies(df2_clean["weathersit"].map(weathersit_labels), prefix="weathersit")

df2_encoded = pd.concat([
    df2_clean.drop(columns=["season", "weathersit", "weekday", "yr"]),
    season_dummies,
    year_dummies,
    weekday_dummies,
    weathersit_dummies
], axis=1)

print(f"Columns after encoding: {df2_encoded.columns.tolist()}")
print(f"Shape after encoding: {df2_encoded.shape}")
print(f"\nData types:\n{df2_encoded.dtypes}")

numerical_cols = [
    "temp", "atemp", "hum", "windspeed",
    "cnt", "casual", "registered"
]

df2_encoded[numerical_cols] = scaler.fit_transform(
    df2_encoded[numerical_cols]
)

print("Dataset Overview:")
print(f"Shape: {df_encoded.shape}")
print(f"\nFirst 5 rows:")
print(df_encoded.head())
print(f"\nData Info:")
df_encoded.info()
print(f"\nBasic Statistics:")
print(df_encoded.describe().round(3))

instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

Missing values (%):
instant       0.0
dteday        0.0
season        0.0
yr            0.0
mnth          0.0
holiday       0.0
weekday       0.0
workingday    0.0
weathersit    0.0
temp          0.0
atemp         0.0
hum           0.0
windspeed     0.0
casual        0.0
registered    0.0
cnt           0.0
dtype: float64

Total duplicate rows: 0
Duplicates (keep first): 0
Columns after cleaning: ['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']
Shape: (731, 14)
Data integrity check (should be 0): 0
Data is consistent: casual + registered = cnt
Columns after encoding: ['mnth', 'holiday', 'workingday', 'temp', 'atemp', 'hum', 'wind