### Group Prject - London Bike Rentals

In this project, you will work with the London Bikes dataset, which records daily bike rentals in the city along with key variables such as dates, weather conditions, and seasonality.

The goal is to apply the full data analytics workflow:

- Clean and prepare the dataset.

- Explore the data through visualisation.

- Construct and interpret confidence intervals.

- Build a regression model to explain variation in bike rentals.

- By the end, you will connect statistical concepts with practical Python analysis.

In [2]:
## Import libraries and data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# load the bikes dataset
bikes = pd.read_csv("../Data/london_bikes.csv")
bikes


Unnamed: 0,date,bikes_hired,year,wday,month,week,cloud_cover,humidity,pressure,radiation,precipitation,snow_depth,sunshine,mean_temp,min_temp,max_temp,weekend
0,2010-07-30T00:00:00Z,6897,2010,Fri,Jul,30,6.0,65.0,10147.0,157.0,22.0,,31.0,17.7,12.3,25.1,False
1,2010-07-31T00:00:00Z,5564,2010,Sat,Jul,30,5.0,70.0,10116.0,184.0,0.0,,47.0,21.1,17.0,23.9,True
2,2010-08-01T00:00:00Z,4303,2010,Sun,Aug,30,7.0,63.0,10132.0,89.0,0.0,,3.0,19.3,14.6,23.4,True
3,2010-08-02T00:00:00Z,6642,2010,Mon,Aug,31,7.0,59.0,10168.0,134.0,0.0,,20.0,19.5,15.6,23.6,False
4,2010-08-03T00:00:00Z,7966,2010,Tue,Aug,31,5.0,66.0,10157.0,169.0,0.0,,39.0,17.9,12.1,20.1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4929,2024-01-27T00:00:00Z,16959,2024,Sat,Jan,4,4.0,,10331.0,39.0,0.0,0.0,21.0,4.5,,12.2,True
4930,2024-01-28T00:00:00Z,15540,2024,Sun,Jan,4,3.0,,10230.0,63.0,0.0,0.0,59.0,6.6,,12.5,True
4931,2024-01-29T00:00:00Z,22839,2024,Mon,Jan,5,8.0,,10222.0,18.0,0.0,0.0,0.0,8.8,,8.8,False
4932,2024-01-30T00:00:00Z,22303,2024,Tue,Jan,5,8.0,,10277.0,19.0,0.0,0.0,0.0,8.3,,12.0,False


**1. Data Cleaning**

Check for missing values across columns. How would you handle them?

Inspect the date column and ensure it is correctly formatted as datetime. Extract useful features (year, month, day, day of week, season).

Convert categorical variables (e.g., season, weather) to appropriate categories in Python.

Ensure numeric columns (e.g., bikes rented, temperature) are in the right format.

In [17]:
## Your code goes here
# check for missing values across columns 
bikes.isna().sum()
# missing values for cloud_cover, humidity, pressure, radiation 
# percentage of missing values if <5% drop 
missing_percent = bikes.isna().mean() * 100
missing_percent
# can safely drop all 
bikes = bikes.dropna(subset=['cloud_cover','humidity','pressure','radiation'])
bikes
# double check
print(bikes[['cloud_cover','humidity','pressure','radiation']].isna().sum())
# all good

cloud_cover    0
humidity       0
pressure       0
radiation      0
dtype: int64


In [18]:
# inspect the date column
print(bikes["date"].dtype)
# make date column datetime
bikes["date"] = pd.to_datetime(bikes["date"])
print(bikes["date"].dtype)
print(bikes.head())

# extract features
bikes['year'] = bikes['date'].dt.year
bikes['month'] = bikes['date'].dt.month
bikes['day'] = bikes['date'].dt.day
bikes['weekday'] = bikes['date'].dt.day_name()

def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Autumn'

bikes['season'] = bikes['month'].apply(get_season)

print(bikes[['date','year','month','day','weekday','season']].head())


datetime64[ns, UTC]
datetime64[ns, UTC]
                       date  bikes_hired  year wday  month  week  cloud_cover  \
0 2010-07-30 00:00:00+00:00         6897  2010  Fri      7    30          6.0   
1 2010-07-31 00:00:00+00:00         5564  2010  Sat      7    30          5.0   
2 2010-08-01 00:00:00+00:00         4303  2010  Sun      8    30          7.0   
3 2010-08-02 00:00:00+00:00         6642  2010  Mon      8    31          7.0   
4 2010-08-03 00:00:00+00:00         7966  2010  Tue      8    31          5.0   

   humidity  pressure  radiation  precipitation  snow_depth  sunshine  \
0      65.0   10147.0      157.0           22.0         NaN      31.0   
1      70.0   10116.0      184.0            0.0         NaN      47.0   
2      63.0   10132.0       89.0            0.0         NaN       3.0   
3      59.0   10168.0      134.0            0.0         NaN      20.0   
4      66.0   10157.0      169.0            0.0         NaN      39.0   

   mean_temp  min_temp  max_temp  

**2. Exploratory Data Analysis (EDA)**

Plot the distribution of bikes rented.

Explore how rentals vary by season and month.

Investigate the relationship between temperature and bikes rented.

**Deliverables:**

At least 3 clear visualisations with captions.

A short written interpretation of key patterns (seasonality, weather effects, etc.).



In [None]:
## Your code goes here

**3. Construct 95% confidence intervals for the mean number of bikes rented per season.**

Repeat the calculation per month.

Interpret the result:

What range of values do you expect the true mean to lie in?

Which seasons/months have higher or lower average demand?

Are there overlaps in the intervals, and what does that mean?

**Deliverables:**

A table or plot showing the mean and confidence intervals.

A short interpretation.

In [None]:
## Your code goes here

**Regression Analysis**

What variables influence the number of bikes rented (y) and how? Build a regression model that best explains the variability in bikes rented.

**Interpret:**

Which predictors are significant?

What do the coefficients mean (in practical terms)?

How much of the variation in bike rentals is explained (R²)?

**Deliverables:**

Regression output table.

A short discussion of which factors matter most for predicting bike rentals.

In [None]:
### Your code goes here

## Deliverables
A knitted HTML, one person per group to submit