## Regression based on only datetime information
_GOAL:_ Predict the value of pm10 based on datetime information (f.e. day of the week, hour of the day, month of the year, year)
- Train once with year and once without year
- Train 2-3 models with random train-validate split and time block split
Regression models: 
1. Decision Tree
2. Random Forest
3. Knn
4. Lineare Regression  



**This notebook contains the decision tree model for pm10 prediction based on datetime information only**  

#### **_PREPARATION_**

In [1]:
# GET ALL THE JSONS INTO ONE DATAFRAME
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os
import json
import glob

In [2]:
# Set the search path for files (assuming the directory is relative to the current script)
file_path_mc124 = os.path.join("..", "mc124_data", "*.json")
files = glob.glob(file_path_mc124)

# Create empty list to store dataframes
li_all_files = []

# Loop through list of files and read each one into a dataframe and append to list
for f in files:
    # Read in json
    temp_df = pd.read_json(f)
    # Append df to list
    li_all_files.append(temp_df)

# Optionally concatenate all dataframes into one if needed
if li_all_files:
    combined_df = pd.concat(li_all_files)
    print(f'Combined dataframe shape: {combined_df.shape}')
else:
    print('No dataframes were created.')

Combined dataframe shape: (542555, 6)


In [3]:
combined_df.sample()

Unnamed: 0,datetime,station,core,component,period,value
1764,2020-04-16 07:00:00+02:00,mc124,nox,nox_1h,1h,146.0


In [4]:
# FILTER BY PARTICLE AND ONLY KEEP THE DATETIME, STATION, PERIOD AND VALUE FEATURE SINCE THE REST ARE CONSTANT INFORMATION (station, core, component, period)
df_reduced = combined_df[['datetime', 'station', 'core', 'value']]
df_reduced.sample(3)

# CUT OFF THE TIMEZONE INFORMATION FROM THE DATETIME TO AVOID CONVERSION ISSUES DUE TO TIME CHANGE IN MARCH AND OCTOBER
df_reduced.loc[:, 'datetime'] = df_reduced['datetime'].astype(str).str.slice(0, 19)
#df_pm10_reduced.loc[:, 'datetime'] = pd.to_datetime(df_pm10_reduced['datetime'], format='mixed')
df_reduced['datetime'] = pd.to_datetime(df_reduced['datetime'], format='mixed')
df_reduced.loc[:, 'datetime'] = df_reduced['datetime'].dt.tz_localize(None)
df_reduced.info()

<class 'pandas.core.frame.DataFrame'>
Index: 542555 entries, 0 to 3654
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   datetime  542555 non-null  datetime64[ns]
 1   station   542555 non-null  object        
 2   core      542555 non-null  object        
 3   value     539422 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 20.7+ MB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reduced['datetime'] = pd.to_datetime(df_reduced['datetime'], format='mixed')


In [5]:
df_reduced.sample(3)

Unnamed: 0,datetime,station,core,value
1448,2024-04-18 22:00:00,mc124,no,11.0
610,2012-04-22 12:00:00,mc124,no,27.0
911,2023-08-24 09:00:00,mc124,pm2,15.0


In [6]:
#use loc to add it to every for every row
df_reduced['hour'] = df_reduced['datetime'].dt.strftime('%H')  # Hour (00-23)
df_reduced['day'] = df_reduced['datetime'].dt.strftime('%d')  # Day of the month (01-31)
df_reduced['month'] = df_reduced['datetime'].dt.strftime('%m')  # Month (01-12)
df_reduced['year'] = df_reduced['datetime'].dt.strftime('%Y')  # Month (01-12)
df_reduced.sample(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reduced['hour'] = df_reduced['datetime'].dt.strftime('%H')  # Hour (00-23)


Unnamed: 0,datetime,station,core,value,hour,day,month,year
395,2013-06-25 12:00:00,mc124,nox,222.0,12,25,6,2013
1666,2009-08-08 20:00:00,mc124,no,2.0,20,8,8,2009
3628,2018-12-01 18:00:00,mc124,no,76.0,18,1,12,2018


In [7]:
df_reduced.info()

<class 'pandas.core.frame.DataFrame'>
Index: 542555 entries, 0 to 3654
Data columns (total 8 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   datetime  542555 non-null  datetime64[ns]
 1   station   542555 non-null  object        
 2   core      542555 non-null  object        
 3   value     539422 non-null  float64       
 4   hour      542555 non-null  object        
 5   day       542555 non-null  object        
 6   month     542555 non-null  object        
 7   year      542555 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(6)
memory usage: 37.3+ MB


In [8]:
# reduce to pm10
df_pm10 = df_reduced[(df_reduced['core'] == 'pm10')]
df_pm10.sample(4)

Unnamed: 0,datetime,station,core,value,hour,day,month,year
465,2021-09-27 02:00:00,mc124,pm10,18.0,2,27,9,2021
3415,2019-11-02 12:00:00,mc124,pm10,7.0,12,2,11,2019
2505,2020-03-11 01:00:00,mc124,pm10,9.0,1,11,3,2020
1985,2017-04-14 10:00:00,mc124,pm10,15.0,10,14,4,2017


In [9]:
# use library to get the day of week based on the datetime
import calendar

days = {
    0: "Monday",
    1: "Tuesday",
    2: "Wednesday",
    3: "Thursday",
    4: "Friday",
    5: "Saturday",
    6: "Saturday",
}

df_daytime = df_reduced
# Convert the 'day', 'month', and 'year' columns to integers
df_daytime['day'] = df_reduced['day'].astype(int)
df_daytime['month'] = df_reduced['month'].astype(int)
df_daytime['year'] = df_reduced['year'].astype(int)

# Function to determine the day of the week
def get_day_of_week(row):
    day_number = calendar.weekday(row['year'], row['month'], row['day'])
    return days[day_number]

# Apply the function to create the new column
df_daytime['day_of_week'] = df_daytime.apply(get_day_of_week, axis=1)
df_daytime.info()

<class 'pandas.core.frame.DataFrame'>
Index: 542555 entries, 0 to 3654
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   datetime     542555 non-null  datetime64[ns]
 1   station      542555 non-null  object        
 2   core         542555 non-null  object        
 3   value        539422 non-null  float64       
 4   hour         542555 non-null  object        
 5   day          542555 non-null  int32         
 6   month        542555 non-null  int32         
 7   year         542555 non-null  int32         
 8   day_of_week  542555 non-null  object        
dtypes: datetime64[ns](1), float64(1), int32(3), object(4)
memory usage: 35.2+ MB


In [10]:
df_daytime.sample(3)

Unnamed: 0,datetime,station,core,value,hour,day,month,year,day_of_week
1353,2017-11-19 17:00:00,mc124,no,28.0,17,19,11,2017,Saturday
1686,2012-01-08 13:00:00,mc124,no2,53.0,13,8,1,2012,Saturday
3239,2018-07-05 00:00:00,mc124,nox,161.0,0,5,7,2018,Thursday


In [11]:
# Reduce the dataset to only contain the specified columns
df_daytime_pm10 = df_pm10[['value', 'hour', 'day', 'month', 'year', 'day_of_week']]

# Rename the 'value' column to 'pm10_value'
df_daytime_pm10 = df_daytime_pm10.rename(columns={'value': 'pm10_value'})

df_daytime_pm10.sample(5)

KeyError: "['day_of_week'] not in index"

#### **_DECISION TREE_** - 1
Features: pm10_value, hour, day, month, year, day_of_week  
Split: Random (Cross-validation)

#### **_DECISION TREE_** - 2
Features: pm10_value, hour, day, month, day_of_week  
Split: Random (Cross-validation)

#### **_DECISION TREE_** - 3
Features: pm10_value, hour, day, month, year, day_of_week  
Split: Time Blocks

#### **_DECISION TREE_** - 4
Features: pm10_value, hour, day, month, day_of_week  
Split: Time Blocks