<a href="https://colab.research.google.com/github/Kenjagi20000/Data-Science-full-project/blob/main/Airquality_in_Juja.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**IMPORTANT CODES**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import plotly.express as px
# model codes
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.ar_model import AutoReg
from sklearn.metrics import mean_absolute_error
from statsmodels.tsa.api import AutoReg

In [None]:
def wrangle(filepath):
    # reading the csv file
    df = pd.read_csv(filepath,
                     delimiter = ";",
                     index_col = "timestamp")
    # changing timestamp object to datetime64
    df.index = pd.to_datetime(df.index)
    # localizing timestamp
    df.index = df.index.tz_convert("Africa/Nairobi")
    # setting frequency
    df.index = df.index.to_period("T")
    # subsetting `p2` observations only
    df = df[df["value_type"] == "P2"]
    # return
    return df

In [None]:
df = wrangle("/content/tmpfajv_dqd.csv")
df.index[:3]

This data set contains PM (particulate matter), temperature, and humidity readings taken with low-cost sensors. These sensors measure the concentration of PM in the air, including particles with diameters less than or equal to 1 micrometer (PM1), 2.5 micrometers (PM2.5), and particles with diameters less than or equal to 10 micrometers (PM10). The data set includes information on the sensor type, date, time, and location of the readings, as well as the sensor’s specific measurement values for Temperature (C), Humidity (%), PM1, PM2.5, and PM10. The data set is ideal for researchers and individuals interested in studying air quality and low-cost sensors in PM measurement. The dataset is stored in CSV format and can be opened using editors like Microsoft Excel, Google Sheets, LibreOffice Calc, etc. Note that P0 in the data represents PM1, P2 represents PM2.5, and P1 represents PM10

In [None]:
df.head()

In [None]:
# creating a boxplot for [p2] items
fig, ax = plt.subplots(figsize = (15,6))
df["value"].plot(kind = "box",vert = False,
                 title = "P2 values box plot",
                 ax = ax);

The P2 values seems to not have outliers and so we are good to go.

In [None]:
# timeseries plot
fig, ax = plt.subplots(figsize = (15,6))
df["value"].plot(title = "Time series plot for the P2 values",
                 ax = ax);

The graph is giving us time series from 9.15pm all the way to 10.35pm in juja

In [None]:
# resampling to know more
resamp = df["value"].resample("H").mean().to_frame()
fig,ax = plt.subplots(figsize = (15,6))
resamp.plot(title  = "One hour resampled plot", ax = ax);

In [None]:
rolling = df["value"].rolling(12).mean()
fig, ax = plt.subplots(figsize = (15,6))
fig = rolling.plot(title = "12 rolling averange plot",
                   ylabel = "P2.5",
                   ax = ax);
plt.show()

Creating a lag of shift one

In [None]:
df["L1"] = df["value"].shift(1)
df[["value","L1"]].corr()

There is a very high positive correlation of 0.97. This shows that there is a higher relationship between the previous readings and the current ones.

In [None]:
# creating a scatter plot of the two plots
fig, ax = plt.subplots(figsize = (15,6))
ax.scatter(x = df["L1"],y = df["value"], label = "Scatter")
ax.plot([5,16],[5,16],linestyle = "dashdot",color = "orange", label = "Line")
ax.set_xlabel("Lag shift of 1")
ax.set_ylabel("Value[P2.5]")
ax.set_title("correlation between P2.5 values and Lag 1")
ax.legend()
plt.show()

In [None]:
# acf plot
fig, ax = plt.subplots(figsize = (15,6))
plot_acf(df["value"],ax = ax)
plt.show();

As we can see, we a gradually decresing autocorrelation indicating that we can apply AR model, let us see a partial autocorrelation to validate our findings

In [None]:
fig, ax = plt.subplots(figsize = (15,6))
plot_pacf(df["value"], ax = ax)
plt.show();

Wow, we only have one significant lag that we can use as seen in our pacf plot.One lag can be used as the rest are scientifically insignificant


In [None]:
# vertical
target = "value"
y = df[target]
cutoff = int(len(y) * 0.8)
cutoff

In [None]:
# y_train, y_test split
y_train, y_test = y.iloc[:cutoff],y.iloc[cutoff:]
len(y_train) + len(y_test) == len(y)
y_train.head()

In [None]:
# model iterate and fitting
model = AutoReg(y_train, lags = 1).fit()

In [None]:
# training predictions
y_pred = model.predict().dropna()
y_pred[:3]

In [None]:
# calculating residuals
y_train_resid = model.resid
y_train_resid.head()

In [None]:
# plotting
fig, ax = plt.subplots(figsize = (15,6))
y_train_resid.plot(ax = ax)
plt.title("AR timeseries predictions Vs. Residuals")
plt.ylabel("Residuals")
plt.xlabel("Time (t)")
fig.show();

In [None]:
# a histogram
fig, ax = plt.subplots(figsize = (12,6))
y_train_resid.hist(ax = ax)
plt.title("A histogram of the Residuals")
plt.ylabel("Frequency")
plt.xlabel("Residuals")
plt.show();

In [None]:
# model evaluation
y_pred_test = model.predict(y_test.index.min(),
                            y_test.index.max())
y_pred_test_align = y_pred_test.loc[y_test.index]
print("aligned predictions are: ",y_pred_test_align)
print("test data are: ", y_test)
test_mae = mean_absolute_error(y_test, y_pred_test_align)
print(f"The mean absolute error is: {test_mae}")

In [None]:
df_pred_test = pd.DataFrame(
    {"y_test": y_test, "y_pred": y_pred_test_align}, index=y_test.index
)
df_pred_test.head()

In [None]:
fig, ax = plt.subplots(figsize = (15,6))
df_pred_test["y_test"].plot(label = "y_test",ax = ax)
df_pred_test["y_pred"].plot(label = "y_pred",ax = ax)
plt.ylabel("P2 values")
plt.xlabel("time (t)")
plt.title("Test predictin line graphs")
plt.legend();

In [None]:
%%capture
# walk-forward validation
y_pred_wfv = pd.Series()
history = y_train.copy()
for i in range(len(y_test)):
    # buid model
    model = AutoReg(history, lags = 1).fit()
    # next prediction
    next_pred = model.forecast()
    y_pred_wfv = y_pred_wfv.append(next_pred)
    history = history.append(y_test)

In [None]:
y_pred = y_pred_wfv.values
y_series = pd.Series(y_pred)
dic = {
    "timestamp":y_test.index,
    "y_test":y_test.values,
    "y_pred":y_series

}
df_pred = pd.DataFrame(dic)
df_pred.set_index("timestamp")
df_pred

In [None]:
test_mae = mean_absolute_error(df_pred["y_test"],
                               df_pred["y_pred"])
print("Test MAE (walk forward validation):", round(test_mae, 2))

In [None]:
fig, ax = plt.subplots(figsize = (15,6))
df_pred["y_pred"].plot(ax = ax);