In this project we are going to connect to MongoDB and use time series data to build a Linear model. During the project we localize the timezone,use rolling average to smooth spiky time series data and determine how far back we want to look when we are going to make a prediction about the future (lag) and we are going to calculate autocorrlation and how a single variable correlate to itself and last but not least we use train-test split with time series data to build and test our model.

In [3]:
#  Import libraries

from pprint import PrettyPrinter
import pandas as pd
import pymongo
from pymongo import MongoClient
import matplotlib.pyplot as plt
import plotly.express as px
import pytz
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

Connect to MongoDB server

In [4]:
client = MongoClient(host="localhost", port=27017)
db = client["air-quality"]
nairobi = db["nairobi"]

Complete wrangle function

So that the results from the database query are read into the DataFrame df.

In [5]:
def wrangle(collection):
    results = collection.find(
        {"metadata.site": 29, "metadata.measurement": "P2"},
        projection={"P2": 1, "timestamp": 1, "_id": 0},
    )

    df = pd.DataFrame(results).set_index("timestamp")
    
    #Localize timezone
    df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")
    
    #Remove outliers
    df = df[df["P2"] < 500]
    
    # Resample 1H window and ffill missing values    
    df = df["P2"].resample("1H").mean().fillna(method="ffill").to_frame()
    
    #Add lag feature
    df["P2.L1"] = df["P2"].shift(1)
    
    #drop NaN rows
    df.dropna(inplace=True)
    
    return df

Import data

In [None]:
df = wrangle(nairobi)
print(df.shape)
df.head()

![1.png](attachment:1.png)

Explore

Creating a boxplot of the "P2" readings in df.

In [None]:
fig, ax = plt.subplots(figsize=(15, 6))
df["P2"].plot(kind="box", vert=False, title="Distribution of PM2.5 Readings", ax=ax);

![boxplot%20using%20Pandas.png](attachment:boxplot%20using%20Pandas.png)

According to this plot there are outliers within our data. So we must add to our wrangle function that all "P2" readings above 500 are dropped from the dataset.

Time series plot

This plot shows us how data moves over time.

In [None]:
fig, ax = plt.subplots(figsize=(15, 6))
df["P2"].plot(xlabel="Time" , ylabel="PM2.5", title="PM2.5 Time Series", ax= ax)

![Line%20plot%20using%20Pandas.png](attachment:Line%20plot%20using%20Pandas.png)

In this plot we can there is missing data and we are going to deal with

Resample Time series

In this project we predict PM 2.5. Another question that we must answer to is that at what interval do we want our prediction. We are going to have hourly prediction so we have to adjust the interval at which we have reading.

In order to do that we must add to our wrangle function to resample df to provide the mean "P2" reading for each hour.


Rolling avarage

Rolling average is a good way to smooth time series data.

Plot the rolling average of the "P2" readings in df. Using a window size of 168 (the number of hours in a week).

In [None]:
fig, ax = plt.subplots(figsize=(15, 6))
df["P2"].rolling(168).mean().plot(ax=ax, ylabel="PM2.5", title="Weekly Rolling Average")

![Rolling%20Average%20plot.png](attachment:Rolling%20Average%20plot.png)

Creat lag feature 

In this project we have one column of time series data. How can we engineer or change this data engineer feature to predict PM 2.5.
The answer is using lag feature.By adding to our wrangle function and creating a column called "P2.L1" that contains the mean"P2" reading from the previous hour, the lag feature will be create.

Correlation matrix

In this project our target is the PM 2.5 reading at the given timestamp and feature is the reading from previous hour. Looking at the correlation within one variable is an example of Autocorrelation.

Creating a correlation matrix for df

In [None]:
df.corr()

![2.png](attachment:2.png)

Autocorrelation plot

Creating a scatter plot that shows PM 2.5 mean reading for each hour as a function of the mean reading from the previous hour. In other words, "P2.L1" should be on the x-axis, and "P2" should be on the y-axis.

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(x= df["P2.L1"], y=df["P2"])
ax.plot([0,120], [0,120], linestyle="--", color="orange")
plt.xlabel("P2.L1")
plt.ylabel("P2")
plt.title("PM2.5 Autocorrelation");

![Autocorrelation%20plot%20using%20Matplotlib.png](attachment:Autocorrelation%20plot%20using%20Matplotlib.png)

According to the plot there is a predictive power from what happened an hour ago if we try to predict what happens now.

Vertical split

Split the DataFrame df into the feature matrix X and the target vector y. Your target is "P2"

In [None]:
target = "P2"
y = df[target]
X = df.drop(columns=target)

Train-test split from time series

In this project we have time series data so for train-test split we have to use cutoff.

Splitting X and y into training and test sets. The first 80% of the data should be in your training set.

In [None]:
cutoff = int(len(X)* 0.8)

X_train, y_train = X.iloc[:cutoff] , y.iloc[:cutoff]
X_test, y_test = X.iloc[cutoff:] , y.iloc[cutoff:]

Build model

Baseline model
Calculating the baseline mean absolute error for our model.


In [None]:
y_pred_baseline = [y_train.mean()] * len(y_train)
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)

print("Mean P2 Reading:", round(y_train.mean(), 2))
print("Baseline MAE:", round(mae_baseline, 2))

![3.png](attachment:3.png)

Iterate

Instantiate a LinearRegression model named model, and fit it to  our training data.

In [None]:
# Build Model
model = LinearRegression()
# Fit model
model.fit(X_train, y_train)


![4.png](attachment:4.png)

Evaluate model

To see how our model perform we have to calculate the training and test mean absolute error .

In [None]:
training_mae = mean_absolute_error(y_train, model.predict(X_train))
test_mae = mean_absolute_error(y_test, model.predict(X_test))
print("Training MAE:", round(training_mae, 2))
print("Test MAE:", round(test_mae, 2))

![5.png](attachment:5.png)

As you can our model beat the baseline. So we are happy with this model.

Communicate results

Extractting the intercept and coefficient from our model.

In [None]:
intercept = model.intercept_.round(2)
coefficient = model.coef_.round(2)[0]
print(f"P2 = {intercept} + ({coefficient} * P2.L1)")

![6.png](attachment:6.png)

Prediction Dataframe for our plot

In [None]:
df_pred_test = pd.DataFrame(
    {
      "y_test": y_test,
      "y_pred": model.predict(X_test)
    }
)
df_pred_test.head()

![7.png](attachment:7.png)

Plot time series prediction

In [None]:
fig = px.line(df_pred_test, labels={"value":"P2"})
fig.show()


![Line%20plot%20time%20series%20prediction.png](attachment:Line%20plot%20time%20series%20prediction.png)

The red are predictive labels and the blue are true labels. It seems that our model did good job of prediction data. So looking back one hour can help us to predict.