### Linear Regression & Predictions using a single feature to predict the O3. Supervised Machine Learning Methods.
Explored by performing prediction models using 2 different cleansed datasets: 
1. Without Zero or Null Values
2. With Zero Values replaced with Median Values of their respective columns

Note: All Outputs Cleared except for Data 1 and Data 2 Evaluation cell outputs. 
The Outputs, particularly the plots were making the notebook size very large. 

In [None]:
# Import Modules and Packages. 
import numpy as np
import pandas as pd
import hvplot.pandas
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

In [None]:
# Read in cleansed data file 'data_drop.csv' from AWS S3 Bucket
url="https://project-4-group-6-air-quality.s3.us-east-2.amazonaws.com/data_drop.csv"
air_data_df = pd.read_csv(
    url,
    sep=',',
    encoding='utf-8',
)
air_data_df.head()

In [None]:
air_data_df.info()

In [None]:
# Drop the 'wd' string object, and date columns,  and save to new df
air_df = air_data_df.drop(['year', 'month', 'day','hour','wd'], axis = 1)

# Rename Unnamed:0 column
air_df.rename(columns ={"Unnamed: 0":"Number"})

In [None]:
# Create a scatter plot
air_plot = air_df.hvplot.scatter(
    x="TEMP",
    y="O3",
    title="Expected Ozone Measures based on Temperature"
)
air_plot

Linear Regression Model to predict Ozone based on Temperature. 

In [None]:
# Create the X set by using the `reshape` function to format the TEMP data as a single column array.
X = air_df["TEMP"].values.reshape(-1, 1)
# Display sample data
X[:5]

In [None]:
# Create an array for the dependent variable y with the O3 data
y = air_df["O3"]

Linear Regression Model with SciKit-learn

In [None]:
# Create a model with scikit-learn
model = LinearRegression()

In [None]:
# Fit the data into the model
model.fit(X, y)

In [None]:
# Display the slope
print(f"Model's slope: {model.coef_}")

In [None]:
# Display the y-intercept
print(f"Model's y-intercept: {model.intercept_}")

In [None]:
# Display the model's best fit line formula
print(f"Model's formula: y = {model.intercept_} + {model.coef_[0]}X")

Plot of Best Fit Line for the Prediction Model

In [None]:
# Make predictions using the X set
predicted_y_values = model.predict(X)

In [None]:
# Create a copy of the original data
air_ozone_predicted = air_df.copy()

# Add a column with the predicted sales values
air_ozone_predicted["Ozone_predicted"] = predicted_y_values

# Display sample data
air_ozone_predicted.head()

In [None]:
# Create a line plot of ads versus the predicted sales values
best_fit_line = air_ozone_predicted.hvplot.line(
    x = "TEMP",
    y = "Ozone_predicted",
    color = "orange"
)
best_fit_line

Plot of inital Scatter Plot and Best Fit Line Plot

In [None]:
# Superpose the original data and the best fit line
air_plot * best_fit_line

Manual Predictions

In [None]:
# Display the formula to predict the sales with 100 ads
print(f"Model's formula: y = {model.intercept_} + {model.coef_[0]} * 100")

# Predict the sales with 100 ads
y_100 = model.intercept_ + model.coef_[0] * 100

# Display the prediction
print(f"Predicted Ozone (O3) metrics {y_100:.2f}")

Predictions Using the `Predict` Function

In [None]:
# Create an array to predict ozone levels for -0.5, -1.0, -1.5, -2.0, and -2.5 Temps
X_temps = np.array([-0.5, -1.0, -1.5, -2.0, -2.5])

# Format the array as a one-column array
X_temps = X_temps.reshape(-1,1)

# Display sample data
X_temps

In [None]:
# Predict ozone for temp values
predicted_ozone_temp = model.predict(X_temps)

In [None]:
# Create a DataFrame for the predicted ozone levels
df_predicted_ozone_temp = pd.DataFrame(
    {
        "temp": X_temps.reshape(1, -1)[0],
        "predicted_ozone": predicted_ozone_temp
    }
)

# Display data
df_predicted_ozone_temp

Data 1. Evaluating the Linear Regression Model 

In [41]:
# Compute metrics for the linear regression model: score, r2, mse, rmse, std
score = model.score(X, y, sample_weight=None)
r2 = r2_score(y, predicted_y_values)
mse = mean_squared_error(y, predicted_y_values)
rmse = np.sqrt(mse)
std = np.std(y)

# Print relevant metrics.
print(f"The score is {score}.")
print(f"The r2 is {r2}.")
print(f"The mean squared error is {mse}.")
print(f"The root mean squared error is {rmse}.")
print(f"The standard deviation is {std}.")

The score is 0.34434693063035227.
The r2 is 0.34434693063035227.
The mean squared error is 2043.6992288229917.
The root mean squared error is 45.20729176607455.
The standard deviation is 55.83049022103166.


### Process repeated using the cleansed dataset 2. 'data_med.csv'

In [None]:
# Read in cleansed data file 'data_drop.csv' from AWS S3 Bucket

url="https://project-4-group-6-air-quality.s3.us-east-2.amazonaws.com/data_med.csv"
air2_data2_df = pd.read_csv(
    url,
    sep=',',
    encoding='utf-8',
)
# Display sample data
air2_data2_df.head()

In [None]:
# Drop the 'wd' string object, and date columns,  and save to new df
air_data2_df = air2_data2_df.drop(['year', 'month', 'day','hour','wd'], axis = 1)

# Rename Unnamed:0 column
air_data2_df.rename(columns ={"Unnamed: 0":"Number"})

Data2. Scatter Plot with Ozone (O3) and Temperature (TEMP)

In [None]:
# Create a scatter plot
air_plot = air_data2_df.hvplot.scatter(
    x="TEMP",
    y="O3",
    title="Expected Ozone Measures based on Temperature with Dataset2"
)
air_plot

In [None]:
# Create the X set by using the `reshape` function to format the TEMP data as a single column array.
X = air_data2_df["TEMP"].values.reshape(-1, 1)

# Display sample data
X[:5]

In [None]:
# Create an array for the dependent variable y with the O3 data
y = air_data2_df["O3"]

Data2.  Linear Regression Model with SciKit-learn

In [None]:
# Create a model with scikit-learn
model = LinearRegression()

In [None]:
# Fit the data into the model
model.fit(X, y)

In [None]:
# Display the slope
print(f"Model's slope: {model.coef_}")

In [None]:
# Display the y-intercept
print(f"Model's y-intercept: {model.intercept_}")

In [None]:
# Display the model's best fit line formula
print(f"Model's formula: y = {model.intercept_} + {model.coef_[0]}X")

Data2.  Plot of Best Fit Line for the Prediction Model

In [None]:
# Make predictions using the X set
predicted_y_values = model.predict(X)

In [None]:
# Create a copy of the original data
ozone_data2_predicted = air_data2_df.copy()

# Add a column with the predicted sales values
ozone_data2_predicted["Ozone_data2_predicted"] = predicted_y_values

# Display sample data
ozone_data2_predicted.head()

In [None]:
# Create a line plot of ads versus the predicted sales values
best_fit_line = ozone_data2_predicted.hvplot.line(
    x = "TEMP",
    y = "Ozone_data2_predicted",
    color = "orange"
)
best_fit_line

Data2.  Plot of inital Scatter Plot and Best Fit Line Plot

In [None]:
# Superpose the original data and the best fit line
air_plot * best_fit_line

Data2.  Manual Predictions

In [None]:
# Display the formula to predict the sales with 100 ads
print(f"Model's formula: y = {model.intercept_} + {model.coef_[0]} * 100")

# Predict the sales with 100 ads
y_100 = model.intercept_ + model.coef_[0] * 100

# Display the prediction
print(f"Predicted Ozone (O3) metrics with data2 {y_100:.2f}")

Data2. Predictions Using the `Predict` Function

In [None]:
# Create an array to predict ozone levels for -0.5, -1.0, -1.5, -2.0, and -2.5 Temps
X_temps = np.array([-0.5, -1.0, -1.5, -2.0, -2.5])

# Format the array as a one-column array
X_temps = X_temps.reshape(-1,1)

# Display sample data
X_temps

In [None]:
# Predict ozone for temp values
predicted_ozone_data2 = model.predict(X_temps)

In [None]:
# Create a DataFrame for the predicted ozone levels
df_predicted_ozone_data2 = pd.DataFrame(
    {
        "temp": X_temps.reshape(1, -1)[0],
        "predicted_ozone": predicted_ozone_temp
    }
)

# Display data
df_predicted_ozone_data2

Data 2. Assessing the Linear Regression Model

In [42]:
# Compute metrics for the linear regression model: score, r2, mse, rmse, std
score = model.score(X, y, sample_weight=None)
r2 = r2_score(y, predicted_y_values)
mse = mean_squared_error(y, predicted_y_values)
rmse = np.sqrt(mse)
std = np.std(y)

# Print relevant metrics.
print(f"The score is {score}.")
print(f"The r2 is {r2}.")
print(f"The mean squared error is {mse}.")
print(f"The root mean squared error is {rmse}.")
print(f"The standard deviation is {std}.")

The score is 0.34434693063035227.
The r2 is 0.34434693063035227.
The mean squared error is 2043.6992288229917.
The root mean squared error is 45.20729176607455.
The standard deviation is 55.83049022103166.


### Conclusion
* Decided to go with the Dataset 1 with data_drop.  
* Dataset 2 with the Zero values replaced with Median value of the respective column adds a lot of bias to the result.