# Modelling with Multiple Linear Regression

In this final notebook we look to answer 2 key questions set out in our problem statement - 
1. What is the extent of impact that each pollutant gas has on bee colony numbers?
2. Which pollutant gases should be prioritised for removal in order to maximise bee colony numbers?

We answer these two questions inferentially through the coefficients derived from a multiple linear regression model, and, measure the performance of our model through metrics 'root mean squared error' (measuring the average difference between values predicted by our model and the actual values) and R-squared score (telling us the proportion of the changes in our target variable bee colony numbers that can be accounted for by our model). 

---

# Imports 

In [48]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error, r2_score

In [51]:
data = pd.read_csv('./data/modelling_dataframe.csv')

# Defining pollutants
pollutants = ['Days CO', 'Days NO2', 'Days Ozone', 'Days PM2.5', 'Days PM10']

In [53]:
# Features and target variable
X = data[pollutants]
y = data['Bee Colonies']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Model evaluation
rmse = root_mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Root Mean Squared Error: {rmse}')
print(f'R-squared: {r2}')

# Making inferences from the coefficients
coefficients = coefficients.sort_values(by='Coefficient', ascending=False)
print("Pollutants impact on Bee Colonies (sorted by impact):")
print(coefficients)

# Making inferences from the coefficients
print("Based on the model, the pollutants affecting bee colony poplutions the most (in order of impact)are:")
print(coefficients['Pollutant'].iloc[:3].values)

Root Mean Squared Error: 316651.77884869935
R-squared: 0.01718034841386784
Pollutants impact on Bee Colonies (sorted by impact):
    Pollutant  Coefficient
3  Days PM2.5   415.753639
2  Days Ozone   355.479014
4   Days PM10   249.542403
1    Days NO2    59.869507
0     Days CO    19.553487
Based on the model, the pollutants affecting bee colony poplutions the most (in order of impact)are:
['Days PM2.5' 'Days Ozone' 'Days PM10']


---
Whilst the R-squared score is extremely low (our model explaining only 0.17% of the changes in bee populations), we must take into account that this project only investigates Air Quality as a factor contributing towards bee populations, [disregarding other larger factors that may affect bee populations to a bigger extent](https://www.europarl.europa.eu/topics/en/article/20191129STO67758/what-s-behind-the-decline-in-bees-and-other-pollinators-infographic). This model also works off of aggregate measures for the entire US accross the period of time for which we have recorded data and so we focus on making inferences from the coefficients above all else in order to address our problem statement. 

Based on the model, the pollutants affecting bee colony poplutions the most (in order of impact)are 'Days PM2.5', 'Days Ozone', 'Days PM10'.