# Introduction
In the last 30 years plastic pollution has quadrupled with roghly 353 tonnes of plastic waste produced globally in 2019 [1]. While the general population is able to mitigate some of the mismanaged plastic waste through reducing, reusing, and recycling, a large amount of plastic waste is produced commercially on a much larger scale. The purpose of this investigation is to use Environmental, Social, and Governance scores of companies, along with other factors such as maritime trading volumes, and GDP per capita to predict the amount of mismanaged plastic polliton in different countries. Models for the analysis were chosen with the [following](#model) in mind. One feature that might have a stronger influence is GDP as countries with a higher GDP might be more careful and manage plastic waste better as to not generate pollution.

# Setting up
The below code contains necessary steps for setting up our machine learning environment. Key features are described in the comments.

In [None]:
import warnings 
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # data visualization
from scipy import stats # statistic analysis (e.g. pearson's r correlation)
from sklearn.preprocessing import StandardScaler # data normalisation (chosen over maxmin scaler due to presence of outliers)
from sklearn.model_selection import train_test_split # train test split

#importing models
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV # cross-validation for hyperparameter tuning

from sklearn.metrics import mean_squared_error as MSE # MSE for error analysis

# getting rid of annoying red messages
import warnings 
warnings.filterwarnings("ignore")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Data Collection and Preperation
Data used included martime trading data, country GDP data, plastic pollution data, and company ESG rating data. All these datasets besides the plastic pollution data provide insight to the commmerical activity of each country. Using this data we hope to find a way to predict for polution through 
commercial activity data.


In order to work with the data we must first read it into dataframes and clean data for our use.



In [None]:
pd.set_option('display.max_rows', None)

# Setting dataset paths
m_trading_path = '../input/maritime-trading-volumes/maritime_volume.csv'
p_pollution_path = '../input/plastic-pollution/plastic-pollution.csv'
#esg_path = '../input/esg-scores-and-ratings/sustainability_scores.csv'
gdp_path = '../input/world-country-gdp-19602021/world_country_gdp_usd.csv'
code_path = '../input/iso-country-codes-global/wikipedia-iso-country-codes.csv'
region_path = '../input/country-mapping-iso-continent-region/continents2.csv'
pop_path = '../input/world-population-dataset/world_population.csv'


# Reading datasets
m_trading = pd.read_csv(m_trading_path) #Maritime trading data
p_pollution = pd.read_csv(p_pollution_path) #Plastic pollution by country
#esg = pd.read_csv(esg_path) #Environmental, social, governance score by company
gdp = pd.read_csv(gdp_path) #GDP data in USD
c_code = pd.read_csv(code_path) #Country code data
region = pd.read_csv(region_path) #Country region data
pop = pd.read_csv(pop_path) #Population data


Datasets are merged using country code and region as indices

In [None]:
# Subsetting and preparing data
gdp = gdp.iloc[15694:15960,:]
m_trading = m_trading.iloc[:,[0,1,51]]
m_trading = m_trading.set_axis(["Region","stat","trading"],axis = 1) 
m_trading = m_trading.iloc[1: , :]
m_trading["Region"] = m_trading["Region"].str.strip() # getting rid of extra spaces

# extracting total trading by region
m_trading_l = m_trading[m_trading["stat"] == "Total goods loaded"]
m_trading_d = m_trading[m_trading["stat"] == "Total goods discharged"]
m_trading_merged =  m_trading_l.merge(m_trading_d, left_on = "Region",right_on = "Region")
m_trading_merged["Total Trading"] = m_trading_merged["trading_x"]+m_trading_merged["trading_y"]
m_trading_merged = m_trading_merged[["Region","Total Trading"]]

pop = pop[["CCA3","2020 Population"]]
#esg["Country"] = esg["Country"].replace(dict(zip(c_code["Alpha-2 code"],c_code["Alpha-3 code"])))

#mean esg score for all companies by country
#esg_score = esg.groupby(["Country"]).mean().reset_index()

# Merging data
merged = gdp.merge(p_pollution,left_on = "Country Code",right_on = "Code").drop(["Entity","Code","Year"],axis = 1)
merged = merged.merge(region.iloc[:,[2,5,6]],left_on = "Country Code",right_on = "alpha-3")
merged = merged.merge(m_trading_merged, left_on = "sub-region",right_on = "Region").drop(["region","alpha-3","sub-region"],axis = 1)
merged = merged.merge(pop, left_on = "Country Code",right_on = "CCA3").drop("CCA3",axis = 1)
merged = merged.dropna()
merged["Total mismanaged plastic waste (kg per year)"] = merged["Mismanaged plastic waste to ocean per capita (kg per year)"] * merged["2020 Population"]
#merged = merged.merge(esg_score,left_on = "Country Code", right_on = "Country").drop("Country", axis = 1)

#m_trading_merged.head(300)
#_pollution.head()
#esg_score.head()
#gdp.head()
merged

# Data Visualisation

In [None]:
region = merged.groupby(by="Region").mean()
sns.scatterplot(region["GDP_per_capita_USD"],merged["Mismanaged plastic waste to ocean per capita (kg per year)"])

In [None]:
region = merged.groupby(by="Region").mean()
sns.scatterplot(region["2020 Population"],region["Mismanaged plastic waste to ocean per capita (kg per year)"])

### Splitting into Training and Testing Data 80% to 20%


In [None]:
merged_y = merged["Total mismanaged plastic waste (kg per year)"] # getting dependent variable
merged = merged.drop(["Total mismanaged plastic waste (kg per year)","Country Name", "Country Code", "year", "Region","Mismanaged plastic waste to ocean per capita (kg per year)"], axis = 1) # dropping non-numerical columns and dependent variable
scaler = StandardScaler() # normalising data
merged = scaler.fit_transform(merged)
merged_y = scaler.fit_transform(np.reshape(merged_y.values,(-1,1)))                                    
X_train, X_test, y_train, y_test = train_test_split(merged,merged_y,test_size = 0.2, random_state = 42 )

# Testing Different Models <a id="results"></a>
### Model selection <a id="model"></a>
Due to there being no clear relationship from the data visualisations it seemed like more complex models would work better. There were also no clear splits in the data making regression seem like the better choice over classification.
### Hyperparameter tuning
Hyperparameters were tuned using grid search cross-validation. By tuning the hyperparameters we were able to reduce the error of each model, therefore increasing performance.
### Test results

1) Random Forest Regressor 54.8% error

2) KNeighbors Regressor 40.8% error

3) Gradient Boosting Regressor 36-38% error

4) Support Vector Regressor 9.09% error



In [None]:
#Random Forest Regressor (using train data prediction to check if overfitting)
model = RandomForestRegressor(n_estimators = 2000, max_depth = 30, random_state = 18)
model.fit(X_train, y_train)
prediction = model.predict(X_test)
prediction2 = model.predict(X_train)
error = MSE(y_test,prediction)
error2 = MSE(y_train,prediction2)
rmse = error**.5
rmse2 = error2**.5
print(rmse)
print(rmse2)



In [None]:
# KNeighboors Regressor CV
leaf_size = list(range(1,50))
n_neighbors = list(range(1,30))
p=[1,2]
param = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)
knr = KNeighborsRegressor()
clf = GridSearchCV(knr, param, cv=10)
best = clf.fit(X_train, y_train)
print('Best leaf_size:', best.best_estimator_.get_params()['leaf_size'])
print('Best p:', best.best_estimator_.get_params()['p'])
print('Best n_neighbors:', best.best_estimator_.get_params()['n_neighbors'])

In [None]:
# KNeighboors Regressor
model2 = KNeighborsRegressor(leaf_size = 1, p = 2, n_neighbors=27)
model2.fit(X_train, y_train)
pred = model2.predict(X_test)
error3 = MSE(y_test,pred)
rmse3 = error3**.5
print(rmse3)

In [None]:
#Gradient Boosting Regressor
errors = 0
for i in range(100):
    model3 = GradientBoostingRegressor(learning_rate= 0.02, max_depth= 6, n_estimators= 100, subsample= 0.1)
    model3.fit(X_train, y_train)
    predict = model3.predict(X_test)
    error = MSE(y_test,predict)
    rmse = error**.5
    errors += rmse
print(f"average rmse :{errors/100}") #Average error over 100 trials because bagging is random.

In [None]:
#Gradient Boosting Regressor CV 
parameters = {'learning_rate': [0.01,0.02,0.03,0.04],
                  'subsample'    : [0.9, 0.5, 0.2, 0.1],
                  'n_estimators' : [100,500,1000, 1500],
                  'max_depth'    : [4,6,8,10]
              }
GBR = GradientBoostingRegressor()

grid_GBR = GridSearchCV(estimator=GBR, param_grid = parameters, cv = 2, n_jobs=-1)
grid_GBR.fit(X_train, y_train)
print(" Results from Grid Search " )
print("\n The best estimator across ALL searched params:\n",grid_GBR.best_estimator_)
print("\n The best score across ALL searched params:\n",grid_GBR.best_score_)
print("\n The best parameters across ALL searched params:\n",grid_GBR.best_params_)

In [None]:
#Support Vector Regressor 
model = SVR(C= 10, epsilon= 0.1, gamma= 1e-07, kernel= 'rbf')
model.fit(X_train, y_train)
prediction = model.predict(X_test)
error = MSE(y_test,prediction)
rmse = error**.5
print(rmse)


In [None]:
#Support Vector Regressor CV 
parameters = {'kernel': ('linear', 'rbf','poly'), 'C':[1.5, 10],'gamma': [1e-7, 1e-4],'epsilon':[0.1,0.2,0.5,0.3]}
SVR_Grid = GridSearchCV(model, parameters)
SVR_Grid.fit(X_train,y_train)
SVR_Grid.best_params_

# Conclusion
The purpose of this investigation was to take different aspects of a country's economic activity to predict the amount of mismanaged plastic waste produced. Through this invesitgation, we were able to predict the amount of mismanaged plastic waste by a country within a 9.1% error. Because the data was normalised using a standard scaler to be between 0 and 1. Any error found using techniques such as RMSE are in % error. The final model chosen is a Support Vector Regression model. This model seems to preform reasonably well for the amount of data it is trained with, and much better than the other models that were [tested.](#results) This could be due to the model being sophisticated enough to fit the data, therefore reducing bias and being able to predict better.


   

# Future Work
Although this model is able predict with a relatively low error of 9%, it can still be improved. One way the model could be improved is through adding more data. The amount of variables that are used to train the dataset are quite small. It would also help to test for statistically significant variables before training the model. With a greater understaning of how the hyperparameters for Support Vector Regression works, each model could also be tuned with more detail in order to increase accuracy.



# References
1) Plastic pollution is growing relentlessly as waste management and recycling fall short, says OECD. (n.d.). Www.oecd.org. Retrieved November 15, 2022, from https://www.oecd.org/newsroom/plastic-pollution-is-growing-relentlessly-as-waste-management-and-recycling-fall-short.htm#:~:text=There%20is%20now%20an%20estimated

Python and Machine Learning Courses on [Datacamp.com](http://www.datacamp.com/)
- [Introduction to Python](https://app.datacamp.com/learn/courses/intro-to-python-for-data-science)
- [Intermediate Python](https://app.datacamp.com/learn/courses/intermediate-python)
- [Supervised Learning with scikit-learn](https://app.datacamp.com/learn/courses/supervised-learning-with-scikit-learn)
- [Machine Learning with Tree-Based Models in Python](https://app.datacamp.com/learn/courses/machine-learning-with-tree-based-models-in-python)



