# Data Cleaning and Preparation
#### Two of the datasets used in this process are taken from [Our World in Data](https://ourworldindata.org/).
The main article in use is [Plastic Pollution](https://ourworldindata.org/plastic-pollution). 
The focus of this study is to understand the total waste produced by each country in the year 2010, as well as the total waste mismanaged by the countries for the same year.


In [1]:
import pandas as pd
import numpy as np

### Following is a dataset containing plastic waste per person in Kg/day

In [2]:
df = pd.read_csv('datasets/per-capita-plastic-waste-vs-gdp-per-capita.csv')

In [3]:
df.columns
df.rename(columns={'GDP per capita, PPP (constant 2011 international $)': 'GDP per capita in PPP', 
                   'Total population (Gapminder, HYDE & UN)': 'Total Population',
                    'Per capita plastic waste (kg/person/day)': 'Waste per person(kg/day)'}, inplace=True)

In [4]:
# removing entities/countries with incomplete/missing data
incomplete_data_index = df[(df['Total Population'].isna()) & (df['GDP per capita in PPP'].isna())].index
df.drop(incomplete_data_index, inplace=True)

In [5]:
# new dataframe that takes in the required data (by year 2010)
data = df[df['Year'] == 2010]
data = data.drop(columns='Continent')

In [6]:
# retrieving continent names (from 2015 data)
con_names = df[df['Year'] == 2015]
data['Continent'] = con_names['Continent'].values

In [7]:
# dropping rows with missing Continent values using index
miss_index = data[data['Continent'].isna()].index
data.drop(miss_index, inplace=True)

In [8]:
# dropping rows with missing per person waste generation values
data = data[data['Waste per person(kg/day)'].notna()]

waste_gener = data.reset_index().drop('index', axis=1)

In [9]:
waste_gener.head(3)

Unnamed: 0,Entity,Code,Year,Waste per person(kg/day),GDP per capita in PPP,Total Population,Continent
0,Albania,ALB,2010,0.069,9927.181841,2948000.0,Europe
1,Algeria,DZA,2010,0.144,12870.602699,35977000.0,Africa
2,Angola,AGO,2010,0.062,5897.682841,23356000.0,Africa


### Following is a dataset containing mismanaged plastic waste per person in Kg/day

In [10]:
df2 = pd.read_csv('datasets/per-capita-mismanaged-plastic-waste-vs-gdp-per-capita.csv')

In [11]:
df2.rename(columns={'Per capita mismanaged plastic waste': 'Mismanaged waste per person(kg/day)',
                     'GDP per capita, PPP (constant 2011 international $)': 'GDP per capita in PPP',
                     'Total population (Gapminder, HYDE & UN)': 'Total Population'}, inplace=True)
df2.drop('Continent', axis=1, inplace=True)

In [12]:
# new dataframe for the required data
data2 = df2[df2['Year'] == 2010]

In [13]:
# dropping rows with missing mismanaged waste 
data2 = data2[data2['Mismanaged waste per person(kg/day)'].isna() != True]

waste_misma = data2.reset_index().drop('index', axis=1)

In [14]:
# joining both the dataframes in one
plastic_waste = pd.merge(waste_gener, waste_misma, how='inner')

In [15]:
# rearranging columns in the dataframe
plastic_waste.columns.tolist()
col_list = ['Entity','Code','Year','Waste per person(kg/day)','Mismanaged waste per person(kg/day)',
           'GDP per capita in PPP','Total Population','Continent']
plastic_waste = plastic_waste[col_list]

# rounding the values per person
plastic_waste.iloc[:, 3:5] = np.around(plastic_waste[['Waste per person(kg/day)', 
                                                      'Mismanaged waste per person(kg/day)']], decimals=2)


#### Generating Total waste and Total mismanaged waste by country
Total waste is achieved by using the product of waste generated per person per day and the total population of that country.
Total mismanaged waste is achieved by using the product of mismanaged waste per person a day and the total population of that country.

Both are then multiplied by 365 to get the value for a year.

In [16]:
plastic_waste['Total waste(kgs/year)'] = ((plastic_waste['Waste per person(kg/day)'] * 
                                    plastic_waste['Total Population']) * 365).astype(int)
plastic_waste['Total waste mismanaged(kgs/year)'] = ((plastic_waste['Mismanaged waste per person(kg/day)'] * 
                                    plastic_waste['Total Population']) * 365).astype(int)

In [17]:
plastic_waste.head()

Unnamed: 0,Entity,Code,Year,Waste per person(kg/day),Mismanaged waste per person(kg/day),GDP per capita in PPP,Total Population,Continent,Total waste(kgs/year),Total waste mismanaged(kgs/year)
0,Albania,ALB,2010,0.07,0.03,9927.181841,2948000.0,Europe,75321400,32280600
1,Algeria,DZA,2010,0.14,0.09,12870.602699,35977000.0,Africa,1838424700,1181844450
2,Angola,AGO,2010,0.06,0.04,5897.682841,23356000.0,Africa,511496400,340997600
3,Anguilla,AIA,2010,0.25,0.01,,13000.0,North America,1186250,47450
4,Antigua and Barbuda,ATG,2010,0.66,0.05,19212.720131,88000.0,North America,21199200,1606000


In [18]:
# creating a CSV file of the cleaned data
# if the file already exists in the folder, the file will not be created again

plastic_waste.to_csv('datasets/plastic_waste.csv')