# Municipal waste generated by countries

### Content
<ol>
    <li>Introduction</li>
    <li>Data description and objectives</li>
    <li>Data manipulation and validation</li>
    <ol>
        <li> Data cleaning & shaping </li>
        <li>Missing values</li>
    </ol>
<li>Web-Scrapping for missing values</li>
</ol>

## 1.Introduction

The world generates millions of tonnes of solid waste every year. Out daily life consist of many things, that we eat, break, use, and then drop it out, because these items already are unnecessary and unuseful. This way humanity creates mountains of waste right behind their living places. 

Municipal waste management in the world has become more and more complex in the last decade because annually, the number of municipal solid waste is rapidly growing, but a tiny portion of it is recycled by countries. So, the aim of this project is to provide guidelines on the scope and coverage of municipal waste for municipal waste collection. 

This investigation is based on statistical data, which in the project will be focusing on how much waste generates humanity.

(https://en.wikipedia.org/wiki/Municipal_solid_waste)

## 2. Data description and objectives

##### Description

Dataset, which was chosen for investigation of the topic, is about country level municipal waste, that is generated by particular country, was found on the Word Bank web-site (https://datacatalog.worldbank.org/dataset/what-waste-global-database/resource/de74cc68-e796-4e42-9793-f140719c91ac#{}). There are 271 rows, and 51 columns in total. However, for this project only necessary and meaningful columns are taken. 
Out of these 51 columns, were chosen 11 columns. 
They are: 
* Country - name of country
* GDP ($M) - GDP of this country
* Food_Organic (%) - percentage of food and organic waste
* Glass(%) - percentage of glass waste
* Metal (%) - percentage of metal waste
* Other(%) - percentage of waste, which is not belongs to any large category
* Paper (%) - percentage of paper waste
* Plastic (%) - percentage of  plastic waste
* Population (%) - number of people living in the country
* Total_Waste (tonnes) - mass of all categories of waste in tonnes
* Recycling (%) - percentage of waste that recycled



##### Objectives

1. Observe the dependence of GDP per capita on the amount of waste generated by one person
2. According to the world map, investigate which countries produce the most amount of waste
3. Consider the relationship between the GDP of the country with recycling waste in percentages
4. Investigate which category of waste (Food/Organic, Glass, Metal, Other, Paper, Plastic) is in the lead 
5. Investigate the factors that affect the production of municipal waste in countries

## 3. Data manipulation and validation

In [1]:
#import libraries
import pandas as pd
import requests
from matplotlib import pyplot as plt
import numpy as np
from bs4 import BeautifulSoup

In [2]:
#reading a csv file
waste = pd.read_csv('country_level_data_0.csv')
waste

Unnamed: 0,iso3c,region_id,country_name,income_id,gdp,composition_food_organic_waste_percent,composition_glass_percent,composition_metal_percent,composition_other_percent,composition_paper_cardboard_percent,...,waste_treatment_controlled_landfill_percent,waste_treatment_incineration_percent,waste_treatment_landfill_unspecified_percent,waste_treatment_open_dump_percent,waste_treatment_other_percent,waste_treatment_recycling_percent,waste_treatment_sanitary_landfill_landfill_gas_system_percent,waste_treatment_unaccounted_for_percent,waste_treatment_waterways_marine_percent,where_where_is_this_data_measured
0,ABW,LCN,Aruba,HIC,,,,,,,...,,,,,,11.0,,89.0,,
1,AFG,SAS,Afghanistan,LIC,2.141361e+10,,,,,,...,,,,,,,,,,Other
2,AGO,SSF,Angola,LMC,1.030423e+11,51.800000,6.700000,4.400000,11.500000,11.900000,...,,,,,,,,,,
3,ALB,ECS,Albania,UMC,1.347108e+10,51.400000,4.500000,4.800000,15.210000,9.900000,...,,,,,,,,,,Some disposal sites
4,AND,ECS,Andorra,HIC,3.319880e+09,31.200000,8.200000,2.600000,11.600000,35.100000,...,,52.1,,,,,,47.9,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212,XKX,ECS,Kosovo,LMC,7.129272e+09,42.000000,6.000000,6.000000,20.000000,8.000000,...,66.43,,,33.57,,,,,,
213,YEM,MEA,"Yemen, Rep.",LIC,1.192703e+10,65.000000,1.000000,6.000000,6.000000,7.000000,...,12.00,,,25.00,,8.0,,47.0,8.0,Other
214,ZAF,SSF,South Africa,UMC,4.212087e+11,16.381655,5.200216,16.910461,45.020646,9.396918,...,72.00,,,,,28.0,,,,
215,ZMB,SSF,Zambia,LMC,2.703717e+10,,,,,,...,,,,,,,,,,


In [3]:
#shape of the file
waste.shape

(217, 51)

### 3.1 Data Cleaning and Shaping

#### Deleting unnecessary columns in the "waste" dataset

In [4]:
waste = waste.drop(['iso3c', 'income_id','waste_treatment_controlled_landfill_percent',
                   'waste_treatment_incineration_percent',
                   'waste_treatment_landfill_unspecified_percent',
                   'waste_treatment_open_dump_percent',
                   'waste_treatment_other_percent',
                   'waste_treatment_sanitary_landfill_landfill_gas_system_percent',
                   'waste_treatment_unaccounted_for_percent'], 1)

In [5]:
waste = waste.drop(['waste_collection_coverage_total_percent_of_waste', 
                   'waste_collection_coverage_urban_percent_of_households',
                   'waste_collection_coverage_urban_percent_of_geographic_area',
                   'waste_collection_coverage_urban_percent_of_population',
                   'waste_collection_coverage_urban_percent_of_waste',
                   'waste_treatment_anaerobic_digestion_percent',
                   'waste_treatment_compost_percent',
                   'waste_treatment_waterways_marine_percent',
                   'where_where_is_this_data_measured'], 1)



In [6]:
waste = waste.drop(['waste_collection_coverage_rural_percent_of_geographic_area', 
                   'waste_collection_coverage_rural_percent_of_households',
                   'waste_collection_coverage_rural_percent_of_population',
                   'waste_collection_coverage_rural_percent_of_waste',
                   'waste_collection_coverage_total_percent_of_geographic_area',
                   'waste_collection_coverage_total_percent_of_households',
                   'waste_collection_coverage_total_percent_of_population'], 1)

In [7]:
waste = waste.drop(['special_waste_hazardous_waste_tons_year', 
                   'special_waste_e_waste_tons_year',
                   'special_waste_construction_and_demolition_waste_tons_year',
                   'other_information_summary_of_key_solid_waste_information_made_available_to_the_public'], 1)



In [8]:
waste = waste.drop(['region_id', 
                   'other_information_information_system_for_solid_waste_management',
                   'other_information_national_agency_to_enforce_solid_waste_laws_and_regulations',
                   'other_information_national_law_governing_solid_waste_management_in_the_country',
                   'other_information_ppp_rules_and_regulations'], 1)

In [9]:
waste = waste.drop(['special_waste_agricultural_waste_tons_year', 
                   'special_waste_industrial_waste_tons_year',
                   'special_waste_medical_waste_tons_year'], 1)

In [10]:
waste = waste.drop(['composition_rubber_leather_percent', 
                   'composition_wood_percent',
                   'composition_yard_garden_green_waste_percent'], 1)

In [11]:
#showing a new shape of the table
waste.shape

(217, 11)

#### Rename colums in appropriate way with measurements in brackets

In [12]:
waste.columns = ["Country", "GDP ($M)", "Food_Organic (%)", "Glass (%)", "Metal (%)" , "Other (%)", "Paper (%)", "Plastic (%)",
                "Population (millions)", "Total Waste (tonnes)", "Recycling (%)"]
waste

Unnamed: 0,Country,GDP ($M),Food_Organic (%),Glass (%),Metal (%),Other (%),Paper (%),Plastic (%),Population (millions),Total Waste (tonnes),Recycling (%)
0,Aruba,,,,,,,,103187.00,8.813202e+04,11.0
1,Afghanistan,2.141361e+10,,,,,,,34656032.00,5.628525e+06,
2,Angola,1.030423e+11,51.800000,6.700000,4.400000,11.500000,11.900000,13.500000,25096150.00,4.213644e+06,
3,Albania,1.347108e+10,51.400000,4.500000,4.800000,15.210000,9.900000,9.600000,2880703.00,1.142964e+06,
4,Andorra,3.319880e+09,31.200000,8.200000,2.600000,11.600000,35.100000,11.300000,82431.00,4.300000e+04,
...,...,...,...,...,...,...,...,...,...,...,...
212,Kosovo,7.129272e+09,42.000000,6.000000,6.000000,20.000000,8.000000,11.000000,1801800.00,3.190000e+05,
213,"Yemen, Rep.",1.192703e+10,65.000000,1.000000,6.000000,6.000000,7.000000,10.000000,27584213.00,4.836820e+06,8.0
214,South Africa,4.212087e+11,16.381655,5.200216,16.910461,45.020646,9.396918,7.090104,51729345.36,1.845723e+07,28.0
215,Zambia,2.703717e+10,,,,,,,14264756.00,2.608268e+06,


#### Convernt all values in a table from Scentific Notation to float

In [13]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)
waste

Unnamed: 0,Country,GDP ($M),Food_Organic (%),Glass (%),Metal (%),Other (%),Paper (%),Plastic (%),Population (millions),Total Waste (tonnes),Recycling (%)
0,Aruba,,,,,,,,103187.00,88132.02,11.00
1,Afghanistan,21413614653.32,,,,,,,34656032.00,5628525.37,
2,Angola,103042328743.66,51.80,6.70,4.40,11.50,11.90,13.50,25096150.00,4213643.58,
3,Albania,13471082475.18,51.40,4.50,4.80,15.21,9.90,9.60,2880703.00,1142964.00,
4,Andorra,3319880351.13,31.20,8.20,2.60,11.60,35.10,11.30,82431.00,43000.00,
...,...,...,...,...,...,...,...,...,...,...,...
212,Kosovo,7129271791.57,42.00,6.00,6.00,20.00,8.00,11.00,1801800.00,319000.00,
213,"Yemen, Rep.",11927030389.45,65.00,1.00,6.00,6.00,7.00,10.00,27584213.00,4836820.00,8.00
214,South Africa,421208662669.69,16.38,5.20,16.91,45.02,9.40,7.09,51729345.36,18457232.00,28.00
215,Zambia,27037168289.17,,,,,,,14264756.00,2608268.00,


#### Convert GDP to Millions of Dollars

In [14]:
waste['GDP ($M)'] = waste['GDP ($M)'].div(1000000000)
waste

Unnamed: 0,Country,GDP ($M),Food_Organic (%),Glass (%),Metal (%),Other (%),Paper (%),Plastic (%),Population (millions),Total Waste (tonnes),Recycling (%)
0,Aruba,,,,,,,,103187.00,88132.02,11.00
1,Afghanistan,21.41,,,,,,,34656032.00,5628525.37,
2,Angola,103.04,51.80,6.70,4.40,11.50,11.90,13.50,25096150.00,4213643.58,
3,Albania,13.47,51.40,4.50,4.80,15.21,9.90,9.60,2880703.00,1142964.00,
4,Andorra,3.32,31.20,8.20,2.60,11.60,35.10,11.30,82431.00,43000.00,
...,...,...,...,...,...,...,...,...,...,...,...
212,Kosovo,7.13,42.00,6.00,6.00,20.00,8.00,11.00,1801800.00,319000.00,
213,"Yemen, Rep.",11.93,65.00,1.00,6.00,6.00,7.00,10.00,27584213.00,4836820.00,8.00
214,South Africa,421.21,16.38,5.20,16.91,45.02,9.40,7.09,51729345.36,18457232.00,28.00
215,Zambia,27.04,,,,,,,14264756.00,2608268.00,


#### Covert column of population to millions of people

In [15]:
waste['Population (millions)'] = waste['Population (millions)'].div(1000000)

In [16]:
#round the number in order to get integer number with two zero(.00) decimal places
waste['Total Waste (tonnes)'] =  waste['Total Waste (tonnes)'].round(1)
waste

Unnamed: 0,Country,GDP ($M),Food_Organic (%),Glass (%),Metal (%),Other (%),Paper (%),Plastic (%),Population (millions),Total Waste (tonnes),Recycling (%)
0,Aruba,,,,,,,,0.10,88132.00,11.00
1,Afghanistan,21.41,,,,,,,34.66,5628525.40,
2,Angola,103.04,51.80,6.70,4.40,11.50,11.90,13.50,25.10,4213643.60,
3,Albania,13.47,51.40,4.50,4.80,15.21,9.90,9.60,2.88,1142964.00,
4,Andorra,3.32,31.20,8.20,2.60,11.60,35.10,11.30,0.08,43000.00,
...,...,...,...,...,...,...,...,...,...,...,...
212,Kosovo,7.13,42.00,6.00,6.00,20.00,8.00,11.00,1.80,319000.00,
213,"Yemen, Rep.",11.93,65.00,1.00,6.00,6.00,7.00,10.00,27.58,4836820.00,8.00
214,South Africa,421.21,16.38,5.20,16.91,45.02,9.40,7.09,51.73,18457232.00,28.00
215,Zambia,27.04,,,,,,,14.26,2608268.00,


### 3.2 Missing values

#### Drop rows of NaN

In [17]:
nan_df = waste[waste.isna().any(axis=1)]

waste = waste.dropna(how='all')
waste.head(70)

Unnamed: 0,Country,GDP ($M),Food_Organic (%),Glass (%),Metal (%),Other (%),Paper (%),Plastic (%),Population (millions),Total Waste (tonnes),Recycling (%)
0,Aruba,,,,,,,,0.10,88132.00,11.00
1,Afghanistan,21.41,,,,,,,34.66,5628525.40,
2,Angola,103.04,51.80,6.70,4.40,11.50,11.90,13.50,25.10,4213643.60,
3,Albania,13.47,51.40,4.50,4.80,15.21,9.90,9.60,2.88,1142964.00,
4,Andorra,3.32,31.20,8.20,2.60,11.60,35.10,11.30,0.08,43000.00,
...,...,...,...,...,...,...,...,...,...,...,...
65,Faeroe Islands,,,,,,,,0.05,61000.00,67.00
66,"Micronesia, Fed. Sts.",0.30,23.83,7.07,16.73,9.32,13.30,26.17,0.10,26039.60,
67,Gabon,18.91,,,,,,,1.09,238102.30,
68,United Kingdom,2757.62,16.70,2.20,3.50,28.20,18.90,20.20,65.13,31567000.00,27.25


#### Drop all records, where the percentage of waste in each category is NaN, and where country is a NaN value

In [18]:
waste = waste[waste['Food_Organic (%)'].notna()
              & waste['Glass (%)'].notna()
              & waste['Metal (%)'].notna()
              & waste['Other (%)'].notna()
              & waste['Paper (%)'].notna()
              & waste['Plastic (%)'].notna()]
waste = waste[waste['Country'].notna()]
        

#### Reset indeces because of deleteming missing values in the above example

In [19]:
waste = waste.reset_index(drop = True)
waste

Unnamed: 0,Country,GDP ($M),Food_Organic (%),Glass (%),Metal (%),Other (%),Paper (%),Plastic (%),Population (millions),Total Waste (tonnes),Recycling (%)
0,Angola,103.04,51.80,6.70,4.40,11.50,11.90,13.50,25.10,4213643.60,
1,Albania,13.47,51.40,4.50,4.80,15.21,9.90,9.60,2.88,1142964.00,
2,Andorra,3.32,31.20,8.20,2.60,11.60,35.10,11.30,0.08,43000.00,
3,United Arab Emirates,384.22,39.00,4.00,3.00,10.00,25.00,19.00,9.27,5413453.40,20.00
4,Argentina,447.52,38.74,3.16,1.84,15.36,13.96,14.61,42.98,17910550.00,6.00
...,...,...,...,...,...,...,...,...,...,...,...
155,Samoa,0.74,42.60,2.20,8.80,19.40,7.20,13.00,0.19,27399.10,36.00
156,Kosovo,7.13,42.00,6.00,6.00,20.00,8.00,11.00,1.80,319000.00,
157,"Yemen, Rep.",11.93,65.00,1.00,6.00,6.00,7.00,10.00,27.58,4836820.00,8.00
158,South Africa,421.21,16.38,5.20,16.91,45.02,9.40,7.09,51.73,18457232.00,28.00


#### Finding indexes where GDP of a country is NaN value

In [20]:
# here we try to find all missing values in GDP column 
# in order to get its indeces for the further web-scpaping
column = pd.isna(waste['GDP ($M)'])
gdp_nan = []
for row in range(2, len(waste)):
    if column[row] == True:
        gdp_nan.insert(row, row)
print(gdp_nan)

#the result gives us an array, where country has no value for GDP

[21, 35, 36, 56, 87, 104, 119, 132, 134, 135, 151, 152]


#### Array of all countries in the WASTE tabel

In [21]:
countries = []
for row in range(len(waste)):
    countries.insert(row, waste['Country'][row])
print(countries)                 

['Angola', 'Albania', 'Andorra', 'United Arab Emirates', 'Argentina', 'Armenia', 'American Samoa', 'Antigua and Barbuda', 'Australia', 'Austria', 'Azerbaijan', 'Burundi', 'Belgium', 'Benin', 'Burkina Faso', 'Bangladesh', 'Bulgaria', 'Bahrain', 'Bahamas, The', 'Belarus', 'Belize', 'Bermuda', 'Bolivia', 'Brazil', 'Barbados', 'Brunei Darussalam', 'Bhutan', 'Canada', 'Switzerland', 'Chile', 'China', 'Cameroon', 'Colombia', 'Comoros', 'Costa Rica', 'Cuba', 'Cayman Islands', 'Cyprus', 'Czech Republic', 'Germany', 'Dominica', 'Denmark', 'Dominican Republic', 'Algeria', 'Ecuador', 'Egypt, Arab Rep.', 'Spain', 'Estonia', 'Ethiopia', 'Finland', 'Fiji', 'France', 'Micronesia, Fed. Sts.', 'United Kingdom', 'Georgia', 'Ghana', 'Gibraltar', 'Guinea', 'Greece', 'Grenada', 'Greenland', 'Guam', 'Guyana', 'Hong Kong SAR, China', 'Honduras', 'Croatia', 'Haiti', 'Hungary', 'Indonesia', 'Ireland', 'Iran, Islamic Rep.', 'Iraq', 'Israel', 'Italy', 'Jamaica', 'Jordan', 'Japan', 'Kazakhstan', 'Kenya', 'Cambodi

#### Retrieving and insert in the new array only those countries, where GDP value is NaN

In [22]:
countries_nan = []
for row in range(len(gdp_nan)):
    countries_nan.insert(row, countries[gdp_nan[row]])
print(countries_nan)

# here the result of countries that has no GDP values

['Bermuda', 'Cuba', 'Cayman Islands', 'Gibraltar', 'Liechtenstein', 'New Caledonia', 'French Polynesia', 'Sint Maarten (Dutch part)', 'Syrian Arab Republic', 'Turks and Caicos Islands', 'British Virgin Islands', 'Virgin Islands (U.S.)']


## 4. Web-Scraping for missing values

In this step, I scrape (https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)) web-site, to find GDP of countries that have missing values (countries_nan - array, which is mentioned above). Then I want to join these two tables, if there will be the same countries.

In [49]:
#200 successful connection to this web-page
wikiGDP = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)')
wikiGDP.status_code

200

In [1]:
#find table number 8 on this page (that has necessary data for this project)
soup = BeautifulSoup(wikiGDP.text, 'html.parser')
table = soup.find_all('table')[8]

NameError: name 'BeautifulSoup' is not defined

In [77]:
#getting all information from this web-page
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)')[8]
df

Unnamed: 0,Rank,Country/Territory,GDP(US$million)
0,,World,87751541
1,1,United States,21427700
2,2,China[n 5],14342903
3,3,Japan,5081770
4,4,Germany,3845630
...,...,...,...
186,181,Palau (2018),284
187,182,Marshall Islands (2018),221
188,183,Kiribati,195
189,184,Nauru,118


In [78]:
#deleting the first record
df = df.dropna(how='any')

In [79]:
# set Rank as an index column
df = df.set_index('Rank')

In [82]:
df

Unnamed: 0_level_0,Country/Territory,GDP(US$million)
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1
1,United States,21427700
2,China[n 5],14342903
3,Japan,5081770
4,Germany,3845630
5,India,2875142
...,...,...
181,Palau (2018),284
182,Marshall Islands (2018),221
183,Kiribati,195
184,Nauru,118


As it seen we have 190 row, but the last Rank is 185. That means thasomewhere in the dataframe we have missed Rank value.

In [84]:
#now we reseted the indeces of the dataframe, and have the same number of records as number of ranks
#exacly 190rows and ranks
df = df.reset_index(drop = True)
df

Unnamed: 0,Country/Territory,GDP(US$million)
0,United States,21427700
1,China[n 5],14342903
2,Japan,5081770
3,Germany,3845630
4,India,2875142
...,...,...
185,Palau (2018),284
186,Marshall Islands (2018),221
187,Kiribati,195
188,Nauru,118
