![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fhackathon&branch=master&subPath=ColonizingMars/ChallengeTemplates/challenge-option-2-how-could-we-colonize-mars.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Data Scientist Challenge: Humanity must build a new home on Mars.


You’re a data scientist on a team of newly-arrived humans. While you were on Earth, you figured out how you could make the planet habitable. From growing food to clothing needs, you need to start building the framework for sustaining life on the red planet. 

In this notebook, we have decided to focus on **Plastics** as key resource that needs to be produced and properly managed.   

![music image](https://github.com/callysto/hackathon/blob/sustainable-society/SustainabilityOnMars/ChallengeExamples/plastic-example-image.jpg?raw=true)

### Section I: Problem background

Plastics are useful for generating fibers (textiles, rope), sheets (bags, wraps, windows), and 3D structures. On Earth, one million plastic drinking bottles are purchased every minute, and up to 5 trillion single-use plastic bags are used worldwide every year. In total, half of all plastic produced is designed to be used only once — and then thrown away (source: UN Environment). 

As a data scientist, we can help researchers understand where the plastic is found and look for innovative ways to recycle plastics and reduce plastic pollution. There is no such thing as a sustainable product in unsustainable packaging.

In this notebook, we want to help scientists reduce  what is not needed so we want to answer the question of **where does tha mjority of plastic waste come from** ? we can answer this in many ways, e.g. by sector, or by country.  We will also explore the extent of plastic waste coming from lollipops 🍭🍭🍭. 

### Section II: The data you used

The data in this notebook was downloaded from Citizen Science Cloud: https://hub.cscloud.host/app/ec-2020-plastics. 

It comes from **three** different projects. 

1. **Marine Debris Monitoring and Assessment Project**: a National Oceanic and Atmospheric Administration-coordinated citizen science initiative that engages volunteers to survey and record the amount and types of marine debris on shorelines. 

2. **Ocean Conservancy TIDES Database**: a public, global ocean trash data set, all collected by volunteers 

3. **European Environment Agency's Marine LitterWatch**: Data collected by the Marine Litter Watch is collected as both clean-ups and as monitoring events. 

Below are some of the fields available for each track which we will use to answer the questions we are interested in:

**CountryName_FromSource** — country name.\
**TotalClassifiedItems_EC2020** - equals the sum of columns from [SUM_Hard_PlasticBeverageBottle : SUM_OtherPlasticDebris]
**PCT_PlasticAndFoam** - percentage of plastic and foam from total classified items
**PCT_Glass_Rubber_Lumber_Metal** - percentage of rubber, lumber and metal from total classified items
**CONTINENT** - continent
**SUM_HardPlasticBeverageBottle** - sum of collected plastic beverage bottles


In [1]:
# importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import cufflinks as cf

In [2]:
# reading the data 
# we have csv file stored in the cloud
url = "https://raw.githubusercontent.com/callysto/hackathon/sustainable-society/SustainabilityOnMars/ChallengeExamples/plastics-data-2015-2018.csv"

# read csv file from url and save it as dataframe
plastics = pd.read_csv(url, index_col=0, low_memory=False)

# print first 5 rows
plastics.head()

Unnamed: 0,UniqueID,SourceID,LocationFreqID,Dataset,Organization,CountryName_FromSource,Longitude1,Latitude1,Longitude2,Latitude2,...,SUM_Foam_OtherPlasticDebris,SUM_OtherPlasticDebris,NAME,COUNTRY,ISO_CODE,ISO_CC,ISO_SUB,COUNTRYAFF,CONTINENT,LAND_TYPE
1,MDP-349,40-3153,Blackpoint Beach (Lon -123.4355847 Lat 38.6905...,NOAA MDMAP Accumulation Survey,California Coast National Monument Task Force,United States,-123.435585,38.690549,-123.432939,38.689234,...,13,1.0,California,United States,USCA,US,CA,United States,North America,Primary land
2,MDP-351,37-3164,Dune Drift Beach (Lon -123.4844062 Lat 38.7287...,NOAA MDMAP Accumulation Survey,California Coast National Monument Task Force,United States,-123.484406,38.728707,-123.487692,38.733347,...,9,0.0,California,United States,USCA,US,CA,United States,North America,Primary land
3,MDP-354,59-3175,Ohlson Beach (Lon -123.4564 Lat 38.7132),NOAA MDMAP Accumulation Survey,California Coast National Monument Task Force,United States,-123.4564,38.7132,-123.4551,38.7106,...,0,0.0,California,United States,USCA,US,CA,United States,North America,Primary land
4,MDP-358,41-3191,Walk On Beach (Lon -123.490915 Lat 38.735105),NOAA MDMAP Accumulation Survey,California Coast National Monument Task Force,United States,-123.490915,38.735105,-123.489614,38.731897,...,0,0.0,California,United States,USCA,US,CA,United States,North America,Primary land
5,MDP-360,11-3195,Rocky Point (Lon -124.4621 Lat 42.7149),NOAA MDMAP Accumulation Survey,Redfish Rocks,United States,-124.4621,42.7149,-124.462,42.7139,...,850,0.0,Oregon,United States,USOR,US,OR,United States,North America,Primary land


In [3]:
# how many rows and colums does the dataframe have?
plastics.shape

(54388, 51)

### Section III: Data Analysis and Visualization

We want to explore major sources of plastic waste ? 

Let's look at what plastic items were collected and counted. We do this by checking the column names of the 'plastics' table. 

In [7]:
# print column names
plastics.columns

Index(['UniqueID', 'SourceID', 'LocationFreqID', 'Dataset', 'Organization',
       'CountryName_FromSource', 'Longitude1', 'Latitude1', 'Longitude2',
       'Latitude2', 'TotalWidth_m', 'TotalLength_m', 'ShorelineName',
       'EventType', 'TotalVolunteers', 'MonthYear', 'Year', 'MonthNum',
       'Month', 'TotalClassifiedItems_EC2020', 'PCT_PlasticAndFoam',
       'PCT_Glass_Rubber_Lumber_Metal', 'SUM_Hard_PlasticBeverageBottle',
       'SUM_Hard_OtherPlasticBottle', 'SUM_HardOrSoft_PlasticBottleCap',
       'SUM_PlasticOrFoamFoodContainer', 'SUM_Hard_BucketOrCrate',
       'SUM_Hard_Lighter', 'SUM_OtherHardPlastic',
       'SUM_PlasticOrFoamPlatesBowlsCup', 'SUM_HardSoft_PersonalCareProduc',
       'SUM_HardSoftLollipopStick_EarBu', 'SUM_Soft_Bag',
       'SUM_Soft_WrapperOrLabel', 'SUM_Soft_Straw', 'SUM_Soft_OtherPlastic',
       'SUM_Soft_CigaretteButts', 'SUM_Soft_StringRingRibbon', 'Fishing_Net',
       'SUM_FishingLineLureRope', 'Fishing_BuoysAndFloats',
       'SUM_Foam_OtherPl

Columns starting with the prefix '**SUM_**' indicates the type of plastic item that was collected and then counted. These are: e.g. beverage bottles, bottle caps, food containers, buckets, crates, lighters, bowls, cups, personal care products, wappers, labels, straws, cigarette butts, string ring ribbons, fishing nets, fishing line lure ropes, fishing buoys and floats. 

The column '**TotalClassifiedItems_EC2020**' contains the total number of plastic waste sources that have been collected and then classified. 

Let's first get the highest number of plastic items that were collected. We do this by finding the largest number in the column 'TotalClassifiedItems_EC2020' using the max() function. 

In [7]:
# find largest number of plastic items collected and classified
print("The maximum number of plastic items collected and classified is - " , plastics['TotalClassifiedItems_EC2020'].max(), 'item')

The maximum number of plastic items collected and classified is -  26420613 item


In [9]:
# let's look at the data associated with this number
plastics.iloc[plastics['TotalClassifiedItems_EC2020'].argmax()]

UniqueID                                                                   TID-51979
SourceID                                                                       66013
LocationFreqID                     TIDES (Lon -0.18885935584342 Lat 5.6160064497862)
Dataset                                               Oecan Conservancy TIDES Report
Organization                                             SNFYVF- Accra, Volta Region
CountryName_FromSource                                                         Ghana
Longitude1                                                                  -0.18886
Latitude1                                                                    5.61601
Longitude2                                                                       NaN
Latitude2                                                                        NaN
TotalWidth_m                                                                     NaN
TotalLength_m                                                    

Which country had the maximum number of plastic waste items collected and classified ?

#### Observations
We can see here that the highest number of plastic waste items collected comes from Ghana in Africa. 

Does this mean that Ghana has the highest plastic pollution ? Think about it then check the conclusion section for some insights. 

Let's breakdown this number and look at the contribution of each collected plastic item that led to Ghanad having the highest number of collected plastic itesm. We will draw a barplot to visualise the collected plastic items and their counts. 

In [10]:
# let's first select the columns and rows we are interested in
# select the columns containing the required data
maximum_plastics_waste = plastics.loc[50185, 'SUM_Hard_PlasticBeverageBottle':'SUM_OtherPlasticDebris']
maximum_plastics_waste = maximum_plastics_waste.to_frame()

# rename column name to reflect the data within
maximum_plastics_waste.columns = ['plastic_waste_sources']

In [10]:
maximum_plastics_waste

Unnamed: 0,plastic_waste_sources
SUM_Hard_PlasticBeverageBottle,4976.0
SUM_Hard_OtherPlasticBottle,3514.0
SUM_HardOrSoft_PlasticBottleCap,26771.0
SUM_PlasticOrFoamFoodContainer,4775.0
SUM_Hard_BucketOrCrate,0.0
SUM_Hard_Lighter,763.0
SUM_OtherHardPlastic,409.0
SUM_PlasticOrFoamPlatesBowlsCup,247666.0
SUM_HardSoft_PersonalCareProduc,3230.0
SUM_HardSoftLollipopStick_EarBu,0.0


We will create barplot to visualise the number of each collected plastic item in Ghana. 

In [24]:
# maximum_plastics_waste.plot(kind = 'bar')
cf.go_offline()
maximum_plastics_waste.iplot(kind='bar', title='What are the major sources of plastic waste in Ghana ?', yTitle='type of plastic item', xTitle='number of plastic items collected')

#### Observations
We can see here that the biggest sources of plastic pollution in Ghana, comes from foam plastic debirs and foam.  Cigarette butts, and soft straw also contribute plastic waste sources in Ghana. 

We also observe that the columns 'SUM_OtherPlasticDebris' and 'SUM_Foam_OtherPlasticDebris' are significanlty larger than the rest of the values and are very general that we can not tell which type of plastic items is this. 

What happens when we recreate the bar plot without these two columns.  

Let's redraw the barplot witout the 'Sum_OtherPlasticDebris' and 'SUM_Foam_OtherPlasticDebris' columns. 

In [39]:
# first we drop the columns we are not interested in
maximum_plastics_waste = maximum_plastics_waste.drop(['SUM_OtherPlasticDebris', 'SUM_Foam_OtherPlasticDebris'], axis = 0)

KeyError: "['SUM_OtherPlasticDebris' 'SUM_Foam_OtherPlasticDebris'] not found in axis"

In [26]:
# then we redraw the bar plot
maximum_plastics_waste.iplot(kind='bar', title='What are the major sources of plastic waste in Ghana ?', yTitle='type of plastic item', xTitle='number of plastic items collected')

#### Observations
We observe that soft cigarette butts, soft straws, plates, bowls and cups are the major contributors to plastic waste in Ghana, Africa.

### How about lollipop sticks ? Do we have to give up lollipop on Mars ? 

We noted that lollipo sticks and ear buds were counted and classified among the plastic items that contribute to plastic waste. 

We would like to answet the question of how much does lollipops actually contirubte to plastic waste ?  

To answer this question, let's look at the distribution of lollipop stick waste across continents. 

We will draw a scatter plot for the count of lollipo sticks in every continent using the columns 'SUM_HardSoftLollipopStick_EarBu' and 'CONTINENT'

In [37]:
# draw the scatter plot
columns = ["SUM_HardSoftLollipopStick_EarBu"]
for col in columns:
    x = plastics.groupby("CONTINENT")[col].mean()
    
x.iplot(kind='line', title='How much does lollipops actually contirubte to plastic waste ?', yTitle='continent', xTitle='number of lollipop sticks and ear buds')

#### Observations
We observe that  Europeans have lollipos (and Ear buds) contributing to sources of plastic waste while all other populations do not seem to like lollipos as much.  

Does this make sense ? Does this mean people in all continents but Europe, do not eat any lollipos ? 

Ofcourse NOT ! People all over the globe eat lollipos. 
Rememeber that this was an aggregated dataset from three different sources, so let's look at counts of lollipops from each of the three datasets. 

In [38]:
# plot counts of lollipops by data source
columns = ["SUM_HardSoftLollipopStick_EarBu"]
for col in columns:
    x = plastics.groupby("Dataset")[col].mean()

x.iplot(kind='bar', title='How many lollipop sticks and ear buds were collected in each dataset ? ', yTitle='lollipop sticks and ear buds counts', xTitle='dataset')

### Section IV: Conclusion 

We have used data to know that Ghana has the highest number of collected plastic waste items. While this can imply that Ghana has the highest plastic waste across al countries, it could be a limitation of this particular dataset where the number of volunteers could be higher or there could be more plastic waste on particular dates (e.g. holidays). 

We have also seen that lollipos sticks and ear buds were only collected and counted by the 'European Environment Agency's Marine LitterWatch' project which falsely implied that lollipop sticks are source of plastic waste only in Euorpe. 

### What's next ?

Several other questions can be answered with this dataset like how does the trend in sources of plastic waste vary across the years or months or regions ? 

📌 In this **example notebook** you have seen how data can give us insights into questions and solutions that are both data-driven. Now go to the **hackathon template** and start solving your own challenge for sustaining life on Mars !