## Data Explanation

For the Endangered Species Dataset, we have data from the IUCN Red List (Table 7) from the years 2007 to 2021 with these columns:

*   Scientific Name
*   Common Name
*   Species Type
*   IUCN Red List Category
*   Reason For Change
*   Year

#### Explanation of the categories and reasons for change: <br>

**IUCN Red List Categories:** EX - Extinct, EW - Extinct in the Wild, CR - Critically Endangered [CR(PE) - Critically Endangered (Possibly Extinct), CR(PEW) - Critically Endangered (Possibly Extinct in the Wild)], EN - Endangered, VU - Vulnerable, LR/cd - Lower Risk/conservation dependent, NT - Near Threatened (includes LR/nt - Lower Risk/near threatened), DD - Data Deficient, LC - Least Concern (includes LR/lc - Lower Risk, least concern).

**Reasons for change:** G - Genuine status change (genuine improvement or deterioration in the species' status); N - Non-genuine status change (i.e., status changes due to new information, improved knowledge of the criteria, incorrect data used previously, taxonomic revision, etc.); E - Previous listing was an Error. <br>

For the World Production Dataset, we have data from 2006 to 2022 from the United Nations Industrial Development Organization with these columns:

*   Table Code
*   Table Description
* 	Country Code
*   Country Description
*   Year
*   ISIC
*   ISIC Description
*   ISIC Combination
* 	Value
*  	Table Definition Code
*	Table Description
* 	Source Code
*	Unit

#### Explanation of the ISIC Description and value: <br>

**ISIC Description:** The categorization of the production of the country in the year based on ISIC code. <br>

**Value:** The value of the production of the country in the year based on ISIC code.




## Data Cleaning Process

We will be cleaning the data in the following ways:

*   Removing unnecessary rows, columns, and data
*   Removing duplicate data and columns
*   Renaming columns
*   Changing data types
*   Splitting columns



In [1]:
import pandas as pd
import numpy as np

### Data Cleaning: Endangered Species Dataset

In [2]:
endangered = pd.read_csv('data/endangered_species.csv')
endangered.head()


Unnamed: 0,Scientific Name,Common Name,Species Type,IUCN Red List Category,Reason For Change,Year
0,Cephalophus spadix,Abbott’s Duiker,Mammal,EN,G,2007
1,Gazella spekei,Speke’s Gazelle,Mammal,EN,G,2007
2,Gorilla gorilla,Western Gorilla,Mammal,CR,G,2007
3,Lipotes vexillifer,Baiji,Mammal,CR(PE),G,2007
4,Mazama chunyi,Dwarf Brocket Deer,Mammal,VU,N,2007


In [3]:
# prints out the rows that have blank values in the scientific/common name column
original_data_endangered = endangered.shape[0]
print("The total amount of data originally is", original_data_endangered)
# drops the rows that have blank values in both of these columns
endangered = endangered.dropna(subset=['Scientific Name', 'Common Name'], how='all')
print("The total amount of data after removing rows with no species name is", endangered.shape[0])
# removes all rows that have blank values in the 'Year' column
endangered = endangered.dropna(subset=['Year'], how='all')
new_data_endangered = endangered.shape[0]
print("The total amount of data after removing rows with blank year values is", new_data_endangered)
print("The total amount of data removed is", original_data_endangered - new_data_endangered)

The total amount of data originally is 10013
The total amount of data after removing rows with no species name is 10003
The total amount of data after removing rows with blank year values is 9995
The total amount of data removed is 18


In [4]:
# convert the years that have the format 2014-3 into 2014

def convert_year(year):
	'''
	Converts the year into a consistent format (for example: from 2014-3 to 2014)
	
	year (string): the year to be converted
	return (string): the converted year
	'''
	if '-' in year:
		year = year.split('-')[0]
		year = str(int(year))
		return year
	elif '.' in year:
		year = year.split('.')[0]
		year = str(int(year))
		return year
	elif '‐' in year:
		year = year.split('‐')[0]
		year = str(int(year))
		return year
	else:
		return year

print(endangered['Year'].tail())

endangered['Year'] = endangered['Year'].apply(convert_year)
endangered['Year'].tail()





10008    2021-1
10009    2021-2
10010    2021-1
10011    2021-3
10012    2021-3
Name: Year, dtype: object


10008    2021
10009    2021
10010    2021
10011    2021
10012    2021
Name: Year, dtype: object

In [5]:
print("The summary statistics of the data set are:")
endangered.describe()

The summary statistics of the data set are:


Unnamed: 0,Scientific Name,Common Name,Species Type,IUCN Red List Category,Reason For Change,Year
count,9977,6421,9995,9992,9776,9995
unique,9635,6098,40,12,3,15
top,Labeo seeberi,Polynesian Tree Snail,Amphibian,LC,N,2020
freq,3,13,1812,2858,8625,2130


In [6]:
# the following code groups the data by the IUCN Red List Category and counts the number of species in each category

endangered.groupby('IUCN Red List Category')['Scientific Name'].count()

IUCN Red List Category
CR         1057
CR(PE)      117
CR(PEW)       2
DD          537
EN         2165
EW           15
EX           67
En            1
LC         2856
LR/nt         1
NT         1454
VU         1702
Name: Scientific Name, dtype: int64

In [7]:
# the following code displays the number of species that are endangered per year
endangered_year = endangered.groupby('Year').count()
endangered_year = endangered_year.reset_index()
endangered_year = endangered_year[['Year', 'Scientific Name']]
endangered_year = endangered_year.rename(columns={'Scientific Name': 'Species Count'})
endangered_year


Unnamed: 0,Year,Species Count
0,2007,141
1,2008,215
2,2009,179
3,2010,483
4,2011,346
5,2012,412
6,2013,422
7,2014,400
8,2015,291
9,2016,754


In [8]:
# the following code displays the number of species that are critically endangered per year
# based on the IUCN Red List Categories contained within the sublist (endangered and above)
endangered_sublist = ['EX', 'EW', 'CR', 'CR(PE)', 'CR(PEW)' 'EN']

endangered_critically_endangered = endangered[endangered['IUCN Red List Category'].isin(endangered_sublist)]
endangered_critically_endangered = endangered_critically_endangered.groupby('Year').count()
endangered_critically_endangered = endangered_critically_endangered.reset_index()
endangered_critically_endangered = endangered_critically_endangered[['Year', 'Scientific Name']]
endangered_critically_endangered = endangered_critically_endangered.rename(columns={'Scientific Name': 'Species Count'})
endangered_critically_endangered

Unnamed: 0,Year,Species Count
0,2007,18
1,2008,33
2,2009,29
3,2010,64
4,2011,44
5,2012,70
6,2013,31
7,2014,69
8,2015,28
9,2016,88


### Data Cleaning: World Production Dataset

In [9]:
world_production = pd.read_csv('data/world_production.csv')
world_production.head()


Unnamed: 0,Table Code,Table Description,Country Code,Country Description,Year,ISIC,ISIC Description,ISIC Combination,Value,Table Definition Code,Table Description.1,Source Code,Unit
0,51,Seasonally adjusted index,8,Albania,2006 Q1,10,Food products,10,65.9,51,Seasonally adjusted index,1,I
1,51,Seasonally adjusted index,8,Albania,2006 Q1,11,Beverages,11,65.9,51,Seasonally adjusted index,1,I
2,51,Seasonally adjusted index,8,Albania,2006 Q1,12,Tobacco products,12,65.9,51,Seasonally adjusted index,1,I
3,51,Seasonally adjusted index,8,Albania,2006 Q1,13,Textiles,13,58.0,51,Seasonally adjusted index,1,I
4,51,Seasonally adjusted index,8,Albania,2006 Q1,14,Wearing apparel,14,58.0,51,Seasonally adjusted index,1,I


In [10]:
# the following code removes all unnecessary columns from the data set
world_production = world_production[['Country Code', 'Country Description', 'Year', 'ISIC', 'ISIC Description', 'Value']]
world_production.head()

Unnamed: 0,Country Code,Country Description,Year,ISIC,ISIC Description,Value
0,8,Albania,2006 Q1,10,Food products,65.9
1,8,Albania,2006 Q1,11,Beverages,65.9
2,8,Albania,2006 Q1,12,Tobacco products,65.9
3,8,Albania,2006 Q1,13,Textiles,58.0
4,8,Albania,2006 Q1,14,Wearing apparel,58.0


### The following code takes about 2 minutes to run, which is why it's commented out
### To run it, uncomment out the following code and comment the last 2 lines of code in this cell

In [11]:
# split up the year and quarter and make a new column for the quarter


# def split_year_quarter(year_quarter):
# 	'''
# 	Splits the year and quarter into two separate columns
	
# 	year_quarter (string): the year and quarter to be split
# 	return (list): a list containing the year and quarter
# 	'''
# 	split_value = 'Q'
# 	if "Y" in year_quarter:
# 		split_value = 'Y'
	
# 	year = year_quarter.split(split_value)[0].strip()
# 	quarter = year_quarter.split(split_value)[1].strip()
# 	if quarter == "":
# 		quarter = "Y"
# 	return [year, quarter]

# world_production[['Year', 'Quarter']] = world_production['Year'].apply(split_year_quarter).apply(pd.Series)
# world_production.to_csv('data/world_production_new.csv', index=False)
# world_production.head()

world_production = pd.read_csv('data/world_production_new.csv')
world_production.head()

Unnamed: 0,Country Code,Country Description,Year,ISIC,ISIC Description,Value,Quarter
0,8,Albania,2006,10,Food products,65.9,1
1,8,Albania,2006,11,Beverages,65.9,1
2,8,Albania,2006,12,Tobacco products,65.9,1
3,8,Albania,2006,13,Textiles,58.0,1
4,8,Albania,2006,14,Wearing apparel,58.0,1


In [12]:
# the following code removes all rows that contains year 2006 or 2022 and reset the indexes
original_data_prod = world_production.shape[0]
print("The total amount of data originally is", original_data_prod)
world_production = world_production[world_production['Year'] != 2006]
world_production = world_production[world_production['Year'] != 2022].reset_index(drop=True)
new_data_prod = world_production.shape[0]
print("The total amount of data after removing rows with the year 2006 or 2022 is", new_data_prod)
print("The total amount of data removed is", original_data_prod - new_data_prod)
world_production.head()

The total amount of data originally is 350636
The total amount of data after removing rows with the year 2006 or 2022 is 323396
The total amount of data removed is 27240


Unnamed: 0,Country Code,Country Description,Year,ISIC,ISIC Description,Value,Quarter
0,8,Albania,2007,10,Food products,60.7,1
1,8,Albania,2007,11,Beverages,60.7,1
2,8,Albania,2007,12,Tobacco products,60.7,1
3,8,Albania,2007,13,Textiles,60.2,1
4,8,Albania,2007,14,Wearing apparel,60.2,1


In [13]:
print("The summary of the data set is:")
world_production.describe()

The summary of the data set is:


Unnamed: 0,Country Code,Year,Value
count,323396.0,323396.0,323396.0
mean,425.8557,2014.222987,111.686101
std,253.487541,4.503588,957.64895
min,8.0,2005.0,-6.7
25%,196.0,2011.0,90.4
50%,428.0,2015.0,100.3
75%,643.0,2018.0,111.9
max,858.0,2021.0,318802.7


In [14]:
# group by the year and ISIC description and sum the values
world_production_sum = world_production.groupby(['Year', 'ISIC Description']).sum()
world_production_sum = world_production_sum.reset_index()
world_production_sum = world_production_sum[['Year', 'ISIC Description', 'Value']]
world_production_sum = world_production_sum.rename(columns={'Value': 'Production Value'})
world_production_sum


Unnamed: 0,Year,ISIC Description,Production Value
0,2005,Basic metals,53892.2
1,2005,Beverages,46945.1
2,2005,Chemicals and chemical products,44470.5
3,2005,Coke and refined petroleum products,40910.2
4,2005,"Computer, electronic and optical products",49861.2
...,...,...,...
443,2021,Tobacco products,72644.0
444,2021,Total manufacturing,132593.0
445,2021,"Water supply; sewerage, waste management",74334.8
446,2021,Wearing apparel,98500.5


In [15]:
# group by isic description and average the values
world_production_avg = world_production.groupby(['ISIC Description'])
world_production_avg = world_production.groupby(['ISIC Description']).mean()
world_production_avg = world_production_avg.reset_index()
world_production_avg = world_production_avg[['ISIC Description', 'Value']]
world_production_avg = world_production_avg.rename(columns={'Value': 'Production Value'})
world_production_avg

Unnamed: 0,ISIC Description,Production Value
0,Basic metals,104.065912
1,Beverages,101.62006
2,Chemicals and chemical products,102.875333
3,Coke and refined petroleum products,127.348899
4,"Computer, electronic and optical products",110.732981
5,Electrical equipment,105.420031
6,"Electricity, gas, steam & air conditioning",103.452861
7,"Fabricated metal products, except machinery",104.437081
8,Food products,101.478381
9,Furniture,104.919756
