# Appendix: 2950 Project Data Cleaning

First, we imported our csv file "Life Expectancy Data.csv" which we downloaded from https://www.kaggle.com/kumarajarshi/life-expectancy-who and can be viewed here https://drive.google.com/file/d/1uWotsPMAGiU_x2lEgssgJe6DtITk4r1I/view?usp=sharing

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
LED = pd.read_csv("Life Expectancy Data.csv")
LED.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


In [2]:
print(f"Our dataset has {len(LED)} rows and 22 columns.")

Our dataset has 2938 rows and 22 columns.


Next, we changed the column names of our dataset by removing leading or ending spaces, changing them to all lowercase, removing double spaces, and replacing any remaining spaces with underscores.

In [3]:
#changing column names
new_colnames = [name.strip() for name in LED.columns]
newer_colnames = [name.lower() for name in new_colnames]
newest_colnames = [name.replace(' ','_') for name in newer_colnames]
LED.columns= newest_colnames
LED=LED.rename(columns={'thinness__1-19_years':'thinness_1-19_years'})
LED.columns

Index(['country', 'year', 'status', 'life_expectancy', 'adult_mortality',
       'infant_deaths', 'alcohol', 'percentage_expenditure', 'hepatitis_b',
       'measles', 'bmi', 'under-five_deaths', 'polio', 'total_expenditure',
       'diphtheria', 'hiv/aids', 'gdp', 'population', 'thinness_1-19_years',
       'thinness_5-9_years', 'income_composition_of_resources', 'schooling'],
      dtype='object')

Next, we remove the columns we do not plan to use in our analysis using the drop() function. These columns include 'infant_deaths', 'hepatitis_b', 'measles', 'bmi', 'under-five_deaths', 'polio', 'diphtheria', 'thinness_1-19_years', 'thinness_5-9_years'.

In [4]:
LED= LED.drop(['infant_deaths', 'alcohol', 'population', 'total_expenditure', 'gdp', 'hepatitis_b', 'measles', 'bmi', 'under-five_deaths', 'polio', 'diphtheria', 'thinness_1-19_years', 'thinness_5-9_years'], axis=1)
LED.head()

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,percentage_expenditure,hiv/aids,income_composition_of_resources,schooling
0,Afghanistan,2015,Developing,65.0,263.0,71.279624,0.1,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,73.523582,0.1,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,73.219243,0.1,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,78.184215,0.1,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,7.097109,0.1,0.454,9.5


Finally, we changed the categorical variable we grouped our data by country and removed the year column.

In [5]:
LED.status= pd.Series(LED.status).map({"Developing":0, "Developed":1})
LED= LED.groupby('country').mean()
LED= LED.drop(["year"], axis=1)
LED.sample(10)

Unnamed: 0_level_0,status,life_expectancy,adult_mortality,percentage_expenditure,hiv/aids,income_composition_of_resources,schooling
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Jordan,0.0,72.9875,114.3125,273.599534,0.1,0.729,13.2375
Georgia,0.0,73.50625,114.9375,99.525714,0.1,0.677937,12.675
Netherlands,1.0,81.13125,61.625,3805.687048,0.1,0.89975,17.05625
Democratic People's Republic of Korea,0.0,69.19375,160.8125,0.0,0.1,,
Republic of Moldova,0.0,69.98125,210.3125,0.0,0.1,,
Mauritius,0.0,72.7125,163.5625,422.239524,0.10625,0.722625,13.53125
Chad,0.0,50.3875,227.75,32.27732,4.3375,0.316625,6.0875
Namibia,0.0,60.4,313.1875,359.930079,13.64375,0.588313,11.55
Nauru,0.0,,,15.606596,0.1,,9.6
Argentina,0.0,75.15625,106.0,773.038981,0.1,0.794125,16.50625


In [6]:
LED = LED.dropna()
len(LED)

173

In [7]:
LED.to_csv("WHOclean-final.csv")

# Data Description Questions

**What are the observations (rows) and the attributes (columns)?**

The rows of the data set are countries and the columns of the data set are: country (country name), year (year this data was collected), status (developed or developing country), life_expectancy (average life expectancy in age), adult_mortality (probability of dying between 15 and 60 years per 1000 population), infant_deaths (number of infant deaths per 1000 population), alcohol (alcohol consumption recorded per capita (in litres of pure alcohol)), percent_expenditure (expenditure on health as a percentage of GDP per capita), hepitatis_b (Hepatitis B immunization coverage among 1-year-olds as a percentage), measles (number of reported cases per 1000 population), BMI (average Body Mass Index), under_five_deaths (number of under-five deaths per 1000 population), polio (Polio immunization coverage among 1-year-olds as a percentage), total_expenditure (general government expenditure on health as a percentage of total government expenditure), diphtheria (diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds as a percentage), hiv/aids (deaths per 1000 live births (0-4 years)), gdp (Gross Domestic Product per capita (in USD)), population, thinness__1-19_years (prevalence of thinness among children and adolescents for ages 10 to 19 as a percentage), thinness_5-9_years (prevalence of thinness among children for age 5 to 9 as a percentage), income_composition_of_resources (Human Development Index in terms of income composition of resources (index ranging from 0 to 1)), schooling (average number of years of schooling).

**Why was this data set created?**

Both the Global Health Observatory (GHO) data repository under the World Health Organization (WHO) and the United Nations track health statistics as well as many other related factors (economic, social, etc.) for all countries. The creators of this data set combined the data from the repository with that of the UN. They observed that in the past 15 years, there have been significant developments in the health sector specifically in terms of decreasing human mortality rates. They created this data set to compare this decreasing mortality rate to a number of other economic, social, and medical factors to analyze how they contribute.

**Who funded the creation of the data set?**

This project was funded by the Industrial Engineering Department at Georgia Institute of Technology.

**What processes might have influenced what data was observed and recorded and what was not?**

In terms of the WHO collecting data, it is possible that more developed countries have more resources and are thus more likely to have more data. With this said, in order to compare each country equitably, the data collectors must collect the same data across all countries. This means that it is possible that the data collection was limited by what data variables were available across every country.

In terms of the creation of the actual data set, the creators mentioned a few factors they used to limit the data set:
    The creators of this data set noted that there were certain countries originally in this data set that had lots of missing data (they noted lesser known countries like Vanuatu, Tonga, Togo, Cabo Verde etc.). They shared that finding the data for these countries was difficult, so they decided to exclude the countries from the final data set of 193 countries. 
    The creators also noted that “among all categories of health-related factors only those critical factors were chosen which are more representative”. This means that they omitted other factors that they deemed less important or less correlated to the data set.

How they data cleaned: Of all health related categories in the WHO data set, only the critical factors (those with high correlations to mortality rates) were chosen to be included. The creators combined these data sets for the years 2000-2015 for 193 countries using the merge function. To account for the missing values, the developers handled it in R software by using the Missmap command.

**What preprocessing was done, and how did the data come to be in the form that you are using?**

The preprocessing done by the creators of the data set are outlined in two questions above. The preprocessing we did was outlined throughout the jupyter notebook; however, to summarize, we renamed the column names by removing leading or ending spaces, changing them to all lowercase, and replacing any remaining spaces with underscores. We also grouped and averaged all of the data for each country for all the years. 

**If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?**

The WHO notes that all of their data sets "represent the best estimates using methodologies for specific indicators that aim for comparability across countries and time; they are updated as more recent or revised data become available, or when there are changes to the methodology being used". The people involved in the data collection are aware of their data being collected and understand the WHOs intent to utilize their data to analyze global health trends to serve the greater health of humanity.

**Where can your raw source data be found, if applicable? Provide a link to the raw data (hosted in a Cornell Google Drive or Cornell Box).**

https://drive.google.com/file/d/1uWotsPMAGiU_x2lEgssgJe6DtITk4r1I/view?usp=sharing 

**List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase:**

Are there too many or too few variables (columns) being evaluated?
Are our research questions too broad or too narrow?