<a href="https://colab.research.google.com/github/Cullen-Kendrick/Data-Experience/blob/main/Collection_and_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Research Question**
---
I am deeply interested in welbeing. I want to figure makes people happy, on a large scale. I want to know what factors of life impact peoples welbeing. In order to do this I wanted to collect data from across the world across time to see how different places were impacted by different factors. In working to solve this question we can do better work to improve peoples lives in meaningful ways. It is folly to expect that a solution in one place of the world would work across the globe. In attempting to answer the question of world happiness, hopefully we can better understand what we can do to help.

# Data Collection

In [None]:
import pandas as pd

regions = pd.read_csv("https://raw.githubusercontent.com/Cullen-Kendrick/Data-219/main/world-happiness-report-2021.csv")
df = pd.read_csv("https://raw.githubusercontent.com/Cullen-Kendrick/Data-219/main/world-happiness-report.csv")

*Getting the Data*
---
Luckily, Kaggle had both datasets. The data was collected by the International Monetary Fund. The difference between the two is that the 2021 dataset was much cleaner and only focused on one year, whereas the secondary set was panel data across multiple years. I will be using the secondary dataset because having more data points provides more evidence and allows for more confidence in the conclusions that arise from said data.

# Data Cleaning

In [None]:
regions = regions[["Country name", "Regional indicator"]]
df = df.dropna()
count = df["Country name"].unique()
regions = regions.sort_values(by=['Country name'])

*Merging the Data Sets*
---
The important aspect of the first data set that I wanted to make sure I captured was the work done to categorize the countries by regions. I was interested in capturing this because it would allow me to seperate the set so that I could analyize the differences in world region and their reported happiness.

In [None]:
df = df.merge(regions, how='left', on='Country name')
reg = df['Regional indicator'].unique()
df['Region'] = df['Regional indicator']

*Manual Labeling*
---
Part of the problem in merging both data sets was that the base set that I wanted to use had more and different countries than the data set I was taking the regional indicator from. To fix this I had to encode by hand every country that showed up the base data set.

In [None]:
mask = df['Country name'] == 'Angola'
df.loc[mask, 'Region'] = 'Sub-Saharan Africa'
mask = df['Country name'] == 'Central African Republic'
df.loc[mask, 'Region'] = 'Sub-Saharan Africa'
mask = df['Country name'] == 'Congo (Kinshasa)'
df.loc[mask, 'Region'] = 'Sub-Saharan Africa'
mask = df['Country name'] == 'Djibouti'
df.loc[mask, 'Region'] = 'Sub-Saharan Africa'
mask = df['Country name'] == 'Guyana'
df.loc[mask, 'Region'] = 'Sub-Saharan Africa'
mask = df['Country name'] == 'Suriname'
df.loc[mask, 'Region'] = 'Sub-Saharan Africa'

In [None]:
mask = df['Country name'] == 'Qatar'
df.loc[mask, 'Region'] = 'Middle East and North Africa'
mask = df['Country name'] == 'Sudan'
df.loc[mask, 'Region'] = 'Middle East and North Africa'
mask = df['Country name'] == 'Syria'
df.loc[mask, 'Region'] = 'Middle East and North Africa'

In [None]:
mask = df['Country name'] == 'Bhutan'
df.loc[mask, 'Region'] = 'South Asia'

mask = df['Country name'] == 'Trinidad and Tobago'
df.loc[mask, 'Region'] = 'Latin America and Caribbean'
mask = df['Country name'] == 'Belize'
df.loc[mask, 'Region'] = 'Latin America and Caribbean'

*Variable Discussion*
---
In this section I wanted to reduce the number of columns I had to work with and I wanted to rename the other columns in order to make the work flow of this project much easier. Additionally, I want to quickly cover what each column represent. The 'Life Ladder' is meant to represent the average reported happiness of a country from 1 to 10. GDP per captia represents the economic prosperity of a country, and it has been logged because that reduces the large variances that emerge from comparing large countries, like the USA, to small countries, like Panema. Social Support, Life expectancy, Freedom to make choices, Generosity and Perceptions of Corruption are all self-explainitory.

In [None]:
df = df.drop(columns=['Regional indicator',
             'Positive affect', 'Negative affect'])
df = df.rename(columns={'Life Ladder': 'happy', 'Log GDP per capita': 'GDP',
                    'Social support': 'support', 'Healthy life expectancy at birth':
                    'life_expec', 'Freedom to make life choices': 'agency',
                    'Generosity': 'gen', 'Perceptions of corruption': 'corrup'})


In [None]:
df.to_csv('clean_data.csv', index= False)