---
---
Recitation 5: Pandas I

Applied Data Science in Python for Social Scientists

New York University, Abu Dhabi

Dated: 26th Sept 2023

---
---
#Start Here
## Learning Goals
### General Goals
- Get hands on experience on data processing
- Learn Pandas library

### Specific Goals
- Accessing Data in DataFrames/Series
- Loading and indexing DataFrames/Series
- Vectorization and Boradcasting
- Querying DataFrames using Boolean Masks

## Distribution of Class Materials
These problem sets and recitations are intellectual property of NYUAD, and we request the students to **not** distribute them or their solutions to other students who have not signed up for this class, and/or intend to sign up in the future. We also request you don't post these problem sets, and recitations online or on any public platforms.

## Submission
You will submit all your code as a Python Notebook through [Brightspace](https://brightspace.nyu.edu/) as **R5_YOUR NETID.ipynb**.

---




# General Instructions
This recitation is worth 50 points. It has 5 parts. All the parts need to be completed in a Jupyter (Colab) Notebook attached with this handout.



# How happy are you, and why?

Think of a ladder as a way of picturing your life. The top of the ladder represents the best possible life for you. The bottom rung of the ladder represents the worst possible life for you. If the ladder has 10 steps, where would you describe your current life to be?

What are the factors explaining that score: Your income?; freedom to make life choices?; health?; corruption around you?; your social support?; or something else?

How do you think this score differs across countries?

![split-apply-combine](https://drive.google.com/uc?id=1QC2rFCO54XfDo51qwH1Am_T-ijmQW9C-)

This score is known as **Cantril ladder** or **Ladder score** named after the famous psychologist *Hadley Cantril* from Princeton.

Now *Gallup* is an American analytics and advisory company that conducts worldwide surveys and polls every year. One of their annual published reports, called the **World Happiness Report**, *ranks* countries based on the aforementioned subjective ladder score. To understand what factors contribute to that score, they also juxtapose this score with following other variables:

1. *GDP per capita*: Computed using the World Development Indicators (WDI) released by the World Bank.

2. *Health life expectancy* : Based on data from the World Health Organization (WHO) Global Health Observatory data repository.

3. *Social support*: The national average of the binary responses (either 0 or 1) to the Gallup World Poll (GWP) question **"If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?"**

4. *Freedom to make life choices*: The national average of binary responses to the GWP question **"Are you satisfied or dissatisfied with your freedom to choose what you do with your life?"**

5. *Perception of corruption*: The average of binary answers to two questions: **"Is corruption widespread throughout the government or not?"** and **"Is corruption widespread within businesses or not?"**

For all the variables above, the larger the value the better except for *Perception of corruption*.

The report is typically based on surveying about 1000 subjects per country. For this recitation, we will assume the number of subjects surveyed per country was *exactly* 1000.

We have provided you this data as a "csv" file called *happiness.csv*.

In this recitation, you will use your knoweldge of pandas from class to do some data exploration using this file.

## Part I: Warming Up (5 points)

Load the *happiness.csv* file as a DataFrame, and view the top 10 rows.

In [57]:
# Write your implementation of the function below this line

######### SOLUTION #########
import pandas as pd

happiness_df = pd.read_csv('./happiness.csv') # Please use the appropriate path to the dataset on your machine
happiness_df.head(10) # Display the first 10 rows of the dataset

######### SOLUTION END #########

Unnamed: 0,Country name,Regional indicator,Ladder score,Logged GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,Luxembourg,Western Europe,7.2375,11.450681,0.906912,72.599998,0.905636,-0.004621,0.367084
1,Singapore,Southeast Asia,6.3771,11.395521,0.910269,76.804581,0.926645,0.029879,0.109784
2,Ireland,Western Europe,7.0937,11.160978,0.942082,72.300789,0.886983,0.145785,0.357184
3,United Arab Emirates,Middle East and North Africa,6.7908,11.109999,0.849181,67.082787,0.941346,0.123451,0.594502
4,Kuwait,Middle East and North Africa,6.1021,11.089825,0.846475,66.767647,0.872366,-0.100185,0.760849
5,Norway,Western Europe,7.488,11.087804,0.952487,73.200783,0.95575,0.134533,0.263218
6,Switzerland,Western Europe,7.5599,10.979933,0.942847,74.102448,0.921337,0.105911,0.303728
7,Hong Kong S.A.R. of China,East Asia,5.5104,10.934671,0.845969,76.771706,0.779834,0.13498,0.420607
8,United States,North America and ANZ,6.9396,10.925769,0.914219,68.2995,0.84262,0.149892,0.699715
9,Netherlands,Western Europe,7.4489,10.812712,0.939139,72.300919,0.908548,0.207612,0.364717


### Rubric

- +5 points for correctness

## Part II: Dropping Data (10 points)

There is an extra column in the csv file with the name *Generosity* that we would like to drop. We would also like to drop all the rows with NaNs. Write code below to do that using Pandas.

Your code should be no more than 2 lines of code.

In [58]:
# Write your code below this line

######### SOLUTION #########
happiness_df.drop('Generosity', axis=1, inplace=True) # Drop the Generosity column
happiness_df.dropna(inplace=True) # Drop all rows with missing values

######### SOLUTION END #########

### Rubric

- +5 points for correctness
- +3 points for conciseness
- +2 points for comments

## Part III: Renaming Columns (5 points)

For the ease of access, we would like to change the casing of all the columns in the DataFrame to lower case. Write code below to accomplish that.

Your code should be no more than 3 lines of code.

In [59]:
# Write your code below this line

######### SOLUTION #########
# Make all columns lowercase
happiness_df.columns = [column.lower() for column in happiness_df.columns]

######### SOLUTION END #########

### Rubric

- +5 points for correctness
- +3 points for conciseness
- +2 points for comments

## Part IV: Where should you migrate? (10 points)

Everybody wants to live a happy life, and so you have decided that after graduation you would like to move to a country where you can live a happy life. Your criteria is to find a country with

1. high ladder score (> 6.0)
1. high logged GDP per capita (> 11.0)
2. high social support (> 0.80)
3. high healthy life expectancy (> 65.0)
4. high freedom to make life choices (> 0.90)
5. low perceived corruption (<= 0.60)

Which countries qualify for migration? Are you living in a happy country? :)

Your code should be no more than two to three lines of code.

In [60]:
# Write your code below this line

######### SOLUTION #########
happiness_df[(happiness_df['ladder score'] > 6.0) & (happiness_df['logged gdp per capita'] > 11.0) & (happiness_df['social support'] > 0.80) & (happiness_df['healthy life expectancy'] > 65.0) & (happiness_df['freedom to make life choices'] > 0.90) & (happiness_df['perceptions of corruption'] <= 0.60)]

######### SOLUTION END #########

Unnamed: 0,country name,regional indicator,ladder score,logged gdp per capita,social support,healthy life expectancy,freedom to make life choices,perceptions of corruption
0,Luxembourg,Western Europe,7.2375,11.450681,0.906912,72.599998,0.905636,0.367084
1,Singapore,Southeast Asia,6.3771,11.395521,0.910269,76.804581,0.926645,0.109784
3,United Arab Emirates,Middle East and North Africa,6.7908,11.109999,0.849181,67.082787,0.941346,0.594502
5,Norway,Western Europe,7.488,11.087804,0.952487,73.200783,0.95575,0.263218


### Rubric

- +5 points for correctness
- +3 points for conciseness
- +2 points for comments

## Part V: Happy regions (20 points)

If you may notice, the data you loaded has a column indicating regions. We would like to compute the mean ladder score of all the different regions. To accomplish that:

1. Identify all the unique regions in your data.
2. Compute the average ladder score per region.
3. Create a new dataframe with two columns **"region"** and **"ladder score"**.

Which region is the happiest as per the average ladder score?

In the next class, we will see a much efficient method to accomplish this task.

Our reference solution is no more than 7 lines of code.


In [61]:
# Write your code below this line

######### SOLUTION #########
unique_regions = happiness_df['regional indicator'].unique() # identify the unique regions
mean_ladder_score = happiness_df.groupby('regional indicator')['ladder score'].mean() # compute the mean ladder score per region
mean_ladder_score_df = pd.DataFrame(mean_ladder_score) # create a dataframe from the mean ladder score

print(mean_ladder_score_df)
happiest_region, highest_ladder_score =  mean_ladder_score_df.idxmax(), mean_ladder_score_df.max() # identify the happiest region
print(f'The happiest region is {happiest_region[0]} with a mean ladder score of {highest_ladder_score[0]}')

######### SOLUTION END #########

                                    ladder score
regional indicator                              
Central and Eastern Europe              5.853844
Commonwealth of Independent States      5.358342
East Asia                               5.714850
Latin America and Caribbean             5.981786
Middle East and North Africa            5.227159
North America and ANZ                   7.173525
South Asia                              4.475443
Southeast Asia                          5.383367
Sub-Saharan Africa                      4.373358
Western Europe                          6.899219
The happiest region is North America and ANZ with a mean ladder score of 7.17352497575


### Rubric

- +10 points for correctness
- +8 points for conciseness
- +2 points for comments
