# Course project guidelines

Your assignment for the course project is to formulate and answer a question of your choosing based on one of the following datasets:

1. ClimateWatch historical emissions data: greenhouse gas emissions by U.S. state 1990-present
2. World Happiness Report 2023: indices related to happiness and wellbeing by country 2008-present
3. Any dataset from the class assignments or mini projects

A good question is one that you want to answer. It should be a question with contextual meaning, not a purely technical matter. It should be clear enough to answer, but not so specific or narrow that your analysis is a single line of code. It should require you to do some nontrivial exploratory analysis, descriptive analysis, and possibly some statistical modeling. You aren't required to use any specific methods, but it should take a bit of work to answer the question. There may be multiple answers or approaches to contrast based on different ways of interpreting the question or different ways of analyzing the data. If your question is answerable in under 15 minutes, or your answer only takes a few sentences to explain, the question probably isn't nuanced enough.

## Deliverable

Prepare and submit a jupyter notebook that summarizes your work. Your notebook should contain the following sections/contents:

* **Data description**: write up a short summary of the dataset you chose to work with following the conventions introduced in previous assignments. Cover the sampling if applicable and data semantics, but focus on providing high-level context and not technical details; don't report preprocessing steps or describe tabular layouts, etc.
* **Question of interest**: motivate and formulate your question; explain what a satisfactory answer might look like.
* **Data analysis**: provide a walkthrough with commentary of the steps you took to investigate and answer the question. This section can and should include code cells and text cells, but you should try to focus on presenting the analysis clearly by organizing cells according to the high-level steps in your analysis so that it is easy to skim. For example, if you fit a regression model, include formulating the explanatory variable matrix and response, fitting the model, extracting coefficients, and perhaps even visualization all in one cell; don't separate these into 5-6 substeps.
* **Summary of findings**: answer your question by interpreting the results of your analysis, referring back as appropriate. This can be a short paragraph or a bulleted list.

## Evaluation

Your work will be evaluated on the following criteria:

1. Thoughtfulness: does your question reflect some thoughtful consideration of the dataset and its nuances, or is it more superficial?
2. Thoroughness: is your analysis an end-to-end exploration, or are there a lot of loose ends or unexplained choices?
3. Mistakes or oversights: is your work free from obvious errors or omissions, or are there mistakes and things you've overlooked?
4. Clarity of write-up: is your report well-organized with commented codes and clear writing, or does it require substantial effort to follow?

In [20]:
import numpy as np
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
import statsmodels.api as sm
# disable row limit for plotting
alt.data_transformers.disable_max_rows()
# uncomment to ensure graphics display with pdf export
# alt.renderers.enable('mimetype')

DataTransformerRegistry.enable('default')

In [21]:
#csv file for world happiness 
world_happiness = pd.read_csv('data/world_happiness/whr-2023.csv')
world_happiness.shape

(2199, 11)

In [22]:
#What are the columns in world happiness 
world_happiness.columns

Index(['Country name', 'year', 'Life Ladder', 'Log GDP per capita',
       'Social support', 'Healthy life expectancy at birth',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Positive affect', 'Negative affect'],
      dtype='object')

Since World_Happiness doesn't contain a regions columns. I have decided to add in a dataset country mapping to add to the region section.

In [23]:
#csv file for country_mapping 
country_mapping = pd.read_csv('data/world_happiness/country_mapping.csv')
country_mapping.shape

(249, 11)

In [24]:
#what are the columns in courty_mapping 
country_mapping.columns 

Index(['name', 'alpha-2', 'alpha-3', 'country-code', 'iso_3166-2', 'region',
       'sub-region', 'intermediate-region', 'region-code', 'sub-region-code',
       'intermediate-region-code'],
      dtype='object')

In [25]:
#pick 3 columns 
country_mapping_col = ['name','region','intermediate-region']

#only need the name and region of country mapping 
region = country_mapping[country_mapping_col].copy()
region.columns

Index(['name', 'region', 'intermediate-region'], dtype='object')

In [26]:
#rename some of the columns in world happiness and region
world_happiness = world_happiness.rename({'Country name': 'country',
                                          'Life Ladder' : 'life_ladder',
                                         'Log GDP per capita' : 'GDP',
                                          'Social support' : 'social_support',
                                         'Healthy life expectancy at birth': 'life_expectancy',
                                         'Freedom to make life choices' : 'freedom',
                                          'Generosity' : 'generosity',
                                         'Perceptions of corruption' : 'corruption',
                                         'Positive affect': 'positive',
                                         'Negative affect' : 'negative'}, axis = 1)
region = region.rename({'name': 'country',
                       'region': 'region',
                        'intermediate-region': 'sub_region'}, axis = 1)

#Display
print(world_happiness.head())
print(region.head())

       country  year  life_ladder    GDP  social_support  life_expectancy   
0  Afghanistan  2008        3.724  7.350           0.451             50.5  \
1  Afghanistan  2009        4.402  7.509           0.552             50.8   
2  Afghanistan  2010        4.758  7.614           0.539             51.1   
3  Afghanistan  2011        3.832  7.581           0.521             51.4   
4  Afghanistan  2012        3.783  7.661           0.521             51.7   

   freedom  generosity  corruption  positive  negative  
0    0.718       0.168       0.882     0.414     0.258  
1    0.679       0.191       0.850     0.481     0.237  
2    0.600       0.121       0.707     0.517     0.275  
3    0.496       0.164       0.731     0.480     0.267  
4    0.531       0.238       0.776     0.614     0.268  
          country   region       sub_region
0     Afghanistan     Asia              NaN
1   Åland Islands   Europe              NaN
2         Albania   Europe              NaN
3         Algeria  

In [27]:
#Merge the two datasets with there respective countries 
continents_happy = pd.merge(world_happiness, region, 
                 on = 'country', how = 'left')
continents_happy.head()
                 

Unnamed: 0,country,year,life_ladder,GDP,social_support,life_expectancy,freedom,generosity,corruption,positive,negative,region,sub_region
0,Afghanistan,2008,3.724,7.35,0.451,50.5,0.718,0.168,0.882,0.414,0.258,Asia,
1,Afghanistan,2009,4.402,7.509,0.552,50.8,0.679,0.191,0.85,0.481,0.237,Asia,
2,Afghanistan,2010,4.758,7.614,0.539,51.1,0.6,0.121,0.707,0.517,0.275,Asia,
3,Afghanistan,2011,3.832,7.581,0.521,51.4,0.496,0.164,0.731,0.48,0.267,Asia,
4,Afghanistan,2012,3.783,7.661,0.521,51.7,0.531,0.238,0.776,0.614,0.268,Asia,


In [28]:
#checking the null
continents_happy.isnull().sum()

country               0
year                  0
life_ladder           0
GDP                  20
social_support       13
life_expectancy      54
freedom              33
generosity           73
corruption          116
positive             24
negative             16
region              296
sub_region         1413
dtype: int64

In [29]:
#Group happiness mean by region and life Ladder(happiness level)
mean_region_ladder = continents_happy.groupby('region')[['life_ladder']].mean()
mean_region_ladder = mean_region_ladder.sort_values(by=['life_ladder'],
                                      ascending = False)
print("==========================")
print(mean_region_ladder)
print("==========================")

#Group happiness mean by region and Positive affect  
mean_region_positive = continents_happy.groupby('region')[['positive']].mean()
mean_region_positive = mean_region_positive.sort_values(by=['positive'],
                                      ascending = False)
print(mean_region_positive)
print("==========================")

#Group happiness mean by region and Negative affects  
mean_region_social = continents_happy.groupby('region')[['social_support']].mean()
mean_region_social = mean_region_social.sort_values(by=['social_support'],
                                      ascending = False)
print(mean_region_social)
print("==========================")

#Group happiness mean by region and Perceptions of corruption 
mean_region_corruption = continents_happy.groupby('region')[['corruption']].mean()
mean_region_corruption = mean_region_corruption.sort_values(by=['corruption'],
                                      ascending = False)
print(mean_region_corruption)
print("==========================")


#Group happiness mean by region and Negative affects  
mean_region_negative = continents_happy.groupby('region')[['negative']].mean()
mean_region_negative = mean_region_negative.sort_values(by=['negative'],
                                      ascending = False)
print(mean_region_negative)
print("==========================")




          life_ladder
region               
Oceania      7.267250
Europe       6.221421
Americas     6.090898
Asia         5.285452
Africa       4.388771
          positive
region            
Americas  0.763287
Oceania   0.759156
Europe    0.644474
Africa    0.637625
Asia      0.613504
          social_support
region                  
Oceania         0.949125
Europe          0.894482
Americas        0.854224
Asia            0.788653
Africa          0.707536
          corruption
region              
Africa      0.787349
Americas    0.764198
Asia        0.758783
Europe      0.707026
Oceania     0.346125
          negative
region            
Africa    0.288689
Americas  0.282290
Asia      0.269830
Europe    0.252111
Oceania   0.208750


In [30]:
# remove any null
continents_happy = continents_happy.dropna(subset=['region'])

#Plot the comparisons of continents
fig_1 = alt.Chart(
    continents_happy.reset_index()
).mark_boxplot(
    outliers=True,
    size=7,
).encode(
    x=alt.X('life_ladder:Q', title = 'Happiness Levels'),
    y=alt.Y('region:N', 
            sort=['Oceania', 'Europe', 'Americas', 'Asia', 'Africa'],
            title = 'Region'), color='region:N'
).properties(
    width=600
).configure_axis(
    labelFontSize=16,
    titleFontSize=16
).configure_legend(
    labelFontSize=16,
    titleFontSize=16
).configure_title(
    fontSize=16
)

fig_1

In [31]:
#African dataset
african_dataset = continents_happy[continents_happy["region"] == "Africa"]
african_dataset.head()

Unnamed: 0,country,year,life_ladder,GDP,social_support,life_expectancy,freedom,generosity,corruption,positive,negative,region,sub_region
29,Algeria,2010,5.464,9.306,,65.5,0.593,-0.21,0.618,,,Africa,Northern Africa
30,Algeria,2011,5.317,9.316,0.81,65.6,0.53,-0.185,0.638,0.503,0.255,Africa,Northern Africa
31,Algeria,2012,5.605,9.33,0.839,65.7,0.587,-0.177,0.69,0.54,0.23,Africa,Northern Africa
32,Algeria,2014,6.355,9.355,0.818,65.9,,,,0.558,0.177,Africa,Northern Africa
33,Algeria,2016,5.341,9.383,0.749,66.1,,,,0.565,0.377,Africa,Northern Africa


In [32]:
#Group happiness mean by african country and life Ladder 
mean_african_ladder = african_dataset.groupby('country')[['life_ladder']].mean()
mean_african_ladder = mean_african_ladder.sort_values(by=['life_ladder'],
                                      ascending = False)
mean_african_ladder


Unnamed: 0_level_0,life_ladder
country,Unnamed: 1_level_1
Mauritius,5.859667
Libya,5.545667
Algeria,5.3774
Somalia,5.183333
Morocco,5.01
Nigeria,4.968929
South Africa,4.918375
Djibouti,4.8225
Ghana,4.756118
Mozambique,4.7387


In [33]:
# Create horizontal box plots using Altair for african sub regions
boxplot_a = alt.Chart(african_dataset).mark_boxplot().encode(
    y=alt.Y('sub_region:N',
            sort=['Northern Africa', 'Southern Africa', 'Middle Africa', 'Western Africa', 'Eastern Africa'], 
            title='Sub Region'),
    x=alt.X('GDP:Q',title='GDP'),
    color=alt.Color('sub_region:N',
                    sort=['Northern Africa', 'Southern Africa', 'Middle Africa', 'Western Africa', 'Eastern Africa'],
                    title='Sub Region')
).properties(
    width=200, height=200,
    title='Box Plot of GDP by Sub Region'
)

boxplot_b = alt.Chart(african_dataset).mark_boxplot().encode(
    y=alt.Y('sub_region:N',
            sort=['Northern Africa', 'Southern Africa', 'Middle Africa', 'Western Africa', 'Eastern Africa'],  
            title='Sub Region'),
    x=alt.X('social_support:Q',title='Social Support'),
    color=alt.Color('sub_region:N',
                    sort=['Northern Africa', 'Southern Africa', 'Middle Africa', 'Western Africa', 'Eastern Africa'],
                    title='Sub Region')
).properties(
    width=200, height=200,
    title='Box Plot of Social Support by Sub Region'
)

boxplot_c = alt.Chart(african_dataset).mark_boxplot().encode(
    y=alt.Y('sub_region:N',
            sort=['Northern Africa', 'Southern Africa', 'Middle Africa', 'Western Africa', 'Eastern Africa'], 
            title='Sub Region'),
    x=alt.X('life_expectancy:Q', title='Life Expectancy'),
    color=alt.Color('sub_region:N',
                    sort=['Northern Africa', 'Southern Africa', 'Middle Africa', 'Western Africa', 'Eastern Africa'],
                    title='Sub Region')
).properties(
    width=200, height=200,
    title='Box Plot of Life Expectancy by Sub Region'
)

boxplot_d = alt.Chart(african_dataset).mark_boxplot().encode(
    y=alt.Y('sub_region:N', 
            sort=['Northern Africa', 'Southern Africa', 'Middle Africa', 'Western Africa', 'Eastern Africa'],
            title='Sub Region'),
    x=alt.X('freedom:Q', title='Freedom'),
    color=alt.Color('sub_region:N', 
                    sort=['Northern Africa', 'Southern Africa', 'Middle Africa', 'Western Africa', 'Eastern Africa'],
                    title='Sub Region')
).properties(
    width=200, height=200,
    title='Box Plot of Freedom by Sub Region'
)

boxplot_e = alt.Chart(african_dataset).mark_boxplot().encode(
    y=alt.Y('sub_region:N', 
            sort=['Northern Africa', 'Southern Africa', 'Middle Africa', 'Western Africa', 'Eastern Africa'],
            title='Sub Region'),
    x=alt.X('generosity:Q',title='Generosity'),
    color=alt.Color('sub_region:N',
                    sort=['Northern Africa', 'Southern Africa', 'Middle Africa', 'Western Africa', 'Eastern Africa'],
                    title='Sub Region')
).properties(
    width=200, height=200,
    title='Box Plot of Generosity by Sub Region'
)

boxplot_f = alt.Chart(african_dataset).mark_boxplot().encode(
    y=alt.Y('sub_region:N',
            sort=['Northern Africa', 'Southern Africa', 'Middle Africa', 'Western Africa', 'Eastern Africa'], 
            title='Sub Region'),
    x=alt.X('corruption:Q',title='Corruption'),
    color=alt.Color('sub_region:N',
                    sort=['Northern Africa', 'Southern Africa', 'Middle Africa', 'Western Africa', 'Eastern Africa'],
                    title='Sub Region')
).properties(
    width=200, height=200,
    title='Box Plot of Corruption by Sub Region'
)


# Combine the boxplot subplots in a 3x2 grid
fig_3 = alt.hconcat(
    alt.vconcat(boxplot_a, boxplot_b, boxplot_c, spacing=20),
    alt.vconcat(boxplot_d, boxplot_e, boxplot_f, spacing=20),
    spacing=40
)

fig_3

In [34]:
#boxplot of sub regions
fig_2 = alt.Chart(
    african_dataset
).mark_boxplot(
    outliers=True,
    size=7,
).encode(
    x=alt.X('life_ladder:Q', title='Happiness Levels'),
    y=alt.Y('sub_region:N', 
            sort=['Northern Africa', 'Southern Africa', 'Middle Africa', 'Western Africa', 'Eastern Africa'], 
            title='Sub Region'),
    color=alt.Color('sub_region:N', legend=None)
).properties(
    width=600
).configure_axis(
    labelFontSize=16,
    titleFontSize=16
).configure_title(
    fontSize=16
)
fig_2

In [35]:
# Formulate the explanatory variable matrix X and response variable y
X = african_dataset[['GDP', 'freedom', 'social_support', 'generosity', 'life_expectancy', 'corruption']]
y = african_dataset['life_ladder']

# Drop rows with missing values in X and y
valid_indices = (~X.isnull()).all(axis=1) & (~y.isnull())
X = X[valid_indices]
y = y[valid_indices]

# Add a constant term to the explanatory variables
X = sm.add_constant(X)

# Fit the regression model
model = sm.OLS(y, X)
results = model.fit()

# Extract the coefficients
coefficients = results.params[1:]

# Create a coefficient table
coef_tbl = pd.DataFrame({
    'estimate': coefficients.values,
    'std err': results.bse[1:]
})
coef_tbl.loc['Error variance', 'estimate'] = results.scale

# Display the coefficient table
coef_tbl


Unnamed: 0,estimate,std err
GDP,0.260259,0.047772
freedom,0.76257,0.231819
social_support,1.233546,0.242708
generosity,1.027552,0.260486
life_expectancy,0.018879,0.006247
corruption,0.321213,0.22215
Error variance,0.292441,


In [36]:
# Define a color scale with a range of colors
color_scale = alt.Scale(scheme='set3')

# Visualize the coefficients
chart = alt.Chart(coef_tbl.reset_index()).mark_bar().encode(
    x=alt.X('estimate:Q', axis=alt.Axis(title='Estimate')),
    y=alt.Y('index:N', axis=alt.Axis(title='Variable'), sort='-x'),
        color=alt.Color('index:N', scale=color_scale)
)
chart.configure_axis(
    labelFontSize=12,
    titleFontSize=14
).configure_legend(
    labelFontSize=12,
    titleFontSize=14
).configure_title(
    fontSize=16
).properties(
    width=400,
    height=300
)
