This Notebook contains EDA for Project 2 of DATASCI 200 that examines the influence of geographic and socioeconomic factors on one's reported happiness. This report concerns itself with two datasets: 
1\) 2005-Present World Happiness Report 
2\) S&P 500 Index Market Trend Data

First we will start by loading in the two datasets:

In [38]:
# Load in both datasets as pandas data frames

# Before we do that, we must import the proper libraries
import numpy as np
import pandas as pd
import matplotlib as plt

# Load in the Wold Happiness report dataset
world_happiness = pd.read_csv('WorldHappinessReport.csv')

# Load in the S&P500 Index Trend Data
SP_Data = pd.read_csv('S&P500Data.csv')

In [39]:
# Check the size of the World Happiness report data
print(world_happiness.shape)

(2199, 13)


In [40]:
# Examine the World Happiness report data
print(world_happiness.head())

  Country Name Regional Indicator  Year  Life Ladder  Log GDP Per Capita  \
0  Afghanistan         South Asia  2008     3.723590            7.350416   
1  Afghanistan         South Asia  2009     4.401778            7.508646   
2  Afghanistan         South Asia  2010     4.758381            7.613900   
3  Afghanistan         South Asia  2011     3.831719            7.581259   
4  Afghanistan         South Asia  2012     3.782938            7.660506   

   Social Support  Healthy Life Expectancy At Birth  \
0        0.450662                         50.500000   
1        0.552308                         50.799999   
2        0.539075                         51.099998   
3        0.521104                         51.400002   
4        0.520637                         51.700001   

   Freedom To Make Life Choices  Generosity  Perceptions Of Corruption  \
0                      0.718114    0.167652                   0.881686   
1                      0.678896    0.190809                   0.

In [41]:
# Check the size of the S&P500 dataset
print(SP_Data.shape)

(2541, 6)


In [42]:
# Examine the S&P500 Index Trend Data
print(SP_Data.head())

         Date  Close/Last Volume     Open     High      Low
0  04/14/2023     4137.64     --  4140.11  4163.19  4113.20
1  04/13/2023     4146.22     --  4100.04  4150.26  4099.40
2  04/12/2023     4091.95     --  4121.72  4134.37  4086.94
3  04/11/2023     4108.94     --  4110.29  4124.26  4102.61
4  04/10/2023     4109.11     --  4085.20  4109.50  4072.55


In [43]:
# We see that the volume column is empty, so we will remove that column
SP_Data = SP_Data.drop('Volume', axis=1)

# Verify that column was dropped
print(SP_Data.head())

         Date  Close/Last     Open     High      Low
0  04/14/2023     4137.64  4140.11  4163.19  4113.20
1  04/13/2023     4146.22  4100.04  4150.26  4099.40
2  04/12/2023     4091.95  4121.72  4134.37  4086.94
3  04/11/2023     4108.94  4110.29  4124.26  4102.61
4  04/10/2023     4109.11  4085.20  4109.50  4072.55


We notice that the two datasets have different sizes, namely a different number of rows. This does not necessarily mean that the date ranges are different (although this is something we should check), since data is tablulated by country and by year in the world happiness report. 

In [44]:
# Now we will filter the data to ensure that the data ranges are consistent across the two datasets. 
# To do so, we will first convert the dates contained with the date data of the S&P 500 data:
SP_Data['Date'] = pd.to_datetime(SP_Data['Date'])

In [45]:
# Since the data in the World Happiness Dataset starts in 2005, we will filter out all data from the S&P 500 dataset that does not fall after 2005:
cutoff_date = pd.to_datetime('2005-01-01')
SP_Data = SP_Data.loc[SP_Data['Date'] >= cutoff_date]
print(SP_Data.shape)

(2541, 5)


In [46]:
# Since there was no change in the size of the data from the previous secting, we realize that the S&P 500 dataset must cover a different date range.
# We will first indentify earliest data contained within the S&P 500 dataset
print(SP_Data['Date'].min())

2013-04-16 00:00:00


In [47]:
# Since the earliest data in the S&P500 dataset is 2013 and is not the start of the year, we will remove all data prior to 2014 
# for both datasets so that we are comparing the same timeframe and full-year intervals for consistency across both datasets:
cutoff_date_new = pd.to_datetime('2014-01-01')
SP_Data = SP_Data.loc[SP_Data['Date'] >= cutoff_date_new]
print(SP_Data.shape)

(2355, 5)


We see that reduced the size of the datset, now we will filter the data contained in the World Happiness set to include the same range:

In [50]:
# we will remove all rows of data in the world happiness dataset that are before 2014:
cutoff_year = 2014
world_happiness = world_happiness.loc[world_happiness['Year'] >= cutoff_year]
print(world_happiness.shape)

(1210, 13)


We see that this has significantly reduced the size of the dataset. Fortunately, we can now compare the two datasets more effectively. Now we will explore the data from both datasets in greater detail: 
