# COGS 108 - Final Project: Exploring Happiness in the United States

# Overview

For this project we were aiming to analyze the different factors that affect happiness in the US. We wanted to determine if happiness can be measured reasonably well using just 5 parameters: 

1) GDP per capita
2) Life expectancy
3) Employment
4) Social progress
5) Gini coefficient (income distribution)

We obtained data for each parameter using web scraping and data wrangling. Next, we used simple linear regression to compare each parameters with the WalletHub happiness index, which was used as a reference point. Using the correlation coefficient (R^2) for each parameter, we were able to find a weighted formula A * GDP + B * Life_Expectancy + C * Employment + D * Social_Progress + E * Gini_Coefficient, where the weights A,B,C,D,E are the r^2 values for each parameter. Finally, we tested the model against the Happy State Index (HSI), in order to determine how close/far our model was from the HSI. In the end, we got a result of r^2 = 0.65, which indicates a relatively well fitting model, with some discrepancies. 

# Names

- Max Kazakov
- Shrimant Singh
- Egbert Doan
- Eric Salguero

# Group Members IDs

- A14931159
- A15312953
- A14243004
- A15316485

(respectively)

# Research Question

Our research question can be summarized as follows: 

### "Are 5 parameters (GDP per capita, life expectancy, employment, social progress, and Gini coefficient (income distribution)) sufficient enough to produce a *reasonable* model of happiness across the 50 United States of America?" 

Please note that in our view a *reasonable* model is one that has an r^2 value >= 75. One may argue that this cut-off level is too low, however, for the purpose of the present analysis, we deemed it adequate, since we are looking to find a compromise between the model's simplicity and predictive capacity. 

## Background and Prior Work

For background work we started out looking at the [World Happiness Index](https://worldhappiness.report/), however we realized that for the purpose of narrowing down our research project it would be better to limit ourselves to the US. Essentially, we attempted to recreate a country-by-country comparison but on a state-by-state level. However, we explored the methodology of the world happiness report, which inspired us to use some of the parameters, e.g. GDP per capita. 

In terms of existing work on the subject matter, we were able to find this [report](https://wallethub.com/edu/happiest-states/6959/) by WalletHub. Below is an excerpt from their methodology.  

"In this study, WalletHub drew upon the findings of “happiness” research to determine which environmental factors are linked to a person’s overall well-being and satisfaction with life. Previous studies have found that good economic, emotional, physical and social health are all key to a well-balanced and fulfilled life.

To determine where Americans exhibit the best combination of these factors, we examined the 50 states across 31 key metrics, ranging from depression rate to sports participation rate to income growth. Read on for our findings, additional insight from a panel of experts and a full description of our methodology."

What is key here is that WalletHub used 31 metrics across three main dimensions:

1) Emotional & Physical Well-Being
2) Work Environment 
3) Community & Environment

Our goal was to come up with a Happiness Index that would only use 5 metrics, roughly 6 times simpler than WalletHub's. However, we cross-referenced WalletHub's index along the way in order to determine the coefficients for our model, as well as in the very end when verifying our end result.  

# Hypothesis

Our hypothesis was that *a model consisting of 5 metrics can produce an r^2 value of 0.75 or higher*. Therefore we expect a positive answer to the research question. Namely, that the five selected metrics (GDP per capita, life expectancy, employment, social progress, and Gini coefficient (income distribution)) are enough to produce a model that yields an r^2>=75, when a simple linear regression is performed, with the Happy State Index as the dependent variable.

The reason why we believe that our model can perform this well is due to the fact that the selected parameters cover a wide range of factors affecting one's perception of happiness. Moreover, the 5 parameters were not selected at random, instead they were generated through both, our common-sense intuition, and background research of other happiness indices (of global and national scale). 

The opposite outcome would be if the model fails to reach the 0.75 threshold. That would mean that either 5 parameters do not sufficiently explain the state of happiness in the US, or that our model is overly simple. The latter would merit a reevaluation of the model and better fine-tuning of the coefficients in possible future iterations of the research project. 

# Datasets

- Dataset Name: List of U.S. states by GDP per capita
- Link to the dataset: https://en.wikipedia.org/wiki/List_of_U.S._states_by_GDP_per_capita
- Number of observations: 51

This dataset shows the GDP per capita of the 50 U.S. states and also includes the District of Columbia. For our observations we have omitted the District of Columbia to only observe the 50 U.S. states.

This dataset will be used along side four other datasets to create a weighted formula for linear regression. This will allow us to see what is affecting happiness in the US. GDP per capita should come to affect happiness in a state as this represents the states economic growth.

- Dataset Name: Happiest States in the U.S.
- Link to the dataset: https://wallethub.com/edu/happiest-states/6959/
- Number of observations: 50

This dataset shows the Happines Index used by wallethub of the 50 U.S. states. Using this metric we will define our own index and compare.

This dataset will be compared to our linear regression model to see how accurate our model is with the factors and weights we have calculates.

- Dataset Name: List of U.S. states and territories by life expectancy
- Link to the dataset: https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_life_expectancy
- Number of observations: 56

This dataset shows the  life expectancy of the 50 U.S. states and also includes U.S. territories. For our observations we have omitted the U.S. terrritores to only observe the 50 U.S. states.

This dataset will be used along side four other datasets to create a weighted formula for linear regression. We believe that Life expectancy will be a factor in happiness as having a longer life can be said improve happiness.

- Dataset Name: Unemployment Rates for States, 2018 Annual Averages
- Link to the dataset: https://www.bls.gov/lau/lastrk18.htm
- Number of observations: 50

This dataset shows the Unemployment Rates of the 50 U.S. states. We will use the 100 - unemployment from this dataset to get values for the employment rate in the U.S.

This dataset will be used along side four other datasets to create a weighted formula for linear regression. Using 100 - unemployment will give us a look at employment rates as a factor in our linear regression model. Both unemployment rates and employment rates are similarly correlated but unemployment is negatively correlated rather than employment which is positive correlation for our model, since it is a sum.

- Dataset Name: US Social Progress Index Results Public
- Link to the dataset: https://socialprogress.blog/2018-social-progress-index-us-states/
- Number of observations: 52

This dataset shows another index that takes into account social aspects like crime rates, medical care and other quality of life factors.

This dataset will be used along side four other datasets to create a weighted formula for linear regression. This index takes into account a variety of social aspects that can weigh on happiness. Aspects like clean water and broadband connection availablity are included in this index.

- Dataset Name: List of U.S. states by Gini coefficient
- Link to the dataset: https://en.wikipedia.org/wiki/List_of_U.S._states_by_Gini_coefficient
- Number of observations: 52

The gini coefficient measures the income gap in a state comparing the low income class with the high income class. This dataset shows the Gini coefficient of the 50 U.S. states and also includes the District of Columbia and the U.S. of America. For our observations we have omitted the District of Columbia and and the U.S. of America to only observe the 50 U.S. states.

This dataset will be used along side four other datasets to create a weighted formula for linear regression. This data set looks at the income gap which can affect happiness for low income class negatively and high income class positively.

# Data Analysis

### Data Cleaning & Pre-processing

Apart from the standard data science toolkit, sklearn was used for Linear Regression models, and seaborn was used for better visualizations. Plotly was used in offline mode for choropleth plots of geospatial data. Patsy and the statsmodels API was used in order to carry out the final OLS testing and determine the r^2 value between our model and the Happy State Index. Scipy was used for descriptive statistics.

In [1]:
import pandas as pd
import numpy as np
import requests
import bs4
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import sklearn
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
import seaborn as sns
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import patsy
import statsmodels.api as sm
from scipy import stats

Initiating Plotly notebook mode for offline Plotly plots (used for Data Visualization with Choropleth plots).

In [2]:
init_notebook_mode(connected=True)

Set seaborn aesthetic parameters for notebook.

In [3]:
sns.set()

## Web scrapping and data wrangling 

### 1. GDP per capita 2018

For GDP, the wikipedia page was scraped without much data cleaning being required, since wikipedia data is usually fairly clean as-is.

In [4]:
wiki = 'https://en.wikipedia.org/wiki/List_of_U.S._states_by_GDP_per_capita'
page = requests.get(wiki)
soup = BeautifulSoup(page.content, 'html.parser')

In [5]:
title_page = soup.title.string
assert 'GDP' in title_page

In [6]:
right_table = soup.find("table", class_='wikitable sortable')
assert isinstance(right_table, bs4.element.Tag)

In [7]:
states, GDP = [], []

for row in right_table.findAll('tr'):
    
    cells = row.findAll('td')
    

    if len(cells) == 0 or '—' in cells[0]:
        continue

    states.append(cells[1].find('a').text)
    GDP.append(cells[2].find(text=True).replace('\n', ''))
    
assert len(states) == 50
assert len(GDP) == 50

In [8]:
GDP_df = pd.DataFrame(list(zip(states, GDP)), 
            columns = ['State', 'GDP 2018'])

GDP_df.set_index('State', inplace=True)

### Happy State Index

Similarly to the Wikipedia web scraping, the Happy State Index was fairly clean. However, a small problem was encountered along the way, namely the website would block our IP after multiple requests. We had to use a VPN in order to circumvent this problem (the same is recommended for anyone interested in replicating our results). 

In [9]:
whub = 'https://wallethub.com/edu/happiest-states/6959/'
page = requests.get(whub)
soup = BeautifulSoup(page.content, 'html.parser')

In [10]:
title_page = soup.title.string
title_page

'IP Block'

In [11]:
right_table = soup.find("table", class_='cardhub-edu-table center-aligned sortable')

In [12]:
states, HSI = [], []

for row in right_table.findAll('tr'):
    
    cells = row.findAll('td')
    

    if len(cells) == 0:
        continue
    

    states.append(cells[1].find(text=True))
    HSI.append(cells[2].find(text=True))
    
assert len(states) == 50
assert len(HSI) == 50

AttributeError: 'NoneType' object has no attribute 'findAll'

In [None]:
HSI_df = pd.DataFrame(list(zip(states, HSI)), 
            columns = ['State', 'Happy State Index'])

HSI_df.set_index('State', inplace=True)

In [None]:
df = pd.merge(HSI_df, GDP_df, on='State', how='inner')

In [None]:
df['GDP 2018'] = df['GDP 2018'].str.replace(',', '')

The data needs to be converted to float. Additionally, GDP was devided by 1000 and rounded to 2 decimal places using an anonymous function.

In [None]:
df['Happy State Index'] = df['Happy State Index'].astype(float)
df['GDP 2018'] = df['GDP 2018'].apply(lambda x : round(int(x) / 1000, 2)) 

### 2. Life Expectancy

The procedure for life expectancy was slightly more complicated than the rest of the Wikipedia web scraping due to some of the hyperlinks containing "(state)" and other unnecessary substrings. This was basically the only data cleaning that was carried out. 

In [None]:
wiki = 'https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_life_expectancy'
page = requests.get(wiki)
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
title_page = soup.title.string
assert 'life' in title_page

In [None]:
right_table = soup.find("table", class_='wikitable sortable')
assert isinstance(right_table, bs4.element.Tag)

In [None]:
states, LE = [], []

for row in right_table.findAll('tr'):
    
    cells = row.findAll('td')
    
    if len(cells) == 0:
        continue
        
    # Special handling for Georgia 
    if cells[1].find('a') != None:
        st = cells[1].find('a')['title']
        states.append(st)
        LE.append(cells[2].find(text=True).replace('\n', ''))
   

In [None]:
LE_df = pd.DataFrame(list(zip(states, LE)), 
            columns = ['State', 'Life Expectancy'])

Anonymous functions were used to clean the entries that contained redundant substrings.

In [None]:
LE_df['State'] = LE_df['State'].apply(lambda x: x.replace(' (state)', ''))
LE_df['State'] = LE_df['State'].apply(lambda x: x.replace(' (U.S. state)', ''))
LE_df.set_index('State', inplace=True)

The wikipedia dataset contained U.S. territories as well, which had to be filtered out.

In [None]:
LE_df = LE_df[LE_df.index.isin(df.index.tolist())]

In [None]:
df = pd.merge(df, LE_df, on='State', how='inner')

In [None]:
df['Life Expectancy'] = df['Life Expectancy'].astype(float)

In [None]:
df.head()

### 3. Employment Rate

For employment rate we had to use data from the Bureau of Labor Statistics, web scraping was used.

In [None]:
url = 'https://www.bls.gov/lau/lastrk18.htm'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
right_table = soup.find(id='lastrk18')
assert isinstance(right_table, bs4.element.Tag)

In [None]:
states, EMP = [], []

for row in right_table.findAll('tr'):

    th = row.findAll('th')
    td = row.findAll('td')
    
    if len(th) == 0 or len(td) == 0:
        continue
        
    states.append(th[0].find(text=True))
    EMP.append(td[0].find(text=True))
    
assert len(states) == 52

In [None]:
EMP_df = pd.DataFrame(list(zip(states, EMP)), 
            columns = ['State', 'Employment'])

In [None]:
EMP_df.set_index('State', inplace=True)


"District of Columbia" had to be filtered out.

In [None]:
EMP_df = EMP_df[EMP_df.index.isin(df.index.tolist())]

In [None]:
df = pd.merge(df, EMP_df, on='State', how='inner')

A simple arithmetic conversion was used to get from employment % to unemployment %.

In [None]:
df['Employment'] = 100 - df['Employment'].astype(float)
df.head()

### 4. Social Progress

For the social progress score we only used the column which contained the Index itself. The remaining columns were dropped since they are essentially the different componenets of the social progress index. In future iterations and improvements of the project, other parameters can be added from this list e.g. Water and Sanitation.

In [None]:
social_scores_df = pd.read_excel('/Users/kazak/Downloads/us-social-progress-index-results-public-4.xlsx')
social_scores_df.head()

Only State, Code, and Social Progress Index columns were used. Code will come in handy at a later stage, when plotting the choropleths. 

In [None]:
df_social_progress_code = social_scores_df.iloc[:, [0, 1, 2]]
df_social_progress_code.head()

In [None]:
df = pd.merge(df, df_social_progress_code, on='State', how='inner')

The cell below re-orders the columns for convenience.

In [None]:
df = df[['State', 'Code', 'Happy State Index', 'GDP 2018', 'Life Expectancy', 'Employment', 'Social Progress Index']]

In [None]:
df.head()

### 5. GINI coefficient (statistical income distribution)

Gini data was also web scrapped from wikipidea without issue.

In [None]:
wiki = 'https://en.wikipedia.org/wiki/List_of_U.S._states_by_Gini_coefficient'
page = requests.get(wiki)
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
title_page = soup.title.string
assert 'Gini' in title_page

In [None]:
right_table = soup.find("table", class_='wikitable sortable')

In [None]:
states, GINI = [], []

for row in right_table.findAll('tr'):
    
    cells = row.findAll('td')
    

    if len(cells) == 0:
        continue
    
    states.append(cells[1].find(text=True))
    GINI.append(cells[2].find(text=True))

assert len(states) == 52
assert len(GINI) == 52

Data entries for the US overall and the District of Columbia had to be dropped.

In [None]:
GINI_df = pd.DataFrame(list(zip(states, GINI)), 
            columns = ['State', 'GINI'])

GINI_df.set_index('State', inplace=True)
GINI_df = GINI_df.drop(['The United States Of America', 'District of Columbia'])

assert GINI_df.size == 50
GINI_df.head()

1 - GINI was used since GINI is between 0 and 1, with 0 being perfectly equal income and 1 being unequal. This way we can simply add the result without worrying about the sign.

In [None]:
df = pd.merge(df, GINI_df, on='State', how='inner')
df['GINI'] = 1 - df['GINI'].astype(float)
df.head()

End of Web Scraping and Data Collection.

## Data Visualization and Exploratory Data Analysis (EDA)

### Histograms to study the near-normality of the distribution of variables

Scipy's "describe" was used for a summary of main statistical measurements.

The Happy states Index is essentially unimodal and near-normally distributed.

In [None]:
sns.distplot(df['Happy State Index'])
plt.show()
stats.describe(df['Happy State Index'])

GDP is also near-normal, with 

In [None]:
sns.distplot(df['GDP 2018'])
stats.describe(df['GDP 2018'])

In [None]:
sns.distplot(df['Life Expectancy'])
stats.describe(df['Life Expectancy'])

In [None]:
sns.distplot(df['Employment'])
stats.describe(df['Employment'])

In [None]:
sns.distplot(df['Social Progress Index'])
stats.describe(df['Social Progress Index'])

In [None]:
sns.distplot(df['GINI'])
stats.describe(df['GINI'])

### Barplot for the Happy State Index

This shows the gradual near-linear decrease oh happiness in the US (as per the Happy State Index).

In [None]:
plt.figure(figsize=(16, 6))
plt.bar(x=df['State'], height=df['Happy State Index'], width=0.8)
plt.title('Happy State Index')
plt.xticks(df['State'], rotation=90)

plt.show()

### Choropleth plots to capture the geospatial relationships between variables

Choropleth maps were appropriate this the data is predominantely geospatial. The colors were used to represent high intensity in red and low in blue, as for a basic heatmap. Plotly was chosen due to their friendly UI and ability to interact with the plots.

Choropleth map tutorial from https://plot.ly/python/choropleth-maps/?utm_source=mailchimp-jan-2015&utm_medium=email&utm_campaign=generalemail-jan2015&utm_term=chloropleth-maps

In [None]:
data = [go.Choropleth(
    colorscale = 'RdBu',
    locations = df['Code'],
    z = df['Happy State Index'].astype(float),
    locationmode = 'USA-states',
    marker = go.choropleth.Marker(
        line = go.choropleth.marker.Line(
            color = 'rgb(255,255,255)',
            width = 2
        )),
    colorbar = go.choropleth.ColorBar(
        title = "0-100")
)]

layout = go.Layout(
    title = go.layout.Title(
        text = 'Happy State Index by State<br>(Hover for breakdown)'
    ),
    geo = go.layout.Geo(
        scope = 'usa',
        projection = go.layout.geo.Projection(type = 'albers usa'),
        showlakes = False),
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
data = [go.Choropleth(
    colorscale = 'RdBu',
    locations = df['Code'],
    z = df['GDP 2018'].astype(float),
    locationmode = 'USA-states',
    marker = go.choropleth.Marker(
        line = go.choropleth.marker.Line(
            color = 'rgb(255,255,255)',
            width = 2
        )),
    colorbar = go.choropleth.ColorBar(
        title = "Thousands USD")
)]

layout = go.Layout(
    title = go.layout.Title(
        text = 'GDP per capita 2018 by State<br>(Hover for breakdown)'
    ),
    geo = go.layout.Geo(
        scope = 'usa',
        projection = go.layout.geo.Projection(type = 'albers usa'),
        showlakes = False),
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
data = [go.Choropleth(
    colorscale = 'RdBu',
    locations = df['Code'],
    z = df['Life Expectancy'].astype(float),
    locationmode = 'USA-states',
    marker = go.choropleth.Marker(
        line = go.choropleth.marker.Line(
            color = 'rgb(255,255,255)',
            width = 2
        )),
    colorbar = go.choropleth.ColorBar(
        title = "Years")
)]

layout = go.Layout(
    title = go.layout.Title(
        text = 'Life Expectancy by State<br>(Hover for breakdown)'
    ),
    geo = go.layout.Geo(
        scope = 'usa',
        projection = go.layout.geo.Projection(type = 'albers usa'),
        showlakes = False),
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
data = [go.Choropleth(
    colorscale = 'RdBu',
    locations = df['Code'],
    z = df['Employment'].astype(float),
    locationmode = 'USA-states',
    marker = go.choropleth.Marker(
        line = go.choropleth.marker.Line(
            color = 'rgb(255,255,255)',
            width = 2
        )),
    colorbar = go.choropleth.ColorBar(
        title = "%")
)]

layout = go.Layout(
    title = go.layout.Title(
        text = 'Employment % by State<br>(Hover for breakdown)'
    ),
    geo = go.layout.Geo(
        scope = 'usa',
        projection = go.layout.geo.Projection(type = 'albers usa'),
        showlakes = False),
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
data = [go.Choropleth(
    colorscale = 'RdBu',
    locations = df['Code'],
    z = df['Social Progress Index'].astype(float),
    locationmode = 'USA-states',
    marker = go.choropleth.Marker(
        line = go.choropleth.marker.Line(
            color = 'rgb(255,255,255)',
            width = 2
        )),
    colorbar = go.choropleth.ColorBar(
        title = "0-100")
)]

layout = go.Layout(
    title = go.layout.Title(
        text = 'Social Progress Index by State<br>(Hover for breakdown)'
    ),
    geo = go.layout.Geo(
        scope = 'usa',
        projection = go.layout.geo.Projection(type = 'albers usa'),
        showlakes = False),
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
data = [go.Choropleth(
    colorscale = 'RdBu',
    locations = df['Code'],
    z = df['GINI'].astype(float),
    locationmode = 'USA-states',
    marker = go.choropleth.Marker(
        line = go.choropleth.marker.Line(
            color = 'rgb(255,255,255)',
            width = 2
        )),
    colorbar = go.choropleth.ColorBar(
        title = "0-1")
)]

layout = go.Layout(
    title = go.layout.Title(
        text = 'GINI Coefficient by State<br>(Hover for breakdown)'
    ),
    geo = go.layout.Geo(
        scope = 'usa',
        projection = go.layout.geo.Projection(type = 'albers usa'),
        showlakes = False),
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

## Linear Regression to determine coefficients for each variable

Simple linear regression models were used in order to determine the r^2 value in each category. A summative interpretation will be offered in a later section.

In [None]:
plt.scatter(df['GDP 2018'], df['Happy State Index'])
plt.title('GDP 2018 vs Happy State Index')
plt.ylabel('Happy State Index (0-100)')
plt.xlabel('GDP per capita in $1000 (2018)')

X = np.array(df['GDP 2018']).reshape((-1, 1))
y = df['Happy State Index']

model = LinearRegression().fit(X, y)
r_sq_gdp = model.score(X, y)
print(round(r_sq_gdp, 3))

y_pred = model.predict(X)
plt.plot(X, y_pred, color='red', linewidth=0.5)

plt.show()

In [None]:
plt.scatter(df['Life Expectancy'], df['Happy State Index'])
plt.title('Life Expectancy vs Happy State Index')
plt.ylabel('Happy State Index (0-100)')
plt.xlabel('Life Expectancy (years)')


X = np.array(df['Life Expectancy']).reshape((-1, 1))
y = df['Happy State Index']

model = LinearRegression().fit(X, y)

r_sq_le = model.score(X, y)
print(round(r_sq_le, 3))

y_pred = model.predict(X)
plt.plot(X, y_pred, color='red', linewidth=0.5)

plt.show()

In [None]:
plt.scatter(df['Employment'], df['Happy State Index'])
plt.title('Employment vs Happy State Index')
plt.ylabel('Happy State Index (0-100)')
plt.xlabel('Employment Rate (%)')

X = np.array(df['Employment']).reshape((-1, 1))
y = df['Happy State Index']

model = LinearRegression().fit(X, y)
r_sq_emp = model.score(X, y)
print(round(r_sq_emp, 3))

y_pred = model.predict(X)
plt.plot(X, y_pred, color='red', linewidth=0.5)

plt.show()

In [None]:
plt.scatter(df['Social Progress Index'], df['Happy State Index'])
plt.title('Social Progress Index vs Happy State Index')
plt.ylabel('Happy State Index (0-100)')
plt.xlabel('Social Progress Index (0-100)')

X = np.array(df['Social Progress Index']).reshape((-1, 1))
y = df['Happy State Index']

model = LinearRegression().fit(X, y)
r_sq_soc = model.score(X, y)
print(round(r_sq_soc, 3))

y_pred = model.predict(X)
plt.plot(X, y_pred, color='red', linewidth=0.5)

plt.show()

In [None]:
plt.scatter(df['GINI'], df['Happy State Index'])
plt.title('Gini Coefficient vs Happy State Index')
plt.ylabel('Happy State Index (0-100)')
plt.xlabel('Gini Coefficient (0-1)')

X = np.array(df['GINI']).reshape((-1, 1))
y = df['Happy State Index']

model = LinearRegression().fit(X, y)
r_sq_gini = model.score(X, y)
print(round(r_sq_gini, 3))

y_pred = model.predict(X)
plt.plot(X, y_pred, color='red', linewidth=1.5)

plt.show()

In [None]:
coef = [r_sq_gdp, r_sq_le, r_sq_emp, r_sq_soc, r_sq_gini]

The most strongly correlated category was Life Expectancy, which makes sense intuitively, since it is often considered conventional wisdom that a happy life is a long life.

In [None]:
coef

The coefficients were normalized.

In [None]:
norm = [x / sum(coef) for x in coef]

In [None]:
norm

A vertical barplot was used to look at the r^2 values of each parameter. Life expectancy was the most correlated, followed by social progress, then employment, GDP and GINI.

In [None]:
labels =['GDP 2018', 'Life Expectancy', 'Employment', 'Social Progress', 'Gini']
plt.figure(figsize=(10, 5))
plt.bar(x=labels, height=norm, width=0.8)
plt.title('Weights of each Category')
plt.xticks(labels, labels, rotation=30)

plt.show()

Below we see a correlation matrix, we can see that life expectancy correlates the most with HSI (the reasons were explained above).

In [None]:
df.corr()

In [None]:
df.head()

Each parameter was normalize onto a scale 0-100 in order for the final index to be scaled appropriately.

In [None]:
for param in df.columns[3:8]:
    col_name = param + ' Norm' # add "Norm" for normalized
    df[col_name] = ( df[param] - min(df[param]) ) / ( max(df[param]) - min(df[param]) ) * 100
    

In [None]:
df.head()

### Our_Index = A * GDP + B * Life_Expectancy + C * Employment + D * Social_Progress + E * Gini_Coefficient

In [None]:
df['Our Index'] = df['GDP 2018 Norm'] * norm[0] + df['Life Expectancy Norm'] * norm[1] + df['Employment Norm'] * norm[2] + \
              df['Social Progress Index Norm'] * norm[3] + df['GINI Norm'] * norm[4]

In [None]:
df.head()

In [None]:
plt.scatter(df['Our Index'], df['Happy State Index'])
plt.title('Our Index vs Happy State Index')
plt.ylabel('Happy State Index (0-100)')
plt.xlabel('Our Index (0-100)')

X = np.array(df['Our Index']).reshape((-1, 1))
y = df['Happy State Index']

model = LinearRegression().fit(X, y)
r_sq_gini = model.score(X, y)
print(round(r_sq_gini, 3))

y_pred = model.predict(X)
plt.plot(X, y_pred, color='red', linewidth=1.5)

plt.show()

The result is r^2, this represents the correlation between Our Index and the Happy State Index which was used as a benchmark. 0.65 can be called "moderately correlated", although < 0.75.

In [None]:
plt.scatter(df['Our Index'], df['Happy State Index'])
plt.title('Our Index vs Happy State Index')
plt.ylabel('Happy State Index (0-100)')
plt.xlabel('Uniform Index (0-100)')

X = np.array(df['Our Index']).reshape((-1, 1))
y = df['Happy State Index']

model = LinearRegression().fit(X, y)
r_sq_gini = model.score(X, y)
print(round(r_sq_gini, 3))

y_pred = model.predict(X)
plt.plot(X, y_pred, color='red', linewidth=1.5)

plt.show()

Columns had to be renamed without whitespaces for OLS model fitting.

In [None]:
df = df.rename(columns={'Happy State Index': 'HSI', 'Our Index': 'OI'})

In [None]:
df.head()

In [None]:
outcome, predictors = patsy.dmatrices('HSI ~ OI', df)

In [None]:
OLS_model = sm.OLS(outcome, predictors)

Below is a summary of the Ordinary Least Squares model, the most useful statistic is the r^2 value in the upper right corner. 

### R^2 = 0.654

In [None]:
res = OLS_model.fit()
print(res.summary())

This concludes the data analysis section. As seen above, we have to reject our hypothesis. A more thorough explanation of the implications of this will be provided in the conclusion section below. 

# Ethics & Privacy

Ethical Considerations:
-	Whether any information that can identify a person is in our data, keep everything anonymous.
-	Data, unless publicly available, will only be obtained with owners’ consent.
-	Data collected will be analyzed for author biases.
-	The research will be carried out in a manner that does not cause any undue harm to anyone.
-	The process and/or findings of the present research will not sere to propagate bias against any individual, or group.
-	Analysis will be conducted in a manner that symbolizes pure honesty and ensures data anonymity. 
-	Analysis will be documented well enough to be reproduceable in future events.
-	Metric Selection will be charged thoughtfully eliminating personal bias toward the selections.
-	Explanation of models features, and justifications will be understandable and clear.

With these bullet points in mind; our team designed a way to minimize selection-bias from each of us by brainstorming dozens of ideas in regard to supporting our hypothesis, and placing them in a random-selection tool. Before picking the variables, we were to research on, each of our team mates had to agree that each variable was equally valid and weighted the same conceptual importance before searching for the datasets. 

After picking a few randomized variable concepts relating to our hypothesis; we began our search for datasets and found sets that did not include any personal information. Beyond this, we checked the sources of each dataset to make sure that the pieces of data could not reverse-engineer the location or personal information of any individual or group of individuals. 
With these data sets free of personal-information we then checked for author biases and tried to use data sets that were open-sourced and were being constantly monitored by the site-publishers artificial-intelligence. Although it is not a real AI, these AI sentries provide diligent and usually an un-bias security in regard to defending data-sets from malicious or false alterations. 

After cleaning up the data sets, we ambitiously tried to implement security measures to mitigate against unwanted users from altering the Jupyter Notebook. Unfortunately, it ended up not being as secured as we would have hoped, causing us to scrap our efforts in providing security for the code. With that said, the models generated have no personal data and the features are rendered from imported Python Modules. This means that, if a user were to alter the code; they would still be subject to using the public module-functions in order to hide un-desired alterations to our models. 
For the metric selection, since we had randomly chosen the data sets from equally weighted concepts, we already ensured that the metric selection did no contain biases from our team mates. Although we selected the columns to include, it was to ensure the relevancy to answering our hypothesis.

Lastly, in our explanation of the models, we each re-read the explanations which provided multi-feedback loops to ensure that we weren’t targeting any individual or group with the features described in our models. By having a system of checks, we did our best to ensure the privacy of individuals was safe and that the models generated from our analysis were not to harm any individual nor group.
 
After completion of this hypothesis, we discussed possible un-predictable harms that might arise from our models. The only one that came to our minds would be hurt feelings from the states that ranked lower on our “Happy State Index”.


# Conclusion & Discussion

Our attempt to devise a Happiness Index of our own, produced by metrics such as GDP per capita, life expectancy, employment, social progress and Gini coefficient, reasonably encompasses a wide range of socio-economic that may ‘define’ happiness across the 50 states. 

By conducting a number of linear regressions on each metric with the Happy State Index data, we were able to ascertain the extent to which each metric affects the overall score. These R-squared values were taken as ‘weights’ that led to the creation of our own happiness index formula. Later, as per our hypothesis, we conducted a regression of the newly devised happiness index with the Happy State Index to see how closely their respective values matched for each state. 

Eventually, an R-squared value of 0.65 wasn’t too short of the hypothesized value of 0.75, and hence doesn’t really discredit the utility of our own index. This leads us to believe that our attempts to quantify happiness using our own metrics gave us a reasonably close value to the one used to calculate the Happy State Index yet leaving room for improvement to consider more variables that can lead to a more nuanced definition of happiness.   