# The World Happiness Report Analysis

##### Aly Boolani

##### A written blog can be found [here](https://www.alyboolani.com/blogs/analyzing-the-worlds-happiness-report): 

## Table of Contents:

1. **[Overview of the World Happiness Report](##Overview)**
    * [Context](##)
    * [Data Acknowledgment](##)
    
    
2. **[Importing Python Packages](#Imports)**


3. **[First look at the Data](##)**
     * [Individual DataFrames](##)
     * [Data Description, Information & Data Types]
     * [Nulls & Missing Values](##)
     * [Early Observations](##)


4. **[Feature Engineering](##)**
    * [Aligning the DataFrames](##)
        * [Renaming columns](##)
        * [Adding columns](##)
        
    * [Combining the DataFrames](##)
    * [Additions to DataFrame](##)
        * [Capital Cities](##)
        * [Capital Latitude & Longitude](##)
  
    
5. **[Insights & Findings](##)**
    * [World Map Visualization](##)
       
    * [Top 10 happiest countries overall](##)
        * [2020](##)
        * [2019](##)
        * [2018](##)
        * [2017](##)
        * [2016](##)
        * [2015](##)
    * [Least 10 happiest countries overall](##)
        * [2020](##)
        * [2019](##)
        * [2018](##)
        * [2017](##)
        * [2016](##)
        * [2015](##)
        
        
6. **[Scatter plots - Relationship between Happiness Score and:](##)**
    * [Region - encoded](##)
    * [GDP Per Capita](##)
    * [Social Support](##)
    * [Health / Life Expectancy](##)
    * [Freedom to make choices](##)
    * [Family Safety](##)
    * [Trust in Government Entities](##)
    * [Generosity](##)
    * [Dystopia Residual](##) 
        

7. **[Using Machine Learning to understand major influencing factors](##)**
    * [Linear Regression](##)
    * [Logistic Regression](##)
    * [Support Vector Machines - Regressor](##)
    * [Scalers](##)
        * [Standard Scaler](##)
        * [Min-Max Scaler](##)
    * [Principal Component Analysis (PCA)](##)
    * [Grid Search with Cross Validation](##)
    

8. **[Final Verdict](##)**

---
# Overview of the World Happiness Report! 
---
The **World Happiness Report** may be a point of interest survey of the state of worldwide bliss. The primary report was distributed in 2012, the second in 2013, third in 2015, and the fourth with the 2016 upgrade. **The World Joy 2017**, which positions 155 nations by their bliss levels, was discharged at the Joined together Countries at an occasion celebrating Universal Day of Joy. The report proceeds to pick up worldwide acknowledgment as governments, organizations and respectful society progressively utilize joy pointers to educate their policy-making choices. Driving specialists over areas – financial matters, brain research, overview investigation, national insights, wellbeing, open approach and more – depict how estimations of well-being can be used effectively to evaluate the advance of countries. The reports survey the state of bliss within the world nowadays and appear how the modern science of bliss clarifies individual and national varieties in bliss. 

The joy scores and rankings utilize information from the Gallup World Survey. The scores are based on answers to the most life evaluation address inquired within the survey. This address, known as the Cantril step, asks respondents to think of a step with the most excellent conceivable life for them being a 10 and the most exceedingly bad conceivable life being a and to rate their claim current lives on that scale. The scores are from broadly agent tests for the a long time 2013-2016 and utilize the Gallup weights to create the gauges agent. The columns taking after the bliss score assess the degree to which each of six variables – financial generation, social back, life anticipation, flexibility, nonattendance of debasement, and liberality – contribute to making life assessments higher in each nation than they are in Dystopia, a theoretical nation that has values rise to to the world’s least national midpoints for each of the six variables. They have no affect on the full score detailed for each nation, but they do explain the variance.

The files contains the Happiness Score for 153 countries along with the factors used to explain the score from the year 2015 to 2020.

Lastly, **The Happiness Score** is a national average of the responses to the main life evaluation question asked in the Gallup World Poll (GWP), which uses the Cantril Ladder. It can be explained by the following factors:
- Access to Social Support
- Economy (GDP per Capita)
- Family
- Health / Life Expectancy
- Freedom to make choices
- Trust in Government Entities
- Generosity
- Dystopia Residual


### Data Acknowlegment
Before we begin, I want to quickly acknowledge that the data has been sourced from Kaggle and can be found [here](https://www.kaggle.com/mathurinache/world-happiness-report). The original authors & editors are: 

**Editors:** 

John Helliwell, Richard Layard, Jeffrey D. Sachs, and Jan Emmanuel De Neve, Co-Editors; Lara Aknin, Haifang Huang and Shun Wang, Associate Editors; and Sharon Paculor, Production Editor

**Citation:**

Helliwell, John F., Richard Layard, Jeffrey Sachs, and Jan-Emmanuel De Neve, eds. 2020. World Happiness Report 2020. New York: Sustainable Development Solutions Network

---

# Imports

In [None]:
# Importing all the packages
# Analysis Libraries
import pandas as pd
import numpy as np

# Visualization Libraries
# Matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

#Seaborn
import seaborn as sns

# Importing plotly objects
import plotly.graph_objects as go

# For sounds when code blocks complete
import os

# For saving the models later. 
import joblib

# Displaying DataFrames - side by side function 
from IPython.display import display_html

# GeoPy Package - Extensions for getting country latitude and longitudes (pip install GeoPy)
from geopy.exc import GeocoderTimedOut 
from geopy.geocoders import Nominatim 

# For making training and testing splits prior to modelling
from sklearn.model_selection import train_test_split

# Importing Scaler for Scaling Data
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Machine Learning Imports 
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.svm import LinearSVR

# For building up a pipeline
from sklearn.pipeline import Pipeline

# For a cross-validated grid search
from sklearn.model_selection import GridSearchCV

# Stopping warnings
import warnings
warnings.filterwarnings('ignore')

---
# First look at the Data
---
This will include taking a look at the following:
- Individual DataFrames
- Data Description, Information & Data Types
- Nulls & Missing Values
- Early Observations

In [None]:
# Reading in our data and setting dataframes to a variable
Year_2015 = pd.read_csv('2015.csv')
Year_2016 = pd.read_csv('2016.csv')
Year_2017 = pd.read_csv('2017.csv')
Year_2018 = pd.read_csv('2018.csv')
Year_2019 = pd.read_csv('2019.csv')
Year_2020 = pd.read_csv('2020.csv')

#### Individual DataFrames - First 5 and last 5 rows!

In [None]:
# Printing out the dataframe for the year 2015
Year_2015

In [None]:
# Printing out the dataframe for the year 2016
Year_2016

In [None]:
# Printing out the dataframe for the year 2017
Year_2017

In [None]:
# Printing out the dataframe for the year 2018
Year_2018

In [None]:
# Printing out the dataframe for the year 2019
Year_2019

In [None]:
# Printing out the dataframe for the year 2020
Year_2020

#### Data Description, Information & Data Types

###### Descriptions

In [None]:
# Description of DataFrame = Year 2015 
Year_2015.describe()

In [None]:
# Description of DataFrame = Year 2015 
Year_2016.describe()

In [None]:
# Description of DataFrame = Year 2015 
Year_2017.describe()

In [None]:
# Description of DataFrame = Year 2015 
Year_2018.describe()

In [None]:
# Description of DataFrame = Year 2015 
Year_2019.describe()

In [None]:
# Description of DataFrame = Year 2015 
Year_2020.describe()

##### Data Points

In [4]:
# Printing out the countries but also the unique countries incase repetition
print(f"Countries in 2015: {len(Year_2015)} & unique countries are: {Year_2015['Country'].nunique()}")
print(f"Countries in 2016: {len(Year_2016)} & unique countries are: {Year_2016['Country'].nunique()}")
print(f"Countries in 2017: {len(Year_2017)} & unique countries are: {Year_2017['Country'].nunique()}")
print(f"Countries in 2018: {len(Year_2018)} & unique countries are: {Year_2018['Country'].nunique()}")
print(f"Countries in 2019: {len(Year_2019)} & unique countries are: {Year_2019['Country'].nunique()}")
print(f"Countries in 2020: {len(Year_2020)} & unique countries are: {Year_2020['Country'].nunique()}")

SyntaxError: EOL while scanning string literal (<ipython-input-4-e82032ec469b>, line 3)

##### Duplicates

In [None]:
# Printing out the number of duplicates
print(f"Year 2015 duplicates: {Year_2015.duplicated().sum()}")
print(f"Year 2016 duplicates: {Year_2016.duplicated().sum()}")
print(f"Year 2017 duplicates: {Year_2017.duplicated().sum()}")
print(f"Year 2018 duplicates: {Year_2018.duplicated().sum()}")
print(f"Year 2019 duplicates: {Year_2019.duplicated().sum()}")
print(f"Year 2020 duplicates: {Year_2020.duplicated().sum()}")


##### Nulls & Missing Values

In [3]:
# Printing out the number of Nulls
print(f"Year 2015 null values: {len(Year_2015.isnull().sum())}")
print(f"Year 2016 null values: {len(Year_2016.isnull().sum())}")
print(f"Year 2017 null values: {len(Year_2017.isnull().sum())}")
print(f"Year 2018 null values: {len(Year_2018.isnull().sum())}")
print(f"Year 2019 null values: {len(Year_2019.isnull().sum())}")
print(f"Year 2020 null values: {len(Year_2020.isnull().sum())}")


NameError: name 'Year_2015' is not defined

##### Quick Visualizations while we explore the data

In [None]:
# Plotting the top 20 countries of 2020 listed as the most happiest
plt.figure()
plt.plot(data = (Year_2020['Country','Happiness Score'].sort_values(by ='Happiness Score').head()),
         kind = 'bar')
plt.show()

---
# Feature Engineering
---

#### Aligning the DataFrames

In [None]:
# Let's take a look at the shape of our dataframes
print(f"World Happiness Report 2015: {Year_2015.shape}")
print(f"World Happiness Report 2016: {Year_2016.shape}")
print(f"World Happiness Report 2017: {Year_2017.shape}")
print(f"World Happiness Report 2018: {Year_2018.shape}")
print(f"World Happiness Report 2019: {Year_2019.shape}")
print(f"World Happiness Report 2020: {Year_2020.shape}")

#### Renaming columns

#### Adding columns

1. Report Year
2. Country Capitals
3. Latitude & Longitude
4. Happiness Brackets

**1. Report Year**

In [None]:
# Adding a first column to the 2015 DataFrame with value 2015
Year_2015.insert(0 , column = 'Report Year', value = '2015')

# Adding a first column to the 2016 DataFrame with value 2016
Year_2016.insert(0 , column = 'Report Year', value = '2016')

# Adding a first column to the 2017 DataFrame with value 2017
Year_2017.insert(0 , column = 'Report Year', value = '2017')

# Adding a first column to the 2018 DataFrame with value 2018
Year_2018.insert(0 , column = 'Report Year', value = '2018')

# Adding a first column to the 2019 DataFrame with value 2019
Year_2019.insert(0 , column = 'Report Year', value = '2019')

# Adding a first column to the 2020 DataFrame with value 2020
Year_2020.insert(0 , column = 'Report Year', value = '2020')

**2. Country Capitals**

In [None]:
##### Reading in a new dataframe with country names and capital cities
country_capitals = pd.read_csv('Country list.csv')
country_capitals.rename(columns = {'Countries' : 'Country'}, inplace = True)
country_capitals.head()

**3. Latitude & Longitude**

In [None]:
# Creating an empty list to store latitudes and longitudes of cities
longitude = [] 
latitude = [] 
   
# Creating a function to find coordinators of a given city
def findGeocode(city): 
       
    # Try & catch will be used to ovecome exceptions thrown by 
    # geolocator using geocodertimedout 
    try: 
          
        # Specify the user agent as 'your_app_name', shouldn't be empty
        geolocator = Nominatim(user_agent="your_app_name") 
        
        # locates the city locator
        return geolocator.geocode(city) 
      
    except GeocoderTimedOut: 
          
        # locates the city    
        return findGeocode(city)     
  
# Each value from city column will be fethced and send to function find_geocode

# Creating a for loop where it goes rows by rows and fetches lat/long per city
for city in country_capitals['Capital']: 
      
    if findGeocode(i) != None: 
           
        loc = findGeocode(i) 
          
        # Coordinators get returned from the function and is stored in two lists 
        latitude.append(loc.latitude) 
        longitude.append(loc.longitude) 
       
    # if coordinate for a city not 
    # found, insert "NaN" indicating  
    # missing value  
    else: 
        latitude.append(np.nan) 
        longitude.append(np.nan) 
        
# Let's finally put these into the country_capitals dataframe
country_capitals['Latitude'] = latitude
country_capitals['Longitude'] = longitude


In [None]:
# Let's quickly take a look at the new dataframe 
country_capitals.head()

In [None]:
# Let's also take a look to see if there are any nulls within these
print(f"Null values in Country_Capitals: {len(country_capitals.isnull().sum())}")


**Cleaning up the dataframes before moving on to combining them**

In [None]:
# Changing the columns for Year 2015
# Renaming the columns and grouping together 
clean_2015 = {'Country' : 'Country', 'Region' : 'Region', 'Happiness Rank' : 'Happiness Rank',
              'Happiness Score' : 'Happiness Score', 'Economy (GDP per Capita)' : 'Economy (GDP per Capita)',
              'Family' : 'Family Safety','Health (Life Expectancy)' : 'Health / Life Expectancy',
              'Freedom' : 'Freedom to make choices',
              'Trust (Government Corruption)' : 'Trust in Government Entities',
              'Generosity' : 'Generosity', 'Dystopia Residual' : 'Dystopia Residual'}

clean_2016 = {'Country' : 'Country', 'Region' : 'Region', 'Happiness Rank' : 'Happiness Rank',
              'Happiness Score' : 'Happiness Score', 'Economy (GDP per Capita)' : 'Economy (GDP per Capita)',
              'Family' : 'Family Safety','Health (Life Expectancy)' : 'Health / Life Expectancy',
              'Freedom' : 'Freedom to make choices',
              'Trust (Government Corruption)' : 'Trust in Government Entities',
              'Generosity' : 'Generosity', 'Dystopia Residual' : 'Dystopia Residual'}

clean_2017 = {'Country' : 'Country', 'Region' : 'Region', 'Happiness.Rank' : 'Happiness Rank',
              'Happiness.Score' : 'Happiness Score', 'Economy..GDP.per.Capita.' : 'Economy (GDP per Capita)',
              'Family' : 'Family Safety','Health..Life.Expectancy.' : 'Health / Life Expectancy',
              'Freedom' : 'Freedom to make choices',
              'Trust..Government.Corruption.' : 'Trust in Government Entities',
              'Generosity' : 'Generosity', 'Dystopia.Residual' : 'Dystopia Residual'}

clean_2018 = {'Country or region' : 'Country', 'Overall rank' : 'Happiness Rank', 'Social support' : 'Social Support',
              'Score' : 'Happiness Score', 'GDP per capita' : 'Economy (GDP per Capita)',
              'Family' : 'Family Safety','Healthy life expectancy' : 'Health / Life Expectancy',
              'Freedom to make life choices' : 'Freedom to make choices',
              'Perceptions of corruption' : 'Trust in Government Entities',
              'Generosity' : 'Generosity', 'Dystopia Residual' : 'Dystopia Residual'}

clean_2019 = {'Country or region' : 'Country', 'Overall rank' : 'Happiness Rank', 'Social support' : 'Social Support',
              'Score' : 'Happiness Score', 'GDP per capita' : 'Economy (GDP per Capita)',
              'Family' : 'Family Safety','Healthy life expectancy' : 'Health / Life Expectancy',
              'Freedom to make life choices' : 'Freedom to make choices',
              'Perceptions of corruption' : 'Trust in Government Entities',
              'Generosity' : 'Generosity', 'Dystopia Residual' : 'Dystopia Residual'}

clean_2020 = {'Country' : 'Country', 'Regional indicator' : 'Region', 'Happiness Rank' : 'Happiness Rank',
              'Ladder score' : 'Happiness Score', 'Economy (GDP per Capita)' : 'Economy (GDP per Capita)',
              'Family' : 'Family Safety','Healthy life expectancy' : 'Health / Life Expectancy',
              'Freedom to make life choices' : 'Freedom to make choices', 'Social support' : 'Social Support',
              'Perceptions of corruption' : 'Trust in Government Entities',
              'Generosity' : 'Generosity', 'Dystopia Residual' : 'Dystopia Residual'}


# Assigning the new column names to ensurie they align 
Year_2015.rename(columns = clean_2015, inplace = True)
Year_2016.rename(columns = clean_2016, inplace = True)
Year_2017.rename(columns = clean_2017, inplace = True)
Year_2018.rename(columns = clean_2018, inplace = True)
Year_2019.rename(columns = clean_2019, inplace = True)
Year_2020.rename(columns = clean_2020, inplace = True)

In [None]:
# Adding the Happiness Ranks and Happiness Scores for 2015
Average_Happiness_Rank_Scores_2015 = Year_2015[['Country', 'Happiness Rank', 'Happiness Score']]
Average_Happiness_Rank_Scores_2015.rename({'Happiness Rank' : 'Happiness Rank_2015',
                                      'Happiness Score': 'Happiness Score_2015'}, axis = 1, inplace = True)

# Adding the Happiness Ranks and Happiness Scores for 2016
Average_Happiness_Rank_Scores_2016 = Year_2016[['Country', 'Happiness Rank', 'Happiness Score']]
Average_Happiness_Rank_Scores_2016.rename({'Happiness Rank' : 'Happiness Rank_2016',
                                      'Happiness Score': 'Happiness Score_2016'}, axis = 1, inplace = True)

# Adding the Happiness Ranks and Happiness Scores for 2017
Average_Happiness_Rank_Scores_2017 = Year_2017[['Country', 'Happiness Rank', 'Happiness Score']]
Average_Happiness_Rank_Scores_2017.rename({'Happiness Rank' : 'Happiness Rank_2017',
                                      'Happiness Score': 'Happiness Score_2017'}, axis = 1, inplace = True)

# Adding the Happiness Ranks and Happiness Scores for 2018
Average_Happiness_Rank_Scores_2018 = Year_2018[['Country', 'Happiness Rank', 'Happiness Score']]
Average_Happiness_Rank_Scores_2018.rename({'Happiness Rank' : 'Happiness Rank_2018',
                                      'Happiness Score': 'Happiness Score_2018'}, axis = 1, inplace = True)

# Adding the Happiness Ranks and Happiness Scores for 2019
Average_Happiness_Rank_Scores_2019 = Year_2019[['Country', 'Happiness Rank', 'Happiness Score']]
Average_Happiness_Rank_Scores_2019.rename({'Happiness Rank' : 'Happiness Rank_2019',
                                      'Happiness Score': 'Happiness Score_2019'}, axis = 1, inplace = True)

# Adding the Happiness Ranks and Happiness Scores for 2020
Average_Happiness_Rank_Scores_2020 = Year_2020[['Country', 'Happiness Rank', 'Happiness Score']]
Average_Happiness_Rank_Scores_2020.rename({'Happiness Rank' : 'Happiness Rank_2020',
                                      'Happiness Score': 'Happiness Score_2020'}, axis = 1, inplace = True)


In [None]:
# Let's take a look at the average scores across all countries
print('Average Happiness Score across all countries are:
      .mean()')

In [None]:
# Create a list to get the sum of all happiness score and divde that

Before we move on to 

**4. Happiness Bracket and Average Happiness**

This will help us see the distribution of happiness in our data

In [None]:
# Plotting subplots for seeing happiness distribution each year 
Happiness_Ranks = []
Happiness_Scores = [Happiness Score_2020, 
                    Happiness Score_2019, 
                    Happiness Score_2018,
                    Happiness Score_2017,
                    Happiness Score_2016,
                    Happiness Score_2015]

plt.subplots(figsize = (10,10),
             nrows = 4,
             ncols = 2)

plt.subplot()
plt.title()
plt.xlabel()
plt.ylabel()


plt.tight_layout()
plt.show()

In [None]:
# Plotting subplots for these to understand relationship between happiness score and he rest 
plt.subplots(figsize = (20,20), 
             nrows = 4, 
             ncols = 2)

# Scatter subplot for Happiness Score & Economy (GDP per Capita)
plt.subplot(4,2,1)
sns.regplot(x = Year_2015['Happiness Score'], 
            y = Year_2015['Economy (GDP per Capita)'],
            color = 'red', marker = '*')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Economy (GDP per Capita)')
plt.xlim(0,10)

# Scatter subplot for Happiness Score & Social Support
plt.subplot(4,2,2)
sns.regplot(x = Year_2015['Happiness Score'],
            y = Year_2015['Social Support'],
            color = 'green', marker = 'o')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Social Support')
plt.xlim(0,10)

# Scatter subplot for Happiness Score & Health / Life Expectancy
plt.subplot(4,2,3)
sns.regplot(x = Year_2015['Happiness Score'],
            y = Year_2015['Health / Life Expectancy'],
            color = 'blue', marker = '*')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Health / Life Expectancy')
plt.xlim(0,10)

# Scatter subplot for Happiness Score & Freedom to make choices
plt.subplot(4,2,4)
sns.regplot(x = Year_2015['Happiness Score'], 
            y = Year_2015['Freedom to make choices'],
            color = 'orange', marker = 'o')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Freedom to make choices')
plt.xlim(0,10)

# Scatter subplot for Happiness Score & Family Safety
plt.subplot(4,2,5)
sns.regplot(x = Year_2015['Happiness Score'],
            y = Year_2015['Family Safety'],
            color = 'red', marker = '*')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Family Safety')
plt.xlim(0,10)

# Scatter subplot for Happiness Score & Trust in Government Entities
plt.subplot(4,2,6)
sns.regplot(x = Year_2015['Happiness Score'],
            y = Year_2015['Trust in Government Entities'],
            color = 'green', marker = 'o')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Trust in Government Entities')
plt.xlim(0,10)

# Scatter subplot for Happiness Score & Generosity
plt.subplot(4,2,7)
sns.regplot(x = Year_2015['Happiness Score'],
            y = Year_2015['Generosity'],
            color = 'green', marker = '*')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Generosity')
plt.xlim(0,10)

# Scatter subplot for Happiness Score & Dystopia Residual
plt.subplot(4,2,8)
sns.regplot(x = Year_2015['Happiness Score'],
            y = Year_2015['Dystopia Residual'],
            color = 'green', marker = '*')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Dystopia Residual')
plt.xlim(0,10)

plt.tight_layout()
plt.show()

## Combining our dataframe altogether

**Before we move onto putting the dataframes together, let's ensure it looks like as follows:**

- `Country`
- `Region`
- `Capital`
- `Latitude`
- `Longitude`
- `Average Happiness Score (2015-2020)`


- Happiness Ranks as follows:
    - `Happiness Rank_2020`
    - `Happiness Rank_2019`
    - `Happiness Rank_2018`
    - `Happiness Rank_2017`
    - `Happiness Rank_2016`
    - `Happiness Rank_2015`
    
    
- Happiness Scores as follows:
    - `Happiness Score_2020`
    - `Happiness Score_2019`
    - `Happiness Score_2018`
    - `Happiness Score_2017`
    - `Happiness Score_2016`
    - `Happiness Score_2015`
    

- `Economy (GDP per Capita)`
- `Social Support`
- `Health / Life Expectancy`
- `Freedom to make choices`
- `Trust in Government Entities`
- `Generosity`
- `Dystopia Residual`


In [None]:
# Adding the Happiness Ranks and Happiness Scores for 2015
Average_Happiness_Rank_Scores_2015 = Year_2015[['Country', 'Happiness Rank', 'Happiness Score']]
Average_Happiness_Rank_Scores_2015.rename({'Happiness Rank' : 'Happiness Rank_2015',
                                      'Happiness Score': 'Happiness Score_2015'}, axis = 1, inplace = True)

# Adding the Happiness Ranks and Happiness Scores for 2016
Average_Happiness_Rank_Scores_2016 = Year_2016[['Country', 'Happiness Rank', 'Happiness Score']]
Average_Happiness_Rank_Scores_2016.rename({'Happiness Rank' : 'Happiness Rank_2016',
                                      'Happiness Score': 'Happiness Score_2016'}, axis = 1, inplace = True)

# Adding the Happiness Ranks and Happiness Scores for 2017
Average_Happiness_Rank_Scores_2017 = Year_2017[['Country', 'Happiness Rank', 'Happiness Score']]
Average_Happiness_Rank_Scores_2017.rename({'Happiness Rank' : 'Happiness Rank_2017',
                                      'Happiness Score': 'Happiness Score_2017'}, axis = 1, inplace = True)

# Adding the Happiness Ranks and Happiness Scores for 2018
Average_Happiness_Rank_Scores_2018 = Year_2018[['Country', 'Happiness Rank', 'Happiness Score']]
Average_Happiness_Rank_Scores_2018.rename({'Happiness Rank' : 'Happiness Rank_2018',
                                      'Happiness Score': 'Happiness Score_2018'}, axis = 1, inplace = True)

# Adding the Happiness Ranks and Happiness Scores for 2019
Average_Happiness_Rank_Scores_2019 = Year_2019[['Country', 'Happiness Rank', 'Happiness Score']]
Average_Happiness_Rank_Scores_2019.rename({'Happiness Rank' : 'Happiness Rank_2019',
                                      'Happiness Score': 'Happiness Score_2019'}, axis = 1, inplace = True)

# Adding the Happiness Ranks and Happiness Scores for 2020
Average_Happiness_Rank_Scores_2020 = Year_2020[['Country', 'Happiness Rank', 'Happiness Score']]
Average_Happiness_Rank_Scores_2020.rename({'Happiness Rank' : 'Happiness Rank_2020',
                                      'Happiness Score': 'Happiness Score_2020'}, axis = 1, inplace = True)


In [None]:
# Code WILL CHANGE
#Creating a DataFrame for 
combined_df = pd.DataFrame(columns = ['Country', 'Capital', 'Region', 'Latitude', 'Longitude', 
                                      'Happiness Rank_2015', 'Happiness Score_2015', 'Average Happiness Score (2015-2020)', 
                                      'Happiness Rank_2016', 'Happiness Score_2016',
                                      'Happiness Rank_2017', 'Happiness Score_2017',
                                      'Happiness Rank_2018', 'Happiness Score_2018',
                                      'Happiness Rank_2019', 'Happiness Score_2019',
                                      'Happiness Rank_2020', 'Happiness Score_2020',
                                      'Standard Error', 'Social Support', 'Lower Confidence Interval', 
                                      'Upper Confidence Interval', 'Upper Whisker','Lower Whisker', 
                                      'Economy (GDP per Capita)', 'Family', 'Health / Life Expectancy', 
                                      'Freedom to make choices', 'Trust in Government Entities', 'Generosity', 
                                      'Dystopia Residual'])

dataframes = [Year_2015, Year_2016, Year_2017, Year_2018, Year_2019, Year_2020]

for data in dataframes:
    combined_df = combined_df.append(data[['Country', 'Region', 'Capital', 'Happiness Score', 'Economy (GDP per Capita)', 
                                           'Health (Life Expectancy)']], ignore_index = 'True')

combined_df.head()

---
# Insights & Findings
---
This will include the following: 
- World Map Visualization
- Top 10 happiest countries overall
- Least 10 happiest countries overall

##### World Map Visualization

In [None]:
# Importing plotly objects
import plotly.graph_objects as go

# Setting up the dataframe 
df = country_capitals
df['Visualization Content'] = df['Capital'] + ', ' 
                                + df['Country'] + ', ' 
                                + df['Happiness Score']


# Setting up the figure
fig = go.Figure(data = go.Scatter(
        lon = df['Longitude'],
        lat = df['Latitude'],
        text = df['Visualization Content'],
        mode = 'markers'
        marker = dict(
            size = 8,
            opacity = 0.5,
            reversescale = True,
            line = dict(
                width = 1, 
                color = 'rgba(102,102,102)'
            ),
            colorscale = 'Blues',
            cmin = 0,
            cmax = 10,
            colorbar_title = 'Happiness Score'
        )))

# Updating the figure
fig.update_layout(width = 1000,
                  height = 1000,
                  title_text = 'World Happiness Map',
                  title_font_size = 30,
                  xaxis_title = '', yaxis_title = ''
)
                           

fig.show()

**A function to display dataframes side by side**

In [None]:
# Defining the function
def display_side_by_side(*args):
    """
    This function takes the two dataframes and puts them side by side and should be put in a format as follows:
    side_to_side(dataframe 1, dataframe 2)
    """
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

#### Overall & Year-by-Year

##### Top 10 happiest countries overall

In [None]:
# Printing out the 10 happiest countries in the past 6 years


##### Least 10 happiest countries overall

In [None]:
# Printing out the 10 least happiest countries in the past 6 years


**Top 10 happiest & least happiest countries in 2020**

In [None]:
# Top 10 happiest countries in 2020
Top_10_2020 = pd.DataFrame(Year_2020[['Country', 
                                      'Happiness Score', 
                                      'Happiness Rank']].sort_values(by = 'Happiness Rank',
                                                                     ascending = True).head(10))

# Top 10 least happiest countries in 2020
Bottom_10_2020 = pd.DataFrame(Year_2020[['Country', 
                                      'Happiness Score', 
                                      'Happiness Rank']].sort_values(by = 'Happiness Rank',
                                                                     ascending = True).tail(10))

# Showing the DataFrame side by side
display_side_by_side(Top_10_2020, Bottom_10_2020)

**Top 10 happiest & least happiest countries in 2019**

In [None]:
# Top 10 happiest countries in 2019
Top_10_2019 = pd.DataFrame(Year_2019[['Country', 
                                      'Happiness Score', 
                                      'Happiness Rank']].sort_values(by = 'Happiness Rank',
                                                                     ascending = True).head(10))

# Top 10 least happiest countries in 2019
Bottom_10_2019 = pd.DataFrame(Year_2019[['Country', 
                                      'Happiness Score', 
                                      'Happiness Rank']].sort_values(by = 'Happiness Rank',
                                                                     ascending = True).tail(10))

# Showing the DataFrame side by side
display_side_by_side(Top_10_2019, Bottom_10_2019)

**Top 10 happiest & least happiest countries in 2018**

In [None]:
# Top 10 happiest countries in 2018
Top_10_2018 = pd.DataFrame(Year_2018[['Country', 
                                      'Happiness Score', 
                                      'Happiness Rank']].sort_values(by = 'Happiness Rank',
                                                                     ascending = True).head(10))
# Top 10 least happiest countries in 2018
Bottom_10_2018 = pd.DataFrame(Year_2018[['Country', 
                                      'Happiness Score', 
                                      'Happiness Rank']].sort_values(by = 'Happiness Rank',
                                                                     ascending = True).tail(10))

# Showing the DataFrame side by side
display_side_by_side(Top_10_2018, Bottom_10_2018)

**Top 10 happiest & least happiest countries in 2017**

In [None]:
Top_10_2017 = pd.DataFrame(Year_2017[['Country', 
                                      'Happiness Score', 
                                      'Happiness Rank']].sort_values(by = 'Happiness Rank',
                                                                     ascending = True).head(10))
Bottom_10_2017 = pd.DataFrame(Year_2017[['Country', 
                                      'Happiness Score', 
                                      'Happiness Rank']].sort_values(by = 'Happiness Rank',
                                                                     ascending = True).tail(10))

display_side_by_side(Top_10_2017, Bottom_10_2017)

**Top 10 happiest & least happiest countries in 2016**

In [None]:
Top_10_2016 = pd.DataFrame(Year_2016[['Country', 
                                      'Happiness Score', 
                                      'Happiness Rank']].sort_values(by = 'Happiness Rank',
                                                                     ascending = True).head(10))
Bottom_10_2016 = pd.DataFrame(Year_2016[['Country', 
                                      'Happiness Score', 
                                      'Happiness Rank']].sort_values(by = 'Happiness Rank',
                                                                     ascending = True).tail(10))

display_side_by_side(Top_10_2016, Bottom_10_2016)

**Top 10 happiest & least happiest countries in 2015**

In [None]:
Bottom_10_2015 = pd.DataFrame(Year_2015[['Country', 
                                      'Happiness Score', 
                                      'Happiness Rank']].sort_values(by = 'Happiness Rank',
                                                                     ascending = True).tail(10))

display_side_by_side(Top_10_2015, Bottom_10_2015)

---
# Scatter Plots - Understanding the Relationships
---
This will include scatterplots for the following comparing Happiness Score with:
- Economy (GDP per Capita)
- Social Support
- Health / Life Expectancy
- Freedom to make choices
- Family Safety
- Trust in Government Entities
- Generosity
- Dystopia Residual

In [None]:
# Plotting subplots for these to understand relationship between happiness score and he rest 
plt.subplots(figsize = (20,20), 
             nrows = 4, 
             ncols = 2)

# Scatter subplot for Happiness Score & Economy (GDP per Capita)
plt.subplot(4,2,1)
sns.regplot(x = Year_2015['Happiness Score'], 
            y = Year_2015['Economy (GDP per Capita)'],
            color = 'red', marker = '*')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Economy (GDP per Capita)')
plt.xlim(0,10)

# Scatter subplot for Happiness Score & Social Support
plt.subplot(4,2,2)
sns.regplot(x = Year_2015['Happiness Score'],
            y = Year_2015['Social Support'],
            color = 'green', marker = 'o')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Social Support')
plt.xlim(0,10)

# Scatter subplot for Happiness Score & Health / Life Expectancy
plt.subplot(4,2,3)
sns.regplot(x = Year_2015['Happiness Score'],
            y = Year_2015['Health / Life Expectancy'],
            color = 'blue', marker = '*')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Health / Life Expectancy')
plt.xlim(0,10)

# Scatter subplot for Happiness Score & Freedom to make choices
plt.subplot(4,2,4)
sns.regplot(x = Year_2015['Happiness Score'], 
            y = Year_2015['Freedom to make choices'],
            color = 'orange', marker = 'o')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Freedom to make choices')
plt.xlim(0,10)

# Scatter subplot for Happiness Score & Family Safety
plt.subplot(4,2,5)
sns.regplot(x = Year_2015['Happiness Score'],
            y = Year_2015['Family Safety'],
            color = 'red', marker = '*')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Family Safety')
plt.xlim(0,10)

# Scatter subplot for Happiness Score & Trust in Government Entities
plt.subplot(4,2,6)
sns.regplot(x = Year_2015['Happiness Score'],
            y = Year_2015['Trust in Government Entities'],
            color = 'green', marker = 'o')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Trust in Government Entities')
plt.xlim(0,10)

# Scatter subplot for Happiness Score & Generosity
plt.subplot(4,2,7)
sns.regplot(x = Year_2015['Happiness Score'],
            y = Year_2015['Generosity'],
            color = 'green', marker = '*')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Generosity')
plt.xlim(0,10)

# Scatter subplot for Happiness Score & Dystopia Residual
plt.subplot(4,2,8)
sns.regplot(x = Year_2015['Happiness Score'],
            y = Year_2015['Dystopia Residual'],
            color = 'green', marker = '*')
plt.title('')
plt.xlabel('Happiness Score')
plt.ylabel('Dystopia Residual')
plt.xlim(0,10)


---
# Using Machine Learning to better understand the major influencing factors affecting World Happiness! 
---
This will include ...

### Scaled & Unscaled Data

#### Standard Scaler

#### Min-Max Scaler

---
## Modeling
---

Regression analysis estimates the relationship between two or more variables. There are multiple benefits of using regression analysis and are as follows:
- It indicates the **significant relationships** between dependent variable and independent variable.
- It indicates the **strength of impact** of multiple independent variables on a dependent variable.

### Linear Regression

In [None]:
# Instatiate the model
LinR_model = LinearRegression()

# Fit the model
LinR_model.fit(X_train, y_train)

# Test the model
LinR_model.score(X_test, y_test)

# Print extra reports 

### Linear Regression - Scaled

In [1]:
# Instatiate the model
LinR_model = LinearRegression()

# Fit the model
LinR_model.fit(X_train_scaled, y_train_scaled)

# Test the model
LinR_model.score(X_test, y_test)

# Print extra reports 

NameError: name 'LinearRegression' is not defined

### Logistic Regression

In [None]:
# Instatiate the model
LogR_model = LogisticRegression()

# Fit the model
LogR_model.fit(X_train, y_train)

# Test the model
LogR_model.score(X_test, y_test)

The Logistic Regression model can handle various types of relationship and is most widely used for classification problems. This is because it applies a non-linear log transofrmation to the predicted odds ratio. 

What we must keep in mind is that the independent variables should not be correlated with each other i.e. no multi collinearity. Therefore, this fails are model findings because it could be argued that many of the variables are inter-related!


### Support Vector Machines

In [None]:
# Instatiate the model
SVM_model = LinearSVR()

# Fit the model
SVM_model.fit(X_train, y_train)

# Test the model
LogR_model.score(X_test, y_test)

### Principal Component Analysis

### Grid Search with Cross Validation

Steps for setting up a Grid Search with Cross Validation are as follows:
- Split the Data into training and testing sets 
- Print out the shape of the data sets
- Instantiate Pipeline with the proper steps & parameters
- Setting up the `param_grid` before moving on to the Grid Search
- Finally, instantiate the Grid Search with the cross validations & fit on training set


**Step 1**: Split the Data into training and testing sets

In [None]:
# Splitting our data again into features and target variables
X = dataframe['']
y = dataframe['Average Happiness Score (2015-2020)']
                
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y, 
                                                    test_size = 0.3, 
                                                    stratify = y, 
                                                    random_state = 101)

**Step 2**: Printing the shape of the data set

In [None]:
# Some print statements to see what the data looks like prior to splitting
print('Total Data Set Size')
print(f"Shape of our features predicting happiness: {X.shape}")
print(f"Shape of our target variable predicting happiness: {y.shape}")

# Some print statements to see what the training data looks like
print('Training set size:')
print(f"Features shape: {X_train.shape}")
print(f"Target shape: {y_train.shape}")

# Some print statements to see what the testing data looks like
print(f"X_test: {X_test.shape}")
print(f"y_test: {y_test.shape}")

**Step 3**: Instantiating and setting up our pipeline parameters

In [None]:
# Setting up our pipeline to scale the data and apply modeling
pipeline = Pipeline([('Scaler', StandardScaler()),
                     ('model', LinearRegression())]


**Step 4**: Setting up the `param_grid` before moving on to the Grid Search

In [None]:
# Below we create a list of parametres we will be using for our GridSearchCV
scalers = [None, MinMaxScaler(), StandardScaler()]

n_jobs = -1

# Parameters for Linear Regression
param_grid_LinR = {'scaler' : scalers, 
                   'model' : [LinearRegression()], 
                   'model__n_jobs': n_jobs}

# Parameters for Logistic Regression
param_grid_LogR = {'scaler' : scalers, 
                   'model' : [LogisticRegression()], 
                   'model__n_jobs': n_jobs}

# Parameters for Principal Component Analysis
param_grid_PCA = {'scaler' : scalers, 
                   'model' : [PCA()], 
                   'model__n_jobs': n_jobs}

# Parameters for Support Vector Machines
param_grid_SVM = {'scaler' : scalers, 
                   'model' : [LinearSVR()], 
                   'model__n_jobs': n_jobs}

# Setting up the parameter grid as a list 
param_grids = [param_grid_LinR, 
               param_grid_LogR, 
               param_grid_PCA, 
               param_grid_SVM]

**Step 5**: Finally, instantiate the Grid Search with 10 cross validations & fit on training set. 

The reason we choose 10 is because the data set is already pretty small so it might be of value to validate it with more cross validations. 

In [None]:
%%time
# Setting up / Instantiating the gridsearch
grid_model = GridSearchCV(estimator = pipeline,
                          param_grid = param_grids,
                          cv = 10,
                          n_jobs = -1,
                          verbose = 1)

# Fitting the data on the grid
grid_model.fit(X_train, y_train)



In [None]:
# Checking for best parameters
print("Best parameters set found on development set:")
print(grid_model.best_params_)
print()

# Checking for best score
print()
print("Grid best score:")
print (grid_model.best_score_)
print()

#Checking for best model
print()
print("Best model is:")
print(grid_model.best_estimator_)

Talk about what some of the top factors are:

# Final Verdict