 # Predicting the Happiness of a Nation Based on it's Development
This project aims to explore the link between the happiness of a nation and it's economic development. 

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os
import csv
import sqlite3 

from tqdm import tqdm_notebook as bar 
from tqdm import tqdm_notebook as tqdm

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler

In [None]:
print(os.listdir("../input/world-happiness"))

# Data Cleaning and Combining
The data for this project comes from two data sets and thus will require cleaning for various possible errors. One dataset contains the world happiness report from 2015, the other contains the economic development indicators. 

In [None]:
#reading the happiness data into a data frame
HappinessDF=pd.read_csv("../input/world-happiness/2015.csv")
print(HappinessDF.dtypes)
HappinessDF.head()

The development data can be read from csv files but I will read it from the sqlite database to demonstrate my SQL knowledge.  

The data base has various development indicators stored in the indicators table along with country codes,year and indicator code. The countries in the happiness data may have different names from those in indicator database so it makes sense to first find what countries in the happiness data set do not appear in the indicators data set. 

In [None]:
connection = sqlite3.connect("../input/world-development-indicators/database.sqlite")
cursor=connection.cursor()

def DataFrameTable(query):
    return pd.read_sql_query(query,connection)

In [None]:
#initiallty need to read country names in a Data Frame 
query="""
    SELECT ShortName  
    FROM Country
"""
DevelCountryNames=DataFrameTable(query)
DevelCountryNames

In [None]:
# Now list the non matching countries
NonMatches=[]
for country in HappinessDF["Country"]:
    if country not in DevelCountryNames["ShortName"].tolist():
        NonMatches.append(country)
print(NonMatches)

Some of these countries are not entirely recognized such as **'Palestinian Territories**, **'Somaliland Region'**, **'North Cyprus'** and **'Taiwan'** but do go by some alternative names like **Palestine**, the **Turkish Republic of Northern Cyprus** and the **Republic of China**. **'Hong Kong'** does not tend to go by any other names but is now technically part of China so has likely not been included separately because of that. **'Congo (Kinshasa)'** appears to represent the **Demorcratic Republic of the Congo** while **'Congo (Brazzaville)'** represents the **Republic of the Congo**. The **'Ivory Coast'** may be go by it's French name **Côte d'Ivoire**. Kyrgyzstan's offical name is **Kyrgyz Republic**. Laos's official name is **Lao People's Democratic Republic**. Slovakia's official name is the **Slovak Republic**.

In [None]:
SearchTerms=["Congo","Ivoire","Kyrgyzstan","Kyrgyz","Syria","Lao","Slovak","Palestine","Somaliland","Cyprus","China","Korea"]
SearchQuery="SELECT ShortName FROM Country WHERE"
for SearchTerm in tqdm(SearchTerms):
    SearchQuery+=" ShortName LIKE '%"+SearchTerm+"%' OR"
SearchQuery=SearchQuery[:-3]
print(SearchQuery)
    

DataFrameTable(SearchQuery)

Now we can assign the countries in the happiness data that didn't directly match countries in the development data.

| Happiness Data | Development Data |
|----------------|------------------|
| Taiwan         |      N/A         |
| Slovakia       | Slovak Republic  |
| South Korea    | Korea            |
| North Cyprus   | N/A              |
| Hong Kong      | Hong Kong SAR, China |
| Kyrgyzstan     | Kyrgyz Republic  |
| Somaliland region | N/A           |
| Laos           | Lao PDR          |
| Palestinian Territories| N/A      |
| Congo (Kinshasa) | Dem. Rep. Congo|
| Congo (Brazzaville) | Congo       |
| Ivory Coast    | Côte d'Ivoire    |
| Syria          | Syrian Arab Republic|



No we need to read the add all the matching countries to a labeled Data Frame with the development factors. The development factors are stored in the Indicators table which contains the columns CountryName, CountryCode, IndicatorName, IndicatorCode, Year, Value. The Country table allows us to the turn a CountryCode into a ShortName. Also the year needs to be reasonably close to the 2000s the The happiness data is from 2015. Must create a list of the countries to query as they appear in the ShortName column of the Country table. Find their country codes and query the indicator table to find the the indicator name and value. All this information can then be added to a table with columns of Country,Happiness,Indicators.  



In [None]:
#first create the list of countries 
CountrysList = HappinessDF["Country"].tolist()

CountrysToRemove=["Taiwan","North Cyprus","Somaliland region","Palestinian Territories"]

CountrysToReplace={
    "Slovakia":"Slovak Republic",
    "South Korea":"Korea",
    "Hong Kong":"Hong Kong SAR, China",
    "Kyrgyzstan":"Kyrgyz Republic",
    "Laos":"Lao PDR",
    "Congo (Kinshasa)":"Dem. Rep. Congo",
    "Congo (Brazzaville)":"Congo",
    "Ivory Coast":"Côte d''Ivoire",
    "Syria":"Syrian Arab Republic"} 
    #This can used to convert when needed


for Country in CountrysToRemove:
    CountrysList.remove(Country)
        
for Country in CountrysToReplace.keys():
    CountrysList[CountrysList.index(Country)]=CountrysToReplace[Country]

print(CountrysList)

There are 1344 indicators in total so we must pick the ones most likely to affect happiness. 

In [None]:

IndicatorQuery="""  
SELECT IndicatorName,max(Value),min(Value)
FROM Indicators
GROUP BY IndicatorName
""" #max and min values is to investigate values
Indicators=DataFrameTable(IndicatorQuery)


In [None]:
Indicators.head()


In [None]:
pd.set_option('max_colwidth', 120)
pd.set_option("max_rows",1400)

#used to pick the indecators of interest uncomment to print them all 
#print(Indicators)

Lets start with some general stats that are likely to affect happiness or ones that it will be interesting to see the effect of on happiness. 

In [None]:
IndicatorsList=[
"Access to electricity (% of population)"
,"Adjusted net enrolment rate, primary, both sexes (%)"
,"Adolescent fertility rate (births per 1,000 women ages 15-19)"
,"Adult literacy rate, population 15+ years, both sexes (%)"
,"Arable land (hectares per person)"
,"Average precipitation in depth (mm per year)"
,"Bribery incidence (% of firms experiencing at least one bribe payment request)"
,"Central government debt, total (% of GDP)"
,"Community health workers (per 1,000 people)"
,"Currency composition of PPG debt, U.S. dollars (%)"
,"Death rate, crude (per 1,000 people)"
,"Droughts, floods, extreme temperatures (% of population, average 1990-2009)"
,"Emigration rate of tertiary educated (% of total tertiary educated population)"
,"Expenditure on education as % of total government expenditure (%)"
,"Fixed broadband subscriptions (per 100 people)"
,"GDP per capita (constant 2005 US$)"
,"GDP per capita growth (annual %)"
,"Improved sanitation facilities (% of population with access)"
,"Income share held by highest 10%"
,"Income share held by highest 20%"
,"Internet users (per 100 people)"
,"Life expectancy at birth, total (years)"
,"Long-term unemployment (% of total unemployment)"
,"Mobile cellular subscriptions (per 100 people)"
,"Net enrolment rate, secondary, both sexes (%)"
,"Net migration"
,"Percentage of students in secondary education who are female (%)"
,"Population density (people per sq. km of land area)"
,"Population, total"
,"Poverty gap at $3.10 a day (2011 PPP) (%)"
,"Refugee population by country or territory of origin"
,"Tax revenue (% of GDP)"
,"Urban population (% of total)"
]


To start with I will I will use the most recent data for each indicator and see how many data points are missing.

In [None]:
def ListToString(List):
    tup="("
    for x in List:
        tup+="'"+str(x)+"',"
    tup=tup[:-1]+")"
    return tup
    
# takes a long time to run, there is no pivot function it in SQL lite so instead I will pivot with pandas
query = """
SELECT data.CountryName,data.IndicatorName,data.Year,I.Value
FROM
(SELECT Country.ShortName AS CountryName,Indicators.CountryCode,IndicatorName,IndicatorCode,MAX(Year) AS Year
FROM Indicators,Country
WHERE Indicators.CountryCode = Country.CountryCode
AND Country.ShortName IN """+ListToString(CountrysList)+"""
AND Indicators.IndicatorName IN """+ListToString(IndicatorsList)+"""
GROUP BY Country.ShortName,Indicators.CountryCode,IndicatorName,IndicatorCode) AS data
LEFT JOIN Indicators I ON data.CountryCode = I.CountryCode AND data.IndicatorCode = I.IndicatorCode and I.Year = data.Year
;
"""

IndicatorData=DataFrameTable(query)
IndicatorData.head()

In [None]:
IndicatorData=IndicatorData.pivot("CountryName","IndicatorName","Value")
IndicatorData.columns.name = None 
IndicatorData.head()

In [None]:
print("There are",len(IndicatorData)-len(IndicatorData.dropna()),"countries with missing data out of",len(IndicatorData))


In [None]:
IndicatorData.describe()

Lets find the the average year for the data points if we select the most recent non-null data point for each country. 

In [None]:
query = """
SELECT AVG(data.Year) AS AverageMostRecentNonNullYearForEachFeature
FROM
(SELECT MAX(Year) AS Year
FROM Indicators,Country
WHERE Indicators.CountryCode = Country.CountryCode
AND Country.ShortName IN """+ListToString(CountrysList)+"""
AND Indicators.IndicatorName IN """+ListToString(IndicatorsList)+"""
AND Indicators.Value IS NOT NULL
GROUP BY Country.ShortName,Indicators.CountryCode,IndicatorName,IndicatorCode) AS data
;
"""

AverageMostRecentNonNullYear=DataFrameTable(query)
display(AverageMostRecentNonNullYear.head())

query = """
SELECT Count(data.Year) AS NumberOfNonNullValues
FROM
(SELECT MAX(Year) AS Year
FROM Indicators,Country
WHERE Indicators.CountryCode = Country.CountryCode
AND Country.ShortName IN """+ListToString(CountrysList)+"""
AND Indicators.IndicatorName IN """+ListToString(IndicatorsList)+"""
AND Indicators.Value IS NOT NULL
GROUP BY Country.ShortName,Indicators.CountryCode,IndicatorName,IndicatorCode) AS data
;
"""
NumberOfNonNulls=DataFrameTable(query)
display(NumberOfNonNulls.head())

print("There should be",len(IndicatorData.index)*len(IndicatorData.columns),"values.")

These data points are recent enough on average and there isn't many missing so I will use these instead. 

In [None]:
query = """
SELECT data.CountryName,data.IndicatorName,data.Year,I.Value
FROM
(SELECT Country.ShortName AS CountryName,Indicators.CountryCode,IndicatorName,IndicatorCode,MAX(Year) AS Year
FROM Indicators,Country
WHERE Indicators.CountryCode = Country.CountryCode
AND Country.ShortName IN """+ListToString(CountrysList)+"""
AND Indicators.IndicatorName IN """+ListToString(IndicatorsList)+"""
AND Indicators.Value IS NOT NULL
GROUP BY Country.ShortName,Indicators.CountryCode,IndicatorName,IndicatorCode) AS data
LEFT JOIN Indicators I ON data.CountryCode = I.CountryCode AND data.IndicatorCode = I.IndicatorCode and I.Year = data.Year
;
"""

IndicatorData=DataFrameTable(query)
IndicatorData=IndicatorData.pivot("CountryName","IndicatorName","Value")
IndicatorData.columns.name = None 
display(IndicatorData.head(5))
display(IndicatorData.describe())

In [None]:
IndicatorData.to_csv("IndicatorData.csv")

![](http://)<a href="IndicatorData.csv"> Download File </a>

### (Can run notebook from here)

In [None]:
import pandas as pd
import numpy as np 


IndicatorData=pd.read_csv("../input/development-data/IndicatorData.csv")
IndicatorData=IndicatorData.set_index(["CountryName"])
display(IndicatorData.head())
IndicatorData.describe()

Lets keep the columns that have at least 147 entries. 

In [None]:
Data=IndicatorData[[col for col in IndicatorData.columns if IndicatorData[col].count()>=147]]
print("We now have",len(Data.columns),"features instead of",len(IndicatorData.columns))
print("If we drop counries with missing data we have",len(Data.dropna()),"countries out of",len(Data))

We don't lose too many countries if we just drop the ones with missing data so lets do that. 

In [None]:
Data=Data.dropna()
IndicatorDF=Data
display(Data.columns)

Now we need to adjust some values to per capita values. After that we can begin exploratory data analysis.

Factors that need adjusting to something per person are:
* Refugee population by country or territory of origin

Factors needed for modification that then need to be removed are:
* Population, total

In [None]:
IndicatorDF["Refugee Rate"]=IndicatorDF["Refugee population by country or territory of origin"]/IndicatorDF["Population, total"]
IndicatorDF=IndicatorDF.drop(columns=["Refugee population by country or territory of origin"])

In [None]:
HappinessFileString="../input/world-happiness/2015.csv"
HappinessDF=pd.read_csv(HappinessFileString)

In [None]:
IndicatorDF.head()

Need to raplace the country names in the index so that we can join the data sets on country names.

In [None]:
CountrysToReplace={
    "Slovakia":"Slovak Republic",
    "South Korea":"Korea",
    "Hong Kong":"Hong Kong SAR, China",
    "Kyrgyzstan":"Kyrgyz Republic",
    "Laos":"Lao PDR",
    "Congo (Kinshasa)":"Dem. Rep. Congo",
    "Congo (Brazzaville)":"Congo",
    "Ivory Coast":"Côte d''Ivoire",
    "Syria":"Syrian Arab Republic"} 

Index=HappinessDF["Country"].tolist()
Index=[country if country not in CountrysToReplace.keys() else CountrysToReplace[country] for country in Index ]
HappinessDF.index=Index
HappinessDF=HappinessDF[["Happiness Score"]]
HappinessDF.head()

In [None]:
DataDF=IndicatorDF.join(HappinessDF,how="inner")
DataDF.head()

# Exploritory Data Analysis

In [None]:
DataDF.info()

In [None]:
DataDF.describe()

In [None]:
DataDF.hist(figsize=(40,40),bins=50, xlabelsize=10, ylabelsize=10)

In [None]:
import seaborn as sns

for i in range(0, len(DataDF.columns), 5):
    sns.pairplot(data=DataDF,x_vars=DataDF.columns[i:i+5],y_vars=["Happiness Score"],height=5)


Most of these features appear to have either no relationship or a linear relationship with happiness, however some such as GDP per capita appear to have a higher order relationship. These non linear relationships will likely not be captured well by linear models. 

In [None]:
CorMatrix=DataDF.corr()
CorMatrix.style.background_gradient(cmap='coolwarm')

Lots of these features are strongly correllated with one another, so it will likely be worth while to try and ridge and lasso regression. 

# **Model Creation**
I will use cross validation as there is not a huge amount of data. Hyper parameters will be tuned on the same data set that is used to score the data, in reality this isn't good practice but (data leakage) but there isn't much data. 
### Linear Regression
We begin with a simple linear regression model. 

In [None]:
X=DataDF.drop(columns=["Happiness Score"]).copy()
Y=DataDF["Happiness Score"].copy()

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression

model=LinearRegression(normalize=True)
params={}

clf = GridSearchCV(model,params,cv=5,n_jobs=-1,iid=True)
clf.fit(X,Y)
results=pd.DataFrame(data=clf.cv_results_)
BestScore = results[results["mean_test_score"]==results["mean_test_score"].max()]["mean_test_score"].values[0]
print("The score on the training set is",BestScore)
print("The models parametres are",results[results["mean_test_score"]==BestScore]["params"].values[0])

### Lasso Regression

In [None]:
from sklearn.linear_model import Lasso
import matplotlib.pyplot as plt

model=Lasso(normalize=True)
params = {
    "alpha":np.logspace(-5,1)
}

clf = GridSearchCV(model,params,cv=5,n_jobs=-1,iid=True)
clf.fit(X,Y)
results=pd.DataFrame(data=clf.cv_results_)
BestScore = results[results["mean_test_score"]==results["mean_test_score"].max()]["mean_test_score"].values[0]
print("The score on the training set is",BestScore)
print("The models parametres are",results[results["mean_test_score"]==BestScore]["params"].values[0])


for param,values in params.items():
    plt.plot( results["param_"+param],results["mean_test_score"])
    plt.xscale("log")
    plt.ylabel(r"$R^2$")
    plt.xlabel(param)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().spines['right'].set_visible(False)
    plt.show()

Lasso regression does not improve results, the best parameter for alpha is the one that makes it most like linear regression. 

### Ridge Regression 

In [None]:
from sklearn.linear_model import Ridge

model=Ridge(normalize=True)
params = {
    "alpha":np.logspace(-6,1)
}

clf = GridSearchCV(model,params,cv=5,n_jobs=-1,iid=True)
clf.fit(X,Y)
results=pd.DataFrame(data=clf.cv_results_)
BestScore = results[results["mean_test_score"]==results["mean_test_score"].max()]["mean_test_score"].values[0]
print("The score on the training set is",BestScore)
print("The models parametres are",results[results["mean_test_score"]==BestScore]["params"].values[0])

for param,values in params.items():
    plt.plot( results["param_"+param],results["mean_test_score"])
    plt.xscale("log")
    plt.ylabel(r"$R^2$")
    plt.xlabel(param)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().spines['right'].set_visible(False)
    plt.show()

Ridge regression has given a slight improvement. Lets try some non linear models to see if we can improve the results. 

In [None]:
from sklearn.preprocessing import PolynomialFeatures
transformer= PolynomialFeatures(degree=2)
PolyX = transformer.fit_transform(X.copy())

In [None]:
from sklearn.model_selection import cross_validate

model=LinearRegression(normalize=True)

results=pd.DataFrame(data=cross_validate(model,PolyX,Y,cv=5,return_train_score=True),index=range(1,6))
results.index=results.index.rename("Fold")
display(results)

Clearly we are over-fitting and need to try Ridge and Lasso regression.

### Polynomial Lasso Regression

In [None]:
from sklearn.linear_model import Lasso

model=Lasso(normalize=True)
params = {
    "alpha":np.logspace(-3.5,0)
}

clf = GridSearchCV(model,params,cv=5,n_jobs=-1,iid=True)
clf.fit(PolyX,Y)
results=pd.DataFrame(data=clf.cv_results_)
BestScore = results[results["mean_test_score"]==results["mean_test_score"].max()]["mean_test_score"].values[0]
print("The score on the training set is",BestScore)
print("The models parametres are",results[results["mean_test_score"]==BestScore]["params"].values[0])


for param,values in params.items():
    plt.plot( results["param_"+param],results["mean_test_score"])
    plt.xscale("log")
    plt.ylabel(r"$R^2$")
    plt.xlabel(param)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().spines['right'].set_visible(False)
    plt.show()



Appears to be fitting something close to a linear model. 

### Polynomial Ridge Regression

In [None]:

model=Ridge(normalize=True)
params = {
    "alpha":np.logspace(-1,2)
}

clf = GridSearchCV(model,params,cv=5,n_jobs=-1,iid=True)
clf.fit(PolyX,Y)
results=pd.DataFrame(data=clf.cv_results_)
BestScore = results[results["mean_test_score"]==results["mean_test_score"].max()]["mean_test_score"].values[0]
print("The score on the training set is",BestScore)
print("The models parametres are",results[results["mean_test_score"]==BestScore]["params"].values[0])

for param,values in params.items():
    plt.plot( results["param_"+param],results["mean_test_score"])
    plt.xscale("log")
    plt.ylabel(r"$R^2$")
    plt.xlabel(param)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().spines['right'].set_visible(False)
    plt.show()

These polynomial models are only marginally more accurate, if at all, than their linear counterparts. Ridge and Lasso regression appear to be forcing to zero or minimiseing most of the polynomial coefficients respectively.

Lets see if a Support Vector Machine or Random Forrest can better capture these relationships. 

### Support Vector Machine

In [None]:
# first we need to normalize the data 
from sklearn.preprocessing import StandardScaler
transformer=StandardScaler()
NormX = transformer.fit_transform(X)

In [None]:
from sklearn.svm import SVR 

model=SVR()
params={
    "C"       : np.logspace(-2,2),
    "epsilon" : np.logspace(-2,2),
    "kernel"  : ["rbf","poly","sigmoid"]
}

clf = GridSearchCV(model,params,cv=5,n_jobs=-1,iid=True,verbose=1)
clf.fit(NormX,Y)
results=pd.DataFrame(data=clf.cv_results_)
BestScore = results[results["mean_test_score"]==results["mean_test_score"].max()]["mean_test_score"].values[0]
print("The score on the training set is",BestScore)

BestParams = results[results["mean_test_score"]==BestScore]["params"].values[0]
print("The models parametres are",BestParams)

In [None]:
from sklearn.ensemble import RandomForestRegressor

model= RandomForestRegressor()
params={
    "n_estimators" : [int(x) for x in np.linspace(1,1000,8)],
    "max_depth"    : [x for x in np.logspace(0,3,8)]+[None]
}

clf = GridSearchCV(model,params,cv=5,n_jobs=-1,iid=True,verbose=1)
clf.fit(NormX,Y)
results=pd.DataFrame(data=clf.cv_results_)
BestScore = results[results["mean_test_score"]==results["mean_test_score"].max()]["mean_test_score"].values[0]
print("The score on the training set is",BestScore)

BestParams = results[results["mean_test_score"]==BestScore]["params"].values[0]
print("The models parametres are",BestParams)

### Linear Ridge Regression Evaluation 

The best model was the Linear Ridge Regression model which will now be retrained and evaluated.  

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

model=Ridge(normalize=True)
params = {
    "alpha":np.logspace(-1.6,-0.2)
}

clf = GridSearchCV(model,params,cv=5,n_jobs=-1,iid=True)
clf.fit(X,Y)
results=pd.DataFrame(data=clf.cv_results_)
BestScore = results[results["mean_test_score"]==results["mean_test_score"].max()]["mean_test_score"].values[0]
print("The score on the training set is",BestScore)
print("The models parametres are",results[results["mean_test_score"]==BestScore]["params"].values[0])

for param,values in params.items():
    plt.plot( results["param_"+param],results["mean_test_score"])
    plt.xscale("log")
    plt.ylabel(r"$R^2$")
    plt.xlabel(param)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().spines['right'].set_visible(False)
    plt.show()

In [None]:
model=Ridge(normalize=True,alpha=0.11406249238513208)
model.fit(X,Y)
Predictions=X.copy()
Predictions["Prediction"]=Predictions.apply(
    lambda row: model.predict([[(row[col]-Predictions[col].mean())/Predictions[col].std() for col in Predictions.columns]])[0],
    axis=1
)

Predictions = Predictions.join(pd.DataFrame(Y),how="inner")
pd.options.display.max_columns=200
display(Predictions[["Prediction","Happiness Score"]].T)


Most of these predictions look reasonable however so are not even possible. For instance Afghanistan has a predicted happiness of -37 out 10. Lets round these up and down accordingly and use the leave one out method to asses the accuracy.

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score

# Leave one out
folds = KFold(n_splits=len(X))

predictions=[]
true=[]
for train,test in folds.split(NormX):
    model=Ridge(normalize=True,alpha=0.11406249238513208)
    X_train, X_test = NormX[train], NormX[test]
    y_train, y_test = Y.values[train], Y.values[test]
    model.fit(X_train,y_train)
    predictions.append(model.predict(X_test))
    true.append(y_test)
    
predictions=[x[0] if x>0 else 0 for x in predictions]
predictions=[x if 10>x else 10 for x in predictions]
true= [x[0] for x in true]
print("The new R squared value is",r2_score(true,predictions))

display(pd.DataFrame({"Predication":predictions,"Happiness Score":true},index=X.index).T)

In [None]:
Coeficients=[(col,coef) for col,coef in zip(X.columns,model.coef_)]
Coeficients = sorted(Coeficients,key=lambda x:np.abs(x[1]),reverse=True)
Coeficients={col:coef for col,coef in Coeficients}
pd.DataFrame(index=Coeficients.keys(),data={"Coefficient":list(Coeficients.values())}).T

Some of these features have surprisingly high coefficients. Such as, *Average precipitation in depth (mm per year)* one might expect a country where it rains a lot to be unhappy however if there is a lot of rain it is easier to create a food surplus. It is interesting to note that *Fixed broadband subscriptions (per 100 people)* has a strong positive effect on happiness but *Mobile cellular subscriptions (per 100 people)* has the smallest effect of any feature. The feature with the strongest negative effect is unsurprisingly *Refugee Rate* as high *Refugee Rate* indicates a war in the country. Unsurprisingly *GDP per capita* has the largest positive effect on happiness. Very surprisingly *Life expectancy at birth, total (years)* has a very small negative effect on happiness.  