<a href="https://colab.research.google.com/github/Espanta/handson-ml/blob/master/People_Satisfaction_and_GDP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predict People Satisfaction Across the Globe

Problem Statement:

We would like to build a model that predicts satisfaction score for people of different countries given their country GDP.

# Download Dataset

Download the Better Life Index data (latest edition, currently it is 2017) from the [OECD’s website](http://homl.info/4) as well as stats about GDP per capita from the [IMF’s website](http://homl.info/5). Then you join the tables and sort by GDP per capita. Table 1-1 shows an excerpt of what you get

# Import Dataset to Google Colab

1. Download CSV and XLS files to your computer
2. Upload them to your Google Drive
3. Open the CSV files using Google Sheets so Google will create the dataset in format of Google Sheets
4. You can remove CSV and XLS files from your drive
5. Use the step by step guide from [here](https://colab.research.google.com/notebooks/io.ipynb#scrollTo=vz-jH8T_Uk2c) and scroll down to **" Google Sheets" ** cell to import data into dataframe

NOTE: After creating Google Sheet into your Drive, make sure you are converting Column 2015 to 0.00 format before importing it into Colab  otherwise Google will import it as a string and you will have hard time to clean the data




In [0]:
# Run below line of code for the first time to install gspread. Once installed comment it for future use
#!pip install --upgrade -q gspread
from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

# Use gc to open Google Sheet Datasets

In [3]:
#Open given sheet
worksheet = gc.open('BLI_30012019054825599').sheet1

# Read contents of CSV file
bli_rows = worksheet.get_all_values()

# Convert to a DataFrame and render.
import pandas as pd
bli  = pd.DataFrame.from_records(bli_rows, columns = bli_rows[0])

# Remove rows where inequality has values other than TOT
bli = bli[bli["INEQUALITY"]=="TOT"]

# Reformat data based on "indicator column"
bli = bli.pivot(index="Country", columns="Indicator", values="Value")

bli.head()

Indicator,Air pollution,Dwellings without basic facilities,Educational attainment,Employees working very long hours,Employment rate,Feeling safe walking alone at night,Homicide rate,Household net adjusted disposable income,Household net financial wealth,Housing expenditure,...,Personal earnings,Quality of support network,Rooms per person,Self-reported health,Stakeholder engagement for developing regulations,Student skills,Time devoted to leisure and personal care,Voter turnout,Water quality,Years in education
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Australia,5,1.1,80,13.2,72,63.6,1.0,33417,57462,20,...,52063,94,2.3,85,2.7,502,14.35,91,92,21.2
Austria,16,1.0,85,6.78,72,80.7,0.4,32544,59574,21,...,48295,92,1.6,70,1.3,492,14.55,75,93,17.1
Belgium,15,2.3,75,4.31,62,70.7,1.0,29968,104084,21,...,49587,92,2.2,75,2.2,503,15.77,89,84,18.2
Brazil,10,6.7,49,7.15,64,37.3,27.6,12227,7102,20,...,14024,90,0.8,70,2.2,395,14.45,79,72,15.9
Canada,7,0.2,91,3.73,73,80.9,1.4,29850,85758,22,...,48403,93,2.5,88,3.0,523,14.41,68,91,16.7


# Import WOE data

In [4]:
#Open given sheet
worksheet = gc.open('WEO_Data').sheet1

# Read contents of CSV file
WEO_rows = worksheet.get_all_values()

# Convert to a DataFrame and render.
import pandas as pd
weo  = pd.DataFrame.from_records(WEO_rows, columns = WEO_rows[0])

# Drop the header row from data
weo = weo.reindex(weo.index.drop(0))

# 1- Select only Country name and 2015 
# 2- then rename it to GDP Per capita
weo = weo[['Country','2015']].rename(columns={'2015':'GDP per capita'})

# Set Country as index column
# Inplace command, will replace the results of command into the same DF
weo.set_index("Country", inplace=True)

#weo.drop_duplicates(inplace=True)
#Print top 5 rows
weo.head()

Unnamed: 0_level_0,GDP per capita
Country,Unnamed: 1_level_1
Afghanistan,599.99
Albania,3995.38
Algeria,4318.14
Angola,4100.32
Antigua and Barbuda,14414.3


In [5]:
bli["Life satisfaction"].head()

Country
Australia    7.3
Austria        7
Belgium      6.9
Brazil       6.6
Canada       7.3
Name: Life satisfaction, dtype: object

# Merge/Join dataset

In [6]:
df = pd.merge(left = weo, right = bli,   left_index=True, right_index=True)
df.sort_values(by="GDP per capita", inplace=True)
df.head()

Unnamed: 0_level_0,GDP per capita,Air pollution,Dwellings without basic facilities,Educational attainment,Employees working very long hours,Employment rate,Feeling safe walking alone at night,Homicide rate,Household net adjusted disposable income,Household net financial wealth,...,Personal earnings,Quality of support network,Rooms per person,Self-reported health,Stakeholder engagement for developing regulations,Student skills,Time devoted to leisure and personal care,Voter turnout,Water quality,Years in education
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Luxembourg,101994.09,12,0.0,79,3.76,66,72.0,0.6,41317,74141,...,62636,92,2.0,70,1.5,483,15.15,91,85,15.1
Hungary,12239.89,19,4.3,83,3.05,67,50.7,1.2,16821,23289,...,21711,84,1.2,56,1.2,474,15.06,62,76,16.6
Poland,12495.33,22,2.7,91,6.68,65,66.3,0.8,18906,14997,...,25921,89,1.1,58,2.6,504,14.42,55,80,17.7
Chile,13340.91,16,9.4,65,10.06,62,51.1,4.5,16588,21409,...,28434,84,1.9,57,1.5,443,14.9,49,69,17.3
Latvia,13618.57,11,12.9,89,2.09,69,60.7,6.6,15269,17105,...,22389,86,1.2,46,2.4,487,13.83,59,77,17.9


In [0]:
test_indices = [0, 1, 6, 8, 33, 34, 35]
train_indices = list(set(range(36)) - set(test_indices))

train = df[["GDP per capita", 'Life satisfaction']].iloc[train_indices]
test = df[["GDP per capita", 'Life satisfaction']].iloc[test_indices]



In [8]:
test

Unnamed: 0_level_0,GDP per capita,Life satisfaction
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Luxembourg,101994.09,6.9
Hungary,12239.89,5.3
Czech Republic,17256.92,6.6
Greece,18064.29,5.2
Switzerland,80675.31,7.5
Brazil,8670.0,6.6
Mexico,9009.28,6.6


In [9]:
# Code example
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
from sklearn.linear_model import LinearRegression

# Prepare the data
X = np.c_[train["GDP per capita"]]
y = np.c_[train["Life satisfaction"]]

# Visualize the data
#df.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
#plt.show()

# Select a linear model
model = sklearn.linear_model.LinearRegression()

# Train the model
model.fit(X, y)

# Make a prediction for Cyprus
X_new = [[18064.29]]  # Cyprus' GDP per capita
print(model.predict(X_new)) # outputs [[5.95199478]]

[[5.95199478]]


In [10]:
# Make a prediction for our test data
model.predict(test['GDP per capita'].values.reshape(-1,1))

array([[9.16992716],
       [5.72868285],
       [5.9210396 ],
       [5.95199478],
       [8.35254892],
       [5.59181055],
       [5.60481881]])

In [0]:
from sklearn.linear_model import Ridge

In [0]:
def ridge_regression(X, y, alpha, models_to_plot={}):
    #Fit the model
    ridgereg = Ridge(alpha=alpha,normalize=True)
   # ridgereg.fit(data[predictors],data['Life satisfaction'])
    ridgereg.fit(X,y)
    y_pred = ridgereg.predict(X)
    
    #Check if a plot is to be made for the entered alpha
 #   if alpha in models_to_plot:
  #      plt.subplot(models_to_plot[alpha])
   #     plt.tight_layout()
    #    plt.plot(X['gdp'],y_pred)
     #   plt.plot(X['gdp'],'y','.')
      #  plt.title('Plot for alpha: %.3g'%alpha)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-y)**2)
    ret = [rss]
    ret.extend([ridgereg.intercept_])
    ret.extend(ridgereg.coef_)
    return ret


In [13]:
train_X =train['GDP per capita'].values.astype('float64')
train_X_df = pd.DataFrame(train_X)
train_X_df.columns=['gdp']
print(train_X_df[:5])

train_y = train['Life satisfaction'].values.astype('float64')
print(train_y)

        gdp
0  12495.33
1  13340.91
2  13618.57
3  15991.74
4  17288.08
[6.  6.7 5.9 6.1 5.6 5.2 5.8 6.4 5.9 5.9 5.9 7.2 7.3 6.4 6.9 7.  7.5 7.3
 7.4 7.  6.7 7.3 7.5 7.3 7.  7.5 6.9 4.8 7.5]


In [14]:
for i in range(2,16):  #power of 1 is already there
    colname = 'gdp_%d'%i      #new var will be x_power
    train_X_df[colname] = train_X_df['gdp']**i

train_X_df[:5]




Unnamed: 0,gdp,gdp_2,gdp_3,gdp_4,gdp_5,gdp_6,gdp_7,gdp_8,gdp_9,gdp_10,gdp_11,gdp_12,gdp_13,gdp_14,gdp_15
0,12495.33,156133300.0,1950937000000.0,2.43776e+16,3.046061e+20,3.806154e+24,4.755915e+28,5.9426730000000006e+32,7.425566e+36,9.278489999999999e+40,1.1593779999999999e+45,1.4486810000000001e+49,1.8101750000000002e+53,2.261873e+57,2.826285e+61
1,13340.91,177979900.0,2374414000000.0,3.167684e+16,4.225978e+20,5.63784e+24,7.521391e+28,1.003422e+33,1.338656e+37,1.785889e+41,2.382539e+45,3.1785240000000003e+49,4.24044e+53,5.657133e+57,7.54713e+61
2,13618.57,185465400.0,2525774000000.0,3.439743e+16,4.684438e+20,6.379535e+24,8.688015e+28,1.183183e+33,1.611327e+37,2.194396e+41,2.988454e+45,4.069847e+49,5.54255e+53,7.54816e+57,1.0279510000000001e+62
3,15991.74,255735700.0,4089660000000.0,6.540077e+16,1.045872e+21,1.672532e+25,2.6746689999999998e+29,4.277261e+33,6.840084999999999e+37,1.0938490000000001e+42,1.749254e+46,2.797362e+50,4.473468e+54,7.153854e+58,1.1440260000000001e+63
4,17288.08,298877700.0,5167022000000.0,8.932789e+16,1.544308e+21,2.669811e+25,4.615591e+29,7.979471000000001e+33,1.379497e+38,2.384886e+42,4.12301e+46,7.127893000000001e+50,1.232276e+55,2.130368e+59,3.682998e+63


In [15]:
#Initialize predictors to be set of 15 powers of x
predictors=['gdp']
predictors.extend(['gdp_%d'%i for i in range(2,16)])

#Set the different values of alpha to be tested
alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]

#Initialize the dataframe for storing coefficients.
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['alpha_%.2g'%alpha_ridge[i] for i in range(0,10)]
coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

models_to_plot = {1e-15:231, 1e-10:232, 1e-4:233, 1e-3:234, 1e-2:235, 5:236}
for i in range(10):
    #coef_matrix_ridge.iloc[i,] =
    print(ridge_regression(train_X_df,train_y.reshape(-1,1), alpha_ridge[i], models_to_plot))

[array([2.57059159]), array([34.04820492]), array([-1.67921389e-02,  3.74236690e-06, -4.22665900e-10,  2.70061363e-14,
       -9.98975334e-19,  1.97964176e-23, -1.29799304e-28, -1.72144191e-33,
        1.42297204e-38,  3.91925350e-43, -7.82185655e-49, -8.69558646e-53,
       -1.25575469e-58,  1.77494923e-62, -1.08335527e-67])]
[array([2.82134472]), array([-3.85147564]), array([ 2.65163564e-03, -2.54891511e-07,  1.10874475e-11, -2.17505223e-16,
        1.08201121e-21,  1.95294745e-26, -1.01817490e-31, -2.65429437e-36,
       -4.14797201e-42,  3.24199523e-46,  3.84556474e-51, -9.85535721e-57,
       -8.43710221e-61, -8.68536829e-66,  1.60989719e-70])]
[array([2.87368406]), array([-0.68646999]), array([ 1.55024559e-03, -1.22112486e-07,  3.76644189e-12, -2.96863757e-17,
       -4.15236320e-22,  1.23696331e-27,  6.32083390e-32,  6.63675381e-37,
        1.62807975e-42, -6.71936576e-47, -1.49877499e-51, -1.87494914e-56,
       -1.25385701e-61,  1.04227394e-66,  5.86602491e-71])]
[array([4.213

Ill-conditioned matrix detected. Result is not guaranteed to be accurate.
Reciprocal condition number4.566151e-17
  overwrite_a=True).T


In [16]:
print(train_X_df.shape)
print (train_y.reshape(-1,1).shape)

(29, 15)
(29, 1)
