In [6]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing

In [7]:
# Predict the happiness scores of different countries, use skills practiced in Titanic and Housing ML workbooks. Unsupervised machine learning. 
# Unsupervised machine learning is a type of machine learning where the algorithm is trained on data without explicit supervision or labeled outcomes. 
# In unsupervised learning, the algorithm tries to find patterns, relationships, or structures within the data without being explicitly told what to look for. 
# The primary goal is to explore the inherent structure and properties of the dataset.
# Potential uses: dimensionality reduction, clustering, association rule learning, anomaly detection, and feature learning. 

# df.describe().T = transposes rows and columns to invert the output structure to make it a bit more readable. 

# Version 1 = Create a regression model to predict happiness score from dataset.
# Version 2 = Refine regression model with cross validation, RFE, K-means, grid search, redo with different classification model, etc. 

# Objective: produce a plan written in pseudo-code or markdown for Vaish to review by EOD. Do not execute plan yet.  

In [8]:
df_2015 = pd.read_csv('2015.csv')
df_2016 = pd.read_csv('2016.csv')
df_2017 = pd.read_csv('2017.csv')
df_2018 = pd.read_csv('2018.csv')
df_2019 = pd.read_csv('2019.csv')
df_bank = pd.read_csv('bank_marketing_dataset.csv')

# I'm going to have a quick look at some heads just to get the gist of these dataframes and their contents as a starting point. 

In [9]:
df_2015.head(3)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204


In [10]:
df_2016.head(3)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137


In [11]:
df_2017.head(3)

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715


In [12]:
df_2018.head(3)

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.632,1.305,1.592,0.874,0.681,0.202,0.393
1,2,Norway,7.594,1.456,1.582,0.861,0.686,0.286,0.34
2,3,Denmark,7.555,1.351,1.59,0.868,0.683,0.284,0.408


In [13]:
df_2019.head(3)

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341


In [14]:
df_bank.head(3)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,subscribed
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


# The Grand Plan # 

The goal: predict *happiness score* which is a numerical (float) variable. Therefore my first thought would be a multiple linear regression to calculate and observe how each other numerical feature interacts with happiness score, but this is a supervised ML model and you want an unsupervised one so I'll think more about this later on.

Initial obs: I've got multiple dataframes, each containing different columns, rows, etc. Should I join/concatenate them?

Answer: comparing the columns across different dataframes, I can see that the columns are not uniform across datasets: some have different names and some represent completely different features, and some are displayed in different places within the dataframe so if I want to join them it might require reordering.
This is especially true for 2015, 2016, and 2017 so I will examine these three in particular to start with. I need to ensure uniformity among the 2015-2019 datasets to make joining possible without ruining them.  There may be a lot of superfluous data in these dataframes that might not be worth including in our analysis too and should be dropped or modified. 

I could use feature selection & extraction couldn't I? Or could do an L1 regularization... Or I could drop the happiness score and then do an unsupervised ML model on the input features on their own to figure out which ones to put into my multiple linear regression like PCA or K-means clustering... so many options! Then I would need to standardise the data so that the differing scales on which the scores operate doesn't throw off my PCA or my later model. 

As for the bank_marketing dataframe, I am unsure if this is relevant to the current activity or if this is something I'll utilise later because the data contained within is so different to the rest, so I'll focus on the 2015-2019 csvs for now. I don't see any predictive value in the banking dataset for ascertaining which features can predict Happiness Score the best. 

# EDA #

Following this, I presume there's some way to join all the dataframes together into one new dataframe much like I was able to produce a new column for my Logistic Regression from two other columns so that's what I will do so that I can proceed with EDA using just one dataframe instead of 5.  

I will then perform univariate EDA to get the gist of each feature, using describes, value_counts, histograms, bar charts, and so on. This will also allow me to see distributions to judge which imputation method to use if there are any nulls that need filling. I might also do some correlation heatmaps to eyeball some multivariate correlations, which could help me with preventing multicollinearity when it comes to producing my prediction model. If I have strong correlations between the different input features, L1 regularization might sort that out for me or I might need to drop some. Or if I start with PCA, that should advise me well on which features to input into the final dataframe or not. 

# Model building #

As discussed, I think I will start by joining my dataframes as described above, then EDA. Then PCA and when this has given me a good idea of which features to retain and which ones to drop, I will proceed with building a multiple linear regression using statsmodel and Sklearn. I can then monitor the performance metrics of this model and check different optimizations. 


Build regression model as overall aim: use unsupervised learning (e.g. clustering) for feature engineering as part of EDA. Then build model from that. 
