# Introduction



The World Happiness Report is a landmark survey of the state of global happiness.

The data trying to analyze which country has higher happiness or life satisfaction and which of them have to be like Dystopia with the world’s lowest incomes, lowest life expectancy, lowest generosity, most corruption, least freedom, and least social support.

Other attributes are provided for the dataset to give us other insights and we will see the effect of each of these attributes.

In [None]:
# Main libraries usage
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Sklearn libraries
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import  OneHotEncoder, MinMaxScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression
from sklearn.tree import  DecisionTreeRegressor
from sklearn.metrics import  mean_squared_error
from sklearn.compose import  ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import  Pipeline
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.base import BaseEstimator, TransformerMixin
%matplotlib inline


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
# Main functions


def stratified_splitting(data, split_attr):
    """
    The function to split data based on specific attribute 
    to ensure that we include all the categories of the data in our test set.
    """
    split = StratifiedShuffleSplit(n_splits=1, test_size=.2, random_state=42)
    for tr_index, tes_index in split.split(data, data[split_attr]):
        tr_stratified_set = data.loc[tr_index]
        tes_stratified_set = data.loc[tes_index]
    return tr_stratified_set, tes_stratified_set 


def compare_random_stratified_split(data, random_set, stratified_set, split_attr):
	overall          = data[split_attr].value_counts() / len(data)
	random_split     = random_set[split_attr].value_counts() / len(random_set)
	stratified_split = stratified_set[split_attr].value_counts() / len(stratified_set)

	# Sort values
	overall.sort_values(ascending=False)
	random_split.sort_values(ascending=False)
	stratified_split.sort_values(ascending=False)

	overall_Vs_random_error     = np.abs(overall - random_split)
	overall_Vs_stratified_error = np.abs(overall - stratified_split)

	# As it all numpy array I will transform to pandas dataframe
	dict_result = {"overall": overall, "random_split": random_split, "stratified_split": stratified_split,
              "overall_Vs_random_error": overall_Vs_random_error,
               "overall_Vs_stratified_error": overall_Vs_stratified_error}
	error_result = pd.DataFrame(dict_result, columns=dict_result.keys())
	return error_result


def visualize_plot(data,plot_kind, x_axis, y_axis, point_size=50, color="gray", colorbar=False):
	plt.figure(figsize=(10,10))
	plt.style.use('fivethirtyeight')
	data.plot(kind=plot_kind, x=x_axis, y=y_axis, s=point_size, 
		c=color, cmap=plt.get_cmap("jet"), colorbar=colorbar)

	return True

def split_seprate(data):
	happiness_report_num = data.drop(['Happiness Score', 'Region', 'Country'], axis=1)
	happiness_report_cat = data[['Region']]
	happiness_report_labels = data[['Happiness Score']]

	return happiness_report_num, happiness_report_cat, happiness_report_labels


def prepare_data(data, target='Happiness Score', number_of_instances=10):

	some_data = data.iloc[:number_of_instances].copy()
	some_data_labeled = some_data[[target]]
	some_data = some_data.drop(target, axis=1)

	return some_data,  some_data_labeled



class AddCombinedAttributes(BaseEstimator, TransformerMixin):

	def __init__(self, add_GDP_for_family = True):
		self.add_GDP_for_family = add_GDP_for_family
	def fit(self, X, y=None):
		return self
	def transform(self, X):
		if self.add_GDP_for_family:
			GDP_for_family = X[:, 2] / (X[:, 3]+1)
			X = np.c_[X, GDP_for_family]
		return X


# Display first 5 rows

The first step is to represent some rows from the data just to know what attributes look like then move on to discover more about them.


In [None]:
world_happiness_report = pd.read_csv("../input/world-happiness/2015.csv")
world_happiness_report.head()


# Some information about the data and attributes

We can see that all of the attributes have no NAN values, even of that, we need to get a specific value for each attribute, because when the system go-live, maybe some of the value of the attribute will not be provided, in the case like this you have to apply the saved value for this attribute.

We also can notice that most of the attributes are numbers except the country and region.

Two attributes that are close to each other are the Happiness Rank and Happiness score, so we will use the score as your target variable since we trying to learn about regression.

Also, the memory usage is small enough to fit our memory.


In [None]:
world_happiness_report.info()

## Some statistics about numbers attributes

Since most of the columns (attributes or features) are numbers it will be helpful to display some of statics related to these attributes, like what the mean, median, what the geatest values and lowest or what most values less than 25% or 50% another name is (first quarter range and third quarter range).

Insights you can get from this,
- maybe you need to apply feature scaling because different ranges for each attribute.
- maybe some numbers attribute like categorical because of small discrete values.


We can see that most attributes values are not far away from mean in the standard deviation but maybe because these attributes are just small numbers like 25% of **Happieness Rank** less than 40.

In [None]:
world_happiness_report.describe()

## Discover some attributes unique values

Because some attributes are categorical and the other ones may be categorical also even displaying as numbers like **Happiness Rank**, so we need to know the unique values and for each of these unique what is the frequent numbers, and actually this can help us in the stage of splitting data.  

Also, some attributes may be better to remove from the dataset because it can cause misleading learning, like **Country**, it just displaying names and no information it holds.

Also like **Region** may hold some information about the life satisfaction from those who live in Africa from those in Europe so we can trying attribute combination, like remove it and add it to check its effect on learning.

Also like **Happiness Rank**, is going from very different values so, it will be good to consider as a number attribute.

In [None]:
world_happiness_report['Country'].value_counts()[:10]

In [None]:
world_happiness_report['Region'].value_counts()

In [None]:
world_happiness_report['Happiness Rank'].value_counts()

# Important Note !

The **Region** can help us during the stage of splitting data to train and test to consider all categories are provided inside the train and test with close ratio.

## Histogram

A simple graph that you can use is the histogram to display different ranges of data with their frequency, it will help you to understand more about the data you dealing with.

We can apply it to the whole data or just for some attributes we interest in.

The histogram helps us to discover like normal distribution of each attribute, or those who are skewed to right or left, which can help us know which attributes have outliers, which may cause misleading in learning and also may these outliers need to discover of this attribute have its effect on the target.

All of these steps help us to get insights and intuition about the data we dealing with.

As we can see from graphs below:

- Economy (GDP per Capita) range from .0 to 1.5.

You can ask the business owner about these values because there is no income for some people like 1.5 and so on, so find which numbers you multiply by is helpful to know the actual values like should we multiply by 10,000 to back the original values.

Even of machine learning model work best with small ranges of numbers, but we need to get a whole overview about the data and its attributes.

- Happiness Score, it from 2 to 7.

Also, we can ask about these values and what about the prediction when the system goes live maybe we go beyond these values or should we consider 7 is the maximum, like these question help us to understand in depth about the data we dealing with.

In [None]:
world_happiness_report.hist(bins=50, figsize=(15, 12))

## Create a Test set

After what we have discussed and present we need to go deeper in our discovery process and visualization, but first, we need to set aside part of the data for testing and use the rest of the data to discover and go deeper with the analysis.

We will split the data using two method, then comapre this pliting to the whole data, and check how the we consider all samples in the testing as well as in training.

The Region attribute can help us in splitting data, to consider each group are provided.

In [None]:
world_happiness_report['Region'].value_counts()

In [None]:
# test_size the size of testing set
# random_state to generate the same splitting each time run the code
random_train_set, random_test_set = train_test_split(world_happiness_report, test_size=.2, random_state=42)

In [None]:
# display number of instances per set
print(len(random_train_set), len(random_test_set))

In [None]:
# this stratified_splitting function from configs file 
tr_stratified_set, tes_stratified_set = stratified_splitting(world_happiness_report, "Region")
# As we can see same number of instance per set
print(len(tr_stratified_set), len(tes_stratified_set))

In [None]:

# this compare_random_stratified_split function from main functions cell
error_result = compare_random_stratified_split(world_happiness_report, random_test_set, 
                                              tes_stratified_set, "Region")
error_result.fillna(0, inplace=True)
error_result

# Important Note !

Even of the big difference error between random and stratified Vs overall based on the **Region** attributes, it help you to consider that we should collect more instances for North America and Australia and New Zealand, as they just 2  instances.

## Discover and visualize to get insights

The histogram is helpful, but as we need to get more insights we need to go deeper, and the visualization helps us to get more insights since the brain is very weel capture information from images and graphs, also instead of representing each attribute on its own, now the time to discover the relation ship between attributes, or some attributes with the target attribute.

Now we have a train and test set so let us keep the test set aside and not touched till we decide to launch the model, also take a copy of the traing set to go in depth of discovering and keep the original data.

In [None]:
happiness_report = tr_stratified_set.copy()
happiness_report.head()

## Some Assumption

It more powerful when you trying to get insights from the data is to make some assumption which can lead you to good result at the end.

As we can in first graph how the **Happiness Score** going on to increase with the strong positive correlation between **Economy (GDP per Capita)** and ** Health (Life Expectancy)**.

Also, may be the **Freedom** has its relation with the **Dystopia Residual)**, as when the Dystopia Residual decrease it means that the life free is going to be high because all of us trying to life in Utopia, and the **Dystopia Residual** is represent the Dystopia Happiness, so it increase with the freedom.


Also, as there are two similr attribute which are **Happiness Rank** and **Happiness Score**, dicover these two attributes may be also help us to get more insights, and as we can see in the graph below there is a strong negative correlation.


In [None]:
_ = visualize_plot(happiness_report, "scatter", "Health (Life Expectancy)", "Economy (GDP per Capita)",
                  50, "Happiness Score", True)

## Correlation

As we get a lot of insights from which attributes have its own linearity of effect the target variables, from strong positive correlation to strong negative correlation. But more insights is to get numbers that represent the correlation between each of these attributes with the target.

And as we can see that GDP per Capita as well as Life Expectancy are most effect the Happiness Score, and how far away Happiness Rank is strong negative as it, and others like **Standard Error** may be need to removed as it have a small negative affect on target so we can trying to train the model with and without.

In [None]:
correlation_metrix = happiness_report.corr()
correlation_metrix['Happiness Score'].sort_values(ascending=False)

## Scatter Matrix

Even of graphs above give us a lot of insight we need more insights, but it not helpfull to graphs each of these attributes agnist each other it consume a lot of time and also we need one graph represent all in 1, but to represent each attribute against each other it means that, we have 10 number attributes it will produce 10*10 graph, so instead we display the most effected ones like what we see in the above numbers.

In [None]:
print(happiness_report.columns)
attributes = ["Happiness Score", "Happiness Rank", "Health (Life Expectancy)", "Economy (GDP per Capita)"]


plt.style.use('default')
pd.plotting.scatter_matrix(happiness_report[attributes], figsize=(16,10))

## Note!!

**As we can see the strong relationship between the target and the other two ones, while these two and the target has a strong negative correlation with the *Happiness Score*.**

## Attribute Combination

It worse trying to add a relationship between two attributes to extract new ones, but even if that multiple attributes have strong positive or negative relation but its not reflect any information to extract new ones from.

Maybe there is a one I will try to add which is related to **Economy (GDP per Capita)' Vs 'Family'**.



In [None]:
happiness_report['GDP_for_family'] =  happiness_report['Economy (GDP per Capita)'] / happiness_report['Family']

# Now lets looking in the correlation again
happiness_corr = happiness_report.corr()
happiness_corr['Happiness Score'].sort_values(ascending=False)

## Note !

The new ones **GDP_for_family** has a positive correlation of *.35* with **Happiness Score**, so we can train the model with and also without.

# Prepare the Data for Machine Learning Algorithms

We have passed through different ways of discovery and analysis to get insights about the data we dealing with and what are the most representative attributes of our data, also we have added a new attribute to look at in the training of our model.

From this step we need to make our work as automated as we can because it will not work just with the data we have, it will work with the test set we keep aside and for the new data when the system goes a life, also it may be used for other similar dataset or some of the functions can be used.

# Notes !

First, we need to split the learning attributes and target variable.

Second, separate the categories attributes from the attributes of the numbers.


In [None]:
happiness_report = tr_stratified_set.copy()
happiness_report_num, happiness_report_cat, happiness_report_labels = split_seprate(happiness_report)

# Numeric Attributes handling

We do not have any missing values in our training, it (126) instance for all attributes, but the thing does not go like this, the test set may include some missing values or even when system goes live it may return some missing values for some attributes so we need to handle like this case and save the value we will replace with once we have missing value.

Also because some attributes have different ranges like **Happiness Rank**, from most of the other ones, and because of some models working well with data in a specific range, we will apply MaxMin features scaling to restrict that all values are between range 0 and 1.

Some of steps we can do with missing values:

- Remove the attribute itself
- Remove corresponding rows (instances)
- Replace the missing value with the mean, standard deviation, or with 0 value as the case required

In [None]:
imputer = SimpleImputer(strategy='mean')
imputer.fit(happiness_report_num.values)
imputer.statistics_

In [None]:
X_training = imputer.transform(happiness_report_num)

X_training.shape # numpy array

In [None]:
# Feature Scaling
min_max_scaling = MinMaxScaler()
min_max_scaling.fit(X_training)
X_training = min_max_scaling.transform(happiness_report_num)

In [None]:

# Return to pandas data frame with imputed data

happiness_report_num = pd.DataFrame(X_training, columns=happiness_report_num.columns,
                                   index=happiness_report_num.index)
happiness_report_num.head()

# Categorize attributes

just one attribute is categorized as the **Region** attribute and it may reflect some information not like country because it just 1 country for each instance, but the **Region** is 10 region overall which can reflect those who live in some region have happiness score large than others in another region.

But not this the point, most of the models accept only numbers so we need to transfer this attribute to the number that the model can deal with, and there are different ways to transform, and this based on the kine of categorized attribute you have, is it ordinal or random one that the arrangements may cause misleading to the model.

The **Region** is not ordinal so will use the **One Hot** method, also there is another way we can deal with embedding.

In [None]:
# happiness_report_cat = pd.DataFrame(happiness_report_cat, columns=['Region'])
one_hot_encode = OneHotEncoder()
happiness_report_cat_1hot = one_hot_encode.fit_transform(happiness_report_cat)
print(happiness_report_cat_1hot.shape)
print(one_hot_encode.categories_)
happiness_report_cat_1hot

# sparse matrix

Because most of this matrix are 0, sklearn keep just the location of **nonZeros** to save your memory, and to back to the numpy array just us **toarray** method assocated with the object.

In [None]:
happiness_report_cat_1hot.toarray()

# Attribute Combination class

We have discussed earlier about attribute combination, but here we need to make this process automated, and also some attributes maybe useful to remove from the dataset like **Standard Error **, it less effective the target variable as we see.

But we need to make the process alongside sklearn functionality.

In [None]:
print(happiness_report_num.values.shape)
attr_adder = AddCombinedAttributes()
add_extra_attr = attr_adder.transform(happiness_report_num.values)
add_extra_attr.shape

# PipeLine

We have moved through different stages and for each one, we have explained why and trying first to make the process, for now, we have prepared most of the things but again it will be helpful to go through again these steps but in the simple pipeline that works for **numbers** and **categories** separately.

## Numeric Pipeline

We have apply these stages for numeric attributes:
- MinMaxScaler
- SimpleImputer
- CombinedAttributes

## Categorize Pipeline

Just we have made one function to convert the category to a number

## Compine the Two pipelines

As we have two pipelines for numbers and categories we will introduce one pipeline to combine them.

In [None]:
happiness_report = tr_stratified_set.copy()
happiness_report = happiness_report.drop('Happiness Score' , axis=1)
num_attr_names = happiness_report_num.columns
cat_attr_name = ['Region']

In [None]:
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('attr_combination', AddCombinedAttributes()),
    ('min_max_scaler', MinMaxScaler()),
])

In [None]:
full_pipeline = ColumnTransformer([
    ("num_pipeline", numeric_pipeline,  num_attr_names),
    ("cat_pipeline", OneHotEncoder(),  cat_attr_name)
])

happiness_report_prepared = full_pipeline.fit_transform(happiness_report)

# Note !

We have 9 numeric attributes and with AddCombinedAttributes it will be 10, and also because we have categorized attributes with 10 discrete values, then it matrix of 10*number of instances, then the ColumnTransformer combine the output from each pipeline and return one matrix.

In [None]:
happiness_report_prepared.shape

# Select & Train the model

Things now are simple than we may assume, we have to go through the dataset from a different perspective like in correlation, different graph representation attributes combination and other, its time to check how the model will make on the training data.

Let's trying a Linear Regression model since we are dealing with continuous variable.

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(happiness_report_prepared, happiness_report_labels)

# Now let check result on some data points, but first pass data to the pipeline
some_data,  some_data_labeled = prepare_data(tr_stratified_set)
some_data_prepared = full_pipeline.transform(some_data)
predict_some_data = lin_reg.predict(some_data_prepared)

print("Predict Values\n", predict_some_data)

print("Actual", some_data_labeled)

In [None]:
happiness_report_predict = lin_reg.predict(happiness_report_prepared)
lin_mse = mean_squared_error(happiness_report_predict, happiness_report_labels)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

In [None]:
# Decision Tree

tree_reg = DecisionTreeRegressor()
tree_reg.fit(happiness_report_prepared, happiness_report_labels)
happiness_report_predict = tree_reg.predict(happiness_report_prepared)
tree_mse = mean_squared_error(happiness_report_predict, happiness_report_labels)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

# Better Evaluation

There is no big difference between the two models in error, but it seems that we have overfitting the dataset very well, actually, this may back to the small number of instances we have but also we have to make our evaluation better than we have.

## Cross Validation

its a helpful method that helps us to train on part of the training set and evaluate our result on another part from also training, not just that it helps you to make different evaluation, train the same models across different iteration for each time pick a part for training and part for evaluation, then the same process pick another part for training and another for testing, ending by that you have trained on the whole training and also evaluate on the whole training set.

In [None]:
scores = cross_val_score(tree_reg, happiness_report_prepared, happiness_report_labels,
                        scoring="neg_mean_squared_error", cv=3)
tree_rmse_scores = np.sqrt(-scores)
tree_rmse_scores

In [None]:
scores = cross_val_score(lin_reg, happiness_report_prepared, happiness_report_labels,
                        scoring="neg_mean_squared_error", cv=3)
tree_rmse_scores = np.sqrt(-scores)
tree_rmse_scores

# Summary

As we can see it seems that we have better result of the two models, but LinearRegression looks better than DecisionTreeRegressor and the assumption we said about overfitting seems to be not here, as the result is close to each other from training to validation, but one last thing to decide that is what we have kept aside which is the **Test Set**.

In [None]:
test_data,  test_labeled = prepare_data(tes_stratified_set)
test_data_prepared = full_pipeline.transform(test_data)
predict_test_data = lin_reg.predict(test_data_prepared)
lin_mse = mean_squared_error(predict_test_data, test_labeled)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

In [None]:
# Note !

As we can see we have a small error on the test set, and this can lead us to launch model to work on our system.

# Note !

As we can see we have a small error on the test set, and this can lead us to launch model to work on our system.