Hello ! So for my first kernel ever, I will try to do some data visualization and to predict happiness with the given columns. There's a first time for everything, and after reading a lot on data science, I will try to do some work. Let's go !

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn

%pylab inline
pylab.rcParams['figure.figsize'] = (10, 7)

df = pd.read_csv("../input/happiness-and-alcohol-consumption/HappinessAlcoholConsumption.csv", sep=",", engine="python")

Let's have a look at the data first.

In [None]:
print("{} countries".format(len(df)))
df.head()

Let's see the happiness distribution, and some of the most/least happy country and others.

In [None]:
X = df["HappinessScore"]
plt.hist(X, bins=30)
plt.title("Happiness Score Histogram")
plt.xlabel("Hapiness Score")
plt.ylabel("# of rows")
plt.show()

In [None]:
print("Most happy country:")
max_happiness = df["HappinessScore"].max()
df[df["HappinessScore"] == max_happiness]

In [None]:
print("Least happy country:")
min_happiness = df["HappinessScore"].min()
df[df["HappinessScore"] == min_happiness]

In [None]:
print("Most GDP per capita country:")
max_GDP = df["GDP_PerCapita"].max()
df[df["GDP_PerCapita"] == max_GDP]

In [None]:
print("Most beer per capita country:")
max_beer = df["Beer_PerCapita"].max()
df[df["Beer_PerCapita"] == max_beer]

Okay so now we want to see if there's any correlation between columns. Can be useful.

In [None]:
pylab.rcParams['figure.figsize'] = (7, 6)
sn.heatmap(df.corr(), annot=True)
plt.title("Correlation matrix")
plt.show()

Some curves to show whether or not there is a trivial correlation between some interesting columns.

In [None]:
pylab.rcParams['figure.figsize'] = (10, 8)

df_sorted = df.sort_values(by=["Beer_PerCapita"])
X, y = df_sorted["Beer_PerCapita"], df_sorted["HappinessScore"]

plt.subplot(5, 1, 1)
plt.plot(X, y)
plt.xlabel("Beer_PerCapita")
plt.ylabel("HappinessScore")
plt.title("Relation between beer consumption and happiness")


df_sorted = df.sort_values(by=["GDP_PerCapita"])
X, y = df_sorted["GDP_PerCapita"], df_sorted["HappinessScore"]

plt.subplot(5, 1, 3)
plt.plot(X, y)
plt.xlabel("GDP_PerCapita")
plt.ylabel("HappinessScore")
plt.title("Relation between GDP and happiness")


df_sorted = df.sort_values(by=["HDI"])
X, y = df_sorted["HDI"], df_sorted["HappinessScore"]

plt.subplot(5, 1, 5)
plt.plot(X, y)
plt.xlabel("HDI")
plt.ylabel("HappinessScore")
plt.title("Relation between HDI and happiness")

plt.show()

As we can see, there seem to be a small correlation between HDI, GDP and happiness, but it appears that beer consumption isn't really related to happiness (and I'm pretty sorry for that, as I wanted to predict happiness only by using beer consumption).

Ok so, just for the kicks, let's look at the mean values when we look at the data grouped by Region.

In [None]:
df_group_by = df.groupby(["Region"]).mean()
df_group_by

Enought is enought, let's predict happiness score of a country.
We first try without any localization feature and for the rest of the kernel, we will only use a LinearRegression model.

In [None]:
# Predicting happiness using linear model
from sklearn import linear_model
from sklearn.model_selection import train_test_split
reg = linear_model.LinearRegression()

X_columns = ["HDI", "GDP_PerCapita", "Beer_PerCapita", "Wine_PerCapita", "Spirit_PerCapita"]

train = df.sample(frac=0.8,random_state=200)
test = df.drop(train.index)

X_train = train[X_columns]
y_train = train["HappinessScore"]
X_test = test[X_columns]
y_test = test["HappinessScore"]

reg.fit(X_train, y_train)
print("R^2 score:", reg.score(X_test, y_test))

Well,I know 0.6 isn't a good value, but I don't know how bad it is. Maybe I have not enought features ? Let's try to add some polynomial features.

In [None]:
# Predicting happiness using linear model, adding polynomial features
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
reg = linear_model.LinearRegression()
poly = PolynomialFeatures(2)

X_columns = ["HDI", "GDP_PerCapita", "Beer_PerCapita", "Wine_PerCapita", "Spirit_PerCapita"]

train = df.sample(frac=0.8,random_state=200)
test = df.drop(train.index)

X_train = train[X_columns]
y_train = train["HappinessScore"]
X_test = test[X_columns]
y_test = test["HappinessScore"]
X_train = poly.fit_transform(X_train)
X_test = poly.fit_transform(X_test)

reg.fit(X_train, y_train)
print("R^2 score with polynomial features:", reg.score(X_test, y_test))

Ok so now we are obviously overfitting. Let's forget about polynomial features, and now try adding a one-hot encoding on the Region values.

In [None]:
# Adding region to X_columns
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
reg = linear_model.LinearRegression()

df_one_hot = pd.get_dummies(df, prefix="region", columns=["Region"])
X_columns = set(df_one_hot.columns.values) - {"Country", "HappinessScore", "Hemisphere"}

train = df_one_hot.sample(frac=0.8,random_state=200)
test = df_one_hot.drop(train.index)

X_train = train[X_columns]
y_train = train["HappinessScore"]
X_test = test[X_columns]
y_test = test["HappinessScore"]

reg.fit(X_train, y_train)
print("R^2 score with region infos:", reg.score(X_test, y_test))

0.7 is better ! Maybe with a one-hot encoding on the Region and Hemisphere feature ?

In [None]:
# Adding region and hemisphere to X_columns
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
reg = linear_model.LinearRegression()

df_one_hot = pd.get_dummies(df, prefix="region", columns=["Region"])
df_one_hot["Hemisphere"] = df_one_hot["Hemisphere"].replace("noth", "north") # Correcting some misspells
df_one_hot = pd.get_dummies(df_one_hot, prefix="hemisphere", columns=["Hemisphere"])
X_columns = set(df_one_hot.columns.values) - {"Country", "HappinessScore"}

train = df_one_hot.sample(frac=0.8,random_state=200)
test = df_one_hot.drop(train.index)

X_train = train[X_columns]
y_train = train["HappinessScore"]
X_test = test[X_columns]
y_test = test["HappinessScore"]

reg.fit(X_train, y_train)
print("R^2 score with region and hemisphere infos:", reg.score(X_test, y_test))

Hemisphere does not really bring that much. We could select a few features just by looking at the correlation matrix ?

In [None]:
# New correlation matrix
df_one_hot = pd.get_dummies(df, prefix="region", columns=["Region"])
select_columns = set(df_one_hot.columns) - {"Country", "HDI", "GDP_PerCapita", "Beer_PerCapita", "Spirit_PerCapita", "Wine_PerCapita"}
df_corr = df_one_hot[select_columns]

pylab.rcParams['figure.figsize'] = (7, 6)
sn.heatmap(df_corr.corr(), annot=True)
plt.title("Correlation matrix")
plt.show()

Ok so now we can try one last time, with only HDI, GPD, Sub-Saharan Africa and Western Europe features (the one with the most happiness correlation).

In [None]:
# Final try, with selected columns, and polynomial features
# columns are selected according to the correlation matrices
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
reg = linear_model.LinearRegression()
poly = PolynomialFeatures(2)

df_one_hot = pd.get_dummies(df, prefix="region", columns=["Region"])
train = df_one_hot.sample(frac=0.8,random_state=200)
test = df_one_hot.drop(train.index)

X_columns = ["HDI", "GDP_PerCapita", "region_Sub-Saharan Africa", "region_Western Europe"]
X_train = train[X_columns]
y_train = train["HappinessScore"]
X_test = test[X_columns]
y_test = test["HappinessScore"]
X_train = poly.fit_transform(X_train)
X_test = poly.fit_transform(X_test)

reg.fit(X_train, y_train)
print("R^2 score with selected polynomial features:", reg.score(X_test, y_test))

Well, 0.78 is better, but I guess this is not the 0.99 I usually find in books. There may be some better model / data preparation to increase the R², but I feel like with those features, we are not able to completly define the happiness score.
Anyway, it is interesting to see that we ended up predicting happiness score without using alcohol consumption !
Maybe we could add a summary of all alcohol columns into one (like a sum, ponderated by the alcohol rate ?)