In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('data/tobacco_data.csv')

In [3]:
df.head()

Unnamed: 0,Country,Year,Tobac_Use_M,Tobac_Use_F,Tax_2015,Happiness_Score,Afford_2015,Ban_Score_Dir_Ads,Ban_Score_Indr_Ads,Ban_Score_add_indir_ads,Warn_Score,Ban_Score_places
0,Albania,2015,51.2,7.6,65.195,4.959,3.92,8,8,3,50,8
1,Argentina,2015,29.5,18.4,75.045,6.574,1.31,7,10,5,50,8
2,Armenia,2015,52.3,1.5,34.165,4.35,3.945,5,2,0,50,3
3,Australia,2015,16.7,13.1,58.515,7.284,2.285,6,2,0,83,6
4,Austria,2015,35.5,34.8,74.835,7.2,1.225,7,8,5,65,2


In [4]:
df.drop(['Country', 'Year'] , axis=1, inplace=True)

In [5]:
df = df.replace('^', 0)
df.head()

Unnamed: 0,Tobac_Use_M,Tobac_Use_F,Tax_2015,Happiness_Score,Afford_2015,Ban_Score_Dir_Ads,Ban_Score_Indr_Ads,Ban_Score_add_indir_ads,Warn_Score,Ban_Score_places
0,51.2,7.6,65.195,4.959,3.92,8,8,3,50,8
1,29.5,18.4,75.045,6.574,1.31,7,10,5,50,8
2,52.3,1.5,34.165,4.35,3.945,5,2,0,50,3
3,16.7,13.1,58.515,7.284,2.285,6,2,0,83,6
4,35.5,34.8,74.835,7.2,1.225,7,8,5,65,2


# Correlation Matrix

In [8]:
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Tobac_Use_M,Tobac_Use_F,Tax_2015,Happiness_Score,Afford_2015,Ban_Score_Dir_Ads,Ban_Score_Indr_Ads,Ban_Score_add_indir_ads,Ban_Score_places
Tobac_Use_M,1.0,0.0847056,0.0967064,-0.327438,-0.0985019,-0.0689886,-0.203158,-0.187177,0.0125115
Tobac_Use_F,0.0847056,1.0,0.656236,0.420648,-0.42563,-0.00518897,-0.0535025,-0.0825792,0.00868903
Tax_2015,0.0967064,0.656236,1.0,0.441964,-0.487305,0.0459263,-0.00311081,-0.10922,0.16135
Happiness_Score,-0.327438,0.420648,0.441964,1.0,-0.626922,-0.0272592,-0.00420253,-0.0360306,-0.156774
Afford_2015,-0.0985019,-0.42563,-0.487305,-0.626922,1.0,-0.155646,-0.000443755,0.0193821,0.0304187
Ban_Score_Dir_Ads,-0.0689886,-0.00518897,0.0459263,-0.0272592,-0.155646,1.0,0.611921,0.557217,0.207777
Ban_Score_Indr_Ads,-0.203158,-0.0535025,-0.00311081,-0.00420253,-0.000443755,0.611921,1.0,0.801185,0.230062
Ban_Score_add_indir_ads,-0.187177,-0.0825792,-0.10922,-0.0360306,0.0193821,0.557217,0.801185,1.0,0.164753
Ban_Score_places,0.0125115,0.00868903,0.16135,-0.156774,0.0304187,0.207777,0.230062,0.164753,1.0


#### Explanation: 
The correlation value is in the range of [-1, 1]. The positive value indicates positive impact and so for negative. The value close to boundries (-1 and 1) demonstrate high tangible relation between indicators and on ther side the values close to zero show insignificant relation between indicators.

#### Analytics:
As could be seen on correlation matrix, tobacco usage of men is negatively impacted by happiness and advertisement limitations which makes sense. However, there is no indicator which impact on Men tobacco consumpation. 
Among all it can be seen that tax is the most important influencer of men consumption, which is against what governments claim about it, but this impact is not much significant. There is only one other indicator which has positive impact on men tobacco usage, which is limitation on places where smoking is allowed. Howere this impact is not much cosiderable.

On the other side, women are more influenced by these indicators, as tax an happiness cause more tobacco usage between women, that although show different habitat between women, confirm our hypothesis about tax which increasing the tax won't cause decrease on tobacco usage. However it seems tobacco usage between women is mostly for fun and not one regular habitate.  

However, the negative correlation between affordability and women tobacco usage shows that the more affordable women (who could be from wealty class) are less likely to smoke and this could be because of more alternative fun for wealty women.  

# Linear Regression
https://realpython.com/linear-regression-in-python/


https://towardsdatascience.com/let-us-understand-the-correlation-matrix-and-covariance-matrix-d42e6b643c22

In [10]:
y_f = np.array(df['Tobac_Use_F'])
y_m = np.array(df['Tobac_Use_M'])

In [11]:
lr_df = df.drop(['Tobac_Use_M', 'Tobac_Use_F'] , axis=1)

In [12]:
x = lr_df.to_numpy()

In [13]:
model_m = LinearRegression().fit(x, y_m)
r_sq_m = model_m.score(x, y_m) 
r_sq_m

0.3348470070477083

In [14]:
print('slope:', model_m.coef_)

slope: [ 0.1203863  -8.21750376 -1.34606392 -0.25634906 -0.50224731 -0.21917918
 -0.01711739 -0.28092335]


In [15]:
model_f = LinearRegression().fit(x, y_f)
r_sq_f = model_f.score(x, y_f) 
r_sq_f

0.4625619100103973

#### Explanation: 
The results show not much satisfying model to map the tobacco usage between countries.


#### R squared definition:
http://www.fairlynerdy.com/what-is-r-squared/
Any R squared value greater than zero means that the regression analysis did better than just using a horizontal line through the mean value.  In the rare cases you get a negative r squared value, you should probably rethink your regression analysis, especially if you are forcing an intercept.

In [16]:
print('slope:', model_f.coef_)
print('intercept:', model_f.intercept_)

slope: [ 0.2770912   0.933964   -0.18411967 -0.13775466 -0.30140579  0.3357009
  0.01268157 -0.2118413 ]
intercept: -6.314735516844612


# Polynomial Regression

https://towardsdatascience.com/polynomial-regression-bbe8b9d97491

In [17]:
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
poly_features = PolynomialFeatures(degree=2, include_bias=True)
X_train, X_test, Y_train, Y_test = train_test_split(x, y_m, test_size=0.3, random_state=1)

X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.fit_transform(X_test)
scaler.fit(X_train_poly)
scaler.fit(X_test_poly)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [18]:
# fit the transformed features to Linear Regression
poly_model = LinearRegression()
poly_model.fit(X_train_poly, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [19]:
# predicting on training data-set
y_train_predicted = poly_model.predict(X_train_poly)

In [20]:
# predicting on test data-set
y_test_predict = poly_model.predict(X_test_poly)

In [21]:
# evaluating the model on training dataset
rmse_train = np.sqrt(mean_squared_error(Y_train, y_train_predicted))
r2_train = r2_score(Y_train, y_train_predicted)

In [22]:
# evaluating the model on test dataset
rmse_test = np.sqrt(mean_squared_error(Y_test, y_test_predict))
r2_test = r2_score(Y_test, y_test_predict)

In [23]:
print("The model performance for the training set")
print("-------------------------------------------")
print("RMSE of training set is {}".format(rmse_train))
print("R2 score of training set is {}".format(r2_train))

print("\n")

print("The model performance for the test set")
print("-------------------------------------------")
print("RMSE of test set is {}".format(rmse_test))
print("R2 score of test set is {}".format(r2_test))

The model performance for the training set
-------------------------------------------
RMSE of training set is 5.284467783625755
R2 score of training set is 0.8148437985479489


The model performance for the test set
-------------------------------------------
RMSE of test set is 16.075558315017837
R2 score of test set is -0.29386735670008424


#### Explanation: 
The results show overfitting which could be because of applying more complex kernel over small data set.

# SVM
https://realpython.com/linear-regression-in-python/

In [216]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn import metrics

In [223]:
X_train, X_test, y_train, y_test = train_test_split(x, y_m, test_size=0.2, random_state=1)

In [224]:
svr = SVR(kernel='linear', epsilon= 0.1) #Default hyperparameters
svr.fit(X_train,y_train)
y_pred_train = svr.predict(X_train)
y_pred_test = svr.predict(X_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
print(svr.score(X_train,y_train))
print(r2_score(y_test,y_pred_test))
print(rmse_train)
print(rmse_test)
# print('Accuracy Score:', svr.score(X_train, y_pred_train, sample_weight=None))

0.2674419488353734
-1.6162520347341136
10.956854494026805
20.845230399345212


R squared intrepretation: https://www.datasciencecentral.com/profiles/blogs/regression-analysis-how-do-i-interpret-r-squared-and-assess-the

#### Explanation: 
No promising result.

In [219]:
# # evaluating the model on training dataset
# rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
# r2_train = r2_score(y_train, y_train_predicted)
# print(rmse_train, r2_train)

In [220]:
# # evaluating the model on testing dataset
# rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
# r2_test = r2_score(y_test, y_pred_test)
# print(rmse_test, r2_test)