<a href="https://colab.research.google.com/github/Snayderr/data_science/blob/main/People_Satisfaction_and_GDP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to enable code completion:

Tools menu ==> click on settings ==> Editor ==> Enable "Automatically trigger code completions"




#Instructions to create a copy of this notebook for youtself

You do not have write access to this notebook.

* From Menu bar, Go to File, 
* Select "Save a copy in my Drive"
* Navigate to Google Drive
* Find a folder named "Collab Notebook" and open it to find your notebook. 
* Rename it and start making changes.

**Note**If there is any file you should read in your code, make sure you copy the file from instructor folder to your own Gdrive by following below steps:

* Right clicking on the file name 
* Select "Make a copy"
* Click on the new file
* Move it to desired folder, preferrably where you have your notebook

# Predict People Satisfaction Across the Globe

Problem Statement:

We would like to build a model that predicts satisfaction score for people of different countries given their country GDP.

# Download Dataset

Download the Better Life Index data (latest edition, currently it is 2017) from the [OECD’s website](http://homl.info/4) as well as stats about GDP per capita from the [IMF’s website](http://homl.info/5). Then you join the tables and sort by GDP per capita. 

# Import Dataset to Google Colab

1. Download CSV and XLS files to your computer
2. Upload them to your Google Drive
3. Open the CSV files using Google Sheets so Google will create the dataset in format of Google Sheets
4. You can remove CSV and XLS files from your drive
5. Use the step by step guide from [here](https://colab.research.google.com/notebooks/io.ipynb#scrollTo=vz-jH8T_Uk2c) and scroll down to **" Google Sheets" ** cell to import data into dataframe

NOTE: After creating Google Sheet into your Drive, make sure you are converting Column 2015 to 0.00 format before importing it into Colab  otherwise Google will import it as a string and you will have hard time to clean the data




In [None]:
# Run below line of code for the first time to install gspread. Once installed comment it for future use
#!pip install --upgrade -q gspread
from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Use gc to open Google Sheet Datasets

In [None]:
#Open given sheet
worksheet = gc.open('/content/drive/My Drive/Artificial intelligence/University of Toronto/3253 - Machine Learning/Colab/Week 1/Cópia de BLI_30012019054825599').sheet1

# Read contents of CSV file
bli_rows = worksheet.get_all_values()

# Convert to a DataFrame and render.
import pandas as pd
bli  = pd.DataFrame.from_records(bli_rows, columns = bli_rows[0])

# Remove rows where inequality has values other than TOT
bli = bli[bli["INEQUALITY"]=="TOT"]

# Reformat data based on "indicator column"
bli = bli.pivot(index="Country", columns="Indicator", values="Value")
#bli.head()
bli["Life satisfaction"].head()

In [None]:
worksheet=gc.open('BLI_30012019054825599').sheet1
bli_rows=worksheet.get_all_values()
import pandas as pd
bli=pd.DataFrame.from_records(bli_rows,columns=bli_rows[0])
bli=bli[bli["INEQUALITY"]=='TOT']
bli=bli.pivot(index="Country",columns="")


In [None]:
#Open given sheet
worksheet = gc.open('WEO_Data').sheet1

# Read contents of CSV file
WEO_rows = worksheet.get_all_values()

# Convert to a DataFrame and render.
import pandas as pd
weo  = pd.DataFrame.from_records(WEO_rows, columns = WEO_rows[0])

# Drop the header row from data
weo = weo.reindex(weo.index.drop(0))

# 1- Select only Country name and 2015 
# 2- then rename it to GDP Per capita
weo = weo[['Country','2015']].rename(columns={'2015':'GDP per capita'})

# Set Country as index column
# Inplace command, will replace the results of command into the same DF
weo.set_index("Country", inplace=True)

#weo.drop_duplicates(inplace=True)
#Print top 5 rows
weo.head()

# Merge/Join dataset

In [None]:
# Now merge BLI and WEO datasets
df = pd.merge(left = weo, right = bli, left_index=True, right_index=True)
df.sort_values(by="GDP per capita", inplace=True)
df.head()


In [None]:
df.iloc[3]

## Split dataset into Train & Test

In [None]:
# Below is the most basic way of splitting data. It is for illustration only
test_indices = [0, 1, 6, 8, 33, 34, 35]
train_indices = list(set(range(36)) - set(test_indices))

train = df[["GDP per capita", 'Life satisfaction']].iloc[train_indices]
test = df[["GDP per capita", 'Life satisfaction']].iloc[test_indices]

In [None]:
train.head()

In [None]:
# Code example
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model

# Prepare the data
X = np.c_[train["GDP per capita"]]
y = np.c_[train["Life satisfaction"]]

In [None]:
type(y)

In [None]:
#Lets look at what is inside X and y. Print first 5 records
X[:5], y[:5]

# Note 

### So far everything was data processing in Python! Nothing advanced! 

### From this cell onwards, we will have Machine Learning!

## Define a model with default values

In [None]:
# Select a basic linear model without setting any parameter (nothing inside paranthesis below)
model = sklearn.linear_model.LinearRegression()

# See the model for yourself
model

## Start training the model using X and y

In [None]:
# Train the model
model.fit(X, y)
#model.coef_

## Do prediction on test data

In [None]:
# Make a prediction for Cyprus
X_new = [[17770]]  # Cyprus' GDP per capita
print(model.predict(X_new)) # outputs [[5.95199478]]

In [None]:
# Make a prediction for our test data
pred = model.predict(test['GDP per capita'].values.reshape(-1,1))

In [None]:
pred

#Now, lets make it better!

Use test dataset and predict the life expectancy using test dataset. 

# Evaluate Model

In [None]:
# Lets create train and test dataset, so we can use train dataset for training the model
# and use test dataset to evaluate model performance 
X_train = np.c_[train["GDP per capita"]]
y_train = np.c_[train["Life satisfaction"]]

X_test = np.c_[test["GDP per capita"]]
y_test = np.c_[test["Life satisfaction"]]

model = model.fit(X_train,y_train)

#Now apply the prediction on test dataset
y_pred_test = model.predict(X_test)

In [None]:
# See predictions for yourself
y_pred_test

In [None]:
from sklearn.metrics import mean_squared_error

#MSE
mean_squared_error(y_test, y_pred_test)


In [None]:
#RMSE
from math import sqrt
sqrt(mean_squared_error(y_test, y_pred_test))


# Question:

### What would you expect if we normalize data and train the model again?

# Now, normalize data before prediction

In [None]:
from sklearn.preprocessing import MinMaxScaler
# Define Scaling technique
scaler = MinMaxScaler()

In [None]:
# Train escaling object 
X_train_escaler = scaler.fit(X_train)

# Apply scaling model to the data
X_train_escaled = X_train_escaler.transform(X_train)
X_train_escaled[:5]

#Normalize Test Dataset

In [None]:
# Apply scaling model to the data
X_test_escaled = X_train_escaler.transform(X_test)
X_test_escaled

# Train Models using Scaled Data

## Start training the model using X and y

In [None]:
model.fit(X_train_escaled, y_train)

y_pred_escaled = model.predict(X_train_escaled)


## Do prediction on test data

In [None]:
y_pred_escaled = model.predict(X_test_escaled)

In [None]:
#RMSE
from math import sqrt
sqrt(mean_squared_error(y_test, y_pred_escaled))


# Can you conclude by comparing RMSE from normalized and not normalized dataset?