## Overview of the Scikit-Learn Python Library ##





<kbd>
<center><img src="machine learning examples.png" alt="drawing" width="800"/><Center>

<center> ref : https://wordstream-files-prod.s3.amazonaws.com/s3fs-public/machine-learning.png <center>
    </kbd>

### What is Sciki-learn ? <sup>[1][2][3]</sup>

A library for machine learning in python.Scikit learn started as a project in 2007 and in 2010 it made its first public release. A international community of experts has been leading its development. According to its website , it used by various international companies such as Spotify to Booking.com right down to me , a data analyst student for her model project.

Sciki-learn is built on NumPy , SciPy and matplotlib , these are the tools for data analysis and data mining where :

 - Using NumPy for array vectorization , 
 - Pandas for accessing dataframes 
 - matplotlib for plotting, 
 - Scipy for scientific computing


Providing the following :

Simple and efficient tools for predictive data analysis

To start at the very beginning , a learning problem considers a set of n samples of data and then tries to predict properties of unknown data from that set. The n samples can have several attributes or features i.e multivariate data.

Machine learning evaluates an algorithm by splitting the data set into two :

- training set - properties are learned
- testing set - learned properties are tested 


Sciki-learn library focuses on modelling this data from learning problems using the following few catergories:

#### ** 1.Supervised learning - used by majority of machine learning

This is where you have labelled input (x) and output (y) variables and you use an algorithm to learn the mapping function from the input to the output.

With the main aim to approximate this mapping function when you have new input data to predict the output variables for that new data . It is called supervised as the process of algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. The algorithm can be corrected if needs be during the predicive process.

Supervised learning can be further grouped into :

- Classification - the output variable (y) is a catergory 
- Regression - the output variable (y) is a real value 

##### **Popular application:**

 - Predictive analytics (house prices, stock exchange prices, etc.) 
 - Text recognition 
 - Spam detection 
 - Customer sentiment analysis 
 - Object detection (e.g. face detection)

#### **2. Unsupervised learning

Here you only have unlabelled input data (x) and no corresponding output variables

The aim of this type of learning is to model the underlying distribution in the data in order to learn more about the data. Its unsupervised meaning , there is no correct answers and no monitoring by a teacher unlike the supervised learning above. Algorithms discover and learn to inherent structure from the input data with no guidance.

Unsupervised learning can be further grouped into :

- Clustering- a problem where you want to discover the inherent groupings in the data i,e purchasing power
- Association - a problem where you want to discover the rules that describe large portions of your data such as people that buy X also tend to buy Y

##### **Popular application:**

Anomaly Detection customer behavour prediction noise removal from the dataset

#### **Other modelling examples:**

Cross Validation − It is used to check the accuracy of supervised models on unseen data.

Dimensionality Reduction − It is used for reducing the number of attributes in data which can be further used for summarisation, visualisation and feature selection.

Ensemble methods − As name suggest, it is used for combining the predictions of multiple supervised models.

Feature extraction − It is used to extract the features from data to define the attributes in image and text data.

Feature selection − It is used to identify useful attributes to create supervised models.


## Dataset : The World Happiness Report <sup>[4][5][6]</sup>

The 2021 World Happiness Report ( 9th ) is  based on a wide variety of data , with the most important source been the Gallup World Poll. 
The life evaluations from this poll provide the basis for the annual happiness rankings. It acknowledges that COVID-19 has posed unique problems in data collection. 
        
        
The purpose of the World Happiness Report 2021 was to focus on the effects of COVID-19 and how people all over the world have fared. 

  - The pandemic's worst effect has been the 2 million deaths from COVID-19 in 2020. A rise of nearly 4% in the annual number of deaths worldwide represents a serious social welfare loss.
  - For the living there has been greater economic insecurity, anxiety, disruption of every aspect of life, and, for many people, stress and challenges to mental and physical health.

### Variables : <sup>[6]</sup>

NB: We will only be concentrating on the following 7 for the purposes of this analysis:

**Ladder score :** or called the **Happiness score**. It is the national average response to the question of life evaluations. So the imagine of a ladder with 0 to 10 steps, 0 representing the worst possible life and 10 the best possible life

**GDP per capita :** Gross domestic product - monetary value of all finished goods and servies made within a country during a specific time period

**Healthy Life Expectancy:** based on data from the WHO Global Health Observatory data repository

**Social Support :** a person has family or friends they can count on in times of trouble. Measured as Binary responses ( either 0 or 1). yes or no

**Freedom to make life choices :** this is the national average of responses to the GWP question , Are you satisfied or not with your freedom to choose what you do with your life

**Generosity:** resodial of regressing the national average of response to the GWP question : have you donated money to charity in the last month ?

**Corruption Perception: 8** measure of the national average of the survery responses to 2 questions - is corruption widespread through your government and 2nd businesses. Overall perception measured as an average of 0 or 1 responses.

## SETUP ##

## Step 1 :Importing Packages ##

In [None]:
# Numerical arrays.
import numpy as np

# Data frames.
import pandas as pd

# Plotting.
import matplotlib.pyplot as plt

# Logistic regression.
import sklearn.linear_model as lm

# K nearest neighbours.
import sklearn.neighbors as nei

# Helper functions.
import sklearn.model_selection as mod

# Fancier, statistical plots.
import seaborn as sns

from sklearn.preprocessing import StandardScaler, normalize
from sklearn.cluster import KMeans
import plotly.express as px
import plotly.graph_objects as go
from chart_studio.plotly import plot, iplot
from plotly.offline import iplot

from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from math import sqrt

import sklearn.datasets as datasets
import sklearn.preprocessing as preprocessing
import sklearn.model_selection as model_selection
import sklearn.metrics as metrics
import sklearn.linear_model as linear_model
from sklearn.metrics import mean_squared_error
from sklearn.metrics import max_error


# import statistic library
from scipy import stats
import statsmodels.api as sm


## Step 2. Read in the Dataset <sup>[7]</sup>

In [None]:
# Load in the datasets Using pandas  - then check if the data works ok with the scikit learn
import warnings
warnings.filterwarnings('ignore') 

happiness = pd.read_csv(r'C:\Users\User\Desktop\repo\Machine-learning-and-Statistics\scikit learn\Happiness 2021 2.csv')
happiness

In [None]:
happiness.describe()

The Ladder score ( happiness scores) mean is 5.53 ,which indicates mid range 

In [None]:
#Printing information on the dataset 
happiness.info()

In [None]:
happiness.shape
# 20 columns with 149 entries

#### Show an world overview of the ladder score using plotly and chloropleth <sup>[8][9]</sup>

Plotly can be used to create an interactive map. 

In [None]:
import plotly.express as px
fig = px.choropleth(data_frame=happiness,
                    locations="Country name",
                    locationmode="country names",
                    color="Ladder score",
                    title="Happiness score per Country")
fig.show()

## Step 3 :PERFORMING EXPLORATORY ANALYSIS 

In [None]:
### Selecting which features to Keep or Drop ###
# There is 20 columns , we do not need all information from them 
# using the drop function to select variables to visualise in the heatmap
happiness = happiness.drop(columns=happiness.columns[12:])
happiness = happiness.drop(columns=[ 'upperwhisker', 'lowerwhisker', 'Standard error of ladder score'])

#### ** 1. Correlations <sup>[10]</sup>
corr = happiness.corr()
f,ax =plt.subplots (figsize=(15,10))
sns.heatmap(corr, annot=True,fmt='.2f', ax=ax);

In [None]:
# Looking at the correlation against the ladder score
correlation_matrix = happiness.corr()
correlation_matrix["Ladder score"]

#### **Heatmap Correlation conclusions:**

Using the correlation information from the above heatmap , we can decide what the target variables can be , which I think is the region rather than the country ( country only appears once). The other variables are ladder score , Socal supports, healthy life expectancy, logged GDP per capta and freedom to make life choices.

In [None]:
#Looking at the data again after dropping columns
happiness.info()

In [None]:
#using a filter to locate information on irelands happiness
happiness[happiness['Country name']== 'Ireland']

In [None]:
import matplotlib.pyplot as plt
>>> happiness["Ladder score"].hist(bins=20)
>>> plt.show()

## the histogram shows the highest degree of happiness on between 3 and 8 with a dip at 5

## Step 3 :PERFORM DATA VISUALIZATION 

1. Pairplot visualization <sup>[11]</sup>

It visualizes data to find the relationship between them where the variables can be continous or categorical. After using a correlation matrix above ,we determined what variables are related closely to the Ladder score ( happiness). Using the seaborn package library , we created a pairplot. 

Looking at the pairplot below , there seems to be linear relation between ladder score and logged GDP per captia , Healthy life expectancy and social support

In [None]:

## plotting the pair plot 

fig=plt.figure(figsize = (20,20))
sns.pairplot(happiness[['Ladder score','Logged GDP per capita','Social support','Healthy life expectancy','Freedom to make life choices','Generosity','Perceptions of corruption']])
             

2. Distplot visualization <sup>[12]</sup>

We used a distplot to visualize the data distribution of variables 

Conclusions : 

Ladder score - there seems to a degree of bimodal in its distribution , it shows the highest ladder score to be around 6 but also around 4.5 
Logged GDP per capita - bimodal in distribution - majority of data is in the range of 9 and 12
Social support - there is a clear distribution where the high range of social support between .8 and 1.0
Healthy life expectancy - the highest result is 70's clearly
Freedom to make life choices - bimodal distribution
Generosity - bimodal distribution
Perceptions of corruption'- a normal distribution 


In [None]:
#printout the displot - distribution plot 
#displot combines the matplotlib.hist function with seaborn kdeplot() for each of the interested columns using a for loop to create these subplots

columns = ['Ladder score','Logged GDP per capita','Social support','Healthy life expectancy','Freedom to make life choices','Generosity','Perceptions of corruption']
plt.figure(figsize = (20, 50))
for i in range(len(columns)):
  plt.subplot(8, 2, i+1)
  sns.distplot(happiness[columns[i]], color = 'r');
  plt.title(columns[i])

plt.tight_layout()

3. Scatter visualization <sup>[13]</sup

Using the plotly package to give an interaction on the plots ; so you can relate the country with its score and life expectancy

In [None]:
# Plot the relationship between score, Healthy life expectancy and country
fig = px.scatter(happiness, x = 'Healthy life expectancy', y = 'Ladder score', color = 'Country name', trendline = "ols", hover_name = "Country name")
fig.update_layout ( title_text = 'Ladder score vs Healthy life expectancy')

fig.show()

## SciKit-Learn Algorithms

----------------------------------------------------------------------------------------------------------------------------------------------------

## SciKit-Learn Algorithm - K- Means Clustering <sup>[14][15]</sup>


An example of an unsupervised Algorithm. It works by grouping some data points with similar attribute values together ( clustering) by measuring the Euclidian distance ( distance between two points)between the points unsupervised.


Aims : 

- To train an unsupervised machine learning algorithm known as k means clustering to cluster countries based on features such as economic production , social suppport , life expectancy , freedom , absense of corruption and generosity

How it works : 

1. Choose the number of clusters "k"
2. Select random K points that are going to be the centroids for each cluster
3. Assign each data point to the nearest centroid - enable to create "k" number of clusters
4. Calculate a new centroid for each cluster
5. Reassign each data point to the new closet centroid
6. Go to step 4 and repeat




#### **Step 1 : Data preparation  for feeding into the clustering model <sup>[16]</sup>

In [None]:
# To create clusters without the use of Ladder score to see which countries fall under similar clusters
# Removing the following columns from the dataset - Country name, regional indicator and Ladder score  so that all we have is table of inputs with no relation to a target class or score

df_happiness= happiness.drop(columns = ['Country name', 'Regional indicator', 'Ladder score'])
df_happiness

#### **Step 2 : Scale the data <sup>[17]</sup>

In [None]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_happiness)

#### **Step 3  : Finding the correct no. of clusters <sup>[18]</sup>

In [None]:
# Finding the correct no. of clusters to use using the elbow method - so called because the line chart looks like a elbow on the arm
#Using the equation called Within Cluster Sum of Squares ( WCSS) - measuring the distance between the input points and the centroid at difference cluster values


WCSS = []
range_values = range(1,20) # look at 20 K values to start 

# use a for loop to apply the different ranges of k value to the k- means clustering algorithn

for i in range_values:
    kmeans = KMeans(n_clusters = i) # KMeans imported from sklearn.cluster
    kmeans.fit(scaled_data)
    WCSS.append(kmeans.inertia_)
    


plt.plot(WCSS, 'bx-')
plt.title('Finding right number of clusters')
plt.xlabel('Clusters')
plt.ylabel('WCSS') 
plt.show()

#### **Step 4 : Applying the K- Means Method <sup>[19]</sup>

In [None]:
## The no. of clusters determined is 3

kmeans = KMeans(3)
kmeans.fit(scaled_data)

In [None]:
labels = kmeans.labels_

In [None]:
kmeans.cluster_centers_.shape

In [None]:
#Adding the cluster center information to our dataset.This data is scaled down 

cluster_centers = pd.DataFrame(data = kmeans.cluster_centers_, columns = [df_happiness.columns])
cluster_centers      

In [None]:
# In order to understand what these numbers mean, let's perform inverse transformation - scale these numbers back 
cluster_centers = scaler.inverse_transform(cluster_centers)
cluster_centers = pd.DataFrame(data = cluster_centers, columns = [df_happiness.columns])
cluster_centers


Cluster 0: countries that have GDP in the range of 0.6 to 1.4 and have high social support. These countries have medium life expectancy and have high freedom to make life choices. These counties have low generosity and high perception of corruption.

Cluster 1: countries that have very high GDP, high social support and high life expectancy. These counties have high freedom to make life choices, medium generosity and low perception of corruption.

Cluster 2: countries that have low GDP average life expectancy and average social support. These counties have low freedom to make life choices, high generosity and medium perception of corruption.

In [None]:
labels.shape # Labels associated to each data point

In [None]:
# applying our inputs from the scaled_data to the kmeans trained model to determine which cluster a country belongs to

y_kmeans = kmeans.fit_predict(scaled_data)
y_kmeans

In [None]:
# concatenate the clusters labels to our original dataframe
happy_df_cluster = pd.concat([happiness, pd.DataFrame({'cluster':labels})], axis = 1)
happy_df_cluster

#### ** Visualisation of the clusters in our happiness dataset

In [None]:

# Plot the histogram of various clusters - showing the different distributions 
for i in df_happiness.columns:
  plt.figure(figsize = (35, 10))
  for j in range(3):
    plt.subplot(1,3,j+1)
    cluster = happy_df_cluster[happy_df_cluster['cluster'] == j]
    cluster[i].hist(bins = 20)
    plt.title('{}    \nCluster {} '.format(i, j))
  
  plt.show()

In [None]:
# Plot the relationship between cluster and score 

fig = px.scatter(happy_df_cluster, x = 'cluster', y = "Ladder score",
            color = "Country name", hover_name = "Regional indicator")
          

fig.update_layout(
    title_text = 'Happiness Score vs Cluster'
)
fig.show()

In [None]:
# Plot the relationship between cluster and GDP

fig = px.scatter(happy_df_cluster, x='cluster', y='Logged GDP per capita',
            color = "Country name", hover_name = "Regional indicator")
       

fig.update_layout(
    title_text='GDP vs Clusters'
)
fig.show()

In [None]:
# Visaulizing the clusters with respect to economy, corruption, gdp, rank and their scores

from bubbly.bubbly import bubbleplot

figure = bubbleplot(dataset=happy_df_cluster, 
    x_column='Logged GDP per capita', y_column='Perceptions of corruption', bubble_column='Regional indicator',  
    color_column='cluster', z_column='Healthy life expectancy', size_column='Ladder score',
    x_title="Logged GDP per capita", y_title="Corruption", z_title="Life Expectacy",
    title='Clusters based Impact of Economy, Corruption and Life expectancy on Happiness Scores of Nations',
    colorbar_title='Cluster', marker_opacity=1, colorscale='Portland',
    scale_bubble=0.8, height=650)

iplot(figure, config={'scrollzoom': True})

In [None]:
# Visualizing the clusters geographically
data = dict(type = 'choropleth', 
           locations = happy_df_cluster["Country name"],
           locationmode = 'country names',
           colorscale='RdYlGn',
           z = happy_df_cluster['cluster'], 
           text = happy_df_cluster["Regional indicator"],
           colorbar = {'title':'Clusters'})

layout = dict(title = 'Geographical Visualization of Clusters', 
              geo = dict(showframe = True, projection = {'type': 'azimuthal equal area'}))

choromap3 = go.Figure(data = [data], layout=layout)
iplot(choromap3)

## Data Preparation and preproccessing :¶ For KNN and Linear Regression Algorithms

## Test and Train Split 
Split the data into training and test sets in Python using scikit-learn’s built-in train_test_split():

The test_size refers to the number of observations that you want to put in the training data and the test data. If you specify a test_size of 0.2, your test_size will be 20 percent of the original data, therefore leaving the other 80 percent as training data.

The random_state is a parameter that allows you to obtain the same results every time the code is run. train_test_split() makes a random split in the data, which is problematic for reproducing the results. Therefore, it’s common to use random_state. 


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
#creating 2 objects that now contain data :x and y . 

x = happiness.drop(columns =['Ladder score',"Country name","Regional indicator"])
y = happiness['Ladder score']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=12345)


In [None]:
X_train.index


In [None]:
X_test.index.size

In [None]:
 y_train.index

In [None]:
y_test.index.size

### Scaling of the data  : Standardization <sup>[20]</sup>

This is a crucial part of data preprocessing stage . Here we use standardization instead of normalization as we know that the data follows a gaussian distribution. Linear regression (gradient based Algorithm) require data to be scaled and KNN is a distance based algorithm is most affected by the range of features therefore needs the data to be scaled prior to fitting the model

In [None]:
# Scaling the data so that all features can contribute equally to the result. 

from sklearn.preprocessing import MinMaxScaler
sc=MinMaxScaler(feature_range=(0, 1))
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)

## SciKit-Learn Algorithm - KNN <sup>[21][22]</sup>

An example of an supervised Algorithm : used for Classification and Regression

 - Classification - learn how to classify any new observation - target variable is catergorical
 - Regression - prediction made on numerous independent variables - target variable is numeric

#### **Aim:

The goal is to develop a model that can predict the happiness score base purely on the other independent variables of the dataset by applying kNN to find the closet prediction score as possible

To use data that already has the answers to train its algorithm. It takes a new data point and looks at the existing data points that neighbour it. The new data point is than catergoried according to the majority of the existing neighbouring data points. the key is to determine the number of neighbours to look at , k . 
we have two types of variables at the same time , a target variable (y) and independent variables (x)
The target variable is what we want to predict and it depends on the independent variables. 
KNN is a non linear learning algorithn - it uses any approach other than a line to separate their cases.

An advantage to KNN lies in the ease of interpretation and understanding whats happening in a model and also it can be very quick to develop. Its good for cases that dont require highly complex techniques.However there are techniques available to improve KNN algorithm to aid it in projects are that highly complex like bagging.


In [None]:
# To see what the happiness score ranges we expect , we visualise though a histogram, the range goes from below 3 up to 8
import matplotlib.pyplot as plt
happiness["Ladder score"].hist(bins=15)
plt.show()

### Splitting Data Into Training and Test Sets for Model Evaluation : Under data processing section of this notebook

- Training data - used to fit the model. For KNN . the training data will be used as neighbours
 - Test data - used to evaluate the model. For KNN , make predictions on the ladder score for a country  in the test data and compare to real data
 


### Evaluation of the Model Fit

The model needs to be evaluated : using RMSE ( root mean squared error) is a common way. Also looking at a range of K values between 1 and 20. As if you use one neighbour a prediction can strongly  change from one point to another. perhaps an outlier. However if you look at multiple data points , this impact is lessened. 


It is computed as follows:

1. Compute the difference between each data point’s actual value and predicted value.
2. For each difference, take the square of this difference.
3. Sum all the squared differences.
4. Take the square root of the summed value.


In [None]:
### Evaluation of the prediction error on the training data

rmse_val = [] #to store rmse values for different k
for K in range(20):
    K = K+1
    knn_model = KNeighborsRegressor(n_neighbors=K) # unfitted model created 
    knn_model.fit(X_train, y_train)
    train_preds = knn_model.predict(X_train)
    mse = mean_squared_error(y_train, train_preds)
    rmse = sqrt(mse)
    rmse_val.append(rmse) #store rmse values
    print('RMSE value for k= ' , K , 'is:', rmse)

For a very low value of k ( suppose k=1) the model can overfit on the training data , leading to higher error rate on the validation data
but for high value of k , the model can also perform badly, so looking above a the error rates 

In [None]:
#plotting the rmse values against k values
curve = pd.DataFrame(rmse_val) #elbow curve 
curve.plot()

Evaluate the performances on the predictive performances on the test set with the same function as before: 

In [None]:
rmse_val2 = [] #to store rmse values for different k
for K in range(20):
    K = K+1
    knn_model = KNeighborsRegressor(n_neighbors=K)

    knn_model.fit(X_train, y_train)

    test_preds = knn_model.predict(X_test)
    mse = mean_squared_error(y_test, test_preds)
    rmse = sqrt(mse)
    rmse_val2.append(rmse) #store rmse values
    print('RMSE value for k= ' , K , 'is:', rmse)


In [None]:
#plotting the rmse values against k values
curve = pd.DataFrame(rmse_val2) #elbow curve 
curve.plot()

I have evaluated the error on data that wasn’t yet known by the model. This more-realistic RMSE is slightly higher than before. The RMSE measures the average error of the predicted ladder score, so you can interpret this as having, on average, an error on training data of 0.4875. Whether an improvement from 0.4875 years to 0.56266 from the test score is good is case specific.

There is a relatively large difference between the RMSE on the training data and the RMSE on the test data. This means that the model suffers from overfitting on the training data: It does not generalize well.

We can see if we can overcome this by optimizing the prediction error or test error using various tuning methods.

In [None]:
### Visualisation of the Model Fit

## Plotting the Fit of Your Model

To understand what the model has learned , you need to visualise how the predictions have been made using matplotlib: using seaborn package to create a scattor plot of columns in the X_test :looking at life expectancy and social support, which were closely correlated to the ladder score ( happiness score) . We use c = to specifify the size of the points in the scatter plot. 
Each point on the plot is a happiness score and the colour indicates the level of predicted happiness ( ladder)., so the age is on the x axis and the social is on the y axis . So the higher the life expectance and social supports the deeper colour the point is , therefore the ladder score is higher. The model is correct in its learnings 


In [None]:
#Scatter Graph 1 : looking at the predicted values ( Test_preds) , used as the color bar. 
#Looking at Social support and Healthy life expectancy variables and ladder score

import seaborn as sns
cmap = sns.cubehelix_palette(as_cmap=True)
f, ax = plt.subplots()
points = ax.scatter(X_test[:,1], X_test[:,2], c=test_preds, s=100, cmap=cmap)
f.colorbar(points)
plt.show()

In [None]:
# by changing the c value to y_test , we can check where the above trend observed exists in the actual values of the dataset. 

import seaborn as sns
cmap = sns.cubehelix_palette(as_cmap=True)
f, ax = plt.subplots()
points = ax.scatter(X_test[:,1], X_test[:, 2], c= y_test, s=100, cmap=cmap)
f.colorbar(points)
plt.show()

## Tune and Optimize kNN in Python Using scikit-learn

Can I improve the predictive score from the kNN Performances ?

From earlier calculations , we determined that the best K value was 9.

Another way to determine the best value for k is using GridSearchCV : 
    

In [None]:
from sklearn.model_selection import GridSearchCV
>>> parameters = {"n_neighbors": range(1, 20)}
>>> gridsearch = GridSearchCV(KNeighborsRegressor(), parameters)
>>> gridsearch.fit(X_train, y_train)
GridSearchCV(estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': range(1, 20),
                         'weights': ['uniform', 'distance']})

In [None]:
gridsearch.best_params_

In [None]:
# Now we know what the best nearest neighbour is we can see how it affects the train and test analysis

train_preds_grid = gridsearch.predict(X_train)
train_mse = mean_squared_error(y_train, train_preds_grid)
train_rmse = sqrt(train_mse)
train_rmse

In [None]:
test_preds_grid = gridsearch.predict(X_test)
test_mse = mean_squared_error(y_test, test_preds_grid)
test_rmse = sqrt(test_mse)
test_rmse

### Adding Weighted Average of Neighbors Based on Distance
Below, you’ll test whether the performance of your model will be any better when predicting using a weighted average instead of a regular average. This means that neighbors that are further away will less strongly influence the prediction.

You can do this by setting the weights hyperparameter to the value of "distance". However, setting this weighted average could have an impact on the optimal value of k. Therefore, you’ll again use GridSearchCV to tell you which type of averaging you should use:

In [None]:
parameters = { "n_neighbors": range(1, 20), "weights": ["uniform", "distance"]}
gridsearch = GridSearchCV(KNeighborsRegressor(), parameters)
gridsearch.fit(X_train, y_train)
gridsearch.best_params_
test_preds_grid = gridsearch.predict(X_test)
test_mse = mean_squared_error(y_test, test_preds_grid)
test_rmse = sqrt(test_mse)
test_rmse

### Further Improving on kNN in scikit-learn With Bagging

As a third step for kNN tuning, you can use bagging. Bagging is an ensemble method, or a method that takes a relatively straightforward machine learning model and fits a large number of those models with slight variations in each fit. Bagging often uses decision trees, but kNN works perfectly as well.

Ensemble methods are often more performant than single models. One model can be wrong from time to time, but the average of a hundred models should be wrong less often. The errors of different individual models are likely to average each other out, and the resulting prediction will be less variable.

You can use scikit-learn to apply bagging to your kNN regression using the following steps. First, create the KNeighborsRegressor with the best choices for k and weights that you got from GridSearchCV:

In [None]:
best_k = gridsearch.best_params_["n_neighbors"]
best_weights = gridsearch.best_params_["weights"]
bagged_knn = KNeighborsRegressor( n_neighbors=best_k, weights=best_weights)

In [None]:
from sklearn.ensemble import BaggingRegressor
bagging_model = BaggingRegressor(bagged_knn, n_estimators=100)

In [None]:
from sklearn.ensemble import BaggingRegressor
bagging_model = BaggingRegressor(bagged_knn, n_estimators=100).fit(X_train, y_train)
test_preds_grid = bagging_model.predict(X_test)
test_mse = mean_squared_error(y_test, test_preds_grid)
test_rmse = sqrt(test_mse)
test_rmse           

Conclusions : 
    
Predictive performance of the algorithm : 
    
    Model                             RMSE :Prediction Error
    Arbitrary                         0.5626661247145013  
    GridsearchCV for k                0.5735040119011083     
    GridserchCV for k and weights     0.5690768514853327
    Bagging and GridsearchCV          0.5706925692674394
    

There isnt alot of differences in the errors achieved by tuning and optimizing kNN .

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## SciKit-Learn Algorithm - Linear Regression <sup>[23][24]</sup>

An example of an supervised Algorithm : use for Classification and Regression

Classification - learn how to classify any new observation - target variable is categorical - using accuracy , precision and recall to measure precision of the model algorithm .

Regression - prediction made on numerous independent variables vs target variable is ladder score using Coefficients of determination , RMSE and MSE to measure the precision of the model algorithm. It identifies the equation that produces the smallest difference between all the observed values and their fitted values otherwise called residuals

The five basic steps : 

1. Import the packages and classes you need.
2. Provide data to work with and eventually do appropriate transformations.
3. Create a regression model and fit it with existing data.
4. Check the results of model fitting to know whether the model is satisfactory.
5. Apply the model for predictions.

In [None]:
# creating a variable lm as the instance of LinearRegression and then fitting a linear model to it 
# using the data from the earlier data processing into train and test sets

lm = linear_model.LinearRegression()
lm.fit(X_train, y_train)

#### Determining the R<sup>2</sup> for each of the variables in order to determine which variable carries the most weight<sup>[25]</sup>

In [None]:
print('Coefficients:\n Social support, Healthy life, Freedom, Generosity, Perceptions of corruption \n',lm.coef_)
print('Intercept:',lm.intercept_)

In [None]:
The variable that carries the least weight is perceptions of corruption , with all the others the same nearly

In [None]:
y_train_pred = lm.predict(X_train)
y_test_pred = lm.predict(X_test)

In [None]:
# check the prediction data & real data for the 10 data entries
print('Real Data')
print(y_test[:10])
print('\n Predicted Data')
print(y_test_pred[:10])
print('\n Diff')
print(y_test[:10]-y_test_pred[:10])


In [None]:
#Visualisation of linearity
plt.scatter(y_test,y_test_pred)
plt.xlabel('Real data')
plt.ylabel('predicted data')
plt.title('Relationship between predictor and real data')
plt.show()

In [None]:
# check distribution from residual using visualisation via a distplot
sns.distplot(y_test - y_test_pred)
plt.title('Residuals', size=18)

In [None]:
# Check distribution from residual
residual = (y_test - y_test_pred)
sw = stats.shapiro(residual)
ks = stats.kstest(residual, 'norm')

print('Shapiro-Wilk test ---- statistic: {}, p-value: {}'.format(sw[0],sw[1]))

Both P -values for each of the tests are greater than 0.05 , therefore we can prove there is a degree normal distribution of residual points 

In [None]:
# Evaluate regression model - R squared

# This value shows how good the regression function is at fitting the data : the closet the value is to 1 , the better
print('R^2 score:',lm.score(X_train, y_train))

In [None]:
# Evaluate regression model - RMSE

### RMSE is the standard deviation of residual or prediction errors - measure of how far from the regression line the data points are. 

## The lower the RMSE the better the model is at making predictions. 
rmse_training = mean_squared_error(y_true=y_train,y_pred=y_train_pred,squared=False)
rmse_test = mean_squared_error(y_true=y_test,y_pred=y_test_pred,squared=False)

print('RMSE Training Data: {}'.format(rmse_training))
print('RMSE Test Data: {}'.format(rmse_test))

Conclusions : 

R^2 score: 0.772241705136514

This value shows that the models keeps around 0.7 value. Looking at the prediction data vs the data graph , its shows that the distribution of the data are fitted appropriately.

In [None]:
# Compare performance between model

list_model = [['Ridge',linear_model.Ridge()],['Lasso',linear_model.Lasso()],['BayessianRidge',linear_model.BayesianRidge()]]
performance_result = {}

for model_name,regression_model in list_model:
  regression_model.fit(X_train, y_train)
  y_train_pred = regression_model.predict(X_train)
  y_test_pred = regression_model.predict(X_test)

  rmse_training = mean_squared_error(y_true=y_train,y_pred=y_train_pred,squared=False)
  rmse_test = mean_squared_error(y_true=y_test,y_pred=y_test_pred,squared=False)

  r_score = regression_model.score(X_train, y_train)

  performance_result[model_name]={'training':rmse_training,'test':rmse_test,'R^2 score':r_score}

performance_result

##### Various model Performance results <sup>[26][27][28]</sup>

Looking at 3 types of regression analysis  : 
    
 - Ridge - a method of estimating the coefficients of multiple-regression models in scenarios where independent variables are highly correlated.
 - Lasso - Least Absolute shrinkage and selection order which performs both variable selection and regularization in order to enhance the prediction accuracy of the model by reducing the coefficients to zero if possible
 - Bayesian Ridge - an approach to linear regression in which the statistical analysis is undertaken within the context of Bayesian inference

| RMSE            	| Training 	| Test 	| R^2  	|
|-----------------	|----------	|------	|------	|
| Ridge           	| 0.51    	| 0.61 	| 0.77 	|
| Lasso           	| 1.05     	| 1.13 	| 0.58 	|
| Bayesian Ridge 	| 0.50    	| 0.62 	| 0.77 	|


### Conclusions : 

 I took the dataset The World Happiness scores and used the following machine learning models to see if I could predict the happiness score.
 
I examined the dataset and then visualised the data , trying to understand the various relationships between the variables examined and the happiness score for each of the countries. 

I reprocessed the data into training and testing through a 80: 20 ratio. to use 

From there I chose the following algorithm models : 

 - K- Means Clustering - Training this algorithm to classify the happiness score through the determined cluster groups based on features such as economic production , social suppport , life expectancy , freedom , absense of corruption and generosity. This model was accurate in its classification
 - KNN - Using RMSE to evaluate the model prediction - the test value was higher then the train value , no difference was seen after using the tuning methods
 - Linear Regression- training this algorithm to see if there is linear relationship between scores and other variables. there was. Our calculated error for the training and test data was also low meaning the model is good at making predictions.

In [None]:
## **References : 

1. https://scikit-learn.org/stable/index.html
2. https://www.tutorialspoint.com/scikit_learn/scikit_learn_modelling_process.htm
3. https://www.analyticsvidhya.com/blog/2020/04/supervised-learning-unsupervised-learning,by Alakh Sethi — April 6, 2020
4. https://worldhappiness.report/ed/2021/
5. https://happiness-report.s3.amazonaws.com/2021/DataForFigure2.1WHR2021C2.xls - link for dataset used
6. https://happiness-report.s3.amazonaws.com/2021/Appendix1WHR2021C2.pdf - Details on the variables examined in analysis
7. https://realpython.com/pandas-python-explore-dataset/, by Reka Horvath  Jan 06, 2020
8. https://plotly.com/python/choropleth-maps/
9. https://chart-studio.plotly.com/~dermi222/2/figure1-world-happiness-index-map-2016/#code
10.https://seaborn.pydata.org/generated/seaborn.heatmap.html
11. https://seaborn.pydata.org/generated/seaborn.pairplot.html
12. https://seaborn.pydata.org/generated/seaborn.displot.html
13. https://plotly.com/python/line-and-scatter/
14. https://en.wikipedia.org/wiki/K-means_clustering
15. https://www.coursera.org/projects/clustering-world-happiness-report
16. https://medium.com/@evgen.ryzhkov/5-stages-of-data-preprocessing-for-k-means-clustering-b755426f9932,by Evgeniy Ryzhkov — July 6, 2020
17. https://medium.com/analytics-vidhya/why-is-scaling-required-in-knn-and-k-means-8129e4d88ed7, by Pulkit Sharma — August 25, 2019
18. https://en.wikipedia.org/wiki/Elbow_method_(clustering)
19. https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/, Alind Gupta , 09 Feb, 2021
20. https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/, Aniruddha Bhandari — April 3, 2020
21. https://realpython.com/knn-python/, by Joos Korstanje  Apr 07, 2021
22. https://www.analyticsvidhya.com/blog/2018/08/k-nearest-neighbor-introduction-regression-python/,Aishwarya Singh — August 22, 2018
23. https://www.kaggle.com/hafidzjnp/model-machine-learning-to-predict-happiness, HAFIDZ NDP · august 2021 
24. https://realpython.com/linear-regression-in-python/#simple-linear-regression-with-scikit-learn, by Mirko Stojiljković  Apr 15, 2019
25. https://statisticsbyjim.com/regression/interpret-r-squared-regression/, Jim Frost 
26. https://en.wikipedia.org/wiki/Ridge_regression
27. https://en.wikipedia.org/wiki/Lasso_(statistics)#:~:text=In%20statistics%20and%20machine%20learning,of%20the%20resulting%20statistical%20model.
28. https://en.wikipedia.org/wiki/Bayesian_inference