<h1>Data Preprocessing and Machine Learning Tutorial using Jupyter Notebook</h1>

<h2>Introduction</h2>

<p>This tutorial aims to show how to start data preprocessing and implement machine learning using a jupyter notebook. 

<h2>Setting up</h2>
First python and jupyter notebook must be installed. It is recommanded to use linux but windows works fine as well.

Install python for Linux <a href="https://docs.aws.amazon.com/cli/latest/userguide/install-linux-python.html">tutorial</a>


Install python for Windows <a href="https://www.ics.uci.edu/~pattis/common/handouts/pythoneclipsejava/python.html">tutorial</a>


Install Jupyter Notebook for Linux <a href="https://www.digitalocean.com/community/tutorials/how-to-set-up-jupyter-notebook-with-python-3-on-ubuntu-18-04">tutorial</a>

Install Jupyter Notebook for Windows <a href="http://www.calvin.edu/~sld33/InstallPython.html">tutorial</a>



<h2> What is Data Preprocessing?</h2>
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Most data in the real world is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. That is how data preprocessing is defined in this <a href="https://hackernoon.com/what-steps-should-one-take-while-doing-data-preprocessing-502c993e1caa"> hackernoon article </a>


<h3>Steps in Data Preprocessing</h3>

Step 1 : Importing Libraries

Step 2 : Importing the Dataset

Step 3 : Checking for missing data

Step 4 : Splitting the data-set into Training and Test Set

Step 5 : Feature Scaling


<h2>Importing Libraries</h2>

You must use these commands atleast once to install the libraries so they can be imported.

In [None]:
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install plotly
!pip install sklearn

Now to import the libraries so they can be used.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from scipy.stats.stats import pearsonr 
import warnings
warnings.filterwarnings('ignore')

<h3>Numpy</h3>
NumPy is the fundamental package for scientific computing with Python.

<h3>Pandas</h3>
Pandas is for data manipulation and analysis.

<h3>Matplotlib</h3>
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hard copy formats and interactive environments across platforms.

<h3>Seaborn</h3>
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

<h2>Importing the Dataset</h2>

In [5]:
dataset = pd.read_csv('Datasets/Emissions_By_Year.csv')

In [12]:
dataset.head(10)

Unnamed: 0.1,Unnamed: 0,1990,1991,1992,1993,1994,1995,1996,1997,1998,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,EU,5716.363618,5614.246902,5442.687808,5346.861754,5327.973194,5381.359446,5494.518703,5398.601884,5360.346343,...,5338.3274,5293.764173,5179.481366,4803.585497,4909.517548,4758.664597,4693.239771,4598.845193,4423.738394,4451.812564
1,Austria,79.70019192,83.635426,77.013749,77.120123,77.699872,81.156933,84.51364,84.036589,83.385042,...,91.868496,89.270917,89.127913,82.161949,87.130046,84.888189,82.132485,82.146518,78.378903,81.000488
2,Belgium,148.7916296,147.997207,147.886025,146.881344,154.710942,157.26726,161.393677,153.143198,158.773249,...,146.277851,142.839348,143.154073,129.912567,136.642483,126.270029,123.113672,123.267066,118.149613,121.641894
3,Bulgaria,104.3729352,82.495723,78.279916,77.57032,73.785921,75.315561,75.17786,71.590812,67.475877,...,64.803832,68.69616,67.383784,58.203311,60.811327,66.130293,61.013377,55.812534,58.021721,62.02112
4,Croatia,31.65207719,24.464016,22.594299,22.846915,22.012017,22.556255,23.149193,24.401346,24.772157,...,29.870867,31.35111,30.205669,28.222112,27.625511,27.240935,25.453141,24.290347,23.418728,23.857803
5,Cyprus,6.361042968,6.995686,7.408277,7.578991,7.841882,7.85666,8.117842,8.23461,8.603574,...,10.428668,10.75302,10.966449,10.697872,10.417301,10.144035,9.607426,8.831174,9.206543,9.188771
6,Czech Republic,198.4770378,179.594358,172.885198,164.946525,157.169654,157.615607,159.593943,155.922967,149.936925,...,150.323537,151.796115,147.290178,138.825986,140.558693,138.820648,135.357597,131.421609,127.499071,128.820666
7,Denmark,72.10435727,82.659668,76.722383,79.022296,83.077968,80.123446,93.307672,83.812711,79.916882,...,76.661475,72.095532,68.452977,65.234394,65.642605,60.478962,55.65677,57.488708,53.508799,50.98362
8,Estonia,40.51040972,37.352561,27.290511,21.365639,22.062924,20.267321,20.966145,20.60983,19.00003,...,18.545405,22.32821,20.083086,16.777844,21.255803,21.28527,20.231245,21.943499,21.204451,18.114852
9,Finland,72.30693499,70.015021,68.468269,70.651314,76.194815,72.710021,78.564232,77.186085,73.539786,...,82.198104,80.867135,72.975161,69.021466,77.321492,69.679638,64.293888,65.161066,61.062802,57.538897


This is the dataset being used for this tutorial. It is how much greenhouse gases each country in the european union produces in million tonnes. 

<h2>Checking for missing data</h2>

In [14]:
dataset["1991"]

0     5614.246902
1       83.635426
2      147.997207
3       82.495723
4       24.464016
5        6.995686
6      179.594358
7       82.659668
8       37.352561
9       70.015021
10     581.872241
11    1215.674633
12     105.305850
13      87.572156
14      57.885570
15     526.768490
16      24.518539
17      50.470342
18      13.792129
19       2.634106
20     234.175737
21     456.719535
22      62.933917
23     203.748718
24      64.845640
25      17.313691
26     301.548394
27      72.849750
28     818.407796
29            NaN
30            NaN
Name: 1991, dtype: float64

Having a look at the third column in the dataset we can see that there are two NaN values. This is due to a formatting error and not an error in the data. But their could be NaN, null or 0 values in your data, these are known as dirty data and its best to remove this data. 

If you have a large dataset and wish to check the entire set then running:

dataset.isnull().sum()

Will give you back all null values for each column.

In [6]:
def clean_data(dataset):
    dataset = dataset[0:29]
    dataset = dataset.set_index(["Unnamed: 0"])
    dataset = dataset.replace(':', np.nan)
    dataset = dataset.fillna(method='backfill')
    return dataset

dataset = clean_data(dataset)

This is a simple method that takes all the columns in the dataset and replaces the nan with a colon. The fillna is a method used for filling holes in the dataset.

In [16]:
dataset["1991"]

Unnamed: 0
EU                5614.246902
Austria             83.635426
Belgium            147.997207
Bulgaria            82.495723
Croatia             24.464016
Cyprus               6.995686
Czech Republic     179.594358
Denmark             82.659668
Estonia             37.352561
Finland             70.015021
France             581.872241
Germany           1215.674633
Greece             105.305850
Hungary             87.572156
Ireland             57.885570
Italy              526.768490
Latvia              24.518539
Lithuania           50.470342
Luxembourg          13.792129
Malta                2.634106
Netherlands        234.175737
Poland             456.719535
Portugal            62.933917
Romania            203.748718
Slovakia            64.845640
Slovenia            17.313691
Spain              301.548394
Sweden              72.849750
United Kingdom     818.407796
Name: 1991, dtype: float64

Running the same code again we can see all NaN values are removed.

<h2>Graphing the dataset</h2>
For this graph we will gather a list of the datasets columns, follwed by slicing the row related to the EU out of the dataset and storing it's associated values in a list.

In [7]:
cols = list(dataset.columns)
dataset = list(dataset.loc["EU"])

Using this data and the list of columns, we can construct our line graph through Plotly.

Going to https://plot.ly/, and creating an account it gives you access to their graphing library whic is a handy tool for graphing your dataset.

In [8]:
plotly.tools.set_credentials_file(username='Fakken', api_key='MkGYnSqRO2SXDRqrJYeb')

You can create an api code in your account settings. This line validates your credentials and gives you access to the plotly tools.

In [10]:
trace1 = go.Scatter(x = cols,
                    y = dataset,
                    mode='lines',
                    name = 'EU')

data = [trace1]
layout = go.Layout(title ='Yearly EU Emissions since 1990', 
                   yaxis=dict(title="Thousand' Tonnes of Oil Equivelent"))
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=True)

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~Fakken/0 or inside your plot.ly account where it is named 'plot from API'


<h2>Splitting the data-set into Training and Test Set for Machine Learning</h2>
In any Machine Learning model we are going to split dataset into two different sets. The Training datasetet and the Test dataset. 

Using linear Regression model we will attempt to predict the future values for yearly emissions for the EU.

In [11]:
 def clean_data(dataset):
    dataset = dataset[0:29]
    dataset = dataset.set_index(["Unnamed: 0"])
    dataset = dataset.replace(':', np.nan)
    dataset = dataset.fillna(method='backfill')
    return dataset

emissions_data = pd.read_csv('Datasets/Emissions_By_Year.csv')
emissions_data = clean_data(emissions_data)
renewable_data = pd.read_csv('Datasets/Renewable_Consumption-By_Country.csv')
renewable_data = clean_data(renewable_data)

To create a linear regression model we will need another dataset to train it agains, for this example we will use the renewable energy consumption by EU country.

Importing the datasets and cleaning them we need to get subsections of each so that the years accurately match up. This is done by slicing the list of columns and then creating a new instance of the dataframe using these sliced columns

In [12]:
emissions_cols = list(emissions_data.columns)[17:]
renewable_cols = list(renewable_data.columns)[:-1]
emissions_data = emissions_data[emissions_cols]
renewable_data = renewable_data[renewable_cols]

After creating the new dataframes we need to turn them into arrays and then reshape them so they can be utilised by the regression model. This is done by selecting the values from the dataset which returns an nd array, which we then reshape into a 2D Array.

In [13]:
emis_list = emissions_data.loc["EU"].values
emis_list = emis_list.reshape(-1, 1)
rene_list = renewable_data.loc["EU"].values
rene_list = rene_list.astype(np.float)
rene_list = rene_list.reshape(-1, 1)

Following this we split our data into training and testing sets. We set the split_size to 0.40 because of our lack of data any less would only produce two to three test points.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(rene_list, emis_list, test_size=0.40)

Now we select the model, fit the training data and generate our predictions for our Y value, which in this case is Yearly Emissions

In [15]:
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
prediction = model.predict(X_test)

We can now take our prediction and the testing data and graph the result to observe the predicted course.

In [17]:
trace1 = go.Scatter(
                x = X_test,
                y = y_test,
                mode='markers',
                name = '{} - Test Values'.format("EU"))
        
trace2 = go.Scatter(
                x = X_test,
                y = prediction,
                mode='lines',
                name = '{} - Predictions'.format("EU"))

    
data = [trace1, trace2]
layout = go.Layout(title ='Predicted future Emissions - EU', 
                   xaxis=dict(title="Renewable Energy Consumption"),
                   yaxis=dict(title="Thousand' Tonnes of Oil Equivelent"))

fig = dict(data=data, layout=layout)
py.iplot(fig, validate=True)