<h1>European Energy Insight</h1>

The primary goal of this project is to provide an insight into the current usage of renewable energy within the EU and it's member states in addition to recent trends of greenhouse gas emissions.<br>
In addition to this we will be investigating the potentially correlative reltationship between a country or regions usage of renewable energy and several factors within the country or regions society.<br><br>

In [1]:
import pandas as pd
import numpy as np
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from scipy.stats.stats import pearsonr  

plotly.tools.set_credentials_file(username='VMunt12', api_key='oxHWpnzEydNC1JoTXNqR')

<h2>Section One : Oversight</h2><br>
In this section we will be investigating the current usage of both renewable energies and the past trends of greenhouse gas emissions.<br>
Firstly we will look to investigate the currrent usage of renewable energies in both the entirety of the EU and following that a breakdown of usage based on individual member states.

In [2]:
hydro_data = pd.read_csv('Datasets/Hydro_Consumption-By_Country.csv')
solar_data = pd.read_csv('Datasets/Solar_Consumption-By_Country.csv')
thermal_data = pd.read_csv('Datasets/Thermal_Consumption-By_Country.csv')
wind_data = pd.read_csv('Datasets/Wind_Consumption-By_Country.csv')

After initially intaking the required datasets, we'll effectively clean them by removing any unnecessary data, re-indexing the dataset so that the country/region is the index, replacing any missing values with NaN and then in the event that there are any missing values they will be backfilled.

In [3]:
def clean_data(dataset):
    dataset = dataset[1:29]
    dataset = dataset.set_index(["Unnamed: 0"])
    dataset = dataset.replace(':', np.nan)
    dataset = dataset.fillna(method='backfill')
    return dataset

hydro_data = clean_data(hydro_data)
solar_data = clean_data(solar_data)
thermal_data = clean_data(thermal_data)
wind_data = clean_data(wind_data)

After cleaning the data we want to create a dictionary to store the results of using our get_sum() function to tally up the total usage for each of our renewable energy sources.

In [5]:
def get_sum(data):
    col_len = len(list(data.columns))
    data_sum = 0 
    for index, row in data.iterrows():
        data_sum += float(row[col_len-1])     
    return data_sum

sum_dict = {}
sum_dict["Hydro"] = get_sum(hydro_data) 
sum_dict["Solar"] = get_sum(solar_data)
sum_dict["Thermal"] = get_sum(thermal_data)
sum_dict["Wind"] = get_sum(wind_data)

We'll then split our dict up into two lists, following this we'll graph these lists using Plotly.

In [6]:
x_vals = []
y_vals = []

for k, v in sum_dict.items():
    x_vals.append(k)
    y_vals.append(v)

In [10]:
trace1 = go.Bar(x = x_vals, y = y_vals)
data = [trace1]
layout = go.Layout(title = "Overall Renewable Energy Usage in the EU - 2016", 
                   yaxis=dict(title="Thousand' Tonnes of Oil Equivelent"))
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)

From this somewhat basic graph we can see that both Hydro Electric and Wind Power are clear leaders within the EU, with Hydro Electric producing energy equal to over 30 million Tonnes of Oil.<br>
Next we'll look to further breakdown these figures into individual countries to see which, if any are leaders within this areas.<br><br>
We'll start by importing a fresh series of datasets and cleaning them off just as before.

In [11]:
hydro_data = pd.read_csv('Datasets/Hydro_Consumption-By_Country.csv')
solar_data = pd.read_csv('Datasets/Solar_Consumption-By_Country.csv')
thermal_data = pd.read_csv('Datasets/Thermal_Consumption-By_Country.csv')
wind_data = pd.read_csv('Datasets/Wind_Consumption-By_Country.csv')

def clean_data(dataset):
    dataset = dataset[1:29]
    dataset = dataset.set_index(["Unnamed: 0"])
    dataset = dataset.replace(':', np.nan)
    dataset = dataset.fillna(method='backfill')
    return dataset

hydro_data = clean_data(hydro_data)
solar_data = clean_data(solar_data)
thermal_data = clean_data(thermal_data)
wind_data = clean_data(wind_data)

Instead of using a single dictionary this time, we'll use a dictionary for each type of renewable energy, each will be used to link each country with it's associated values.

In [12]:
def each_type(data):
    val_dict = {}
    col_len = len(list(data.columns))
    for index, row in data.iterrows():
        val_dict[index] = float(row[col_len-1])
    return val_dict

hydro_dict = each_type(hydro_data)
solar_dict = each_type(solar_data)
thermal_dict = each_type(thermal_data)
wind_dict = each_type(wind_data)

After gathering the required information in our dictionaries, we will use a split_vals() function to split up our dictionary values so that they can be later graphed.

In [13]:
def split_vals(data, opt):
    if(opt == 1):
        y_vals = []
        for k, v in data.items():
            y_vals.append(v)
        return y_vals
    else:
        x_vals = []
        y_vals = []
        for k, v in data.items():
            x_vals.append(k)
            y_vals.append(v)
        return x_vals, y_vals
    
x_vals, hy_y_vals = split_vals(hydro_dict, 0)
so_y_vals = split_vals(solar_dict, 1)
th_y_vals = split_vals(thermal_dict, 1)
wi_y_vals = split_vals(wind_dict, 1)

In [14]:
trace1 = go.Bar(x = x_vals,
                y = hy_y_vals,
                name = 'Hydro')

trace2 = go.Bar(x = x_vals,
                y = so_y_vals,
                name = 'Solar')

trace3 = go.Bar(x = x_vals,
                y = th_y_vals,
                name = 'Thermal')

trace4 = go.Bar(x = x_vals,
                y = wi_y_vals,
                name = 'Wind')

data = [trace1, trace2, trace3, trace4]

layout = go.Layout(title ='Renewable Energy Breakdown per Country - 2016', 
                   yaxis=dict(title="Thousand' Tonnes of Oil Equivelent"),
                  barmode='stack')
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)

From looking at the graph it's clear to see that there are clear leaders within the area of renewable energy, with Germany and Spain being ahead of the curve.<br>
In Sections Two we will utilise the information gathered in these past two graphs to further investigate potentially correlative relationships and potential future trends<br><br>
In the next graph we will aim to investigate the trend of EU Emissions.<br>
As with every new graphing we will start by getting a fresh dataset and cleaning it.

In [31]:
emissions_data = pd.read_csv('Datasets/Emissions_By_Year.csv')

In [32]:
def clean_data(dataset):
    dataset = dataset[0:29]
    dataset = dataset.set_index(["Unnamed: 0"])
    dataset = dataset.replace(':', np.nan)
    dataset = dataset.fillna(method='backfill')
    return dataset

emissions_data = clean_data(emissions_data)

For this graph we will gather a list of the datasets columns, follwed by slicing the row related to the EU out of the dataset and storing it's associated values in a list.

In [33]:
cols = list(emissions_data.columns)
emissions_data = list(emissions_data.loc["EU"])

Using this data and the list of columns, we can construct our line graph through Plotly.

In [34]:
trace1 = go.Scatter(x = cols,
                    y = emissions_data,
                    mode='lines',
                    name = 'EU')

data = [trace1]
layout = go.Layout(title ='Yearly EU Emissions since 1990', 
                   yaxis=dict(title="Thousand' Tonnes of Oil Equivelent"))
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)

As would be expected, as time has gone on the yearly emissions have steadily dropped for the EU as a whole. However, as we saw in the previous graph there were clear leaders in the area of emissions reduction, as such we will investigate which members states have been the most successful in reducing their emissions since 1990.<br>
As usual we will start off with a fresh and clean dataset.

In [35]:
def clean_data(dataset):
    dataset = dataset[1:29]
    dataset = dataset.set_index(["Unnamed: 0"])
    dataset = dataset.replace(':', np.nan)
    dataset = dataset.fillna(method='backfill')
    return dataset

emissions_data = pd.read_csv('Datasets/Emissions_By_Year.csv')
emissions_data = clean_data(emissions_data)

As with the previous graph we will get a list of columns, however this time we're instead using it as a means to find the final entry within the dataset.<br> 
We'll use this value aswell as the datset in the get_diff() function to find the difference between the emission in 1990 and the most recent emissions data for each country.

In [36]:
def get_diff(data, pos):
    x_vals = []
    y_vals = []
    for index, row in data.iterrows():
        x_vals.append(index)
        diff = float(row[pos-1]) - float(row[0])
        y_vals.append(diff)
        
    return x_vals, y_vals

col_len = len(list(emissions_data.columns))
x_vals, y_vals = get_diff(emissions_data, col_len)

In [37]:
trace1 = go.Bar(x = x_vals,
                y = y_vals)

data = [trace1]
layout = go.Layout(title ='Change in Emissions since 1990', 
                   yaxis=dict(title="Thousand' Tonnes of Oil Equivelent"))
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)

Interestingly we see that despite seeing that Spain was one of the leaders in renewable energy they still have a higher emission rate than they did in 1990. Additionally, the United Kingdom has had a drastic reduction in emissions almost on par with Germany, despite only having roughly half of Germany's renewable energy use.

<h2>Section Two : Insight</h2><br>
Throughout this section we will investigate the potentially correlative relationship between a regions usage of renewable energy and several other factors.<br>
Within this section we'll perform the investigations on two different regions, those being;<br>
<ul>
  <li>The EU</li>
  <li>Germany</li>
</ul>
<br><br>

<h3>Percentage of GDP spent on Research and Development</h3>
<h4>EU</h4>
Firstly we need to import and clean our datasets, following this we'll clean and prepare our data for graphing.

In [38]:
renewable_data = pd.read_csv('Datasets/Renewable_Consumption-By_Country.csv')
rnd_data = pd.read_csv('Datasets/R&D-By_Country.csv')

In [39]:
def clean_data(dataset):
    dataset = dataset[0:29]
    dataset = dataset.set_index(["Unnamed: 0"])
    dataset = dataset.replace(':', np.nan)
    dataset = dataset.fillna(method='backfill')
    return dataset

renewable_data = clean_data(renewable_data)
rnd_data = clean_data(rnd_data)

In [41]:
cols = list(renewable_data.columns)
rnd_data = rnd_data[cols]

x_vals = []
for item in cols:
    x_vals.append(int(item))

For the preparation of the data, we firstly retrieve the selected region / countrys row from the dataset and based on whether or not the div parameter is set to True or False, these values will either be divided to reduce the number to something easier to work with, or else the values from the supplied data will simply be appended to a list.<br>
In this case we're dividing the data down to smaller numbers as to make the graph easier to interpret.

In [42]:
def data_prep(country, df, div):
    df_vals = pd.DataFrame(data=df.loc[country])
    df_vals = df_vals.transpose()
    vals = []
    
    if(div == True):
        for index, row in df_vals.iterrows():
            for item in row:
                vals.append((float(item)/100000))
        return vals
    else:
        for index, row in df_vals.iterrows():
            for item in row:
                vals.append(float(item))
        return vals
    
ren_vals = data_prep("EU", renewable_data, True)
rnd_vals = data_prep("EU", rnd_data, False)

With the data successfully retrieved and prepared we can move on to graphing.

In [43]:
trace1 = go.Scatter(
            x = x_vals,
            y = rnd_vals,
            mode='lines+markers',
            name = '{} - R&D'.format("EU"))
    
trace2 = go.Scatter(
            x = x_vals,
            y = ren_vals,
            mode='lines+markers',
            name = '{} - TTOE'.format("EU"))

data = [trace1, trace2]
layout = go.Layout(title ='Renewable Energy vs Percentage GDP spent on R&D - EU')
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)

*TTOE = Trillion Tonnes of Oil Equivelant<br><br>
From looking at the graph it certainly appears that there is a correlation between the EU's overall spending on R&D and the renewable energy consumption.<br>
Following this we'll calculate a Pearson correlation between two lists of the corresponding values.

In [44]:
ren_list = list(ren_vals)
rnd_list = list(rnd_vals)
print(pearsonr(ren_list, rnd_list))

(0.9321142924348524, 8.555900858960648e-05)


<h4>Germany</h4>

In [46]:
renewable_data = pd.read_csv('Datasets/Renewable_Consumption-By_Country.csv')
rnd_data = pd.read_csv('Datasets/R&D-By_Country.csv')

def clean_data(dataset):
    dataset = dataset[0:29]
    dataset = dataset.set_index(["Unnamed: 0"])
    dataset = dataset.replace(':', np.nan)
    dataset = dataset.fillna(method='backfill')
    return dataset

renewable_data = clean_data(renewable_data)
rnd_data = clean_data(rnd_data)

cols = list(renewable_data.columns)
rnd_data = rnd_data[cols]

x_vals = []
for item in cols:
    x_vals.append(int(item))

As we're switching from the entire EU to a single country, we can reduce the amount we have to divide down by to reach a suitable scale for the graph.

In [48]:
def data_prep(country, df, div):
    df_vals = pd.DataFrame(data=df.loc[country])
    df_vals = df_vals.transpose()
    vals = []
    
    if(div == True):
        for index, row in df_vals.iterrows():
            for item in row:
                vals.append((float(item)/10000))
        return vals
    else:
        for index, row in df_vals.iterrows():
            for item in row:
                vals.append(float(item))
        return vals
    
ren_vals = data_prep("Germany", renewable_data, True)
rnd_vals = data_prep("Germany", rnd_data, False)

trace1 = go.Scatter(
            x = x_vals,
            y = rnd_vals,
            mode='lines+markers',
            name = '{} - R&D'.format("Germany"))
    
trace2 = go.Scatter(
            x = x_vals,
            y = ren_vals,
            mode='lines+markers',
            name = '{} - MTOE'.format("Germany"))

data = [trace1, trace2]
layout = go.Layout(title ='Renewable Energy vs Percentage GDP spent on R&D - Germany')
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)

*MTOE = Million tonnes of Oil Equivelant<br><br>
As before there seems to be a reasonable correlation between investments in R&D and the improvement in usage of renewable energy usage.

In [49]:
ren_list = list(ren_vals)
rnd_list = list(rnd_vals)
print(pearsonr(ren_list, rnd_list))

(0.8916032490708725, 0.0005289350672805719)


<br><h3>Yearly Emissions</h3>
<h4>EU</h4>

In [83]:
def clean_data(dataset):
    dataset = dataset[0:29]
    dataset = dataset.set_index(["Unnamed: 0"])
    dataset = dataset.replace(':', np.nan)
    dataset = dataset.fillna(method='backfill')
    return dataset

renewable_data = pd.read_csv('Datasets/Renewable_Consumption-By_Country.csv')
emissions_data = pd.read_csv('Datasets/Emissions_By_Year.csv')
renewable_data = clean_data(renewable_data)
emissions_data = clean_data(emissions_data)


In [84]:
cols = list(renewable_data.columns)[:-1]
emissions_data = emissions_data[cols]
x_vals = []
for item in cols:
    x_vals.append(int(item))

In [85]:
def data_prep(country, df, div):
    df_vals = pd.DataFrame(data=df.loc[country])
    df_vals = df_vals.transpose()
    vals = []
    
    if(div == True):
        for index, row in df_vals.iterrows():
            for item in row:
                vals.append(float(item)/1000)
        return vals
    
    else:
        for index, row in df_vals.iterrows():
            for item in row:
                vals.append(int(item))
        return vals
    
ren_vals = data_prep("EU", renewable_data, True)
emissions_vals = data_prep("EU" , emissions_data, True)    

In [86]:
trace1 = go.Scatter(
            x = x_vals,
            y = emissions_vals,
            mode='lines+markers',
            name = '{} - Emissions TTOE'.format("EU"))
    
trace2 = go.Scatter(
            x = x_vals,
            y = ren_vals,
            mode='lines+markers',
            name = '{} - Renewable MTOE'.format("EU"))
    
data = [trace1, trace2]
layout = go.Layout(title ='Renewable Energy vs Yearly Emissions - EU')
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)

In [87]:
ren_list = list(ren_vals)
rnd_list = list(rnd_vals)
print(pearsonr(ren_list, rnd_list))

(-0.17034137523022694, 0.6380033688097215)


<h4>Germany</h4>

In [95]:
def clean_data(dataset):
    dataset = dataset[0:29]
    dataset = dataset.set_index(["Unnamed: 0"])
    dataset = dataset.replace(':', np.nan)
    dataset = dataset.fillna(method='backfill')
    return dataset

renewable_data = pd.read_csv('Datasets/Renewable_Consumption-By_Country.csv')
emissions_data = pd.read_csv('Datasets/Emissions_By_Year.csv')
renewable_data = clean_data(renewable_data)
emissions_data = clean_data(emissions_data)

In [96]:
cols = list(renewable_data.columns)[:-1]
emissions_data = emissions_data[cols]
x_vals = []
for item in cols:
    x_vals.append(int(item))

In [97]:
def data_prep(country, df, div):
    df_vals = pd.DataFrame(data=df.loc[country])
    df_vals = df_vals.transpose()
    vals = []
    
    if(div == True):
        for index, row in df_vals.iterrows():
            for item in row:
                vals.append(float(item)/1000)
        return vals
    
    else:
        for index, row in df_vals.iterrows():
            for item in row:
                vals.append(int(item))
        return vals
    
ren_vals = data_prep("Germany", renewable_data, True)
emissions_vals = data_prep("Germany" , emissions_data, False)    

In [98]:
trace1 = go.Scatter(
            x = x_vals,
            y = emissions_vals,
            mode='lines+markers',
            name = '{} - Emissions TTOE'.format("Germany"))
    
trace2 = go.Scatter(
            x = x_vals,
            y = ren_vals,
            mode='lines+markers',
            name = '{} - Renewable MTOE'.format("Germany"))
    
data = [trace1, trace2]
layout = go.Layout(title ='Renewable Energy vs Yearly Emissions - Germany')
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)

In [99]:
ren_list = list(ren_vals)
rnd_list = list(rnd_vals)
print(pearsonr(ren_list, rnd_list))

(-0.33222637641569275, 0.3482962983586983)


<h3>Mean Income</h3>
<h4>EU</h4>

In [101]:
def clean_data(dataset):
    dataset = dataset[0:29]
    dataset = dataset.set_index(["Unnamed: 0"])
    dataset = dataset.replace(':', np.nan)
    dataset = dataset.fillna(method='backfill')
    return dataset

renewable_data = pd.read_csv('Datasets/Renewable_Consumption-By_Country.csv')
income_data = pd.read_csv('Datasets/Mean_Income-By_Country.csv')
renewable_data = clean_data(renewable_data)
income_data = clean_data(income_data)

In [102]:
cols = list(income_data.columns)[:-1]
renewable_data = renewable_data[cols]
income_data = income_data[cols]

In [103]:
for item in cols:
    income_data[item] = income_data[item].str.replace(",","").astype(float)

x_vals = []
for item in cols:
    x_vals.append(int(item))

In [105]:
def data_prep(country, df, div):
    df_vals = pd.DataFrame(data=df.loc[country])
    df_vals = df_vals.transpose()

    vals = []
    
    if(div == True):
        for index, row in df_vals.iterrows():
            for item in row:
                vals.append(float(item)/10000)
        return vals
    
    else:
        for index, row in df_vals.iterrows():
            for item in row:
                vals.append(float(item)/10000)
        return vals
    
ren_vals = data_prep("EU", renewable_data, True)
income_data = data_prep("EU" , income_data, False)

In [106]:
trace1 = go.Scatter(
            x = x_vals,
            y = income_data,
            mode='lines+markers',
            name = '{} - Mean Income in Thousand Euros'.format("EU"))
    
trace2 = go.Scatter(
            x = x_vals,
            y = ren_vals,
            mode='lines+markers',
            name = '{} - TTOE'.format("EU"))

data = [trace1, trace2]
layout = go.Layout(title ='Renewable Energy vs Mean Income - EU')
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)

<h4>Germany<h4>

In [108]:
def clean_data(dataset):
    dataset = dataset[0:29]
    dataset = dataset.set_index(["Unnamed: 0"])
    dataset = dataset.replace(':', np.nan)
    dataset = dataset.fillna(method='backfill')
    return dataset

renewable_data = pd.read_csv('Datasets/Renewable_Consumption-By_Country.csv')
income_data = pd.read_csv('Datasets/Mean_Income-By_Country.csv')
renewable_data = clean_data(renewable_data)
income_data = clean_data(income_data)

In [109]:
cols = list(income_data.columns)[:-1]
renewable_data = renewable_data[cols]
income_data = income_data[cols]

In [110]:
for item in cols:
    income_data[item] = income_data[item].str.replace(",","").astype(float)

x_vals = []
for item in cols:
    x_vals.append(int(item))

In [111]:
def data_prep(country, df, div):
    df_vals = pd.DataFrame(data=df.loc[country])
    df_vals = df_vals.transpose()

    vals = []
    
    if(div == True):
        for index, row in df_vals.iterrows():
            for item in row:
                vals.append(float(item)/1000)
        return vals
    
    else:
        for index, row in df_vals.iterrows():
            for item in row:
                vals.append(float(item)/1000)
        return vals
    
ren_vals = data_prep("Germany", renewable_data, True)
income_data = data_prep("Germany" , income_data, False)

In [115]:
plotly.tools.set_credentials_file(username='VMunt12', api_key='oxHWpnzEydNC1JoTXNqR')

In [116]:
trace1 = go.Scatter(
            x = x_vals,
            y = income_data,
            mode='lines+markers',
            name = '{} - Mean Income in Thousand Euros'.format("Germany"))
    
trace2 = go.Scatter(
            x = x_vals,
            y = ren_vals,
            mode='lines+markers',
            name = '{} - TOE'.format("Germany"))

data = [trace1, trace2]
layout = go.Layout(title ='Renewable Energy vs Mean Income - Germany')
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~Vmunt12/0 or inside your plot.ly account where it is named 'plot from API'


While there does appear to be a slight correlation between Mean Income and Usage of renewable energy in Germany, without more data it could quite simply be a case of the mean income raising as the country developes further.<br> 
While the data provides and interesting theory without further data to properly test unfortunately it will have to be ruled as inconclusive.

<h2>Section Three : Linear Regression</h2><br>
Throughout this section we will attempt to predict the future values for both yearly emissions and usage of renewable energy using Linear Regression models based on the previous values supplied for the EU.


<h3>Yearly Emissions</h3>

In [118]:
def clean_data(dataset):
    dataset = dataset[0:29]
    dataset = dataset.set_index(["Unnamed: 0"])
    dataset = dataset.replace(':', np.nan)
    dataset = dataset.fillna(method='backfill')
    return dataset

emissions_data = pd.read_csv('Datasets/Emissions_By_Year.csv')
emissions_data = clean_data(emissions_data)
renewable_data = pd.read_csv('Datasets/Renewable_Consumption-By_Country.csv')
renewable_data = clean_data(renewable_data)

After getting the datasets and cleaning them we need to get subsections of each so that the years accurately match up. This is done by slicing  the list of columns and then creating a new intance of the dataframe using these sliced columns

In [119]:
emissions_cols = list(emissions_data.columns)[17:]
renewable_cols = list(renewable_data.columns)[:-1]
emissions_data = emissions_data[emissions_cols]
renewable_data = renewable_data[renewable_cols]

After creating the new dataframes we need to turn them into arrays and then reshape them so they can be utilised by the regression model. This is done by selecting the values from the dataset which returns an nd array, which we then reshape into a 2D Array.

In [121]:
emis_list = emissions_data.loc["EU"].values
emis_list = emis_list.reshape(-1, 1)
rene_list = renewable_data.loc["EU"].values
rene_list = rene_list.astype(np.float)
rene_list = rene_list.reshape(-1, 1)

Following this we split our data into training and testing sets. We set the split_size to 0.40 because of our lack of data any less would only produce two to three test points.

In [124]:
X_train, X_test, y_train, y_test = train_test_split(rene_list, emis_list, test_size=0.40)

Now we select the model, fit the training data and generate our predictions for our Y value, which in this case is Yearly Emissions

In [125]:
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
prediction = model.predict(X_test)

We can now take our prediction and the testing data and graph the result to observe the predicted course.

In [130]:
trace1 = go.Scatter(
                x = X_test,
                y = y_test,
                mode='markers',
                name = '{} - Test Values'.format("EU"))
        
trace2 = go.Scatter(
                x = X_test,
                y = prediction,
                mode='lines',
                name = '{} - Predictions'.format("EU"))

    
data = [trace1, trace2]
layout = go.Layout(title ='Predicted future Emissions - EU', 
                   xaxis=dict(title="Renewable Energy Consumption"),
                   yaxis=dict(title="Thousand' Tonnes of Oil Equivelent"))

fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)

<h3>Renewable Energy Usage</h3>

The process is virtually the same as the previous prediction, as such I wont be commenting each step.

In [134]:
def clean_data(dataset):
    dataset = dataset[0:29]
    dataset = dataset.set_index(["Unnamed: 0"])
    dataset = dataset.replace(':', np.nan)
    dataset = dataset.fillna(method='backfill')
    return dataset

rnd_data = pd.read_csv('Datasets/R&D-By_Country.csv')
rnd_data = clean_data(rnd_data)
renewable_data = pd.read_csv('Datasets/Renewable_Consumption-By_Country.csv')
renewable_data = clean_data(renewable_data)

In [135]:
rnd_cols = list(rnd_data.columns)[2:]
renewable_cols = list(renewable_data.columns)
rnd_data = rnd_data[rnd_cols]
renewable_data = renewable_data[renewable_cols]

In [136]:
rnd_list = rnd_data.loc["EU"].values
rnd_list = rnd_list.reshape(-1, 1)
rene_list = renewable_data.loc["EU"].values
rene_list = rene_list.astype(np.float)
rene_list = rene_list.reshape(-1, 1)

In [137]:
X_train, X_test, y_train, y_test = train_test_split(rnd_list, rene_list, test_size=0.40)

In [138]:
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
prediction = model.predict(X_test)

In [139]:
trace1 = go.Scatter(
                x = X_test,
                y = y_test,
                mode='markers',
                name = '{} - Test Values'.format("EU"))
        
trace2 = go.Scatter(
                x = X_test,
                y = prediction,
                mode='lines',
                name = '{} - Predictions'.format("EU"))

    
data = [trace1, trace2]
layout = go.Layout(title ='Predicted future Renewable Energy Consumption - EU', 
                   xaxis=dict(title="Renewable Energy Consumption"),
                   yaxis=dict(title="Research & Development share of GDP"))

fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)