# Predicting a country's power consumption

The provided training data contains information about some economic factors, and the total power consumption of 14 different countries. The goal is to learn from this data set the relationship between those economic factors and their power consumption in kwh per capita, and then predict the power consumption of two new countries, based on the other data.
Since the data varies so much between countries, it makes since to not only do a simple linear regression but also determine which country these new countries most resemble.
Therefore this program will train 14 different regression models; one for each country. Then, upon receiving the data for these new countries, will perform a multi-class classification to determine which country these countries most resemble economically. The classification will output a number 0-13, signifying the name of the country the new country most resembles. Then, a linear regression will be run on the new country, using the model from the country the classification outputs.
For both the classification, a Support Vector Machine will be used.
For the regression, a multiple linear regression model is used.

In [27]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn
from sklearn import svm
from sklearn import linear_model

# REGRESSORS
This section will import the data from each country seperately and store them in pandas' dataframes for later use in training the regression models.

In [28]:
excelAustralia = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/Australia.xls")
australia = excelAustralia.parse("tab_GSU")
excelAustria = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/Austria.xls")
austria = excelAustria.parse("tab_GSU")
excelBelgium = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/Belgium.xls")
belgium = excelBelgium.parse("tab_GSU")
excelCanada = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/Canada.xls")
canada = excelCanada.parse("tab_GSU")
excelFinland = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/Finland.xls")
finland = excelFinland.parse("tab_GSU")
excelGermany = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/Germany.xls")
germany = excelGermany.parse("tab_GSU")
excelIceland = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/Iceland.xls")
iceland = excelIceland.parse("tab_GSU")
excelIreland = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/Ireland.xls")
ireland = excelIreland.parse("tab_GSU")
excelLuxembourg = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/Luxembourg.xls")
luxembourg = excelLuxembourg.parse("tab_GSU")
excelNetherlands = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/Netherlands.xls")
netherlands = excelNetherlands.parse("tab_GSU")
excelNorway = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/Norway.xls")
norway = excelNorway.parse("tab_GSU")
excelSweden = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/Sweden.xls")
sweden = excelSweden.parse("tab_GSU")
excelSwitzerland = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/Switzerland.xls")
switzerland = excelSwitzerland.parse("tab_GSU")
excelUS = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/US.xls")
us = excelUS.parse("tab_GSU")

# Classifiers
This section imports the collective data that will be used for the classification. The names of the countries have been numericized so that the SVM will run properly. Three SVMs are created with three different kernels, since I don't know the structure of the data. All three can be run and tested for accuracy, but for now I will go with the RBF. It's important to go in to the fitting process understanding that if a linear kernel is used, and the data is not very linear, the SVM will run on and on with no helpful feedback.

In [29]:
excelClassifier = pd.ExcelFile("/Users/Qonzor/PowerConsumption/ClassificationOrig.xls")
classifierData = excelClassifier.parse("tab_GSU")
excelClassifierTesting = pd.ExcelFile("/Users/Qonzor/PowerConsumption/ClassificationTesting.xls")
classifierTestingData = excelClassifierTesting.parse("tab_GSU")
classifierOneX = np.array(classifierData.drop('Country', axis = 1)).astype(float)
classifierOneY = np.array(classifierData['Country']).astype(int)
classifierTwoX = np.array(classifierTestingData.drop('Country', axis = 1)).astype(float)
classifierTwoY = np.array(classifierTestingData['Country']).astype(float)
linear_classifierOne = svm.SVC(kernel = 'linear')
rbf_classifierOne = svm.SVC(kernel = 'rbf')
poly_classifierOne = svm.SVC(kernel = 'poly')

In [30]:
rbf_classifierOne.fit(classifierOneX, classifierOneY)
rbf_classifierOne.score(classifierOneX, classifierOneY, sample_weight = None)

1.0

# Prediction

The classification algorithm will be run against the first country. It will spit out a number 1-14. The regression will then be fit to the country that number specifies. Then the new country will be run through the regression model. It should output a value in kwh for each year.

In [31]:
#This bit of mess is to test the classifier on Belgium to make sure it classifies properly
excel = pd.ExcelFile("/Users/Qonzor/PowerConsumption/regression/BelgiumTest.xls")
deleteThis = excel.parse("tab_GSU")
testBelgium = np.array(deleteThis).astype(float)
guessBelgium = rbf_classifierOne.predict(testBelgium)
print guessBelgium

[3 3 3 3 3 3 3 3 3 3 3 3 3]


In [32]:
#It correctly identified Belgium so we can move on to the new country's data
excelTest = pd.ExcelFile("/Users/Qonzor/PowerConsumption/SubmissionFile.xlsx")
test = excelTest.parse("SubmittalFile")
testA = np.array(test).astype(float)
guess = rbf_classifierOne.predict(testA)
print guess

[14 14 14 14 14 14 14 14 14 14 14 14 14]


The classifier seems to think the new country most resembles the United States. Run the data through the U.S. regressor.

In [33]:
X= np.array(us.drop('Electric power consumption (kWh per capita)', axis = 1)).astype(float)
y = np.array(us['Electric power consumption (kWh per capita)']).astype(float)
linReg = linear_model.LinearRegression(fit_intercept = True)
linReg.fit(X, y)
print linReg.score(X, y)

0.734410447778


In [34]:
#The score is good enough for our purpose so we'll go ahead and predict for the new country
linReg.predict(testA)

array([  6112.3247894 ,   5728.23164361,   5830.5953575 ,   8524.9133655 ,
        10142.501412  ,  10487.90667488,  10955.57294795,  12907.26147713,
        14659.17954533,  12118.49712168,  11745.78261714,  12604.36836457,
        10297.81502701])