<h1>Polynomial Linear Regression</h1>
<h2>Here we are using a "Relative CPU Performance DataSet"
<a href = "https://archive.ics.uci.edu/ml/datasets/Computer+Hardware">Link</a></h2>

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
dataset = pd.read_csv('machine.csv')

In [3]:
dataset.head()

Unnamed: 0,adviser,32/60,125,256,6000,256.1,16,128,198,199
0,amdahl,470v/7,29,8000,32000,32,8,32,269,253
1,amdahl,470v/7a,29,8000,32000,32,8,32,220,253
2,amdahl,470v/7b,29,8000,32000,32,8,32,172,253
3,amdahl,470v/7c,29,8000,16000,32,8,16,132,132
4,amdahl,470v/b,26,8000,32000,64,8,32,318,290


In [63]:
X = dataset.iloc[:,[0,2,3,4,5,6,7,8]].values
y = dataset.iloc[:,-1].values

In [64]:
X

array([['amdahl', 29, 8000, ..., 8, 32, 269],
       ['amdahl', 29, 8000, ..., 8, 32, 220],
       ['amdahl', 29, 8000, ..., 8, 32, 172],
       ...,
       ['sratus', 125, 2000, ..., 2, 14, 52],
       ['wang', 480, 512, ..., 0, 0, 67],
       ['wang', 480, 1000, ..., 0, 0, 45]], dtype=object)

<h2>Data Preprocessing</h2>
<h3>Dummy Encoding the first attribute</h3>

In [66]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

labelencoder = LabelEncoder()
X[:,0] = labelencoder.fit_transform(X[:,0])
onehotencoder = OneHotEncoder(categorical_features= [0])
X = onehotencoder.fit_transform(X).toarray()

In [67]:
X

array([[  1.,   0.,   0., ...,   8.,  32., 269.],
       [  1.,   0.,   0., ...,   8.,  32., 220.],
       [  1.,   0.,   0., ...,   8.,  32., 172.],
       ...,
       [  0.,   0.,   0., ...,   2.,  14.,  52.],
       [  0.,   0.,   0., ...,   0.,   0.,  67.],
       [  0.,   0.,   0., ...,   0.,   0.,  45.]])

In [68]:
X = X[:,1:]

<h3>Splitting into training set and test set</h3>

In [70]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y)

<h2>Applying Regression Models</h2>

<h3>Linear Regression</h3>

In [71]:
from sklearn.linear_model import LinearRegression
lin_regrs = LinearRegression()
lin_regrs.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [72]:
y_pred = lin_regrs.predict(X_test)

In [81]:
y_pred

array([ 82.69981422,  14.70398836,  29.58137242, 103.84705501,
        33.93268772,  72.70709426, 118.05285602,   6.28797553,
        23.94877436,  66.71995766,  19.02358401, 307.95921198,
        31.73295684,  30.74646185,   6.75535245,  89.26569335,
        65.31569262,  24.9484754 ,  52.79752917,  66.55664408,
       377.81722564, -14.16777238,  55.46513876, 412.59331998,
       197.17722089,  35.69875754,  38.1380225 , 124.28573592,
        15.48014294,  29.04503994,  34.88952011,  16.42544843,
        -3.63133705,   5.65224075,  51.29014187, 136.79968842,
        14.39109178,  34.390149  , 188.36101616,  36.48452479,
        54.92361175,  75.10882589,  37.90149004,  22.46717596,
        99.82903453, 187.69855662,  32.04356626,  14.60560992,
        22.35287835,  44.58899037,  14.02223964,  29.62568487])

In [82]:
y_test

array([ 74,  28,  35, 116,  28,  67,  72,  22,  46,  80,  25, 253,  20,
        26,  22,  88,  75,  24,  34,  47, 382,  19,  52, 381, 181,  33,
        33, 120,  25,  24,  30,  20,  15,  21,  48, 117,  29,  34, 175,
        65,  18,  62,  34,  42,  81, 136,  53,  28,  23,  28,  27,  36])

<p>As we can see the <b>Linear Regression</b> model isn't the best fit, so now we'll apply,</p>
<h3>Polynomial Regression</h3>

In [83]:
from sklearn.preprocessing import PolynomialFeatures

In [124]:
poly = PolynomialFeatures(degree = 3)

In [125]:
X_poly = poly.fit_transform(X_train)

In [126]:
poly_regrs = LinearRegression()
poly_regrs.fit(X_poly,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [127]:
y_pred_poly = poly_regrs.predict(poly.fit_transform(X_test))

In [128]:
y_pred_poly

array([  75.30997408,   30.03104043,   35.92920039,  111.78110774,
         56.44875718,   68.78448468,   67.59046548,   22.45903038,
         46.82883614,   82.42486623,   25.9435421 ,  256.78567415,
         20.37970985,   47.23642168,   20.84466288,   88.95695915,
         62.03780306,   24.60887296,   34.17691853,   62.11449081,
        222.66091659,   20.45991548,   41.97071023,  309.01861875,
        246.63039384,   34.6269452 ,  -77.50967663,  127.98925456,
         27.98435858,   24.66207436,   19.73162579,   20.52095959,
         16.50362063,   22.40371452,   41.72834644,  132.8512892 ,
          2.30940188,   33.99989763, -147.76570084,   62.83660602,
         10.23262476,   69.837918  ,   34.00051117,   44.95366763,
         76.13707924,  157.3466641 ,   52.84695242,   39.00402934,
         19.30427628,   27.35412545,   28.14752185,   32.83590601])

In [129]:
y_test

array([ 74,  28,  35, 116,  28,  67,  72,  22,  46,  80,  25, 253,  20,
        26,  22,  88,  75,  24,  34,  47, 382,  19,  52, 381, 181,  33,
        33, 120,  25,  24,  30,  20,  15,  21,  48, 117,  29,  34, 175,
        65,  18,  62,  34,  42,  81, 136,  53,  28,  23,  28,  27,  36])

<p>Now we can see, the <b>Polynomial regression</b> model yields better test outputs