## Computer Hardware Data Set 

Estimate the CPU Performence data, in terms of its cycle time, memory size etc

#### Attribute Information
- vendor name
- Model Name: many unique symbols
- MYCT: machine cycle time in nanoseconds (integer) 
- MMIN: minimum main memory in kilobytes (integer) 
- MMAX: maximum main memory in kilobytes (integer)
- CACH: cache memory in kilobytes (integer) 
- CHMIN: minimum channels in units (integer) 
- CHMAX: maximum channels in units (integer) 
- PRP: published relative performance (integer) 
- ERP: estimated relative performance from the original article (integer)

### Import Required Modules

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

### Reading Data File

In [2]:
columns=['vendor', 'Model', 'MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMIN', 'CHMAX', 'PRP', 'ERP']
cpuDf = pd.read_csv('machine.data', sep=",", names=columns)
cpuDf.head()

Unnamed: 0,vendor,Model,MYCT,MMIN,MMAX,CACH,CHMIN,CHMAX,PRP,ERP
0,adviser,32/60,125,256,6000,256,16,128,198,199
1,amdahl,470v/7,29,8000,32000,32,8,32,269,253
2,amdahl,470v/7a,29,8000,32000,32,8,32,220,253
3,amdahl,470v/7b,29,8000,32000,32,8,32,172,253
4,amdahl,470v/7c,29,8000,16000,32,8,16,132,132


In [3]:
cpuDf.shape

(209, 10)

isnull() to check the null values in the data set

In [4]:
cpuDf.isnull().any()

vendor    False
Model     False
MYCT      False
MMIN      False
MMAX      False
CACH      False
CHMIN     False
CHMAX     False
PRP       False
ERP       False
dtype: bool

In [5]:
cpuDf.vendor.value_counts()

ibm             32
nas             19
sperry          13
honeywell       13
ncr             13
siemens         12
amdahl           9
cdc              9
burroughs        8
hp               7
harris           7
dg               7
magnuson         6
c.r.d            6
ipl              6
dec              6
formation        5
cambex           5
prime            5
gould            3
perkin-elmer     3
nixdorf          3
bti              2
wang             2
apollo           2
basf             2
four-phase       1
microdata        1
sratus           1
adviser          1
Name: vendor, dtype: int64

### Label Encoding

We will use the sklearn built in methods to process the data and applying the machine learning algorithms

Label Encoder is the method to convert categorical variables into numerical variables

In [17]:
from sklearn import preprocessing
encode_vendor = preprocessing.LabelEncoder()
cpuDf['vendor'] = encode_vendor.fit_transform(cpuDf['vendor'])
cpuDf.head(10)

Unnamed: 0,vendor,Model,MYCT,MMIN,MMAX,CACH,CHMIN,CHMAX,PRP,ERP
0,0,32/60,125,256,6000,256,16,128,198,199
1,1,470v/7,29,8000,32000,32,8,32,269,253
2,1,470v/7a,29,8000,32000,32,8,32,220,253
3,1,470v/7b,29,8000,32000,32,8,32,172,253
4,1,470v/7c,29,8000,16000,32,8,16,132,132
5,1,470v/b,26,8000,32000,64,8,32,318,290
6,1,580-5840,23,16000,32000,64,16,32,367,381
7,1,580-5850,23,16000,32000,64,16,32,489,381
8,1,580-5860,23,16000,64000,64,16,32,636,749
9,1,580-5880,23,32000,64000,128,32,64,1144,1238


In [18]:
X = cpuDf.drop(['Model', 'ERP'], axis=1)
y = cpuDf['ERP']

In [19]:
X.head()

Unnamed: 0,vendor,MYCT,MMIN,MMAX,CACH,CHMIN,CHMAX,PRP
0,0,125,256,6000,256,16,128,198
1,1,29,8000,32000,32,8,32,269
2,1,29,8000,32000,32,8,32,220
3,1,29,8000,32000,32,8,32,172
4,1,29,8000,16000,32,8,16,132


In [20]:
y.head()

0    199
1    253
2    253
3    253
4    132
Name: ERP, dtype: int64

### Train Test Split

train_test_split Split arrays or matrices into random train and test subsets

In [21]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print("The shape of Training Data is : ", x_train.shape, y_train.shape)
print("The shape of Testing Data is : ", x_test.shape, y_test.shape)

The shape of Training Data is :  (167, 8) (167,)
The shape of Testing Data is :  (42, 8) (42,)


### Regression Model

LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

In [22]:
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(x_train, y_train)

print(clf.coef_)
print(clf.intercept_)

train_accuracy = clf.score(x_train, y_train)
test_accuracy = clf.score(x_test, y_test)

print(train_accuracy)
print(test_accuracy)

[0.17775423 0.03838502 0.00578821 0.00329183 0.06711135 0.38348825
 0.29392683 0.5658995 ]
-35.565108798376386
0.9580907109636116
0.9611648566928219


The R2 score used when calling score on a regressor will use multioutput='uniform_average'

The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0

### Prediction

In [25]:
train_predict = clf.predict(x_train)

train_predictions = pd.DataFrame({'true_CPU_performence': y_train, 'predicted_CPU_performence': train_predict})
train_predictions.head()




test_predict = clf.predict(x_test)
test_predictions = pd.DataFrame({'true_CPU_performence': y_test, 'predicted_CPU_performence': test_predict})

print("-------Training Predictions--------")
print(train_predictions.head())


print("-------Testing Predictions--------")
print(test_predictions.head())




-------Training Predictions--------
     true_CPU_performence  predicted_CPU_performence
52                     72                 110.477004
163                    62                  59.652537
8                     749                 648.532691
83                     20                   2.165868
30                    102                 158.527856
-------Testing Predictions--------
     true_CPU_performence  predicted_CPU_performence
23                     45                  72.019429
189                   149                 192.636535
62                     19                   6.940289
165                    80                  65.841514
27                     22                   5.954895


### Assignment

Plot the regression line vs actual values