Load necessary modules

In [2]:
from sklearn import linear_model
import numpy as np
import matplotlib.pyplot as plt

Getting computer hardware data set

In [3]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/machine.data

--2021-03-09 12:33:10--  https://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/machine.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8726 (8,5K) [application/x-httpd-php]
Saving to: ‘machine.data’


2021-03-09 12:33:11 (1,57 MB/s) - ‘machine.data’ saved [8726/8726]



Let's have a look at the data

In [4]:
!cat machine.data

adviser,32/60,125,256,6000,256,16,128,198,199
amdahl,470v/7,29,8000,32000,32,8,32,269,253
amdahl,470v/7a,29,8000,32000,32,8,32,220,253
amdahl,470v/7b,29,8000,32000,32,8,32,172,253
amdahl,470v/7c,29,8000,16000,32,8,16,132,132
amdahl,470v/b,26,8000,32000,64,8,32,318,290
amdahl,580-5840,23,16000,32000,64,16,32,367,381
amdahl,580-5850,23,16000,32000,64,16,32,489,381
amdahl,580-5860,23,16000,64000,64,16,32,636,749
amdahl,580-5880,23,32000,64000,128,32,64,1144,1238
apollo,dn320,400,1000,3000,0,1,2,38,23
apollo,dn420,400,512,3500,4,1,6,40,24
basf,7/65,60,2000,8000,65,1,8,92,70
basf,7/68,50,4000,16000,65,1,8,138,117
bti,5000,350,64,64,0,1,4,10,15
bti,8000,200,512,16000,0,4,32,35,64
burroughs,b1955,167,524,2000,8,4,15,19,23
burroughs,b2900,143,512,5000,0,7,32,28,29
burroughs,b2925,143,1000,2000,0,5,16,31,22
burroughs,b4955,110,5000,5000,142,8,64,120,124
burroughs,b5900,143,1500,6300,0,5,32,30,35
burroughs,b5920,143,3100,6200,0,5,20,33,39
burroughs,b6900,143,2300,6200,0,6,6

Loading all available quantitative inputs and the (non-predicted) output

In [5]:
X = np.loadtxt('machine.data',delimiter=',',usecols=(2,3,4,5,6,7))
y = np.loadtxt('machine.data',delimiter=',',usecols=(8))

Train the linear model by least squares extimation

In [7]:
lin_regr = linear_model.LinearRegression()
lin_regr.fit(X, y)

LinearRegression()

Evaluating predictor for the first 10 training samples

In [8]:
lin_regr.predict(X[:10,:])

array([337.18564802, 311.94899965, 311.94899965, 311.94899965,
       199.08720932, 332.32728062, 452.35843075, 452.35843075,
       630.64290195, 959.4871324 ])

Hm... quite obviously, the predictor is far off from the claimed to be built model in the underlying paper.

According to the paper, which can be found under https://doi.org/10.1145/32232.32234, the following inputs are used for the construction of the linear model.

In [9]:
CACH = np.loadtxt('machine.data',delimiter=',',usecols=(5))*0.1
CHMIN = np.loadtxt('machine.data',delimiter=',',usecols=(6))
CHMAX = np.loadtxt('machine.data',delimiter=',',usecols=(7))
CHAVG = np.round(CHMIN+CHMAX)+1
MYCT= np.loadtxt('machine.data',delimiter=',',usecols=(2))
SPEED = 1.0/MYCT
CHCAP = CHAVG*SPEED*10
MMIN = np.loadtxt('machine.data',delimiter=',',usecols=(3))
MMAX = np.loadtxt('machine.data',delimiter=',',usecols=(4))
MAVG = ((MMIN+MMAX)/2.0) * 1.0e-3

Putting all inputs into one input vetor

In [10]:
X = np.concatenate((MAVG.reshape(-1,1),CACH.reshape(-1,1),CHCAP.reshape(-1,1)),axis=1)
print(X.shape)

(209, 3)


Let's try to now build the proper linear model...

In [11]:
lin_regr = linear_model.LinearRegression()

lin_regr.fit(X, y)

LinearRegression()

... and use it for the first 10 training samples.

In [12]:
lin_regr.predict(X[:10,:])

array([259.05805636, 317.54110043, 317.54110043, 317.54110043,
       191.81261777, 350.18816216, 439.1899086 , 439.1899086 ,
       586.76996887, 891.60336049])

Again, the model does not comply with the claimed to be received predictions in the last column of the data set. What's wrong?

(If you like, you can try to find the issue...) 

(Remark by the instructor: There is no "reference solution" for this.)