
<h1 id="Classification">Classification<a class="anchor-link" href="#Classification">¶</a></h1>



<p>In this problem we'll set up a simple classifier using sklearn.</p>



<p>For this exercise, we'll use the polish bankruptcy data, available here:
<a href="https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data">https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data</a></p>



<p>First, download the data and save it to your local machine. Then we will read in the file as a csv:</p>


In [None]:

f = open("/home/lizhaoyi/datasets/bankruptcy/5year.arff", 'r')




<p>The data has a slightly odd format (try opening the above file in a text editor to see), so we have to skip the first few lines before we get to the actual csv data.</p>


In [None]:

while not '@data' in f.readline():
    pass




<p>Other than that we read it like a regular csv file, along with some simple data processing as we go</p>


In [None]:

dataset = []
for l in f:
    if '?' in l: # Missing entry
        continue
    l = l.split(',')
    values = [1] + [float(x) for x in l]
    values[-1] = values[-1] > 0 # Convert to bool
    dataset.append(values)



In [None]:

len(dataset)




<p>Looking at just the last entry (i.e., the label), how many positive values are there?</p>


In [None]:

sum([x[-1] for x in dataset])




<p>Generate feature and label matrices from the data (in this case, we're just using the features directly from the csv)</p>


In [None]:

X = [values[:-1] for values in dataset]



In [None]:

y = [values[-1] for values in dataset]




<p>We'll use the "linear_model" library from sklearn, which has a Logistic Regression class</p>


In [None]:

from sklearn import linear_model



In [None]:

model = linear_model.LogisticRegression()



In [None]:

model.fit(X, y)




<p>Once the model has finished training, we can use the model to make predictions. Here we just use our training data to make predictions, but we could also use new (i.e., previously unseen) datapoints</p>



<h1 id="Simple-diagnostics---model-accuracy">Simple diagnostics - model accuracy<a class="anchor-link" href="#Simple-diagnostics---model-accuracy">¶</a></h1>


In [None]:

predictions = model.predict(X)




<p>The predictions made my the model are just a vector of "True" and "False" values for each datapoint</p>


In [None]:

predictions




<p>Next we convert this to an array indicating which of the predictions were correct</p>


In [None]:

correctPredictions = predictions == y



In [None]:

correctPredictions




<p>To compute the accuracy of the model, we are just counting the fraction of "True" values in the above vector</p>


In [None]:

sum(correctPredictions) / len(correctPredictions)




<h1 id="Classification-with-separate-training/test-splits">Classification with separate training/test splits<a class="anchor-link" href="#Classification-with-separate-training/test-splits">¶</a></h1>



<p>In the previous example, we calculated accuracy, but used the same data for evaluation as we used for training. A more robust evaluation would consist of evaluating the model on <em>new</em> data.</p>



<p>Again, we start by reading in the csv data as before.</p>


In [None]:

f = open("/home/lizhaoyi/datasets/bankruptcy/5year.arff", 'r')



In [None]:

while not '@data' in f.readline():
    pass



In [None]:

dataset = []
for l in f:
    if '?' in l:
        continue
    l = l.split(',')
    values = [1] + [float(x) for x in l]
    values[-1] = values[-1] > 0 # Convert to bool
    dataset.append(values)




<p>We start by splitting the data into training and testing portions. It is important to randomly shuffle the data first, so that these two portions are similar samples of the data.</p>


In [None]:

import random



In [None]:

random.shuffle(dataset)



In [None]:

X = [values[:-1] for values in dataset]



In [None]:

y = [values[-1] for values in dataset]




<p>Here we split the data so that the first half (of the shuffled data) is used for training and the rest is used for testing</p>


In [None]:

N = len(X)
X_train = X[:N//2]
X_test = X[N//2:]
y_train = y[:N//2]
y_test = y[N//2:]



In [None]:

len(X), len(X_train), len(X_test)




<p>Again we'll use the same Logistic Regression model for training, though this time we'll only train <em>on the training data</em></p>


In [None]:

from sklearn import linear_model



In [None]:

model = linear_model.LogisticRegression()



In [None]:

model.fit(X_train, y_train)




<p>Below we make predictions in the same way as before, but now do so seperately for the training and test sets.</p>


In [None]:

predictionsTrain = model.predict(X_train)
predictionsTest = model.predict(X_test)



In [None]:

correctPredictionsTrain = predictionsTrain == y_train
correctPredictionsTest = predictionsTest == y_test




<p>Training accuracy:</p>


In [None]:

sum(correctPredictionsTrain) / len(correctPredictionsTrain)




<p>Test accuracy:</p>


In [None]:

sum(correctPredictionsTest) / len(correctPredictionsTest)




<h1 id="Gradient-descent-(regression)">Gradient descent (regression)<a class="anchor-link" href="#Gradient-descent-(regression)">¶</a></h1>



<p>Here we'll show how to implement regression using gradient descent. Of course this is not necessary when using a library, though the purpose of this exercise is to demonstrate how gradient-descent-based learning operates in general.</p>



<p>This exercise is based on the UCI PM2.5 (Air Quality) dataset, available here:
<a href="https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data">https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data</a></p>
<p>As usual, this data should first be saved to your local disk.</p>



<p>We first read the data as usual, and extract the pm2.5 measurements as outputs to be predicted:</p>


In [None]:

path = "datasets/PRSA_data_2010.1.1-2014.12.31.csv"
f = open(path, 'r')



In [None]:

dataset = []
header = f.readline().strip().split(',')
for line in f:
    line = line.split(',')
    dataset.append(line)



In [None]:

header.index('pm2.5')



In [None]:

dataset = [d for d in dataset if d[5] != 'NA']




<p>For demonstration we'll build a simple feature vector consisting of only a single feature (temperature), plus an offset. Thus our goal is to predict air quality as a function of temperature.</p>


In [None]:

def feature(datum):
    feat = [1, float(datum[7])] # Temperature
    return feat



In [None]:

X = [feature(d) for d in dataset]
y = [float(d[5]) for d in dataset]



In [None]:

X[0]




<p>Below we initialize our parameter vector (theta).</p>


In [None]:

K = len(X[0])
K



In [None]:

theta = [0.0]*K




<p>It can help to initialize the model to a solution that's "close to" the optimal value. A good initial guess for the offset term is to initialize it to the average value of the label (i.e., the average air quality measurement)</p>


In [None]:

theta[0] = sum(y) / len(y)




<p>Below are a few utility functions to compute the inner product and norm of a vector. These are also implemented in various libraries (e.g. numpy), and using the library functions would likely result in a more efficient implementation, but the functions are written here just for demonstration purposes.</p>


In [None]:

def inner(x,y):
    return sum([a*b for (a,b) in zip(x,y)])



In [None]:

def norm(x):
    return sum([a*a for a in x]) # equivalently, inner(x,x)




<h3 id="Derivative-function:">Derivative function:<a class="anchor-link" href="#Derivative-function:">¶</a></h3>



<p>Most important is our function to compute the derivative. The expression being computed here is the derivative of the objective (the Mean Squared Error) with respect to the parameters (theta), at our current estimate of theta.</p>


In [None]:

def derivative(X, y, theta):
    dtheta = [0.0]*len(theta) # Initialize the derivative vector to be a vector of zeros
    K = len(theta)
    N = len(X)
    MSE = 0 # Compute the MSE as we go, just to print it for debugging
    for i in range(N):
        error = inner(X[i],theta) - y[i]
        for k in range(K):
            dtheta[k] += 2*X[i][k]*error/N # See the lectures to understand how this expression was derived
        MSE += error*error/N
    return dtheta, MSE



In [None]:

learningRate = 0.003




<p>Now we iteratively call our derivative function to improve our estimate of theta, by following the direction of the derivative. The details of this function are quite difficult to get right, e.g. if the learning rate or the convergence criteria are not set well the function may not produce a good solution.</p>


In [None]:

while (True):
    dtheta,MSE = derivative(X, y, theta)
    m = norm(dtheta)
    print("norm(dtheta) = " + str(m) + " MSE = " + str(MSE))
    for k in range(K):
        theta[k] -= learningRate * dtheta[k]
    if m < 0.01: break




<p>Once gradient descent has converged, we can examine the learned value of theta:</p>


In [None]:

theta




<h1 id="Gradient-descent-in-Tensorflow">Gradient descent in Tensorflow<a class="anchor-link" href="#Gradient-descent-in-Tensorflow">¶</a></h1>



<p>Although the above gradient descent code produced an acceptable solution, it was (a) difficult to implement; (b) difficult to tune the optimization process (e.g. learning rates and stopping criteria); and (c) inefficient (i.e., slow). Using libraries like tensorflow can help to alleviate these issues.</p>



<p>First you will need to set up a working install of tensorflow. See installation instructions here:
<a href="https://www.tensorflow.org/install">https://www.tensorflow.org/install</a></p>


In [None]:

import tensorflow as tf




<p>Otherwise we read in the data as before</p>


In [None]:

path = "datasets/PRSA_data_2010.1.1-2014.12.31.csv"
f = open(path, 'r')



In [None]:

dataset = []
header = f.readline().strip().split(',')
for line in f:
    line = line.split(',')
    dataset.append(line)



In [None]:

header.index('pm2.5')



In [None]:

dataset = [d for d in dataset if d[5] != 'NA']



In [None]:

def feature(datum):
    feat = [1, float(datum[7]), float(datum[8]), float(datum[10])] # Temperature, pressure, and wind speed
    return feat



In [None]:

X = [feature(d) for d in dataset]
y = [float(d[5]) for d in dataset]



In [None]:

y = tf.constant(y, shape=[len(y),1])



In [None]:

K = len(X[0])




<p>The main advantage of tensorflow is that we don't have to compute the gradient - tensorflow will compute it for us. Instead, we have to implement our <em>objective</em> (i.e., the MSE) in terms of tensorflow operations:</p>


In [None]:

def MSE(X, y, theta):
  return tf.reduce_mean((tf.matmul(X,theta) - y)**2)




<p>Next we tell tensorflow that theta is our vector of variables to be optimized (we also specify its shape and initial values)</p>


In [None]:

theta = tf.Variable(tf.constant([0.0]*K, shape=[K,1]))




<p>Here we select an optimizer, which is essentially a specific gradient descent implementation. The parameter passed to the optimizer is a learning rate.</p>


In [None]:

optimizer = tf.train.AdamOptimizer(0.01)




<p>Then we tell tensorflow that our MSE function is the objective to be optimized.</p>


In [None]:

objective = MSE(X,y,theta)




<p>Then we tell tensorflow that this objective should be minimized (i.e., we are trying to minimize an error, rather than maximizing an accuracy), and initialize the session.</p>


In [None]:

train = optimizer.minimize(objective)



In [None]:

init = tf.global_variables_initializer()



In [None]:

sess = tf.Session()
sess.run(init)




<p>Finally we run 1000 iterations of gradient descent. Note how fast this is compared to our own implementation!</p>


In [None]:

for iteration in range(1000):
  cvalues = sess.run([train, objective])
  print("objective = " + str(cvalues[1]))




<p>Once gradient descent has converged, we can print out the results (i.e., theta)</p>


In [None]:

with sess.as_default():
  print(MSE(X, y, theta).eval())
  print(theta.eval())

