# PA 4

This program will build sample classifiers for the pre-processed automobile dataset created for PA1. 

We will use mean value replacement to resolve missing values, as described by the following steps.

## Step 1 - Random Instances: Linear Regression

This step will create a classifier that predicts mpg values using least squares linear regression.

Our predictor attribute will be vehicle weight. The algorithm will take a set of instances, predict their MPG values, and then discretize these values based on the DOE classifications given in PA2.

To test our classifier, we will select 5 random instances from the dataset, and then compare our predicted MPG classification with the actual classification from the data.

To begin, we will first define several helper functions in order to create and clean our dataset.

* `read_data()`
    * Reads a text file for the data to create a dataset from
    * **Parameters**
        * `filename`: The name of the file to read data from
    * **Returns** 
        * A string containing the text from the given file
* `create_dataset()`
    * Turns a formatted string into a 2D dataset array
    * **Parameters**
        * `data`: A string containing the data to build the dataset from
    * **Returns**
        * The data in the dataset as a 2D array
* `resolve_missing_values()`
    * Resolves all instances of "NA" by replacing them with the mean of that attribute, in-place
    * **Parameters**
        * `data`: The dataset to clean

In [1]:
def read_data(filename):
    f = open(filename, 'r')
    text = f.read()
    f.close()
    return text

def create_dataset(data):
    data_r = data.splitlines()
    dataset = []
    for line in data_r:
        instance = line.split(',')
        dataset.append(instance)
    for instance in dataset:
        for i in range(10):
            try:
                instance[i] = float(instance[i])
            except:
                instance[i] = instance[i]
    return dataset

def resolve_missing_values(data):
    for i in range(10):
        if i != 8:
            sum_i = 0
            count_i = 0
            for instance in data:
                if instance[i] != "NA":
                    try:
                        sum_i += instance[i]
                        count_i += 1
                    except:
                        print(instance[i])
            if count_i == 0:
                continue
            mean = sum_i / count_i
            for instance in data:
                if instance[i] == "NA":
                    instance[i] = mean

Now that we have these functions defined, we will call them on the "auto-data.txt" file to create the dataset for this project.

In [2]:
dataset = create_dataset(read_data("auto-data.txt"))
resolve_missing_values(dataset)

Finally, we have a dataset that we can use for the rest of this project.

Now, we will define functions for linear regression. To do this, we will copy the helper functions written for PA3, outlined here for clarity.

* `compute_total_instances()`  
    * **Params**: 
        * `data` = the dataset to query
    * **Returns**: The number of instances in that dataset
* `sum_attribute()`
    * **Params**:
        * `data` = the dataset to query
        * `index` = the index of the attribute to query
    * **Returns**:
        * The sum of all values for the given index in the dataset
* `mean_attribute()`
    * **Params**:
        * `data` = the dataset to query
        * `index` = the index of the attribute to query
    * **Returns**:
        * The mean of all values of the given attribute in the dataset
* `linear_regression_slope()`
    * **Params**:
        * `data` = the dataset to query
        * `x_index` = the index of the attribute to use for the x values
        * `y_index` = the index of the attribute to use for the y values
    * **Returns**:
        * The slope of the linear regression line
* `lienar_regression_intercept()`
     * **Params**:
        * `data` = the dataset to query
        * `x_index` = the index of the attribute to use for the x values
        * `y_index` = the index of the attribute to use for the y values
    * **Returns**:
        * The intercept of the linear regression line
* `linear_regression_correlation()`
     * **Params**:
        * `data` = the dataset to query
        * `x_index` = the index of the attribute to use for the x values
        * `y_index` = the index of the attribute to use for the y values
    * **Returns**:
        * The correlation coefficient of the linear regression line
* `linear_regression_std_error()`
    * **Params**:
        * `data` = the dataset to query
        * `y_index` = the index of the attribute to use for the y values
    * **Returns**:
        * The Standard Error of the linear regression line
        
        
Additionally, we will use these formulas for linear regression:
* **Linear Regression**:  $\overline{y} = m\overline{x} + b$
* **Slope**:  $m = \frac{\sum_{i = 1}^{n} (x_{i}-\overline{x})(y_{i}-\overline{y})}{\sum_{i = 1}^{n} (x_{i}-\overline{x})^{2}}$
* **Intercept**: $\overline{y} - m\overline{x}$
* **Correlation Coefficient**: $r = \frac{\sum_{i = 1}^{n} (x_{i}-\overline{x})(y_{i}-\overline{y})}{\sqrt{\sum_{i = 1}^{n} (x_{i}-\overline{x})^{2} \sum_{i = 1}^{n}(y_{i}-\overline{y})^{2}}}$
* **Standard Error**: $\sqrt{\frac{\sum_{i = 1}^{n}(y_{i} - \overline{y}^{2}}{n}}$

In [3]:
def compute_total_instances(data):
    return len(data)

def sum_attribute(data, index):
        sum = 0
        for instance in data:
            sum += instance[index]
        return sum
    
def mean_attribute(data, index):
    mean = sum_attribute(data, index) / compute_total_instances(data)
    return mean
    
    
def linear_regression_slope(data, x_index, y_index):
    mean_x = mean_attribute(data, x_index)
    mean_y = mean_attribute(data, y_index)
    sum_dividend = 0
    sum_divisor = 0
    for i in range(len(data)):
        sum_dividend += (data[i][x_index] - mean_x)*(data[i][y_index] - mean_y)
        sum_divisor += (data[i][x_index] - mean_x)**2
    m = sum_dividend / sum_divisor
    return m

def linear_regression_intercept(data, x_index, y_index):
    mean_x = mean_attribute(data, x_index)
    mean_y = mean_attribute(data, y_index)
    m = linear_regression_slope(data, x_index, y_index)
    b = mean_y - (mean_x * m)
    return b

def linear_regression_correlation(data, x_index, y_index):
    mean_x = mean_attribute(data, x_index)
    mean_y = mean_attribute(data, y_index)
    sum_dividend = 0
    sum_divisor_x = 0
    sum_divisor_y = 0
    for i in range(len(data)):
        sum_dividend += (data[i][x_index] - mean_x)*(data[i][y_index] - mean_y)
        sum_divisor_x += ((data[i][x_index] - mean_x)**2)
        sum_divisor_y += ((data[i][y_index] - mean_y)**2)
    sum_divisor = sum_divisor_x * sum_divisor_y
    divisor = np.sqrt(sum_divisor)
    r = sum_dividend / divisor
    return r

def linear_regression_std_error(data, y_index):
        mean_y = mean_attribute(data, y_index)
        sum_dividend = 0
        for i in range(len(data)):
            sum_dividend += (data[i][y_index] - mean_y) ** 2
        std_sqr = sum_dividend / compute_total_instances(data)
        std_error = np.sqrt(std_sqr)
        return std_error


For the DOE Classifications, we will use these values, defined in PA2:

| Rating | MPG   |
|--------|-----  |
|   10   | ≥ 45  |
|   9    | 37-44 |
|   8    | 31-36 |
|   7    | 27-30 |
|   6    | 24-26 |
|   5    | 20-23 |
|   4    | 17-19 |
|   3    | 15-16 |
|   2    |   14  |
|   1    | ≤ 13  |

To make our predictions, we will define our classifier as follows:
* `classifier()`
    * **Parameters**
        * `data`: The dataset to use for defining our linear regression
        * `test`: A list of test instances to make predictions on
    * **Returns**
        * A list of the DOE classification for the predicted MPG values of each test instance

In [4]:
def classifier(data, test):
    m = linear_regression_slope(data, 4, 0)
    b = linear_regression_intercept(data, 4, 0)
    predictions = []
    for instance in test:
        x = instance[4]
        mpg = m*x + b
        if mpg >= 45:
            y = 10
        elif mpg >= 37:
            y = 9
        elif mpg >= 31:
            y = 8
        elif mpg >= 27:
            y = 7
        elif mpg >= 24:
            y = 6
        elif mpg >= 20:
            y = 5
        elif mpg >= 17:
            y = 4
        elif mpg >= 15:
            y = 3
        elif mpg >= 14:
            y = 2
        else:
            y = 1
        predictions.append(y)
    return(predictions)

To test this, we will select a random 5 instances from our dataset to predict on, generated using the following helper function:

* `generate_random_instances()`
    * **Parameters**
        * `data`: The dataset to pull random instances from
        * `n`: The number of instances to generate
    * **Returns**
        * A list of _n_ random instances from the dataset

In [5]:
from random import randint

def generate_random_instances(data, n):
    test_indices = []
    while len(test_indices) < n:
        index = randint(0, len(data) - 1)
        if index not in test_indices:
            test_indices.append(index)
    instances = []
    for i in test_indices:
        instances.append(data[i])
    return instances

test_instances = generate_random_instances(dataset, 5)

The following cell will run the classifier function on the test instances and display the output.

In [6]:
predictions = classifier(dataset, test_instances)

def print_output_1(test_instances, predictions):
    print("===========================================")
    print("STEP 1: Linear Regression MPG Classifier")
    print("===========================================")
    for i in range(len(predictions)):
        print("instance: ", end="")
        mpg = test_instances[i].pop(0)
        if mpg >= 45:
            actual = 10
        elif mpg >= 37:
            actual = 9
        elif mpg >= 31:
            actual = 8
        elif mpg >= 27:
            actual = 7
        elif mpg >= 24:
            actual = 6
        elif mpg >= 20:
            actual = 5
        elif mpg >= 17:
            actual = 4
        elif mpg >= 15:
            actual = 3
        elif mpg >= 14:
            actual = 2
        else:
            actual = 1
        instance_string = ""
        for attribute in test_instances[i]:
            instance_string += str(attribute)
            instance_string += ", "
        instance_string.rstrip(", ")
        print(instance_string)
        print("class:", predictions[i], end="")
        print(", actual:", actual)

print_output_1(test_instances, predictions)

STEP 1: Linear Regression MPG Classifier
instance: 8.0, 350.0, 165.0, 4142.0, 11.5, 70.0, 1.0, "chevrolet chevelle concours (sw)", 3210.0, 
class: 2, actual: 5
instance: 6.0, 173.0, 115.0, 2595.0, 11.3, 79.0, 1.0, "chevrolet citation", 4112.573033707865, 
class: 6, actual: 7
instance: 4.0, 134.0, 95.0, 2560.0, 14.2, 78.0, 3.0, "toyota corona", 4574.0, 
class: 6, actual: 7
instance: 8.0, 307.0, 130.0, 3504.0, 12.0, 70.0, 1.0, "chevrolet chevelle malibu", 2881.0, 
class: 4, actual: 4
instance: 6.0, 225.0, 110.0, 3620.0, 18.7, 78.0, 1.0, "dodge aspen", 3911.0, 
class: 4, actual: 4


## Step 2 - Random Instances: kNN

For this step, we will instead create a nearest neighbor classifier for mpg instead of 