# Task 3 : Dimensionality Reduction

This notebook contains the process with which the classifiers were built for this task.
It is long and detailed. Refer to the Short version of this notebook for a simpler, direct solution to the task at hand.

# TOC
* [Preface](#pre)
* [Generator](#gen)
* [Visualising the dataset](#vis)
* [Classification](#class)
    - [Logistic Regression](#lr)
    - [K-Nearest Neighbours](#knn)
    - [Decision Trees](#dt)
    - [Random Forest](#rf)
    - [RBF Support Vector Machine](#rbf)
    - [Naive Bayes Classifier](#nb)
    - [Artificial Neural Network](#ann)
    - [AdaBoost](#ada)
* [Concluding thoughts](#conc)

# **Preface** <a class="anchor" id="pre"></a>

Initially, I considered using scikit-learn to run all the above classifiers with. But after extracting the dataset, I realised that I don't have enough memory to load the entire dataset. This is a problem, because all sklearn functions require the entire dataset to be loaded into memory. Hence, in an attempt to circumvent this issue, I ended up implementing my own generator to yield the data one line at a time, and also wrote my own classifiers to work with this method. I also used a custom plotting function to generate the ROC.

This is just a starting point, though. There are a couple of issues with the method I've adopted. For one, it loads elements into memory sequentially. This is a problem wherever we use batches, because we cannot gurantee there being a good, uniform split of elements from each class in the batch. This might lead to accuracy issues. But it can be mitigated by performing random access on the file, by maybe using seek() to read a specific line in the file, and then randomly reading the lines to make the training set. This might ensure uniformity across batches.

In [4]:
# Import statements
import matplotlib.pyplot as plt
import numpy as np

# **Generator** <a class="anchor" id="gen"></a>

As was mentioned above, I don't have enough memory to load the whole data set. So, the logical thing to do is to load in each row at a time.

We load the file in the `getRow` generator, which in turn yields us one row from the dataset.

In [None]:
# Generator, because I don't have enough memory to load the dataset fully

# Converts the read line into a list for further processing
def processLine(l):
    '''
    Input:
        l : line to convert into array
        
    Output:
        l : numpy array, after conversion
    '''
    l = l.split(',')
    l = [float(i) for i in l]
    l = np.array(l)
    return l

# Generator to yield each row at a time
def getRow():
    with open('../../HIGGS_6M.csv', 'r') as fh:
        while True:
            line = fh.readline()
            # print(line)
            row = processLine(line)
            # print(row)
            yield row
            
def getBatch(SIZE):
    x = getRow()
    i = 0
    batch = np.array([])
    while i < SIZE:
        row = next(x)
        np.append(batch, row)
    yield batch
            
# Test generator to see if it works
x = getRow()
print(next(x))
print(next(x))
# Deleting the generator, because we don't want any memory leaks
del x

# Test batch generator to see if it works
x = getBatch(10)
print(next(x))
print(next(x))
# Delete generator
del x