# Task 3 : Dimensionality Reduction
***
This notebook contains the process with which the classifiers were built for this task.
It is long and detailed. Refer to the Short version of this notebook for a simpler, direct solution to the task at hand.
***
# TOC
* [Preface](#pre)
* [Generator](#gen)
* [Understanding  the dataset](#vis)
* [Classification](#class)
    - [Logistic Regression](#lr)
    - [K-Nearest Neighbours](#knn)
    - [Decision Trees](#dt)
    - [Random Forest](#rf)
    - [RBF Support Vector Machine](#rbf)
    - [Naive Bayes Classifier](#nb)
    - [Artificial Neural Network](#ann)
    - [AdaBoost](#ada)
* [Concluding thoughts](#conc)
***
# **Preface** <a class="anchor" id="pre"></a>

Initially, I considered using scikit-learn to run all the above classifiers with. But after extracting the dataset, I realised that I don't have enough memory to load the entire dataset. This is a problem, because all sklearn functions require the entire dataset to be loaded into memory. Hence, in an attempt to circumvent this issue, I ended up implementing my own generator to yield the data one line at a time, and also wrote my own classifiers to work with this method. I also used a custom plotting function to generate the ROC.

This is just a starting point, though. There are a couple of issues with the method I've adopted. For one, it loads elements into memory sequentially. This is a problem wherever we use batches, because we cannot gurantee there being a good, uniform split of elements from each class in the batch. This might lead to accuracy issues. But it can be mitigated by performing random access on the file, by maybe using seek() to read a specific line in the file, and then randomly reading the lines to make the training set. This might ensure uniformity across batches.

In [1]:
# Import statements
import matplotlib.pyplot as plt
from time import time

***
# **Generator** <a class="anchor" id="gen"></a>

As was mentioned above, I don't have enough memory to load the whole data set. So, the logical thing to do is to load in each row at a time.

We load the file in the `getRow` generator, which in turn yields us one row from the dataset. Next, we also define a function `getBatch` to yield batches of size SIZE.

The code for these functions can be found below.

In [2]:
# Generator, because I don't have enough memory to load the dataset fully

# Converts the read line into a list for further processing
def processLine(l):
    '''
    Input:
        l : line to convert into array
        
    Output:
        l : numpy array, after conversion
    '''
    l = l.split(',')
    l = [float(i) for i in l]
    return l

# Generator to yield each row at a time
def getRow():
    '''
    Output:
        Generator object, yields each row
    '''
    with open('../../HIGGS_6M.csv', 'r') as fh:
        try:
            while True:
                line = fh.readline()
                # print(line)
                row = processLine(line)
                # print(row)
                yield row
        # End of file reached
        except:
            return None

# Generator to yield a batch at a time
def getBatch(SIZE):
    try:
        while True:
            x = getRow()
            i = 0
            batch = []
            while i < SIZE:
                row = next(x)
                batch.append(row)
                i += 1
            yield batch
    # EoF
    except:
        return None

In [3]:
# Test generator to see if it works
x = getRow()
print(next(x))
print(next(x))
# Deleting the generator, because we don't want any memory leaks
del x

[1.0, 0.869293212890625, -0.6350818276405334, 0.22569026052951813, 0.327470064163208, -0.6899932026863098, 0.7542022466659546, -0.24857313930988312, -1.0920639038085938, 0.0, 1.3749921321868896, -0.6536741852760315, 0.9303491115570068, 1.1074360609054565, 1.138904333114624, -1.5781983137130737, -1.046985387802124, 0.0, 0.657929539680481, -0.010454569943249226, -0.0457671694457531, 3.101961374282837, 1.353760004043579, 0.9795631170272827, 0.978076159954071, 0.9200048446655273, 0.7216574549674988, 0.9887509346008301, 0.8766783475875854]
[1.0, 0.9075421094894409, 0.3291472792625427, 0.3594118654727936, 1.4979698657989502, -0.3130095303058624, 1.09553062915802, -0.5575249195098877, -1.588229775428772, 2.1730761528015137, 0.8125811815261841, -0.2136419266462326, 1.2710145711898804, 2.214872121810913, 0.4999939501285553, -1.2614318132400513, 0.7321561574935913, 0.0, 0.39870089292526245, -1.138930082321167, -0.0008191101951524615, 0.0, 0.3022198975086212, 0.8330481648445129, 0.985699653625488

In [4]:
# Test batch generator to see if it works
x = getBatch(2)
print(next(x))
# Delete generator
del x

[[1.0, 0.869293212890625, -0.6350818276405334, 0.22569026052951813, 0.327470064163208, -0.6899932026863098, 0.7542022466659546, -0.24857313930988312, -1.0920639038085938, 0.0, 1.3749921321868896, -0.6536741852760315, 0.9303491115570068, 1.1074360609054565, 1.138904333114624, -1.5781983137130737, -1.046985387802124, 0.0, 0.657929539680481, -0.010454569943249226, -0.0457671694457531, 3.101961374282837, 1.353760004043579, 0.9795631170272827, 0.978076159954071, 0.9200048446655273, 0.7216574549674988, 0.9887509346008301, 0.8766783475875854], [1.0, 0.9075421094894409, 0.3291472792625427, 0.3594118654727936, 1.4979698657989502, -0.3130095303058624, 1.09553062915802, -0.5575249195098877, -1.588229775428772, 2.1730761528015137, 0.8125811815261841, -0.2136419266462326, 1.2710145711898804, 2.214872121810913, 0.4999939501285553, -1.2614318132400513, 0.7321561574935913, 0.0, 0.39870089292526245, -1.138930082321167, -0.0008191101951524615, 0.0, 0.3022198975086212, 0.8330481648445129, 0.9856996536254

Now that we've got the generator ready, let's move on to visualising the features in the dataset.
***
# **Understanding the Dataset** <a class="anchor" id="vis"></a>

First things first, we check if all 6M rows are present. Then, we get a rough range of the values present in each column. We can also calculate the mean value of each column here. There are 29 columns in the dataset. First column is the category the element belongs to, the next 21 columns are basic features, and the final 7 features are functions of the first 21 (as mentioned in the question).

In [5]:
t1 = time()
x = getRow()
rows = 0
# Lists to store the properties for each column
maximum = [0] * 29
minimum = [0] * 29
total = [0] * 29

try:
    # While loop to iterate through the dataset
    while True:
        r = next(x)
        rows += 1
        # Iterate through each element in the row to see if we need to update any of the above properties
        for i in range(len(r)):
            # If element is less than minimum
            if r[i] < minimum[i]:
                minimum[i] = r[i]
            # If element is greater than maximum
            if r[i] > maximum[i]:
                maximum[i] = r[i]
            # Add element to total, to find average
            total[i] += r[i]
except:
    print("Iterated through ", rows, "rows")

# Calculate average
average = [i/rows for i in total]

for i in range(29):
    print("-----------------------------")
    print(i+1, "th column properties")
    print("Maximum element: ", maximum[i])
    print("Minimum element: ", minimum[i])
    print("Sum of all elements: ", total[i])
    print("Average value of column: ", average[i])
print("-----------------------------")
print("Execution took ", time()-t1, "s")

Iterated through  6000000 rows
-----------------------------
1 th column properties
Maximum element:  1.0
Minimum element:  0
Sum of all elements:  3178345.0
Average value of column:  0.5297241666666667
-----------------------------
2 th column properties
Maximum element:  10.396014213562012
Minimum element:  0
Sum of all elements:  5946832.861814141
Average value of column:  0.9911388103023568
-----------------------------
3 th column properties
Maximum element:  2.4348678588867188
Minimum element:  -2.434976100921631
Sum of all elements:  -37.90278963888704
Average value of column:  -6.317131606481174e-06
-----------------------------
4 th column properties
Maximum element:  1.7432359457015991
Minimum element:  -1.7425082921981812
Sum of all elements:  2179.9233033626515
Average value of column:  0.0003633205505604419
-----------------------------
5 th column properties
Maximum element:  15.396821022033691
Minimum element:  0
Sum of all elements:  5989140.586218525
Average value of c

As we can see here, it takes almost one and a half minutes to iterate through the dataset once.

Now that we have the value for the average of each column, the next thing we would want to do is calculate the variance and standard deviation displayed across each column.
***
*NOTE:*
1. *We don't know if there's any missing data present in the table as zeros*
2. *If such data exists, we don't know how to distingush the missing elements from elements who's value is actually zero*

*So, we assume the zeros present in the dataset are elements where the value is actually zero, and not cases of missing values.*
***