In [23]:
import numpy
import scipy
import scipy.sparse
import sklearn.metrics.pairwise

# Data cleaning

The first step of nearly any machine learning analysis project is data cleaning. This is done in order to allow a larger variety of models to work with a predictable input, such that exceptions (in this case special characters such as quotation marks, '[comma]' and others) will not cause any disturbance in the model. The following code loads the data, 'cleans' it, and afterwards sets the entire cleaned data in an array. Comments are added in the code to improve interpretability.

### Cleaning

In [24]:
## Set an empty list variable

descriptions = []

with open('descriptions.txt', encoding = "utf8") as f:
    for line in f:
        text = line.lower()                                       ## Lowercase all characters
        text = text.replace("[comma]"," ")                        ## Replace [commas] with empty space
        for ch in text:
            if ch < "0" or (ch < "a" and ch > "9") or ch > "z":   ## The cleaning operation happens here, remove all special characters
                text = text.replace(ch," ")
        text = ' '.join(text.split())                             ## Remove double spacing from sentences
        text = text.split('\n')
        descriptions.append(text)
dataSet = numpy.array(descriptions)
f.close()

The data is now structured and represented in an array, which we could interpret as a vector.

In [25]:
print('The size of our data set: ', dataSet.size)
print('The dimension of our dataset are: ', dataSet.shape)
print('\n')
print('-- 0th element of our dataSet --', '\n', dataSet[0])
print('\n')
print('-- 1st element of our dataSet --', '\n', dataSet[1])

The size of our data set:  1480
The dimension of our dataset are:  (1480, 1)


-- 0th element of our dataSet -- 
 ['round face short and overweight likes to wear jeans and sweaters drinks wine at dinner short liberal overweight short hair eats at whole foods does not work our very much']


-- 1st element of our dataSet -- 
 ['jug ears mustache and beard and long sideburns stylish hair no laugh lines eyes are clear no drugs or alcohol confident a little overweight from double chin']


Since the input vector now is 'clean', different models can be trained. An increase in accuracy should be obtained when compared to a benchmark test.