In [19]:
import numpy
import scipy
import scipy.sparse
import sklearn
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import KMeans
import sklearn.metrics.pairwise

# Data cleaning

The first step of nearly any machine learning analysis project is data cleaning. This is done in order to allow a larger variety of models to work with a predictable input, such that exceptions (in this case special characters such as quotation marks, '[comma]' and others) will not cause any disturbance in the model. The following code loads the data, 'cleans' it, and afterwards sets the entire cleaned data in an array. Comments are added in the code to improve interpretability.

### Cleaning

In [2]:
## Set an empty list variable

descriptions = []

with open('descriptions.txt', encoding = "utf8") as f:
    for line in f:
        text = line.lower()                                       ## Lowercase all characters
        text = text.replace("[comma]"," ")                        ## Replace [commas] with empty space
        for ch in text:
            if ch < "0" or (ch < "a" and ch > "9") or ch > "z":   ## The cleaning operation happens here, remove all special characters
                text = text.replace(ch," ")
        text = ' '.join(text.split())                             ## Remove double spacing from sentences
        descriptions.append(text)
dataSet = numpy.array(descriptions)
#print('After running first results, the following sentence was found : ')
##line 496 is the weird one


In [20]:
#numpy.save("descriptions_cleaned_array.npy",dataSet)
dataSet = numpy.load("descriptions_cleaned_array.npy")
#dataSet = numpy.load("coco_val.npy")

The data is now cleaned and neatly fit into an array. Some basic information about the cleaned array will be provided in the following code.

In [3]:
print('The size of our data set: ', dataSet.size)
print('The dimension of our dataset are: ', dataSet.shape)
print('\n')
print('-- 0th element of our dataSet --', '\n', dataSet[0])
print('\n')
print('-- 1st element of our dataSet --', '\n', dataSet[1])

The size of our data set:  1479
The dimension of our dataset are:  (1479,)


-- 0th element of our dataSet -- 
 round face short and overweight likes to wear jeans and sweaters drinks wine at dinner short liberal overweight short hair eats at whole foods does not work our very much


-- 1st element of our dataSet -- 
 jug ears mustache and beard and long sideburns stylish hair no laugh lines eyes are clear no drugs or alcohol confident a little overweight from double chin


### Representation

Since the input vector now is 'clean', different representations can be made, which in turn can then be trained to obtain accuracy measures of classification. Firstly, countVectorizer by scikitLearn (which counts all the instances of words) will run on our cleaned dataset. Afterwards TfIdf will run, in order the have the Term frequency, inverse document frequency (which will essentially remove non-informative words suchs as: 'the', 'and', 'a'). Scikit-learn provides a neat function to do this in a single function.

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
TfIdf_dataSet = vectorizer.fit_transform(dataSet)
print("What our Tf-Idf looks like: ")
print()
print(TfIdf_dataSet[1:2])

vectorVocab = vectorizer._validate_vocabulary()

What our Tf-Idf looks like: 

  (0, 228)	0.0999230235196548
  (0, 2737)	0.16083047896043273
  (0, 1776)	0.06254106376813398
  (0, 2119)	0.3161753448886334
  (0, 1247)	0.14464356086569088
  (0, 2538)	0.19064057405466286
  (0, 419)	0.18076247495377387
  (0, 2292)	0.1401975901762004
  (0, 3445)	0.21120111824703844
  (0, 3749)	0.2492839547283478
  (0, 2604)	0.3123113475808705
  (0, 2188)	0.2417063139284673
  (0, 2262)	0.264108063599575
  (0, 1420)	0.07045478691841404
  (0, 286)	0.11292286212515924
  (0, 772)	0.21289776024949109
  (0, 1217)	0.2324320311261342
  (0, 2694)	0.1235469355267728
  (0, 180)	0.27051486760744353
  (0, 878)	0.20649095624162253
  (0, 2276)	0.14173141580413545
  (0, 1605)	0.1515561961580996
  (0, 1191)	0.3161753448886334
  (0, 740)	0.1644468421153148


### Cosine similarity

Now we can safely compute the distance between each document. After sorting, the most similar top 5 documents will be provided.

In [14]:
cosineSimilarity = sklearn.metrics.pairwise.cosine_similarity(TfIdf_dataSet)
cosineSimilaritySorted = numpy.argsort(cosineSimilarity, axis=1)
top5Similar = cosineSimilaritySorted[:,-6:-1]

print(top5Similar)

[[  66  406 1084   65 1453]
 [ 144 1372  797  548  555]
 [1173 1270 1050  523  342]
 ...
 [ 590  446 1170  205 1371]
 [ 406  178  831  966 1063]
 [1033 1143  540  499  279]]


### Interpret the cosine similarity

Following the cosine metric, the 1159 sentence in our dataSet is closest to the 1251 sentence in our data set. Let's see what they both look like:

In [6]:
print('Sentence 1159 in the dataSet: ')
print(dataSet[1159])
print()
print('Sentence 1251 in the dataSet: ')
print(dataSet[1251])

Sentence 1159 in the dataSet: 
somewhat athletic fair skin average weight normal appearance no tattoos parents have a fair amount of money has sexually assaulted a female has never had a job

Sentence 1251 in the dataSet: 
probably skinny they may be going a little bald possibly taller than average possibly hyper energetic they seem like they are a bit mess unkept


### KMeans clustering

Besides finding similar documents by cosine similarity, an implementation of KMeans clustering is done in the following code. This is more meaningful, since the information is known that there are 5 sentences that are equal to each other, therefore making the number of clusters to 296. Also, it allows for topic extraction, which can be interpreted as the most important words for each cluster. 

In [22]:
KMeans = sklearn.cluster.KMeans(n_clusters=296)

In [23]:
KmeansFit = KMeans.fit(TfIdf_dataSet)
labels = KMeans.predict(TfIdf_dataSet)

In [24]:
a = numpy.argsort(KmeansFit.cluster_centers_, axis = 1)
top5SimilarKMeans = a[:,1:6]
print(top5SimilarKMeans)

[[2912 2913 2914 2915 2916]
 [2898 2899 2900 2901 2902]
 [2878 2879 2880 2881 2882]
 ...
 [2905 2906 2907 2908 2909]
 [2911 2912 2913 2914 2915]
 [2889 2890 2891 2892 2893]]


<span style="color:red">I think we still need to order by the first column's value so that we get the lines in a row. Then we save as a CSV as noted in code line below</span>

In [25]:
##save results to results file
numpy.savetxt("results.csv",top5SimilarKMeans)

In [10]:
print("Top terms per cluster:")
order_centroids = KmeansFit.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(3):
    print ("Cluster %d:" % i,)
    for ind in order_centroids[i, :10]:
        print (' %s' % terms[ind],)
    print

Top terms per cluster:
Cluster 0:
 denny
 complextion
 tint
 waitress
 150
 mom
 20
 early
 brown
 one
Cluster 1:
 hang
 with
 light
 out
 likes
 friends
 skin
 brown
 weekend
 working
Cluster 2:
 she
 her
 to
 weighing
 has
 everything
 wear
 keep
 active
 way


### Results
To actually get a better estimation of our text similarity, several tests are performed on a test set.