In [1]:
import numpy
import scipy
import scipy.sparse
import sklearn
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import KMeans
import sklearn.metrics.pairwise

# Data cleaning

The first step of nearly any machine learning analysis project is data cleaning. This is done in order to allow a larger variety of models to work with a predictable input, such that exceptions (in this case special characters such as quotation marks, '[comma]' and others) will not cause any disturbance in the model. The following code loads the data, 'cleans' it, and afterwards sets the entire cleaned data in an array. Comments are added in the code to improve interpretability.

### Cleaning

In [2]:
## Set an empty list variable

descriptions = []

with open('descriptions.txt', encoding = "utf8") as f:
    for line in f:
        text = line.lower()                                       ## Lowercase all characters
        text = text.replace("[comma]"," ")                        ## Replace [commas] with empty space
        for ch in text:
            if ch < "0" or (ch < "a" and ch > "9") or ch > "z":   ## The cleaning operation happens here, remove all special characters
                text = text.replace(ch," ")
        text = ' '.join(text.split())                             ## Remove double spacing from sentences
        descriptions.append(text)
dataSet = numpy.array(descriptions)
#print('After running first results, the following sentence was found : ')
##line 496 is the weird one


In [3]:
#numpy.save("descriptions_cleaned_array.npy",dataSet)
#dataSet = numpy.load("descriptions_cleaned_array.npy")
dataSet = numpy.load("coco_val.npy")

The data is now cleaned and neatly fit into an array. Some basic information about the cleaned array will be provided in the following code.

In [4]:
dataSet = dataSet[0:50]
print('The size of our data set: ', dataSet.size)
print('The dimension of our dataset are: ', dataSet.shape)
print('\n')
print('-- 0th element of our dataSet --', '\n', dataSet[0])
print('\n')
print('-- 1st element of our dataSet --', '\n', dataSet[1])

The size of our data set:  50
The dimension of our dataset are:  (50,)


-- 0th element of our dataSet -- 
 a child holding a flowered umbrella and petting a yak


-- 1st element of our dataSet -- 
 a young man holding an umbrella next to a herd of cattle


### Representation

Since the input vector now is 'clean', different representations can be made, which in turn can then be trained to obtain accuracy measures of classification. Firstly, countVectorizer by scikitLearn (which counts all the instances of words) will run on our cleaned dataset. Afterwards TfIdf will run, in order the have the Term frequency, inverse document frequency (which will essentially put less importance on non-informative words suchs as: 'the', 'and', 'a'). Scikit-learn provides a neat function to do this in a single function.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
TfIdf_dataSet = vectorizer.fit_transform(dataSet)
print("What our Tf-Idf looks like: ")
print()
print(TfIdf_dataSet[0:1])

vectorVocab = vectorizer._validate_vocabulary()

What our Tf-Idf looks like: 

  (0, 24)	0.43611951897636925
  (0, 64)	0.28136432388711935
  (0, 48)	0.43611951897636925
  (0, 155)	0.32308280994485267
  (0, 2)	0.23590395467851702
  (0, 110)	0.43611951897636925
  (0, 169)	0.43611951897636925


### Cosine similarity

Now we can safely compute the distance between each document. After sorting, the most similar top 5 documents will be provided. The first vector in the matrix represents the 'base' sentence. The vectors following are the sentences most similar to that 'base' sentence.

In [6]:
cosineSimilarity = sklearn.metrics.pairwise.cosine_similarity(TfIdf_dataSet)
numpy.fill_diagonal(cosineSimilarity,1)
cosineSimilaritySorted = numpy.argsort((-1*(cosineSimilarity)),axis=1)
top5similar = (cosineSimilaritySorted[:,0:6])
print(top5similar)

[[ 0  4  1  2 12  3]
 [ 1  2  4  3 12 10]
 [ 2  3  1  4 12  0]
 [ 3  2  1  4 18  9]
 [ 4  1  2  3 10  0]
 [ 5  6  7  8 43 17]
 [ 6  5  7  8 28 39]
 [ 7  5  6  8  9 17]
 [ 8  9 43  5 34  6]
 [ 9 18  8 12  3 26]
 [10 11 12 13  4  1]
 [11 10 49 26 48 12]
 [12 13 14  9  1 26]
 [13 12 14 47 10 46]
 [14 12 13 21 26 10]
 [15 30 32 35 16 48]
 [16 15 18 49 45 48]
 [17 19 43 34 16  5]
 [18  9 45 44 15 26]
 [19 17 18 31 15 30]
 [20 23 21 22 37 15]
 [21 20 23 14 41 22]
 [22 23 20 21 24 43]
 [23 22 20 21 24 43]
 [24 46 23 36 22 35]
 [25 29 27 18 15 32]
 [26 29 27 28 18 12]
 [27 26 28 25 36 29]
 [28 26 29 27 39 36]
 [29 26 28 25 27 15]
 [30 32 15 44 35 40]
 [31 18 19  9 15 26]
 [32 30 15 35 40 44]
 [33 43 34 15 30 24]
 [34 43 44 15 40 33]
 [35 36 37 15 32 39]
 [36 35 39 37 38 28]
 [37 35 36 39 20  1]
 [38 39 36 28 27 46]
 [39 38 36 28 35 37]
 [40 30 32 43 44 34]
 [41 49 47 21 40  3]
 [42 44 43 30 40 32]
 [43 34 44 42 15 40]
 [44 43 30 18 15 34]
 [45 46 47 49 18 48]
 [46 47 45 48 49 13]
 [47 46 45 48

### Interpret the cosine similarity

Following the cosine metric, the first sentence in our dataSet is closest to the 1455 sentence in our data set. Let's see what they both look like:

In [7]:
# print('Sentence 1 in the dataSet: ')
# print(dataSet[0])
# print()
# print('Sentence 1455 in the dataSet: ')
# print(dataSet[1454])

### KMeans clustering

Besides finding similar documents by cosine similarity, an implementation of KMeans clustering is done in the following code. This is more meaningful, since it is known that there are 5 sentences that are equal to each other, therefore making the number of clusters to 296. Also, it allows for topic extraction, which can be interpreted as the most important words for each cluster. 

In [8]:
KMeans = sklearn.cluster.KMeans(n_clusters=10)

In [9]:
KmeansFit = KMeans.fit(TfIdf_dataSet)
print(KmeansFit.labels_)

[5 5 5 5 5 0 0 0 4 7 9 3 6 6 6 9 7 4 7 7 8 8 8 8 8 7 3 3 3 3 9 7 9 4 4 2 2
 2 2 2 9 9 4 4 4 1 1 1 1 1]


In [10]:
cosineSimilarityK = sklearn.metrics.pairwise.cosine_similarity(KmeansFit.cluster_centers_)
numpy.fill_diagonal(cosineSimilarityK,1)
cosineSimilaritySortedK = numpy.argsort((-1*(cosineSimilarityK)),axis=0)
print(cosineSimilaritySortedK)
top5similarK = (cosineSimilaritySortedK)

[[0 1 2 3 4 5 6 7 8 9]
 [4 7 3 7 9 9 5 9 7 4]
 [7 9 1 2 7 6 3 4 4 7]
 [2 3 9 1 0 7 1 3 2 5]
 [1 2 8 6 8 1 7 1 9 1]
 [8 5 5 9 1 2 9 5 5 2]
 [3 4 7 5 2 3 8 8 0 3]
 [5 6 0 8 5 8 4 0 1 8]
 [6 0 4 0 6 4 0 2 6 6]
 [9 8 6 4 3 0 2 6 3 0]]


In [11]:
##save results to results file
numpy.savetxt("results.csv",top5similarK)

In [12]:
print("Top terms per cluster:")
order_centroids = KmeansFit.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(8):
    print ("Cluster %d:" % i,)
    for ind in order_centroids[i, :5]:
        print (' %s' % terms[ind],)
    print

Top terms per cluster:
Cluster 0:
 appliances
 kitchen
 with
 and
 white
Cluster 1:
 helmet
 toilet
 red
 yellow
 it
Cluster 2:
 parking
 cars
 meters
 street
 meter
Cluster 3:
 riding
 street
 bike
 her
 down
Cluster 4:
 bathroom
 bathtub
 pedestal
 sink
 claw
Cluster 5:
 umbrella
 holding
 an
 boy
 young
Cluster 6:
 cat
 girl
 small
 holding
 shirt
Cluster 7:
 the
 bathroom
 is
 door
 to


### Results
To actually get a better estimation of our text similarity, several tests are performed on a test set.