# Loading dependencies

In [1]:
import numpy
import sklearn
import sklearn.metrics.pairwise
from sklearn.metrics.pairwise import pairwise_distances
import string
import collections
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from pprint import pprint
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rickmackenbach/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/rickmackenbach/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Data cleaning

The first step of nearly any machine learning analysis project is data cleaning. This is done in order to allow a larger variety of models to work with a predictable input, such that exceptions (in this case special characters such as quotation marks, '[comma]' and others) will not cause any disturbance in the model. The following code loads the data, 'cleans' it, and afterwards sets the entire cleaned data in an array. Comments are added in the code for interpretability. NLTK is also used for cleaning.

### Cleaning

In [2]:
def process_text(text, stem=True):
    """ Tokenize text and stem words removing punctuation """
    text = text.translate(string.punctuation)
    tokens = word_tokenize(text)
 
    if stem:
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(t) for t in tokens]
 
    return tokens

In [3]:
## Set an empty list variable

descriptions = []

with open('descriptions.txt', encoding = "utf8") as f:
    for line in f:
        text = line.lower()                                       ## Lowercase all characters
        text = text.replace("[comma]"," ")                        ## Replace [commas] with empty space
        for ch in text:
            if ch < "0" or (ch < "a" and ch > "9") or ch > "z":   ## The cleaning operation happens here, remove all special characters
                text = text.replace(ch," ")
        text = ' '.join(text.split())                             ## Remove double spacing from sentences
        descriptions.append(text)
dataSet = numpy.array(descriptions)

print('After running first results, the following sentence was found : ')
print()
print(dataSet[496])
print()
print()
print('Although this sentence is not meaningful, it will remain in the dataset in order to have consistent results.')

After running first results, the following sentence was found : 

rwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewad rwevfderdcerfadwerfdewad rwevfderdcerfadwerfdewad rwevfderdcerfadwerfdewad rwevfderdcerfadwerfdewad rwevfderdcerfadwerfdewad rwevfderdcerfadwerfdewad rwevfderdcerfadwerfdewad rwevfderdcerfadwerfdewad rwevfderdcerfadwerfdewad rwevfderdcerfadwerfdewad rwevfderdcerfadwerfdewad rwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdcerfadwerfdewadrwevfderdc

### Representation

Since the input vector now is 'clean', different representations can be made, which in turn can then be 'trained' to obtain accuracy measures of classification. Firstly, countVectorizer by scikitLearn (which counts all the instances of words) will run on our cleaned dataset. Afterwards TfIdf will run, in order the have the Term frequency, inverse document frequency (which will essentially put less importance on non-informative words suchs as: 'the', 'and', 'a'). Scikit-learn provides a neat function to do this in a single function, namely TdIdfVectorizer

In [4]:
vectorizer = TfidfVectorizer(stop_words='english')
TfIdf_dataSet = vectorizer.fit_transform(dataSet)
print("What our Tf-Idf looks like: ")
print()
print(TfIdf_dataSet[0:1])

vectorVocab = vectorizer._validate_vocabulary()

What our Tf-Idf looks like: 

  (0, 3085)	0.18016293756155294
  (0, 1351)	0.10858700066557808
  (0, 3257)	0.3001624823648848
  (0, 2585)	0.34043262183046635
  (0, 2139)	0.11306416211646501
  (0, 4024)	0.21854146699056293
  (0, 1981)	0.25581194154091114
  (0, 3626)	0.31679149916873117
  (0, 1152)	0.24315753998842082
  (0, 4070)	0.2684663430934014
  (0, 1073)	0.30413709761624086
  (0, 2120)	0.2684663430934014
  (0, 1687)	0.06619086894558779
  (0, 1200)	0.26383180239523174
  (0, 1473)	0.31679149916873117
  (0, 1117)	0.16480321090672406
  (0, 4092)	0.16167588241502212


The data is now cleaned and neatly fit into an sparse array. Some basic information about the cleaned array will be provided in the following code.

In [5]:
print('The size of our data set: ', dataSet.size)
print('The dimension of our dataset are: ', dataSet.shape)
print('\n')
print('-- 0th element of our dataSet --', '\n', dataSet[0])
print('\n')
print('-- 1st element of our dataSet --', '\n', dataSet[1])

The size of our data set:  1480
The dimension of our dataset are:  (1480,)


-- 0th element of our dataSet -- 
 round face short and overweight likes to wear jeans and sweaters drinks wine at dinner short liberal overweight short hair eats at whole foods does not work our very much


-- 1st element of our dataSet -- 
 jug ears mustache and beard and long sideburns stylish hair no laugh lines eyes are clear no drugs or alcohol confident a little overweight from double chin


# Distance metrics

## Cosine similarity

Now we can safely compute the distance between each document. After sorting, the most similar top 5 documents will be provided. The first vector in the matrix represents the 'base' sentence. The vectors following are the sentences most similar to that 'base' sentence. This should be read per row. For example, the second element of the first row is most similar to the first element of the first row.

In [6]:
## Make use of SKlearn cosine similarity
cosineSimilarity = sklearn.metrics.pairwise.cosine_similarity(TfIdf_dataSet)
print(cosineSimilarity)

[[1.         0.06419899 0.01454109 ... 0.00629138 0.06862054 0.07042177]
 [0.06419899 1.         0.06743771 ... 0.04726818 0.06877843 0.04734656]
 [0.01454109 0.06743771 1.         ... 0.00565678 0.065617   0.02817174]
 ...
 [0.00629138 0.04726818 0.00565678 ... 1.         0.00482428 0.00497692]
 [0.06862054 0.06877843 0.065617   ... 0.00482428 1.         0.07985904]
 [0.07042177 0.04734656 0.02817174 ... 0.00497692 0.07985904 1.        ]]


In [7]:
## Adjust the cosineSimilarity matrix accordingly to sort and get results
numpy.fill_diagonal(cosineSimilarity,1.1)
cosineSimilaritySorted = numpy.argsort((-1*(cosineSimilarity)),axis=1)
top5similar = (cosineSimilaritySorted[:,0:5])
print()
print(top5similar)


[[   0 1454   65   66  406]
 [   1  556  549 1373  944]
 [   2  342    4  288  379]
 ...
 [1477 1372  210  902  681]
 [1478  967  706  669 1084]
 [1479 1341 1144  500  773]]


### Interpretation of the cosine similarity

Following the cosine metric, the first sentence in our dataSet is closest to the 1455 sentence in our data set. Let's see what they both look like:

In [8]:
print('Sentence 1 in the dataSet: ')
print(dataSet[0])
print()
print('Sentence 1455 in the dataSet: ')
print(dataSet[1454])

Sentence 1 in the dataSet: 
round face short and overweight likes to wear jeans and sweaters drinks wine at dinner short liberal overweight short hair eats at whole foods does not work our very much

Sentence 1455 in the dataSet: 
she is older looking and has some wrinkles she has a round face and a round nose has 2 kids has 1 grandkid not very happy likes to drink wine


## Euclidean distance

In [9]:
euclid = pairwise_distances(TfIdf_dataSet, metric='euclidean')
euclidSorted = numpy.argsort(euclid, axis=1)
top5SimilarEuclidean = euclidSorted[:,0:6]
print(top5SimilarEuclidean)

[[   0 1454   65   66  406 1085]
 [   1  556  549 1373  944  144]
 [   2  342    4  288  379 1417]
 ...
 [1477 1372  210  902  681  392]
 [1478  967  706  669 1084  625]
 [1479 1341 1144  500  773  541]]


Interpretation of euclidean distance, which is similar to our cosine similarity.

In [10]:
print('Sentence 1 in the dataSet: ')
print(dataSet[0])
print()
print('Sentence 1455 in the dataSet: ')
print(dataSet[1454])

Sentence 1 in the dataSet: 
round face short and overweight likes to wear jeans and sweaters drinks wine at dinner short liberal overweight short hair eats at whole foods does not work our very much

Sentence 1455 in the dataSet: 
she is older looking and has some wrinkles she has a round face and a round nose has 2 kids has 1 grandkid not very happy likes to drink wine


# KMeans clustering

Besides finding similar documents by cosine similarity, an implementation of KMeans clustering is done in the following code. This is more meaningful, since it is known that there are 5 sentences that are equal to each other, therefore making the number of clusters to 296. Also, it allows for topic extraction, which can be interpreted as the most important words for each cluster. 

## Cleaning the dataset

In [11]:
sentences = []

with open('descriptions.txt', encoding = "utf8") as f:
    for line in f:
        text = line.lower()                                       ## Lowercase all characters
        text = text.replace("[comma]"," ")                        ## Replace [commas] with empty space
        for ch in text:
            if ch < "0" or (ch < "a" and ch > "9") or ch > "z":   ## The cleaning operation happens here, remove all special characters
                text = text.replace(ch," ")
        text = ' '.join(text.split())                             ## Remove double spacing from sentences
        sentences.append(text)
        
#sentences = sentences[0:100]
nclusters = int(len(sentences)/5)

print(len(sentences), nclusters)


1480 296


In [12]:
def word_tokenizer(text):
            #tokenizes and stems the text
    tokens = word_tokenize(text)
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(t) for t in tokens if t not in stopwords.words('english')]
    return tokens

We transform the dataset to work with KMeans

In [13]:
def cluster_sentences(sentences, nb_of_clusters):
    tfidf_vectorizer = TfidfVectorizer(tokenizer=word_tokenizer,
                                    stop_words=stopwords.words('english'),
                                     max_df=0.9,
                                    min_df=0.1,
                                    lowercase=True)
            #builds a tf-idf matrix for the sentences
    tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)
    kmeans = KMeans(n_clusters=nb_of_clusters)
    kmeans.fit(tfidf_matrix)
    
    kmeans_array = kmeans.fit_predict(tfidf_matrix)
    
    clusters = collections.defaultdict(list)
    for i, label in enumerate(kmeans.labels_):
            clusters[label].append(i)


    return dict(clusters), kmeans_array

We train the KMeans in the next lines of code.

In [14]:
clusters, kmeans_predict = cluster_sentences(sentences, nclusters)

In [15]:
output = []

for i in range(len(kmeans_predict)):
    temp = [i]
    for j in range(1,len(kmeans_predict)):
        jl = [j]
        if kmeans_predict[i]==kmeans_predict[j] and j not in temp:
            temp += [j]
    output.append(temp)


In [16]:
for line in output:
    if len(line) > 5:
        index = output.index(line)
        output[index] = line[0:5]
    if len(line) > 1  and len(line) < 5:
        line += [line[1]]*(5-len(line))
    if len(line) == 1:
        line += [line[0]]*4

for line in output:
    print(line)

[0, 158, 353, 844, 1305]
[1, 1405, 1405, 1405, 1405]
[2, 811, 969, 1418, 811]
[3, 48, 352, 623, 636]
[4, 809, 1050, 809, 809]
[5, 930, 954, 979, 1314]
[6, 216, 799, 846, 1038]
[7, 520, 596, 642, 699]
[8, 383, 420, 1323, 1421]
[9, 849, 946, 1321, 849]
[10, 30, 59, 74, 91]
[11, 79, 89, 112, 515]
[12, 300, 507, 614, 1416]
[13, 101, 204, 306, 662]
[14, 510, 549, 867, 1048]
[15, 314, 366, 480, 557]
[16, 71, 333, 782, 898]
[17, 266, 273, 555, 669]
[18, 122, 294, 347, 1274]
[19, 319, 562, 319, 319]
[20, 506, 959, 1458, 506]
[21, 22, 524, 588, 700]
[22, 21, 524, 588, 700]
[23, 65, 447, 622, 65]
[24, 50, 676, 876, 1033]
[25, 540, 540, 540, 540]
[26, 146, 547, 810, 848]
[27, 90, 187, 304, 456]
[28, 142, 281, 901, 1026]
[29, 284, 423, 784, 820]
[30, 10, 59, 74, 91]
[31, 718, 761, 1139, 1140]
[32, 263, 459, 731, 1074]
[33, 237, 260, 439, 1133]
[34, 274, 607, 906, 949]
[35, 42, 1053, 42, 42]
[36, 1039, 1087, 1039, 1039]
[37, 481, 568, 734, 1344]
[38, 129, 1066, 1392, 129]
[39, 236, 1424, 236, 236]


[1358, 225, 264, 1061, 1067]
[1359, 641, 687, 711, 878]
[1360, 199, 1202, 199, 199]
[1361, 530, 768, 783, 1093]
[1362, 132, 701, 1335, 132]
[1363, 3, 48, 352, 623]
[1364, 120, 513, 690, 791]
[1365, 44, 224, 355, 412]
[1366, 192, 368, 581, 818]
[1367, 414, 780, 894, 414]
[1368, 86, 538, 617, 1027]
[1369, 370, 1291, 370, 370]
[1370, 556, 674, 899, 1030]
[1371, 599, 654, 888, 1157]
[1372, 110, 259, 446, 673]
[1373, 610, 723, 807, 931]
[1374, 322, 377, 491, 952]
[1375, 861, 861, 861, 861]
[1376, 349, 1031, 349, 349]
[1377, 115, 145, 168, 419]
[1378, 102, 152, 202, 246]
[1379, 43, 469, 914, 965]
[1380, 1017, 1457, 1017, 1017]
[1381, 490, 635, 838, 907]
[1382, 558, 594, 765, 1412]
[1383, 249, 532, 249, 249]
[1384, 449, 877, 1143, 449]
[1385, 360, 519, 649, 996]
[1386, 203, 1110, 203, 203]
[1387, 626, 991, 1295, 626]
[1388, 57, 97, 156, 233]
[1389, 93, 424, 1004, 1032]
[1390, 410, 961, 410, 410]
[1391, 984, 984, 984, 984]
[1392, 38, 129, 1066, 38]
[1393, 502, 552, 502, 502]
[1394, 94, 96, 393

## Interpretation of the KMeans results

In [17]:
print('The first sentence:')
print(dataSet[0])
print()
print('The 159th sentence:')
print(dataSet[158])

The first sentence:
round face short and overweight likes to wear jeans and sweaters drinks wine at dinner short liberal overweight short hair eats at whole foods does not work our very much

The 159th sentence:
f physical appearance that are fairly constant during an encounter i e excluding emotional expression are partly biologically their emotional state as shown by the drabness or brightness of their clothes life is too short for shitty wine you if you can feel people pulling away you should ask them about it
