## Text Classification

### Part 1) Similarity analysis

Below are 3 different texts. Calculate the level similarity between them and find out which of the texts are more similar to each other.

steps:

        - create the feature vector of each text

        - make all vectors compareable. it means that the same position in each vector should represent the same word
                
        To achieve this we have to:
                - create a unified vocabulary from the feature vectors
                - recreate the feature vectors ensuring that each position in the vectors represents the same word

Feature vector is a numerical representation a text. i.e., frequencies of a set of features/words in a text. This type of vector representation is commonly used for tasks like text classification, clustering, or similarity analysis.

In [10]:
text1 = "The Jaccard coefficient or Jaccard index, also known as Intersection over Union, named after the Swiss botanist Paul Jaccard (1868–1944), is a measure of the similarity of sets. It is often referred to by its definition as IoU (Intersection over Union)."
text2 = "The Isabelline Ghost Bat (Diclidurus isabella or Diclidurus isabellus) is a bat species found in northern South America, belonging to the genus of American Ghost Bats. This species is the only representative of the subgenus Depanycteris, which was temporarily classified as a separate genus. Despite the isabelline-colored body parts, Oldfield Thomas may have referred to a person named Isabell in his initial description."
text3  = "The American Ghost Bats (Diclidurus) are a genus of bats from the family of Smooth-tailed Free-tailed Bats. Along with the White Bat, they are among the only bats that are light gray to white in color. They are native to the tropical regions of Central and South America."

In [11]:
# creating feature vector - frequencies of individual words or tokens

from collections import Counter
import nltk

stopwords = nltk.corpus.stopwords.words('english')

def count_text(text):
    tokens = nltk.word_tokenize(text, language='english') # tokenize the text
    tokens = [e for e in tokens if e.lower() not in stopwords] # remove stopwords and punctuations

    return Counter(tokens)

feature_vec1 = count_text(text1)
feature_vec2 = count_text(text2)
feature_vec3 = count_text(text3)

print('feature_vec1 ----------:', len(feature_vec1))
print(feature_vec1)

print('feature_vec2 ----------:', len(feature_vec2))
print(feature_vec2)

print('feature_vec3 ----------:', len(feature_vec3))
print(feature_vec3)

feature_vec1 ----------: 23
Counter({'Jaccard': 3, ',': 3, 'Intersection': 2, 'Union': 2, '(': 2, ')': 2, '.': 2, 'coefficient': 1, 'index': 1, 'also': 1, 'known': 1, 'named': 1, 'Swiss': 1, 'botanist': 1, 'Paul': 1, '1868–1944': 1, 'measure': 1, 'similarity': 1, 'sets': 1, 'often': 1, 'referred': 1, 'definition': 1, 'IoU': 1})
feature_vec2 ----------: 39
Counter({',': 3, '.': 3, 'Ghost': 2, 'Diclidurus': 2, 'species': 2, 'genus': 2, 'Isabelline': 1, 'Bat': 1, '(': 1, 'isabella': 1, 'isabellus': 1, ')': 1, 'bat': 1, 'found': 1, 'northern': 1, 'South': 1, 'America': 1, 'belonging': 1, 'American': 1, 'Bats': 1, 'representative': 1, 'subgenus': 1, 'Depanycteris': 1, 'temporarily': 1, 'classified': 1, 'separate': 1, 'Despite': 1, 'isabelline-colored': 1, 'body': 1, 'parts': 1, 'Oldfield': 1, 'Thomas': 1, 'may': 1, 'referred': 1, 'person': 1, 'named': 1, 'Isabell': 1, 'initial': 1, 'description': 1})
feature_vec3 ----------: 27
Counter({'.': 3, 'Bats': 2, 'bats': 2, 'American': 1, 'Ghost': 

In [12]:
# make all vectors compareable
vocabulary=[]

for feature in [feature_vec1, feature_vec2, feature_vec3]:
    for word in feature:
        vocabulary.append(word)

vocabulary = set(vocabulary)
print('vocabulary ----------:', len(vocabulary))
print(vocabulary)

feature_vec1_values = [feature_vec1.get(word, 0) for word in vocabulary]
feature_vec2_values = [feature_vec2.get(word, 0) for word in vocabulary]
feature_vec3_values = [feature_vec3.get(word, 0) for word in vocabulary]

print('feature_vec1 ----------:', len(feature_vec1_values))
print(feature_vec1_values)

print('feature_vec2 ----------:', len(feature_vec2_values))
print(feature_vec2_values)

print('feature_vec3 ----------:', len(feature_vec3_values))
print(feature_vec3_values)

vocabulary ----------: 71
{'body', 'native', 'family', 'Oldfield', 'Thomas', 'person', '1868–1944', 'Isabell', 'South', 'separate', 'American', 'Intersection', 'definition', 'index', 'Isabelline', 'Depanycteris', 'White', 'among', 'description', 'known', 'referred', 'northern', 'sets', ')', 'botanist', 'Bats', 'Central', 'bat', 'representative', '(', 'also', '.', 'Free-tailed', 'IoU', 'coefficient', 'temporarily', 'bats', 'color', 'Smooth-tailed', 'similarity', 'gray', 'Ghost', 'classified', 'named', 'found', 'species', 'isabelline-colored', 'Union', 'tropical', 'subgenus', 'Jaccard', 'Paul', 'Bat', 'Diclidurus', 'Along', 'regions', 'isabellus', 'may', 'belonging', 'initial', 'light', 'white', 'Swiss', 'America', 'isabella', ',', 'measure', 'often', 'parts', 'Despite', 'genus'}
feature_vec1 ----------: 71
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 2, 1, 0, 0, 0, 0, 2, 1, 2, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Now that all feature vectors are compareable, we can do similary analysis using, but not limited to, below method.

#### a) Jaccard coefficient
Jaccard coefficient is simplest way to measure how similar two texts are. It counts the number of words that are common between the texts and divides it by the total number of unique words in both texts.

            jac(A, B) = |A ∩ B| / |A ∪ B|

It produces values between 0 and 1, where 1 indicates perfect similarity and 0 indicates no similarity.

#### b) Manhattan distance metric
Unlike Jaccard coeficient, manhattan distance does not consider if a word is common between two texts, but it also takes into account the number of times a word is mentioned in each text.

                Manhattan Distance(A, B) = Σ |Ai - Bi| for i=1 to n
                        - Ai and Bi are the frequencies of each word in the texts / feature vectors
                        - n is the total number of features/words

Manhattan distance metric produces non-negative integer values, where 0 indicates perfect similarity and higher values suggest increasing dissimilarity.

#### c) Cosine similarity
Manhattan distance is largely affected by the length of the text. To overcome this, Cosine similarity introduces another way to measure the text similarity as below.

        cos (A,B) = Σi Ai·Bi / √Σi A^2i · √Σi B^2i
            
Cosine similarity is beneficial when you want to focus on the semantic similarity between texts because it's not sensitive to the text's length.

It produces values between -1 and 1, where 1 indicates perfect similarity and -1 indicates perfect dissimilarity.

In [13]:
# Jaccard coefficient between text1, text2, and text3
def jaccard(vector1, vector2):
    interc=0
    union=0
    for i in range(len(vector1)):
        if vector1[i]>0 and vector2[i]>0:
            interc+=1
        elif vector1[i]>0 or vector2[i]>0:
            union+=1
        else:
            continue

    return interc/union

In [14]:
jac_text1_vs_text2 = jaccard(feature_vec1_values, feature_vec2_values)
jac_text1_vs_text3 = jaccard(feature_vec1_values, feature_vec3_values)
jac_text2_vs_text3 = jaccard(feature_vec2_values, feature_vec3_values)

print('Jaccard coefficient between text1 and text2 =', jac_text1_vs_text2)
print('Jaccard coefficient between text1 and text3 =', jac_text1_vs_text3)
print('Jaccard coefficient between text2 and text3 =', jac_text2_vs_text3)

Jaccard coefficient between text1 and text2 = 0.12
Jaccard coefficient between text1 and text3 = 0.09523809523809523
Jaccard coefficient between text2 and text3 = 0.2857142857142857


In [15]:
# Manhattan distance between text1, text2, and text3
manhattan_text1_text2 = sum([abs(v1-v2) for v1, v2 in zip(feature_vec1_values, feature_vec2_values)])
manhattan_text1_text3 = sum([abs(v1-v2) for v1, v2 in zip(feature_vec1_values, feature_vec3_values)])
manhattan_text2_text3 = sum([abs(v1-v2) for v1, v2 in zip(feature_vec2_values, feature_vec3_values)])

print('Manhattan distance between text1 and text2 =', manhattan_text1_text2)
print('Manhattan distance between text1 and text3 =', manhattan_text1_text3)
print('Manhattan distance between text2 and text3 =', manhattan_text2_text3)

Manhattan distance between text1 and text2 = 61
Manhattan distance between text1 and text3 = 53
Manhattan distance between text2 and text3 = 50


In [16]:
# Cosine similarity between text1, text2, and text3
import math

def cosine(vector1, vector2):
    aibi = sum([v1*v2 for v1, v2 in zip(vector1, vector2)])
    sqrt_ai2 = math.sqrt(sum([v**2 for v in vector1]))
    sqrt_bi2 = math.sqrt(sum([v**2 for v in vector2]))  

    return (aibi) / (sqrt_ai2 * sqrt_bi2)

cosine_text1_text2 = cosine(feature_vec1_values, feature_vec2_values)
cosine_text1_text3 = cosine(feature_vec1_values, feature_vec3_values)
cosine_text2_text3 = cosine(feature_vec2_values, feature_vec3_values)

print('Cosine similarity between text1 and text2 =', cosine_text1_text2)
print('Cosine similarity between text1 and text3 =', cosine_text1_text3)
print('Cosine similarity between text2 and text3 =', cosine_text2_text3)

Cosine similarity between text1 and text2 = 0.3491282676376715
Cosine similarity between text1 and text3 = 0.27628324232744655
Cosine similarity between text2 and text3 = 0.4960712045370879


As seen above, text-2 and text-3 are more similar (Jaccard coefficient = 2.89, Manhattan distance = 50, Cosine similarity = 0.5) than text-1 and text2 (Jaccard coefficient = 0.12, Manhattan distance = 61, Cosine similarity = 0.35) or text-1 and text-3 (Jaccard coefficient = 0.095, Manhattan distance = 53, Cosine similarity = 0.28)

___
            Note to self. All in one:
___


In [17]:
from collections import Counter
import nltk
import math

def simalarity(text1, text2, type = 'Jaccard', remove_stop_words = True, language = 'english'):

    # create feature vectors
    text1_tokens = nltk.word_tokenize(text1, language=language)
    text2_tokens = nltk.word_tokenize(text2, language=language)

    if remove_stop_words:
        stopwords = nltk.corpus.stopwords.words(language)
        text1_tokens = [e for e in text1_tokens if e.lower() not in stopwords]
        text2_tokens = [e for e in text2_tokens if e.lower() not in stopwords]

    feature_vec1 = Counter(text1_tokens)
    feature_vec2 = Counter(text2_tokens)

    # make feature vectors comparable / normalize
    vocabulary=[]
    for feature in [feature_vec1, feature_vec2]:
        for word in feature:
            vocabulary.append(word)
    vocabulary = set(vocabulary)
    feature_vec1_values = [feature_vec1.get(word, 0) for word in vocabulary]
    feature_vec2_values = [feature_vec2.get(word, 0) for word in vocabulary]

    # similarity analysis 
    if type == 'Jaccard':
            interc=0
            union=0
            for i in range(len(feature_vec1_values)):
                if feature_vec1_values[i]>0 and feature_vec2_values[i]>0:
                    interc+=1
                elif feature_vec1_values[i]>0 or feature_vec2_values[i]>0:
                    union+=1
                else:
                    continue

            return interc/union
    
    elif type == 'Manhattan':
        return sum([abs(v1-v2) for v1, v2 in zip(feature_vec1_values, feature_vec2_values)])

    elif type == "Cosine":
        aibi = sum([v1*v2 for v1, v2 in zip(feature_vec1_values, feature_vec2_values)])
        sqrt_ai2 = math.sqrt(sum([v**2 for v in feature_vec1_values]))
        sqrt_bi2 = math.sqrt(sum([v**2 for v in feature_vec2_values]))

        return (aibi) / (sqrt_ai2 * sqrt_bi2)
    
    else:
        return 'Please make sure you selected the correct type for similarity analysis.'

In [18]:
t = 'Jaccard'
print(simalarity(text1, text2, type = t))
print(simalarity(text1, text3, type = t))
print(simalarity(text2, text3, type = t))

0.12
0.09523809523809523
0.2857142857142857


In [19]:
t = 'Manhattan'
print(simalarity(text1, text2, type = t))
print(simalarity(text1, text3, type = t))
print(simalarity(text2, text3, type = t))

61
53
50


In [20]:
t = 'Cosine'
print(simalarity(text1, text2, type = t))
print(simalarity(text1, text3, type = t))
print(simalarity(text2, text3, type = t))

0.3491282676376715
0.27628324232744655
0.4960712045370879


In [21]:
t = 'xyz',
print(simalarity(text1, text2, type = t))
print(simalarity(text1, text3, type = t))
print(simalarity(text2, text3, type = t))

Please make sure you selected the correct type for similarity analysis.
Please make sure you selected the correct type for similarity analysis.
Please make sure you selected the correct type for similarity analysis.
