<h1 align="center">Machine Learning for NLP</h1>
    <h2 align="center">Text Feature Extraction</h2>
    <h3 align="center">Zahra Amini</h3>
<div style="width: 100%; text-align: center;">
    <table>
        <tr>
            <td>
                <a class="link" href="https://t.me/Zahraamini_ai">Telegram</a><br>
                <a class="link" href="https://www.linkedin.com/in/zahraamini-ai/">LinkedIn</a><br>
                <a class="link" href="https://www.youtube.com/@AcademyHobot">YouTube</a><br>
            </td>
            <td>
                <a class="link" href="https://github.com/aminizahra">GitHub</a><br>
                <a class="link" href="https://www.kaggle.com/aminizahra">Kaggle</a><br>
                <a class="link" href="https://www.instagram.com/zahraamini_ai/">Instagram</a><br>
            </td>
        </tr>
    </table>
</div>

## Libraries

In [3]:
# pip install unidecode

In [4]:
import pandas as pd
import numpy as np
import re

from unidecode import unidecode

import nltk
from nltk.tokenize import word_tokenize

from nltk.tokenize import sent_tokenize

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity

In [44]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

## Load Data

In [6]:
file_path = 'preprocessedTweet.csv'
data = pd.read_csv(file_path)

In [7]:
data.head()

Unnamed: 0,target,text,sentiment,sentence_tokens,word_tokens
0,0,switchfoot awww thats a bummer you shoulda ...,negative,['switchfoot awww thats a bummer you should...,"['switchfoot', 'awww', 'thats', 'a', 'bummer',..."
1,0,is upset that he cant update his facebook by t...,negative,['is upset that he cant update his facebook by...,"['is', 'upset', 'that', 'he', 'cant', 'update'..."
2,0,kenichan i dived many times for the ball manag...,negative,['kenichan i dived many times for the ball man...,"['kenichan', 'i', 'dived', 'many', 'times', 'f..."
3,0,my whole body feels itchy and like its on fire,negative,['my whole body feels itchy and like its on fi...,"['my', 'whole', 'body', 'feels', 'itchy', 'and..."
4,0,nationwideclass no its not behaving at all im ...,negative,['nationwideclass no its not behaving at all i...,"['nationwideclass', 'no', 'its', 'not', 'behav..."


## Bag of Words (BoW)
<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">

The **Bag of Words (BoW)** model is a method for extracting features from text for use in machine learning algorithms. In this approach, a text is represented as the frequency of words within it, ignoring grammar and order of words. Each unique word in the text corpus is considered a feature, and its frequency count in each document forms the feature vector.
</div>

In [9]:
vectorizer = CountVectorizer()
bow_features = vectorizer.fit_transform(data['text'])

## Term Frequency-Inverse Document Frequency (TF-IDF)
<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">

The** Term Frequency-Inverse Document Frequency (TF-IDF**) combines two metrics: Term Frequency (TF) and Inverse Document Frequency (IDF). The formula for calculating TF-IDF for a word \( t \) in a document \( d \) is:

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
$$

Where:

- **Term Frequency (TF)** measures how frequently a word appears in a document and is calculated as:

$$
\text{TF}(t, d) = \frac{\text{Number of times } t \text{ appears in } d}{\text{Total number of words in } d}
$$

- **Inverse Document Frequency (IDF)** measures how unique a word is across all documents in the corpus and is calculated as:

$$
\text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing } t}\right)
$$

Words with high TF-IDF scores are important because they appear frequently in a document but not across many documents in the corpus, making them unique and meaningful features.
</div>


In [11]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(data['text'])

## Word Embedding | Word2Vec
<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">

The **Word2Vec** model is a neural network-based method for learning vector representations of words in a continuous vector space. It captures semantic meanings by grouping similar words close to each other in the vector space. Here is an explanation of each parameter used in the code:<br>

- **sentences**: This parameter represents the input data, which should be a list of tokenized sentences (each sentence is a list of words). In this example, <code>tokenized_text</code> is used as the input.<br>
- **vector_size**: Defines the dimensionality of the word vectors. A higher value like 100 means each word is represented by a vector with 100 dimensions, capturing more details about word meanings.<br>
- **window**: Specifies the maximum distance between the current and predicted word within a sentence. A value of 5 means the model considers up to 5 words before and after the target word, capturing more contextual information.<br>
- **min_count**: Ignores all words with a frequency lower than this value. Setting it to 1 ensures that even words appearing once in the corpus are included in the model.<br>
- **workers**: Sets the number of worker threads used for training. A higher number, such as 4, speeds up the training process on multi-core systems.
</div>

In [13]:
tokenized_text = data['word_tokens']
word2vec_model = Word2Vec(sentences=tokenized_text, vector_size=100, window=5, min_count=1, workers=4)

In [14]:
def get_sentence_vector(sentence_tokens, model):
    word_vectors = [model.wv[word] for word in sentence_tokens if word in model.wv]
    if word_vectors:
        return np.mean(word_vectors, axis=0)
    else:
        return np.zeros(model.vector_size)

In [15]:
word2vec_features = [get_sentence_vector(sentence, word2vec_model) for sentence in tokenized_text]

## GloVe
<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">

**GloVe (Global Vectors for Word Representation)** is a pre-trained word embedding method that captures semantic relationships between words. Unlike Word2Vec, which relies on local context windows, GloVe is trained on word co-occurrence statistics across the entire corpus. This means it considers the global word relationships, making it especially useful for capturing semantic similarities.<br><br>

- **Pre-trained Embeddings**: GloVe provides pre-trained word vectors, typically trained on large datasets like Wikipedia or Common Crawl. These embeddings are loaded into the model, and words in your text are represented using these vectors.<br>
- **Dimensionality**: GloVe embeddings come in various dimensions (e.g., 50, `100`, 200, 300). A higher dimension allows the embedding to capture more semantic detail but requires more computational resources.<br>
- **Usage**: You can load GloVe embeddings using a library like Gensim or manually by mapping each word to its vector representation. For instance, each word in your vocabulary is represented by a pre-trained GloVe vector, which can be used in downstream tasks like text classification, clustering, or semantic analysis.
</div>

In [17]:
glove_path = r'C:\PC\Zahraamini_ai\GitHub\NLP_FeatureExtraction\glove.6B\glove.6B.100d.txt' 
# download from https://nlp.stanford.edu/projects/glove/

In [18]:
glove_embeddings = {}
with open(glove_path, 'r', encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        glove_embeddings[word] = vector

In [19]:
def get_sentence_embedding(sentence):
    words = sentence.split()
    word_vectors = [glove_embeddings[word] for word in words if word in glove_embeddings]
    if word_vectors:
        return np.mean(word_vectors, axis=0)
    else:
        return np.zeros(100)

In [20]:
glove_features = [get_sentence_embedding(sentence) for sentence in data['text']]

## Cosine Similarity
<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">
**
Cosine similarit**y is a measure used to determine how similar two vectors are by calculating the cosine of the angle between them. It’s commonly used in text analysis to compare the similarity between two documents or sentences represented as vectors. The cosine similarity score ranges from -1 to 1, where 1 indicates identical vectors, 0 indicates orthogonal (no similarity), and -1 indicates opposite directions.<br><br>

The formula for cosine similarity between two vectors \( A \) and \( B \) is:

$$
\text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \|B\|}
$$

Where:

- \( A \cdot B \) is the dot product of the vectors \( A \) and \( B \).

- \( \|A\| \) and \( \|B\| \) are the magnitudes (norms) of vectors \( A \) and \( B \), calculated as:

$$
\|A\| = \sqrt{\sum_{i=1}^{n} A_i^2}
$$

$$
\|B\| = \sqrt{\sum_{i=1}^{n} B_i^2}
$$

The result will be a number between -1 and 1, with higher values indicating greater similarity.
</div>


<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">

In text feature extraction methods, **cosine similarity** quantifies the similarity between two text representations, typically after transforming texts into vectors using techniques like Bag of Words, TF-IDF, or Word Embeddings (e.g., Word2Vec, GloVe). By calculating the cosine of the angle between two vectors, cosine similarity reveals how semantically similar two texts are. A cosine similarity score close to 1 indicates high similarity, suggesting the texts are likely to share similar content or context.
</div>

In [23]:
cosine_sim_bow = np.mean(cosine_similarity(bow_features[:10000]))

In [24]:
cosine_sim_tfidf = np.mean(cosine_similarity(tfidf_features[:10000]))

In [25]:
cosine_sim_word2vec = np.mean(cosine_similarity(word2vec_features[:10000]))

In [26]:
cosine_sim_glove = np.mean(cosine_similarity(glove_features[:10000]))

In [27]:
print("Cosine Similarity for BoW:\n", cosine_sim_bow)

Cosine Similarity for BoW:
 0.04784706731270371


<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">

This value is relatively low and indicates that the two texts have little similarity when using the Bag of Words (BoW) method. This result is often due to the fact that BoW only considers the frequency of words and does not take into account the order or **semantic relationships** of the words. Therefore, if the texts use different words, the BoW similarity **will naturally be low**.
</div>

In [29]:
print("Cosine Similarity for TF-IDF:\n", cosine_sim_tfidf)

Cosine Similarity for TF-IDF:
 0.01337352036602397


<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">

This value is even lower than that of BoW. The TF-IDF method focuses on words that hold greater significance in each text by considering the relative importance of words based on their frequency across the entire corpus. A low similarity score indicates that the key important words in these two texts differ significantly, suggesting that the texts may be thematically different or use different words altogether.
</div>

In [31]:
print("Cosine Similarity for Word2Vec:\n", cosine_sim_word2vec)

Cosine Similarity for Word2Vec:
 0.996087


<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">

A high value (close to 1) indicates that the texts are semantically very similar. The Word2Vec model, by considering the semantic relationships between words, can identify similarities between words with similar meanings, even if the exact same words are not used. Therefore, this result suggests that the two texts likely share similar meanings or themes.
</di>


In [33]:
print("Cosine Similarity for GloVe:\n", cosine_sim_glove)

Cosine Similarity for GloVe:
 0.8267009158948537


<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">

This value is even lower than that of BoW. The TF-IDF method focuses on words that hold greater significance in each text by considering the relative importance of words based on their frequency across the entire corpus. A low similarity score indicates that the key important words in these two texts differ significantly, suggesting that the texts may be thematically different or use different words altogether.
</div>

<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">

The results indicate that Word2Vec and GloVe, both of which utilize the semantic relationships of words for feature representation, show a higher similarity between the two texts, whereas BoW and TF-IDF demonstrate a very low similarity due to their reliance on the frequency and relative importance of words. This suggests that if the goal is to measure semantic similarity between texts, using word embeddings like Word2Vec and GloVe is preferable to simpler methods like BoW and TF-IDF.
</di>
