<a href="https://colab.research.google.com/github/Ashikagade333/Ashikagade_INFO5371_Fall2023/blob/main/ashika.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
One interesting text classification task is sentiment analysis, where the goal is to determine the sentiment or emotion expressed in a given text, such as a product review, social media post, or customer feedback. Sentiment analysis can be useful for businesses to understand customer opinions, gauge product satisfaction, and make informed decisions based on public sentiment. Here are five types of features that could be valuable for building a sentiment analysis machine learning model:
Bag of Words (BoW) Features:

Word Frequency: Count the frequency of each word in the text. Positive and negative sentiment words can be given different weights to capture sentiment polarity.
N-grams: Include bi-grams or tri-grams to capture word combinations and phrases that may carry sentiment, such as "not good" or "very happy."

TF-IDF (Term Frequency-Inverse Document Frequency) Features:
TF-IDF Vector: Compute TF-IDF values for each word in the text. TF-IDF gives more weight to words that are unique to the document but not too common across all documents.
Word Embeddings:

Pre-trained Word Embeddings (e.g., Word2Vec, GloVe): Represent words as dense vectors in a continuous vector space. Average or concatenate word embeddings to form document-level embeddings.
Sentiment Lexicons:

Sentiment Lexicon Scores: Assign sentiment scores (e.g., positive, negative, or neutral) to words in the text based on a sentiment lexicon. Calculate overall sentiment score by aggregating word scores.

Part-of-Speech (POS) Features:
POS Tag Frequencies: Count the frequency of different POS tags (e.g., verbs, adjectives, adverbs) in the text. Analyze how the distribution of these tags correlates with sentiment.

Emoticon and Emoji Analysis:
Emoticon and Emoji Presence: Count the occurrence of emoticons and emojis in the text. Assign sentiment scores to commonly used emoticons and emojis.

These features collectively provide a rich representation of text data, allowing a machine learning model to capture both explicit and nuanced sentiment expressions. Combining these features can enhance the accuracy and effectiveness of sentiment analysis models


Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
import pandas as pd

# Initialize an empty list to store the decoded data
ashika_list = []

# Local path to the data file
file_path = "D:\\Users\\Ashika\\Desktop\\Synonyms.txt"

# Open the file and read its content
with open(file_path, 'r', encoding='utf-8') as file:
    for line in file:
        # Remove newline characters and append the line to the list
        decoded_data = line.replace('\r\n', '')
        if decoded_data:
            ashika_list.append(decoded_data)

# Create a DataFrame using pandas with a column named 'Text'
ashika_df = pd.DataFrame(ashika_list, columns=['Text'])

# Display the DataFrame
print("Original DataFrame:")
print(ashika_df)

Original DataFrame:
                                                  Text
0                   Synonyms of about (Entry 1 of 3)\n
1                                 1having to do with\n
2    a POIGANT story about a young man who goes off...
3                                 Synonyms for about\n
4                                                   \n
..                                                 ...
117                                                 \n
118  asleep, dormant, dozing, napping, resting, sle...
119  dozy, drowsy, nodding, SLEEPY, slumberous (or ...
120                                      dreaming ME\n
121                             hypnotized, MESMERIZED

[122 rows x 1 columns]


In [None]:
# Function to find the number of words in a text
def ashika_num_words(x):
    return len(str(x).split(" "))

# Apply the function to the 'Text' column and create a new column 'Num of Words'
ashika_df['Num of Words'] = ashika_df['Text'].apply(lambda y: ashika_num_words(y))

# Display the DataFrame after adding 'Num of Words' column
print("\nDataFrame after adding 'Num of Words' column:")
print(ashika_df)


DataFrame after adding 'Num of Words' column:
                                                  Text  Num of Words
0                   Synonyms of about (Entry 1 of 3)\n             7
1                                 1having to do with\n             4
2    a POIGANT story about a young man who goes off...            12
3                                 Synonyms for about\n             3
4                                                   \n             1
..                                                 ...           ...
117                                                 \n             1
118  asleep, dormant, dozing, napping, resting, sle...             8
119  dozy, drowsy, nodding, SLEEPY, slumberous (or ...             8
120                                      dreaming ME\n             2
121                             hypnotized, MESMERIZED             2

[122 rows x 2 columns]


In [None]:
# Create a new column 'Num of Char' to store the number of characters in each text
ashika_df['Num of Char'] = ashika_df['Text'].str.len()

# Display the DataFrame after adding 'Num of Char' column
print("\nDataFrame after adding 'Num of Char' column:")
print(ashika_df)


DataFrame after adding 'Num of Char' column:
                                                  Text  Num of Words  \
0                   Synonyms of about (Entry 1 of 3)\n             7   
1                                 1having to do with\n             4   
2    a POIGANT story about a young man who goes off...            12   
3                                 Synonyms for about\n             3   
4                                                   \n             1   
..                                                 ...           ...   
117                                                 \n             1   
118  asleep, dormant, dozing, napping, resting, sle...             8   
119  dozy, drowsy, nodding, SLEEPY, slumberous (or ...             8   
120                                      dreaming ME\n             2   
121                             hypnotized, MESMERIZED             2   

     Num of Char  
0             33  
1             19  
2             54  
3            

In [None]:
# Function to find the average word length in a text
def ashika_avg_word_length(x):
    words = x.split()
    if words:
        return sum(len(word) for word in words) / len(words)
    else:
        return None

# Apply the function to the 'Text' column and create a new column 'Avg Word Length'
ashika_df['Avg Word Length'] = ashika_df['Text'].apply(lambda z: ashika_avg_word_length(z))

# Display the DataFrame after adding 'Avg Word Length' column
print("\nDataFrame after adding 'Avg Word Length' column:")
print(ashika_df)


DataFrame after adding 'Avg Word Length' column:
                                                  Text  Num of Words  \
0                   Synonyms of about (Entry 1 of 3)\n             7   
1                                 1having to do with\n             4   
2    a POIGANT story about a young man who goes off...            12   
3                                 Synonyms for about\n             3   
4                                                   \n             1   
..                                                 ...           ...   
117                                                 \n             1   
118  asleep, dormant, dozing, napping, resting, sle...             8   
119  dozy, drowsy, nodding, SLEEPY, slumberous (or ...             8   
120                                      dreaming ME\n             2   
121                             hypnotized, MESMERIZED             2   

     Num of Char  Avg Word Length  
0             33         3.714286  
1            

In [None]:
# Function to find the number of special characters in a text
def ashika_num_special_characters(x):
    count = sum(not(char.isalpha()) and not(char.isdigit()) for char in x)
    return count

# Apply the function to the 'Text' column and create a new column 'Num of spec char'
ashika_df['Num of spec char'] = ashika_df['Text'].apply(lambda y: ashika_num_special_characters(y))

# Display the DataFrame after adding 'Num of spec char' column
print("\nDataFrame after adding 'Num of spec char' column:")
print(ashika_df)


DataFrame after adding 'Num of spec char' column:
                                                  Text  Num of Words  \
0                   Synonyms of about (Entry 1 of 3)\n             7   
1                                 1having to do with\n             4   
2    a POIGANT story about a young man who goes off...            12   
3                                 Synonyms for about\n             3   
4                                                   \n             1   
..                                                 ...           ...   
117                                                 \n             1   
118  asleep, dormant, dozing, napping, resting, sle...             8   
119  dozy, drowsy, nodding, SLEEPY, slumberous (or ...             8   
120                                      dreaming ME\n             2   
121                             hypnotized, MESMERIZED             2   

     Num of Char  Avg Word Length  Num of spec char  
0             33         3.714

In [None]:
# Function to find the number of numerics in a text
ashika_df['Num of num'] = ashika_df['Text'].apply(lambda x: len([word for word in x.split() if word.isdigit()]))

# Display the DataFrame after adding 'Num of num' column
print("\nDataFrame after adding 'Num of num' column:")
print(ashika_df)


DataFrame after adding 'Num of num' column:
                                                  Text  Num of Words  \
0                   Synonyms of about (Entry 1 of 3)\n             7   
1                                 1having to do with\n             4   
2    a POIGANT story about a young man who goes off...            12   
3                                 Synonyms for about\n             3   
4                                                   \n             1   
..                                                 ...           ...   
117                                                 \n             1   
118  asleep, dormant, dozing, napping, resting, sle...             8   
119  dozy, drowsy, nodding, SLEEPY, slumberous (or ...             8   
120                                      dreaming ME\n             2   
121                             hypnotized, MESMERIZED             2   

     Num of Char  Avg Word Length  Num of spec char  Num of num  
0             33        

In [None]:

# Function to find the number of uppercase words in a text
ashika_df['Num of upper case words'] = ashika_df['Text'].apply(lambda x: len([word for word in x.split() if word.isupper()]))

# Display the DataFrame after adding 'Num of upper case words' column:
print("\nDataFrame after adding 'Num of upper case words' column:")
print(ashika_df)


DataFrame after adding 'Num of upper case words' column:
                                                  Text  Num of Words  \
0                   Synonyms of about (Entry 1 of 3)\n             7   
1                                 1having to do with\n             4   
2    a POIGANT story about a young man who goes off...            12   
3                                 Synonyms for about\n             3   
4                                                   \n             1   
..                                                 ...           ...   
117                                                 \n             1   
118  asleep, dormant, dozing, napping, resting, sle...             8   
119  dozy, drowsy, nodding, SLEEPY, slumberous (or ...             8   
120                                      dreaming ME\n             2   
121                             hypnotized, MESMERIZED             2   

     Num of Char  Avg Word Length  Num of spec char  Num of num  \
0         

Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

Rank of most important features here:
1. Number of sentences
2. Number of Words
3. Number of Characters
4. Stowords
5. Lowercase
6. Removal of Punctuation


Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import torch

# Different text data with the name 'ashika'
ashika = [
    "Ashika is a software engineer specializing in web development and cloud computing.",
    "Outside of work, Ashika enjoys exploring new programming languages and building innovative projects.",
    "Ashika is dedicated to creating scalable and efficient solutions for complex technical challenges.",
    "Currently, Ashika is involved in a project focused on optimizing serverless architectures for better performance."
]

# Your query with the name 'ashika'
query_ashika = "What technologies is Ashika currently exploring?"

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings for a given text
def get_bert_embeddings_ashika(text):
    input_ids = tokenizer.encode(text, return_tensors='pt')
    with torch.no_grad():
        outputs = model(input_ids)
    embeddings = outputs.pooler_output.numpy()
    return embeddings

# Get BERT embeddings for the query and each document
query_embedding_ashika = get_bert_embeddings_ashika(query_ashika)
ashika_embeddings = [get_bert_embeddings_ashika(doc) for doc in ashika]

# Calculate cosine similarity between the query and each document
similarities_ashika = [cosine_similarity(query_embedding_ashika, doc_embedding)[0][0] for doc_embedding in ashika_embeddings]

# Rank documents based on similarity in descending order
ranked_ashika = sorted(zip(ashika, similarities_ashika), key=lambda x: x[1], reverse=True)

# Display the ranked documents with the name 'ashika'
print("Ranked Documents based on Cosine Similarity for 'Ashika':")
for document, similarity in ranked_ashika:
    print(f"Similarity: {similarity:.4f}\n{document}\n")


Ranked Documents based on Cosine Similarity for 'Ashika':
Similarity: 0.9834
Currently, Ashika is involved in a project focused on optimizing serverless architectures for better performance.

Similarity: 0.9820
Ashika is dedicated to creating scalable and efficient solutions for complex technical challenges.

Similarity: 0.9804
Outside of work, Ashika enjoys exploring new programming languages and building innovative projects.

Similarity: 0.9517
Ashika is a software engineer specializing in web development and cloud computing.

