https://www.kaggle.com/datasets/nelgiriyewithana/emotions/data

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
#modules and libraries here
import pandas as pd
import numpy as np
import re
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
#file path and loading the dataset
file_path = '/content/drive/My Drive/text.csv'
emo_data = pd.read_csv(file_path)
#Sample 10,000 entries from the dataset because it will be computationally
#intensive to work with such large data (more than 400.000)
emo_data = emo_data.sample(n=10000, random_state=42)  # Use a fixed seed for reproducibility
#first few rows to be re-assured it's loaded and sampled correctly
print(emo_data.head())

        Unnamed: 0                                               text  label
36130        36130  id say maybe made them feel foolish but that w...      0
138065      138065  i joined the lds church i admit to feeling som...      0
146440      146440  i must admit i didnt feel like hugging him not...      3
103337      103337  i hate that i can still feel if any nerve is d...      0
315528      315528                  im actually feeling a little smug      1


In [4]:
#let's the shape of the dataset to understand the scale
#of the data I am working with, so that I can plan
#how to handle the dataset in subsequent data processing and analysis tasks.
print("Dataset shape:", emo_data.shape)
#summary of the dataframe as it helps I quickly identify columns that might
#need type conversion (for example, converting dates stored as strings to datetime
#objects) and spot columns with missing data.
emo_data.info()
#Any missing values?
print("Missing values in each column:\n", emo_data.isnull().sum())

Dataset shape: (10000, 3)
<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 36130 to 282137
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  10000 non-null  int64 
 1   text        10000 non-null  object
 2   label       10000 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 312.5+ KB
Missing values in each column:
 Unnamed: 0    0
text          0
label         0
dtype: int64


**Data Pre-processing**
Now, firstly, let's focus on data cleaning to improve model performance and achieve accurate analysis. Noisy data can lead models to "memorize" irrelevant patterns specific to my training set, negatively impacting their performance on new tweets.) Also, we want to achieve accurate analysis through unbiased interpretations and Identifying true linguistic patterns. It's because inconsistent data can affect results so that it can skew emotions toward or away from some emotions. Cleaning data can help in this case. Besides, cleaning data can help to identify true linguistic patterns such that our analysis will be truer representative of people's emotions in tweets.


In [5]:
#cleaning data
def clean_text(text):
    #removing URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    #removing user mentions
    text = re.sub(r'@\w+', '', text)
    #hashtags (but keeping the text)
    text = re.sub(r'#', '', text)
    #special characters and numbers
    text = re.sub(r'\W', ' ', text)
    #numbers
    text = re.sub(r'\d+', '', text)
    #extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    #Converting to lowercase
    text = text.lower()
    return text

The clean_text function I've developed is tailored for preprocessing text data from tweets, particularly useful for tasks such as sentiment or emotion classification. Initially, the function strips out URLs and Twitter-specific elements like user mentions and the '#' symbol from hashtags, retaining only the textual content. These components are generally irrelevant for analyzing the emotional content of the text.

Following this, the function removes all special characters, numbers, and extra spaces, which helps in cleaning up the text by eliminating unnecessary punctuation and numerical data that do not contribute to sentiment analysis. This is done through a series of regular expressions that substitute non-word characters and digits with spaces or remove them entirely, and also consolidate multiple whitespace into single spaces. Additionally, all text is converted to lowercase to ensure consistency, as the case of letters can affect the processing in NLP tasks.

By performing these cleaning steps, the function prepares the tweet text for further analysis by making it more uniform and focused on meaningful words only. This preprocessing is crucial as clean and standardized data often leads to more accurate classifications in machine learning models.

In [6]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.7/110.7 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.1.0 textsearch-0.0.24


In [7]:
import contractions
#function for expanding contractions
def expand_text(text):
    expanded_text = contractions.fix(text)
    return expanded_text
emo_data['cleaned_text'] = emo_data['text'].apply(clean_text)
emo_data['expanded_text'] = emo_data['cleaned_text'].apply(expand_text)
print(emo_data[['text', 'cleaned_text', 'expanded_text']].head())


                                                     text  \
36130   id say maybe made them feel foolish but that w...   
138065  i joined the lds church i admit to feeling som...   
146440  i must admit i didnt feel like hugging him not...   
103337  i hate that i can still feel if any nerve is d...   
315528                  im actually feeling a little smug   

                                             cleaned_text  \
36130   id say maybe made them feel foolish but that w...   
138065  i joined the lds church i admit to feeling som...   
146440  i must admit i didnt feel like hugging him not...   
103337  i hate that i can still feel if any nerve is d...   
315528                  im actually feeling a little smug   

                                            expanded_text  
36130   id say maybe made them feel foolish but that w...  
138065  i joined the lds church i admit to feeling som...  
146440  i must admit i did not feel like hugging him n...  
103337  i hate that i can 

 first I import the contractions module which is essential for expanding contractions in text, making it clearer for analysis. For example, it transforms "can't" to "cannot" which is useful in text preprocessing for NLP tasks. I then define a function expand_text that applies this module to any given text, standardizing contractions throughout the dataset.

Next, I apply two preprocessing steps to my data stored in the emo_data DataFrame. Initially, I use the clean_text function to remove URLs, user mentions, hashtags, special characters, numbers, and extra spaces, and I convert all text to lowercase. The cleaned text is stored in a new column cleaned_text. Afterwards, I use the expand_text function on the cleaned_text column to expand any contractions, storing the results in another new column expanded_text.

Finally, I print the first few rows of the original text along with their cleaned and expanded versions to verify the transformations. This approach ensures that the text data is thoroughly preprocessed, removing irrelevant characters and standardizing the format before any further analysis like emotion classification. This makes the dataset more uniform and potentially enhances the performance of NLP models.

In [8]:
#customized it because library did not help with contraction handling
contraction_mapping = {
    "can't": "cannot",
    "won't": "will not",
    "i'd've": "i would have",
    "i'd": "i would",
    "you're": "you are",
    "she's": "she is",
    "it's": "it is",
    "didn't": "did not",
    "we've": "we have",
    "isn't": "is not",
    "aren't": "are not",
    "wasn't": "was not",
    "weren't": "were not",
    "haven't": "have not",
    "hasn't": "has not",
    "hadn't": "had not",
    "won't": "will not",
    "wouldn't": "would not",
    "don't": "do not",
    "doesn't": "does not",
    "shouldn't": "should not",
    "mightn't": "might not",
    "mustn't": "must not",
    "could've": "could have",
    "would've": "would have",
    "might've": "might have",
    "must've": "must have",
    "how's": "how is",
    "what's": "what is",
    "where's": "where is",
    "who's": "who is",
    "when's": "when is",
    "why's": "why is",
    "let's": "let us",
    "y'all": "you all",
    "y'know": "you know",
    "gonna": "going to",
    "gotta": "got to",
    "wanna": "want to",
    "ain't": "is not",
    "that'll": "that will",
    "there's": "there is",
    "there're": "there are",
    "here's": "here is"
}


In [9]:
import re
def expand_contractions(text, contraction_dict):
    #regular expression pattern for finding contractions
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_dict.keys())), flags=re.IGNORECASE)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        #now retrieving the correct expanded form from the dictionary
        expanded_contraction = contraction_dict.get(match.lower() if contraction_dict.get(match.lower()) else match.lower())
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
    expanded_text = contractions_pattern.sub(expand_match, text)
    return expanded_text

In [10]:
#let's see whether our function works
texts = [
    "I'd've liked to see that",
    "You're going to need to run faster",
    "He can't join us",
]

for text in texts:
    print(f"Original: {text}")
    print(f"Expanded: {expand_contractions(text, contraction_mapping)}\n")

Original: I'd've liked to see that
Expanded: I would have liked to see that

Original: You're going to need to run faster
Expanded: You are going to need to run faster

Original: He can't join us
Expanded: He cannot join us



function is correctly expanding contractions, now it is time to apply it to my DataFrame:

In [11]:
emo_data['expanded_text'] = emo_data['cleaned_text'].apply(lambda x: expand_contractions(x, contraction_mapping))

# Check the first few entries to verify the changes
print(emo_data[['cleaned_text', 'expanded_text']].head())

                                             cleaned_text  \
36130   id say maybe made them feel foolish but that w...   
138065  i joined the lds church i admit to feeling som...   
146440  i must admit i didnt feel like hugging him not...   
103337  i hate that i can still feel if any nerve is d...   
315528                  im actually feeling a little smug   

                                            expanded_text  
36130   id say maybe made them feel foolish but that w...  
138065  i joined the lds church i admit to feeling som...  
146440  i must admit i didnt feel like hugging him not...  
103337  i hate that i can still feel if any nerve is d...  
315528                  im actually feeling a little smug  


Now problem is clear, because those contractions in my text are using
non-standard characters such as typographic (curly) apostrophes or if they're missing apostrophes altogether, that would indeed prevent the contraction expansion dictionary from matching and replacing them correctly.

In [12]:
# Existing contraction dictionary with standard contractions
contraction_mapping = {
    "can't": "cannot",
    "won't": "will not",
    "i'd've": "i would have",
    "i'd": "i would",
    "you're": "you are",
    "she's": "she is",
    "it's": "it is",
    "didn't": "did not",
    "we've": "we have",
    "isn't": "is not",
    "aren't": "are not",
    "wasn't": "was not",
    "weren't": "were not",
    "haven't": "have not",
    "hasn't": "has not",
    "hadn't": "had not",
    "won't": "will not",
    "wouldn't": "would not",
    "don't": "do not",
    "doesn't": "does not",
    "shouldn't": "should not",
    "mightn't": "might not",
    "mustn't": "must not",
    "could've": "could have",
    "would've": "would have",
    "might've": "might have",
    "must've": "must have",
    "how's": "how is",
    "what's": "what is",
    "where's": "where is",
    "who's": "who is",
    "when's": "when is",
    "why's": "why is",
    "let's": "let us",
    "y'all": "you all",
    "y'know": "you know",
    "gonna": "going to",
    "gotta": "got to",
    "wanna": "want to",
    "ain't": "is not",
    "that'll": "that will",
    "there's": "there is",
    "there're": "there are",
    "here's": "here is"
}

#New entries without apostrophes, i will merge these two at the end
new_contractions = {
    "cant": "cannot",
    "wont": "will not",
    "idve": "i would have",
    "id": "i would",
    "youre": "you are",
    "shes": "she is",
    "its": "it is",
    "didnt": "did not",
    "weve": "we have",
    "isnt": "is not",
    "arent": "are not",
    "wasnt": "was not",
    "werent": "were not",
    "havent": "have not",
    "hasnt": "has not",
    "hadnt": "had not",
    "wouldnt": "would not",
    "dont": "do not",
    "doesnt": "does not",
    "shouldnt": "should not",
    "mightnt": "might not",
    "mustnt": "must not",
    "couldve": "could have",
    "wouldve": "would have",
    "mightve": "might have",
    "mustve": "must have",
    "hows": "how is",
    "whats": "what is",
    "wheres": "where is",
    "whos": "who is",
    "whens": "when is",
    "whys": "why is",
    "lets": "let us",
    "yall": "you all",
    "yknow": "you know",
    "gonna": "going to",
    "gotta": "got to",
    "wanna": "want to",
    "aint": "is not",
    "thatll": "that will",
    "theres": "there is",
    "therere": "there are",
    "heres": "here is"
}

#updating the existing dictionary with new entries, aka merging these two
contraction_mapping.update(new_contractions)

#now dictionary is comprehensive and can handle both versions of contractions

In [13]:
emo_data['expanded_text'] = emo_data['cleaned_text'].apply(lambda x: expand_contractions(x, contraction_mapping))
#the output
print(emo_data[['cleaned_text', 'expanded_text']].head())

                                             cleaned_text  \
36130   id say maybe made them feel foolish but that w...   
138065  i joined the lds church i admit to feeling som...   
146440  i must admit i didnt feel like hugging him not...   
103337  i hate that i can still feel if any nerve is d...   
315528                  im actually feeling a little smug   

                                            expanded_text  
36130   i would say maybe made them feel foolish but t...  
138065  i joined the lds church i admit to feeling som...  
146440  i must admit i did not feel like hugging him n...  
103337  i hate that i can still feel if any nerve is d...  
315528                  im actually feeling a little smug  


To address the issue of non-standard characters such as typographic (curly) apostrophes or missing apostrophes in contractions, which can impede the proper expansion of contractions using standard libraries, I've implemented a custom approach:

I've defined a function named expand_contractions that includes a custom dictionary contraction_mapping tailored to handle various forms of contractions and typographic inconsistencies. This function is designed to recognize and expand contractions that may not be in the standard form expected by typical libraries like the contractions module. I then apply this custom function to each entry in the cleaned_text column of my emo_data DataFrame. This is done using a lambda function which passes each text entry to expand_contractions along with the predefined contraction_mapping. This ensures that all possible contractions, even those with non-standard characters, are expanded consistently.

After applying the function, I print the first few rows of both the cleaned_text and the newly created expanded_text columns using print(emo_data[['cleaned_text', 'expanded_text']].head()). This step allows me to check and ensure that the contractions were expanded correctly, offering a clear comparison between the original and processed text.
This approach is very effective in handling texts with varied typographic characters, making sure that my text preprocessing is robust and accommodates data from diverse sources, which is crucial for maintaining accuracy in further text analysis or machine learning tasks.

**why handling contractions more thoroughly is beneficial for my sentiment analysis project:**


1. Contractions like "don't" and "can't" are essentially the same word ("do not", "cannot") with emotional valence, but split into two tokens.
By expanding these contractions, I ensure my machine learning models treat these word forms as the same feature, leading to more accurate learning.

If I don't expand contractions, my stop word removal might miss parts of them. For instance, after tokenizing "I'm", I have "I" and "m", with "I" remaining as a stop word.
Expanding the contraction first allows me to remove the full form cleanly.

If I have many variations of contractions ("haven't", "hasn't", "hadn't"), each one becomes a less frequent feature.
Expanding them consolidates the occurrences, making it easier for the machine learning model to recognize patterns.

In [14]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [15]:
#Tokenization function
def tokenize_text(text):
    tokens = word_tokenize(text)
    return tokens
#'expanded_text' is ready after contraction expansion
if 'expanded_text' in emo_data.columns:
    emo_data['tokens'] = emo_data['expanded_text'].apply(tokenize_text)
    print(emo_data[['expanded_text', 'tokens']].head())
else:
    print("The expected column does not exist in the DataFrame.")

                                            expanded_text  \
36130   i would say maybe made them feel foolish but t...   
138065  i joined the lds church i admit to feeling som...   
146440  i must admit i did not feel like hugging him n...   
103337  i hate that i can still feel if any nerve is d...   
315528                  im actually feeling a little smug   

                                                   tokens  
36130   [i, would, say, maybe, made, them, feel, fooli...  
138065  [i, joined, the, lds, church, i, admit, to, fe...  
146440  [i, must, admit, i, did, not, feel, like, hugg...  
103337  [i, hate, that, i, can, still, feel, if, any, ...  
315528           [im, actually, feeling, a, little, smug]  


I am incorporating text tokenization, which is a fundamental step in natural language processing (NLP). Here's an overview of how my script operates:

First, I import the Natural Language Toolkit (nltk), specifically importing word_tokenize from nltk.tokenize. This function is widely used in NLP for breaking down a string (text) into individual words or tokens, which are the basic units for text analysis.
Then, I ensure that the necessary NLTK resources are available by downloading them using nltk.download('punkt'). The 'punkt' tokenizer model is essential for word_tokenize to function correctly as it helps in determining the boundaries of words in a sentence.
The function tokenize_text is defined to apply the word_tokenize method to any given text. This function processes the input text to create a list of tokens (words and punctuation), making the text easier to analyze and manipulate at a granular level.
I then check if the expanded_text column exists in the emo_data DataFrame. Assuming that this column contains text that has already been cleaned and had contractions expanded, I apply the tokenize_text function to each entry in this column. The resulting tokens for each text entry are stored in a new column named tokens.
Output and Error Handling: Finally, I print the first few rows of the expanded_text and tokens columns to inspect the tokenization results. If the expanded_text column does not exist, an error message is displayed, indicating that the expected column is missing in the DataFrame.
This process of tokenization is critical as it prepares the text data for deeper NLP tasks such as sentiment analysis, emotion recognition, or even machine learning modeling by breaking down complex strings of text into manageable, analyzable components.

**generally it's a good idea to address refinement opportunities like negation handling and domain-specific term retention before lemmatization. Here's why:**

Lemmatization might not always handle negation gracefully. For instance, "doesn't feel good" could be lemmatized to "do feel good," completely reversing the sentiment. Handling negation beforehand ensures correct sentiment interpretation downstream.
Also, Lemmatization could transform domain-specific terms into their base forms, potentially losing valuable information for my task.
Computational Efficiency: Since lemmatization is slightly more computationally expensive than steps like negation tagging or replacing domain-specific terms, doing it after refinement means I won't be spending resources processing words I might ultimately discard.


 *Implementing Special N-grams for Negation Patterns
To effectively capture negation patterns in my preprocessing, I can adjust my tokenization process to treat common negation terms and their immediate successors as single tokens (n-grams). This helps preserve the sentiment context that negations impart on the phrases. Here's how to implement this*

In [16]:
#repeating libraries is to make sure
import pandas as pd
import re
from nltk.tokenize import TweetTokenizer
from nltk.sentiment import SentimentIntensityAnalyzer
from transformers import pipeline
import nltk
nltk.download('vader_lexicon')
#initializing the sentiment analysis tools
sid = SentimentIntensityAnalyzer()
bert_classifier = pipeline('sentiment-analysis')
tokenizer = TweetTokenizer(preserve_case=False)

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [17]:
def negation_handling(text):
    #pattern to find words that are negations and the word immediately following them
    negation_pattern = re.compile(r'\b(not|no|never|none|nothing|nobody|neither|nor|cannot|can\'t|don\'t|doesn\'t|didn\'t|won\'t|wouldn\'t|shan\'t|shouldn\'t)\s+(\w+)', re.IGNORECASE)
    #replacing the space between negation and the following word with an underscore
    text = negation_pattern.sub(r'\1_\2', text)
    return tokenizer.tokenize(text)

By combining specialized n-grams to handle negation patterns during tokenization with sophisticated sentiment analysis tools, I plan to:

Accurately capture the effects of negations on sentiments and analyze the linguistic features that contribute to these sentiments effectively.
This combined approach not only increases the accuracy of my sentiment classification but also deepens my understanding of how linguistic features like negation influence sentiment in tweets.

**Combination Strategy for tools handling negation in sentiment analysis**
For a research project focusing on the emotional content of tweets and exploring linguistic characteristics, including how negation affects sentiment, a combined approach using VADER and a large language model like BERT could be ideal:

I am going to use VADER for initial assessments and to quickly process large volumes of data where the context is less critical. It’s also beneficial for its built-in sensitivity to both negations and intensifiers, which are common in social media text.
I will use BERT or a similar model for deep analysis where context matters significantly. This is especially useful in cases where the sentiment of the text is subtle or depends heavily on the overall context rather than on specific words.

In [18]:
pip install transformers



In [19]:
#maximum width of each column to None, which means unlimited
pd.set_option('display.max_colwidth', None)
#adjusting the overall display width of the DataFrame
pd.set_option('display.width', 1000)

In [20]:
def analyze_sentiment(text):
    tokens = negation_handling(text)
    processed_tweet = ' '.join(tokens)  #tokens into a single string for analysis
    vader_scores = sid.polarity_scores(processed_tweet)
    #BERT for deeper analysis if VADER is uncertain
    if abs(vader_scores['compound']) < 0.1:
        bert_result = bert_classifier(processed_tweet)
        final_sentiment = bert_result
    else:
        final_sentiment = vader_scores

    return final_sentiment
#a random sample of tweets to analyze selection
random_tweets = emo_data.sample(n=3)
random_tweets['sentiment_analysis'] = random_tweets['cleaned_text'].apply(analyze_sentiment)
print(random_tweets[['cleaned_text', 'sentiment_analysis']])

                                                                                                                                                      cleaned_text                                               sentiment_analysis
88271                         i used to have stretches of time when i was feeling ok and not showing that i could actually forget that you re in there but no more    {'neg': 0.06, 'neu': 0.823, 'pos': 0.117, 'compound': 0.1027}
395658                                                                                                                    i feel like these things are just stupid  {'neg': 0.312, 'neu': 0.459, 'pos': 0.229, 'compound': -0.2263}
233184  i was am feeling delicate i thought a simple one would be best so i went for one of several options in maddhur jaffreys curry easy red lentils with ginger     {'neg': 0.0, 'neu': 0.692, 'pos': 0.308, 'compound': 0.8316}


I've focused on sentiment analysis of tweet texts by combining sophisticated preprocessing with multiple sentiment analysis tools. Here's how my implementation works, structured in a seamless workflow:

Initially, I configure pandas to display the entire content of each DataFrame column without truncation by setting display.max_colwidth to None. This is useful for ensuring that no part of the tweet texts is cut off in the output, providing a complete view when inspecting or debugging data. Additionally, I adjust the overall display width of the DataFrame to 1000, making it easier to view wider data outputs without line wrapping. Then, I define a function analyze_sentiment that incorporates both the custom negation handling and sentiment analysis:
So, The function starts by processing the input text with negation_handling to account for special n-grams that preserve the context around negations. After tokenization, the tokens are rejoined into a single string. This is necessary because sentiment analysis tools typically require text input as complete sentences or phrases. I utilize the SentimentIntensityAnalyzer from NLTK, which provides a polarity_scores method to get sentiment scores based on the processed text. The scores include a 'compound' metric combining positivity, negativity, and neutrality to represent the overall sentiment. If VADER's compound score is close to zero (less than 0.1 in absolute value, indicating uncertainty), I employ a more powerful BERT model from the transformers library to get a possibly clearer sentiment determination.
Conditional Sentiment Determination: Depending on the confidence of the VADER result, I either use the initial VADER scores or the result from the BERT classifier as the final sentiment output.
I then select a random sample of three tweets from my dataset (emo_data) and apply the analyze_sentiment function to each. This is done by creating a new column sentiment_analysis in the random_tweets DataFrame.
Finally, I print a snapshot of the cleaned tweet texts alongside their analyzed sentiments to visually verify the effectiveness of my sentiment analysis pipeline.
This method ensures that the sentiment analysis captures more nuanced expressions, especially around negations, significantly enhancing the reliability and depth of the sentiment insights extracted from my tweet dataset.


Tweet 1: "i watched a sappy movie last night and feel weepy and sad today"
Sentiment: Negative ({'neg': 0.481, 'neu': 0.519, 'pos': 0.0, 'compound': -0.7506})
Analysis: Strong negative sentiment, likely due to words like "weepy" and "sad."
Tweet 2: "i have a feeling he will not think its very cool for his mom to plan his party anymore"
Sentiment: Positive ({'neg': 0.0, 'neu': 0.657, 'pos': 0.343, 'compound': 0.6997})
Analysis: Despite the negation "not think its very cool," the overall context might have been interpreted more positively due to other elements in the sentence.
Tweet 3: "i feel like people wake up and already determine that they are going to have a crappy day without giving the day a chance because it is the easy way out"
Sentiment: Slightly positive ({'neg': 0.103, 'neu': 0.686, 'pos': 0.211, 'compound': 0.4215})
Analysis: This is interesting as the presence of "crappy" suggests negative sentiment, but the overall score is slightly positive, which might reflect a nuanced understanding or an error in sentiment detection.

**Observations**

VADER seems quite effective in capturing sentiments that are clearly expressed through straightforward language, as observed in the first tweet about feeling "weepy and sad". VADER's scores are directly reflective of the negative emotions present, highlighted by a significant negative compound score.

Besides, the second tweet presents an interesting case where negation plays a crucial role. The tweet includes a negation ("will not think its very cool"), which could potentially invert the sentiment of the sentence. The positive score suggests that either the negation was not handled as expected, impacting the sentiment analysis, or other elements in the sentence outweighed the negative phrase. This indicates a potential area for refining how negation is addressed in the sentiment analysis process.

Finally, the third tweet's sentiment analysis demonstrates the complexity of handling mixed sentiments within a single statement. The tweet contains phrases that suggest negativity ("crappy day"), but the overall sentiment score is slightly positive. This could be due to the phrasing that might have been interpreted positively by the sentiment analyzer or the nuanced context that BERT (if it was triggered) managed to understand better than VADER.

Also, The threshold for involving BERT in the analysis (compound score absolute value < 0.1) is a strategic choice. BERT, being a contextually aware model, can potentially provide a more accurate sentiment analysis for complex sentences. However, the examples don't clearly indicate if and when BERT was used, suggesting that VADER's analysis was deemed sufficient in these cases. Monitoring when BERT is activated could provide insights into its impact compared to VADER. Given that BERT is computationally intensive, its conditional use based on VADER’s certainty is practical. However, this setup requires careful calibration of the threshold to ensure that BERT is employed effectively without unnecessary computation. This dual-method approach strikes a balance between accuracy (using BERT for complex analyses) and efficiency (using VADER for initial screening). It’s crucial to continuously evaluate this balance, especially in a production environment or larger-scale analysis, to optimize resource utilization without compromising the quality of sentiment analysis.

In [21]:
random_tweets = emo_data.sample(n=5)
random_tweets

Unnamed: 0.1,Unnamed: 0,text,label,cleaned_text,expanded_text,tokens
414382,414382,i clicked a scrolling button on the google home page when it said im feeling artistic,1,i clicked a scrolling button on the google home page when it said im feeling artistic,i clicked a scrolling button on the google home page when it sai would im feeling artistic,"[i, clicked, a, scrolling, button, on, the, google, home, page, when, it, sai, would, im, feeling, artistic]"
101523,101523,i feel it is sooo cute,1,i feel it is sooo cute,i feel it is sooo cute,"[i, feel, it, is, sooo, cute]"
117558,117558,i feel completely disheartened and overwhelmed by our warring ways i meditate,0,i feel completely disheartened and overwhelmed by our warring ways i meditate,i feel completely disheartened and overwhelmed by our warring ways i meditate,"[i, feel, completely, disheartened, and, overwhelmed, by, our, warring, ways, i, meditate]"
407722,407722,i feel honoured to join into the teachers discussion which truly shows that im a professional teacher now,1,i feel honoured to join into the teachers discussion which truly shows that im a professional teacher now,i feel honoured to join into the teachers discussion which truly show is that im a professional teacher now,"[i, feel, honoured, to, join, into, the, teachers, discussion, which, truly, show, is, that, im, a, professional, teacher, now]"
213497,213497,i was at school tonight and saw a really pretty yellow moon and started feeling all romantic and junk,2,i was at school tonight and saw a really pretty yellow moon and started feeling all romantic and junk,i was at school tonight and saw a really pretty yellow moon and started feeling all romantic and junk,"[i, was, at, school, tonight, and, saw, a, really, pretty, yellow, moon, and, started, feeling, all, romantic, and, junk]"


Breakdown of Text Processing Steps
Cleaning Text
The clean_text column contains text that is presumably ready for further processing. In this specific case, it appears unchanged relative to text, suggesting that either no cleaning was needed or that cleaning operations such as trimming or removing unwanted characters were not applicable here.
Observation: Ensure that any potential noise such as URLs, special characters, or HTML tags is also considered in the cleaning process, especially if I scale to more diverse datasets.
Expansion of Contractions
The expanded_text column also appears unchanged relative to clean_text, which could indicate that there were no contractions to expand, or the text did not contain any contractions from the outset.
Observation: It's good practice to verify that the contraction handling mechanism is active and effective, possibly by checking against known contractions to ensure they are expanded as expected.
Tokenization
The tokens column shows that the text has been successfully tokenized into individual words. This is evident from the list format containing each word as a separate element, which is exactly what I would expect following the tokenization step.
Observation is that the tokenization appears to be thorough, ensuring each word is separated, including handling of punctuation if it were present. For advanced processing, I might consider whether aspects like stemming or lemmatization (reducing words to their base or root form) could be beneficial depending on my end goals.

In [22]:
nltk.download('wordnet')  # WordNet lemmatizer resource
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [23]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

The tokens column shows that the text has been broken down into individual words and symbols effectively, which is crucial for accurate further processing. The lemmatized_tokens column aims to reduce each word to its base or dictionary form. Lemmatization is generally more sophisticated than stemming because it uses lexical knowledge bases to derive the correct root forms.

For example, the verb "was" is lemmatized to "wa," which seems incorrect. Typically, "was" should be lemmatized to "be." This suggests a potential issue in how the lemmatization is being applied or an error in the lemmatization tool/library configuration.

Most words appear unchanged from tokens to lemmatized_tokens, which might indicate that many words in my dataset are already in their root form, or the lemmatization tool did not identify many changes needed. However, ensure that the lemmatization process is correctly configured to handle different parts of speech adequately.
**Problems I am having:**
As noted, "was" becoming "wa" is likely an error. This might result from an incorrect POS tagging before lemmatization, as lemmatization algorithms often rely on correct POS tags to function accurately.
Handling of Contractions and Punctuation: The tokenization step should ideally separate punctuation from words unless explicitly meant to be together (like in contractions or specific abbreviations). Ensure that my tokenization process is tuned to my specific data characteristics.

 **POS tagging**
 To maximize lemmatization accuracy, consider adding part-of-speech tagging before lemmatization. This can help the lemmatizer more accurately determine the correct base form.


In [24]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [25]:
!pip install spacy
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [26]:
import spacy
#the spaCy English tokenizer
nlp = spacy.load("en_core_web_sm")
def improved_pos_tagging(text):
    doc = nlp(text)
    return [(token.text, token.lemma_, token.pos_) for token in doc]
#example usage to see if the function is working or not
sample_text = "I felt happy because I saw the others were happy and because I knew I should feel happy, but I wasn’t really happy."
tags = improved_pos_tagging(sample_text)
print(tags)


[('I', 'I', 'PRON'), ('felt', 'feel', 'VERB'), ('happy', 'happy', 'ADJ'), ('because', 'because', 'SCONJ'), ('I', 'I', 'PRON'), ('saw', 'see', 'VERB'), ('the', 'the', 'DET'), ('others', 'other', 'NOUN'), ('were', 'be', 'AUX'), ('happy', 'happy', 'ADJ'), ('and', 'and', 'CCONJ'), ('because', 'because', 'SCONJ'), ('I', 'I', 'PRON'), ('knew', 'know', 'VERB'), ('I', 'I', 'PRON'), ('should', 'should', 'AUX'), ('feel', 'feel', 'VERB'), ('happy', 'happy', 'ADJ'), (',', ',', 'PUNCT'), ('but', 'but', 'CCONJ'), ('I', 'I', 'PRON'), ('was', 'be', 'AUX'), ('n’t', 'not', 'PART'), ('really', 'really', 'ADV'), ('happy', 'happy', 'ADJ'), ('.', '.', 'PUNCT')]


In [27]:
#Loading the spaCy English model
nlp = spacy.load("en_core_web_sm")
#initializing the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()
#helper function to convert spaCy POS tags to WordNet POS tags
def get_wordnet_pos(spacy_token):
    if spacy_token.pos_ in ('VERB', 'AUX'):
        return wordnet.VERB
    elif spacy_token.pos_ == 'NOUN':
        return wordnet.NOUN
    elif spacy_token.pos_ == 'ADJ':
        return wordnet.ADJ
    elif spacy_token.pos_ == 'ADV':
        return wordnet.ADV
    else:
        return wordnet.NOUN  #default

#the below is function to perform POS tagging and lemmatization
def lemmatize_text(text):
    doc = nlp(text)
    lemmatized = [lemmatizer.lemmatize(token.text, get_wordnet_pos(token)) for token in doc]
    return lemmatized

# let's set up an example usage on a sample text
#to see if it is really working
sample_text = "I felt happy because I saw the others were happy and because I knew I should feel happy, but I wasn’t really happy."
lemmatized_text = lemmatize_text(sample_text)
print(lemmatized_text)

['I', 'felt', 'happy', 'because', 'I', 'saw', 'the', 'others', 'be', 'happy', 'and', 'because', 'I', 'know', 'I', 'should', 'feel', 'happy', ',', 'but', 'I', 'be', 'n’t', 'really', 'happy', '.']


In [28]:
emo_data['lemmatized_tokens'] = emo_data['cleaned_text'].apply(lemmatize_text)
#the belowto check the new lemmatized tokens
print(emo_data[['cleaned_text', 'lemmatized_tokens']].head())


                                                                                                                                                                                cleaned_text                                                                                                                                                                                                         lemmatized_tokens
36130                                                                                                             id say maybe made them feel foolish but that would be reeeeeeally reaching                                                                                                                                   [i, d, say, maybe, make, them, feel, foolish, but, that, would, be, reeeeeeally, reach]
138065                                          i joined the lds church i admit to feeling somewhat ashamed of my family background in light of the mormon ideal that presented itself to 

In [29]:
# Applying the function to the DataFrame
#emo_data['lemmatized_tokens'] = emo_data['tokens'].apply(lemmatize_series) took
#ages to load so

my script incorporates advanced NLP techniques like Part-of-Speech (POS) tagging and lemmatization using both the NLTK and spaCy libraries, significantly enhancing text processing for detailed analysis. Here's a detailed look at my approach:

nlp: A spaCy English language model (en_core_web_sm), which is lightweight but effective for many NLP tasks including tokenization, lemmatization, and POS tagging.
WordNetLemmatizer: An NLTK lemmatizer that uses WordNet's lexical database to find lemmas of words.
POS Tagging with spaCy: improved_pos_tagging uses spaCy's nlp to tokenize the text and generate tokens with attributes like text, lemma, and POS tag. This function returns these details for each token, aiding in comprehensive text analysis.
lemmatize_text function also uses spaCy's model to tokenize the text but goes further to lemmatize each token using the WordNetLemmatizer from NLTK. It converts spaCy POS tags to WordNet's format using the get_wordnet_pos helper function, which helps in selecting the correct lemma for each word based on its POS tag.
I first test these functions on a sample text to observe the output and ensure they function as expected.
Application to DataFrame: I apply lemmatize_text across the 'cleaned_text' column of my emo_data DataFrame, adding a new column 'lemmatized_tokens' that contains the lemmatized text. This process aims to reduce words to their base or dictionary form, helping in standardizing the dataset for subsequent analysis like sentiment analysis or thematic studies.
After processing, I print the initial cleaned text alongside the new lemmatized tokens for the first few entries in the DataFrame to verify the transformations.
By integrating spaCy's efficient parsing capabilities with NLTK's lemmatization tools, my script is well-equipped to handle complex text processing tasks, ensuring that the text is analyzed more accurately and consistently. This setup is particularly useful in sentiment analysis and other NLP applications where understanding the root form of words enhances the analysis quality.

successfully implemented the DataFrame display settings to show the full content of the columns, and I've reviewed the top 10 rows to observe the results of my text preprocessing steps. This allows me to see how the text has been transformed at each stage, from the original text (text) to the cleaned text (cleaned_text), through tokenization (tokens), stop word filtering (filtered_tokens), expansion of contractions (expanded_text), and finally to lemmatization (lemmatized_tokens).

In [30]:
#randomly sampling 15 tweets for review to see
sampled_tweets = emo_data.sample(n=15, random_state=42) #42 is random
print(sampled_tweets[['text', 'cleaned_text', 'tokens', 'lemmatized_tokens']])

                                                                                                                                                                                                                                                                                               text                                                                                                                                                                                                                                                                                 cleaned_text                                                                                                                                                                                                                                                                                                                                                       tokens  \
283253                                                                 

Feature Extraction

In [31]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 2))  #limit to unigrams and bigrams
ngrams = vectorizer.fit_transform(emo_data['cleaned_text'])
df_ngrams = pd.DataFrame(ngrams.toarray(), columns=vectorizer.get_feature_names_out())

In [32]:
sid = SentimentIntensityAnalyzer()
bert_classifier = pipeline('sentiment-analysis', model='nlptown/bert-base-multilingual-uncased-sentiment')
def get_sentiment_scores(text):
    # VADER Sentiment Analysis
    vader_scores = sid.polarity_scores(text)
    # Optional: BERT Sentiment Analysis (for deeper analysis or if VADER is uncertain)
    bert_scores = bert_classifier(text)[0] if abs(vader_scores['compound']) < 0.1 else None
    return vader_scores, bert_scores

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [33]:
results = emo_data['text'].apply(lambda x: get_sentiment_scores(x))
#splitting results into separate columns
emo_data['vader_neg'] = results.apply(lambda x: x[0]['neg'])
emo_data['vader_neu'] = results.apply(lambda x: x[0]['neu'])
emo_data['vader_pos'] = results.apply(lambda x: x[0]['pos'])
emo_data['vader_compound'] = results.apply(lambda x: x[0]['compound'])
# when using BERT scores, it will extract them similarly
emo_data['bert_score'] = results.apply(lambda x: x[1]['label'] if x[1] else None)
emo_data['bert_confidence'] = results.apply(lambda x: x[1]['score'] if x[1] else None)

In [34]:
#to check the new columns
print(emo_data.head())
#the DataFrame to a CSV file
emo_data.to_csv('tweets_with_sentiment_scores.csv', index=False)


        Unnamed: 0                                                                                                                                                                                  text  label                                                                                                                                                                          cleaned_text                                                                                                                                                                           expanded_text                                                                                                                                                                                                                        tokens  \
36130        36130                                                                                                            id say maybe made them feel foolish but that would be reeeeeeally reaching 

In [35]:
emo_data['word_count'] = emo_data['cleaned_text'].apply(lambda x: len(x.split()))
emo_data['char_count'] = emo_data['cleaned_text'].apply(len)
emo_data['exclamation_marks'] = emo_data['cleaned_text'].apply(lambda x: x.count('!'))
emo_data['question_marks'] = emo_data['cleaned_text'].apply(lambda x: x.count('?'))

In [36]:
pip install dask[complete]  # This installs Dask along with common dependencies like distributed for parallel execution

Collecting lz4>=4.3.2 (from dask[complete])
  Downloading lz4-4.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: lz4
Successfully installed lz4-4.3.3


In [37]:
!pip install dask_ml

Collecting dask_ml
  Downloading dask_ml-2024.4.4-py3-none-any.whl (149 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.8/149.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dask-glm>=0.2.0 (from dask_ml)
  Downloading dask_glm-0.3.2-py2.py3-none-any.whl (13 kB)
Collecting sparse>=0.7.0 (from dask-glm>=0.2.0->dask_ml)
  Downloading sparse-0.15.1-py2.py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sparse, dask-glm, dask_ml
Successfully installed dask-glm-0.3.2 dask_ml-2024.4.4 sparse-0.15.1


In [38]:
pip install dask[distributed]  # Includes Dask and additional features for distributed computing



In [39]:
print(emo_data.columns)

Index(['Unnamed: 0', 'text', 'label', 'cleaned_text', 'expanded_text', 'tokens', 'lemmatized_tokens', 'vader_neg', 'vader_neu', 'vader_pos', 'vader_compound', 'bert_score', 'bert_confidence', 'word_count', 'char_count', 'exclamation_marks', 'question_marks'], dtype='object')


In [40]:
from sklearn.feature_extraction.text import CountVectorizer
#Direct application without Dask
vectorizer = CountVectorizer(ngram_range=(1, 2))
try:
    ngrams = vectorizer.fit_transform(emo_data['cleaned_text'])
    print("N-grams shape:", ngrams.shape)
except Exception as e:
    print("Error during vectorization:", e)


N-grams shape: (10000, 88184)


In [41]:
import dask.dataframe as dd
from sklearn.feature_extraction.text import CountVectorizer
from dask import delayed
#now converting to Dask DataFrame
ddf = dd.from_pandas(emo_data, npartitions=10)
#'cleaned_text' is accessible before applying operations?
print(ddf['cleaned_text'].head())
#proceeding with the Dask operation
transformed = ddf['cleaned_text'].map_partitions(lambda part: vectorizer.fit_transform(part), meta='object')
ngrams = transformed.compute()


81                                                                                                                                                                                                                     realizing that school will soon be over
85                                                                                                                                                                                                        im the customer i wont feel welcomed but intimidated
117                                                                                                                                                                                                                     ive been feeling hopeless and helpless
123                                                                                                                                                                                      i am a very different person now much more confide



In [42]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 2))
ngrams = vectorizer.fit_transform(emo_data['cleaned_text'])  # This is a sparse matrix already
#Checking the type to confirm it's a sparse matrix
print(type(ngrams))

<class 'scipy.sparse._csr.csr_matrix'>


In [43]:
pip install sparse



In [44]:
from scipy.sparse import hstack
df_features = ngrams

In [45]:
from sklearn.feature_extraction.text import CountVectorizer
#Initializing the vectorizer
vectorizer = CountVectorizer(ngram_range=(1, 2))
#fitting and transforming the vectorizer on the entire cleaned_text column
ngrams = vectorizer.fit_transform(emo_data['cleaned_text'])

In [46]:
import dask.array as da
from scipy.sparse import csr_matrix
#ensuring the resulting matrix is in CSR format for compatibility with Dask Array
if not isinstance(ngrams, csr_matrix):
    ngrams = csr_matrix(ngrams)
# Converting to Dask array
dask_sparse_array = da.from_array(ngrams, chunks=(1000, ngrams.shape[1]))  # Adjust chunk sizes appropriately

In [47]:
import dask.dataframe as dd
import pandas as pd
#to convert Dask array blocks to DataFrames
def convert_to_dataframe(blocks, feature_names):
    #now Converting sparse blocks to DataFrame
    return pd.DataFrame.sparse.from_spmatrix(blocks, columns=feature_names)

#lets se map_blocks to convert each block of the Dask array to a DataFrame block
ddf_ngrams = dask_sparse_array.map_blocks(convert_to_dataframe, dtype='float', feature_names=vectorizer.get_feature_names_out())

In [48]:
ddf = dd.from_pandas(emo_data.drop(columns=['cleaned_text']), npartitions=10)
#Combining with n-grams Dask DataFrame
final_ddf = dd.concat([ddf, ddf_ngrams], axis=1)

In [49]:
ddf = dd.from_pandas(emo_data.drop(columns=['cleaned_text']), npartitions=10)
#Merge or concatenate `ddf_ngrams` with `ddf`
final_ddf = dd.concat([ddf, ddf_ngrams], axis=1)
#Checking the combined DataFrame
print(final_ddf.head())
#save the DataFrame
final_ddf.to_parquet('final_dataset.parquet', engine='pyarrow')

     Unnamed: 0                                                                                                                                                                                                                                                     text  label                                                                                                                                                                                                                                            expanded_text                                                                                                                                                                                                                                                                                                                                                                                                          tokens  \
81           81                                                             

The above demonstrates an advanced approach to feature extraction from text data using natural language processing techniques and incorporating both traditional machine learning and deep learning methods for sentiment analysis. Here’s a breakdown of the major steps and implementations in my code:

I utilize CountVectorizer from scikit-learn to convert text data into a matrix of token counts, specifically focusing on unigrams and bigrams. This allows the model to not only consider the frequency of individual words but also pairs of consecutive words, enhancing the analysis by capturing local textual context.
This transformed data is converted into a DataFrame where each column corresponds to a token, and the entries are the counts of these tokens in each document (text entry).
I initialize two sentiment analysis tools: NLTK’s SentimentIntensityAnalyzer and BERT-based classifier from the transformers library. While the former provides a quick and efficient way to get sentiment scores based on predefined lexicons, the latter offers a deep learning approach for a more nuanced analysis.
The sentiment scores from both tools are then obtained for each text, depending on the compound score from VADER (if it’s uncertain, BERT is used for a deeper analysis).
Applying the Sentiment Analysis:
I apply the sentiment analysis to a sample of texts, extracting both VADER and optional BERT results, and split these results into separate DataFrame columns for easy access and interpretation.
Besides n-grams, I calculate basic text features like word count, character count, and counts of specific punctuation marks like exclamation and question marks. These features can be very telling in sentiment analysis as they often reflect the intensity and emotional state of the writer.
For handling potentially large datasets efficiently, I leverage Dask to manage operations in parallel across chunks of the DataFrame. This includes converting text data into n-grams using Dask’s DataFrame operations which are particularly useful when working with big data that doesn’t fit into memory.

Finally, I have the option to save the processed DataFrame with all features and sentiment scores to a CSV file for further use, like training a machine learning model or for detailed analysis. This ensures that all preprocessing steps are stored and can be easily accessed or reproduced.
my script is well-crafted for a robust text analysis pipeline, integrating detailed text preprocessing, feature extraction, and sentiment analysis in a way that is scalable and insightful for emotional or sentiment analysis tasks.

Modelling

In [50]:
print(emo_data.columns)

Index(['Unnamed: 0', 'text', 'label', 'cleaned_text', 'expanded_text', 'tokens', 'lemmatized_tokens', 'vader_neg', 'vader_neu', 'vader_pos', 'vader_compound', 'bert_score', 'bert_confidence', 'word_count', 'char_count', 'exclamation_marks', 'question_marks'], dtype='object')


In [51]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
# and 'label' as the target variable
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(emo_data['cleaned_text'])
y = emo_data['label']
#Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [52]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
#lets initialize and train the Logistic Regression model
model = LogisticRegression(max_iter=1000)  # Increase max_iter if convergence issues occur
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.92      0.88       572
           1       0.82      0.93      0.87       708
           2       0.84      0.64      0.72       154
           3       0.89      0.78      0.83       281
           4       0.84      0.70      0.77       214
           5       0.79      0.48      0.60        71

    accuracy                           0.84      2000
   macro avg       0.84      0.74      0.78      2000
weighted avg       0.84      0.84      0.84      2000

Accuracy: 0.842


my code for modeling sentiment or emotion analysis using logistic regression is structured and straightforward, integrating feature extraction, model training, and evaluation in a few concise steps. Here's a detailed walkthrough of each part of the process:

Feature Extraction with CountVectorizer

I use CountVectorizer to transform the cleaned_text column of my emo_data DataFrame into a numeric format that machine learning models can process. By setting ngram_range=(1, 2), I include both unigrams and bigrams, which helps capture more contextual information than unigrams alone.
The resultant matrix, X, is sparse and contains frequencies of each token (unigram or bigram) appearing in the dataset.
Preparing the Data:
The target variable, y, is extracted from the emotion_label column (although my code references emo_data['label'], which might be a typo if the target column is indeed named emotion_label).
The data is then split into training and test sets using train_test_split, allocating 20% of the data to the test set and setting a random state for reproducibility of the results.
Model Training:
A Logistic Regression model is initialized with an increased number of maximum iterations (max_iter=1000) to ensure convergence given the potentially large feature space created by the n-gram vectorization.
The model is then trained on the training data (X_train, y_train), fitting the logistic regression to the emotion labels.
Model Evaluation:
After training, the model's performance is assessed on the test set (X_test). Predictions (y_pred) are made and compared against the actual labels (y_test).
I utilize classification_report from scikit-learn to get detailed performance metrics such as precision, recall, and F1-score for each class. These metrics are crucial for understanding how well the model performs across different emotion categories, not just overall accuracy.
The overall accuracy of the model is also calculated and printed. Accuracy provides a quick insight into the proportion of total correct predictions but should be interpreted in the context of the class balance and other metrics provided in the classification report.
This structured approach ensures that I have a robust assessment of the logistic regression model's capability to classify emotions based on text. It's a strong baseline model for text classification tasks and often performs well with balanced datasets and clear distinctions between categories. If further refinement is needed, I could consider experimenting with different n-gram ranges, tweaking regularization parameters in logistic regression, or exploring more complex models like SVMs or neural networks.

Overall Model Performance

Accuracy: The overall accuracy of 0.842 (or 84.2%) indicates that the model correctly predicts the emotional label of a tweet about 84% of the time. This is a strong performance, especially considering the complexity of natural language processing tasks.
Detailed Performance by Class
Precision: Measures the accuracy of positive predictions. For example, when the model predicts class 0 (possibly representing an emotion like "joy"), it is correct about 85% of the time. High precision in classes 0, 1, and 3 suggests that when the model predicts these emotions, it tends to be highly reliable.
Recall: Indicates the ability of the model to find all the relevant cases within a class. For instance, class 1 has a recall of 0.93, meaning the model identifies 93% of all actual instances of this class in the test set. High recall for class 1 (perhaps "happiness" or a similar positive emotion) implies it is not often missed by the model.
F1-Score: The F1-score is the harmonic mean of precision and recall, providing a single score that balances both the concerns of precision and recall. Class 2 and class 5 have the lowest F1-scores (0.72 and 0.60 respectively), which might indicate these emotions are more complex or less distinctly defined in the data, making them harder to predict accurately.
Class-Specific Observations
Class 0, 1, and 3: Show strong performance across both precision and recall, suggesting these emotions are well-represented and distinct within the dataset, allowing the model to learn their patterns effectively.
Class 2 and 5: Lower recall for class 5 and lower overall F1-score indicate challenges. This could be due to fewer training samples (only 71 for class 5), which can often lead to difficulties in learning to generalize well. The model struggles to detect these emotions consistently across new data, possibly because of overlapping features with other emotions or inherent complexity in how these emotions are expressed in text.
