#### The "Twitter Sentiment Extraction" project aims to leverage the power of Natural Language Processing (NLP) and Machine Learning to analyze Twitter data and extract sentiments from tweets. Twitter, being a prominent social media platform, is a treasure trove of opinions, emotions, and insights expressed by users worldwide. Understanding the sentiments and emotions embedded in these short messages can provide valuable insights for businesses, marketers, researchers, and policymakers.

#### The primary goal of this project is to build a sentiment analysis model capable of determining whether a tweet carries a positive, negative, or neutral sentiment. Sentiment analysis is a subfield of NLP that focuses on understanding and categorizing the emotions conveyed through textual data.

#### However, we don't stop at just determining the sentiment. This project goes beyond standard sentiment analysis by taking it a step further â€“ extracting the most critical words or phrases contributing to the sentiment of a tweet. By doing so, we gain deeper insights into why a particular tweet is classified with a specific sentiment, enabling us to identify what aspects or topics are driving the emotions expressed by Twitter users.

### Key Objectives:

- Sentiment Analysis: Develop an accurate sentiment analysis model that can accurately categorize tweets as positive, negative, or neutral based on their textual content.

- Sentiment Extraction: Go beyond traditional sentiment analysis by extracting the most important words or phrases that contribute to the overall sentiment of each tweet.

- Data Exploration: Understand the characteristics of the Twitter dataset, explore the distribution of sentiments, and identify patterns or trends in the data.

- Model Evaluation: Evaluate the performance of the sentiment analysis model using various metrics, ensuring its effectiveness in predicting sentiments.

- Interpretation and Visualization: Visualize the extracted words or phrases in an interpretable and insightful manner to gain a deeper understanding of the sentiments expressed in the tweets.

## Importing Libraries:

In [None]:
os.getcwd()

In [None]:
os.listdir('/root/.kaggle')

In [None]:
mv kaggle.json ~/.kaggle

In [None]:
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle competitions download -c tweet-sentiment-extraction

In [None]:
import re
import string
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from collections import Counter

from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import nltk
from nltk.corpus import stopwords

from tqdm import tqdm
import os
import nltk
import spacy
import random
from spacy.util import compounding
from spacy.util import minibatch

import warnings
warnings.filterwarnings("ignore")

## Importing Data:

In [None]:
!unzip tweet-sentiment-extraction.zip -d /content/

In [None]:
train = pd.read_csv('/content/train.csv')
test = pd.read_csv('/content/test.csv')
ss = pd.read_csv('/content/sample_submission.csv')

#### Let's look at the data first:

In [None]:
print(train.shape)
print(test.shape)

print(train.head())
print(test.head())

In [None]:
train.info()

In [None]:
train.dropna(inplace = True)

In [None]:
test.info()

In [None]:
print(train.isnull().sum())
print(test.isnull().sum())

## EDA:

### 1. Exploring sentiment class distribution:

In [None]:
sns.countplot(x = 'sentiment', data = train)
plt.title('Sentiment Class Distribution')
plt.show()

### 2. Text Analysis:

In [None]:
# Calculate tweet lengths (number of characters and words)
train['char_count'] = train['text'].apply(len)
train['word_count'] = train['text'].apply(lambda x: len(x.split()))

# Visualize tweet length distributions
sns.histplot(data=train, x='char_count', hue='sentiment', bins=30, kde=True)
plt.title('Tweet Length Distribution by Sentiment')
plt.show()

sns.histplot(data=train, x='word_count', hue='sentiment', bins=30, kde=True)
plt.title('Word Count Distribution by Sentiment')
plt.show()

# Create Word Clouds for each sentiment class
def plot_wordcloud(sentiment):
    text = ' '.join(train[train['sentiment'] == sentiment]['text'])
    wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', width=800, height=400).generate(text)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(f'Word Cloud for {sentiment} Tweets')
    plt.show()

plot_wordcloud('positive')
plot_wordcloud('negative')
plot_wordcloud('neutral')

### 3. Word Frequency Analysis:

##### You can download the "stopwords" data directly using the code:
"import nltk
nltk.download('stopwords')"

However, I was facing some internet issues, hence I manually downloaded the file and defined the function to read the same.

In [None]:
import nltk
# Download NLTK stopwords
nltk.download('stopwords')

In [None]:
from collections import Counter

def word_frequency_analysis(data):

    stop_words = set(nltk.corpus.stopwords.words('english'))

    # Rest of the function remains the same
    exclude = set(string.punctuation)
    words = []
    for tweet in data['text']:
        tweet_words = tweet.lower().split()
        for word in tweet_words:
            if word not in stop_words and word not in exclude:
                words.append(word)

    word_counts = Counter(words)
    most_common = word_counts.most_common(20)

    print(most_common)

    plt.figure(figsize=(10, 5))
    plt.bar(range(len(most_common)), [x[1] for x in most_common], color='blue')
    plt.xticks(range(len(most_common)), [x[0] for x in most_common])
    plt.title('Word Frequency')
    plt.show()

# Assuming you have the 'train' DataFrame
word_frequency_analysis(train)

#### Some insights: Before we start analyzing the data, let's take a look at some things that we already know about it. This will help us to gain more insights from the data.

- We know that selected_text is a subset of text. This means that selected_text is a part of text, but it is not the entire text. <br>
</br>
- We know that selected_text does not jump between two sentences. This means that selected_text is a continuous segment of text. For example, if the text is "Spent the entire morning in a meeting with a vendor, and my boss was not happy with them. Lots of fun. I had other plans for my morning", then selected_text could be "my boss was not happy with them. Lots of fun", but it could not be "Morning, vendor and my boss". <br>
</br>
- We know that neutral tweets have a jaccard similarity of 97 percent between text and selected_text. This means that for neutral tweets, the selected text is very similar to the entire text. <br>
</br>
- We know that there are rows where selected_text starts from between the words. This means that there are some tweets where the selected text does not start at the beginning of a word. This can make it difficult to analyze the data, because it is not clear what the selected text is referring to. </br>
- We do not know whether the output of the test set contains these discrepancies. This means that we do not know if the test set will have the same problems with the selected text as the training set.

### Let's look at some meta features:

In [None]:
def jaccard_distance(row):
    # Convert the text and selected_text to sets of words
    text_words = set(row['text'].lower().split())
    selected_text_words = set(row['selected_text'].lower().split())

    # Calculate the intersection and union of the sets
    intersection = len(text_words.intersection(selected_text_words))
    union = len(text_words.union(selected_text_words))

    # Calculate the Jaccard similarity and distance
    jaccard_similarity = intersection / union
    jaccard_distance = 1 - jaccard_similarity

    return jaccard_distance

In [None]:
train['jaccard_distance'] = train.apply(jaccard_distance, axis=1)

In [None]:
train['Num_words_ST'] = train['selected_text'].apply(lambda x:len(str(x).split())) #Number Of words in Selected Text
train['Num_word_text'] = train['text'].apply(lambda x:len(str(x).split())) #Number Of words in main text
train['difference_in_words'] = train['Num_word_text'] - train['Num_words_ST'] #Difference in Number of words text and Selected Text

In [None]:
train.head()

In [None]:
def jaccard_score(row):
    # Convert the text and selected_text to sets of words
    text_words = set(row['text'].lower().split())
    selected_text_words = set(row['selected_text'].lower().split())

    # Calculate the intersection and union of the sets
    intersection = len(text_words.intersection(selected_text_words))
    union = len(text_words.union(selected_text_words))

    # Calculate the Jaccard score
    if union != 0:
        jaccard_score = intersection / union
    else:
        jaccard_score = 0.0

    return jaccard_score

train['jaccard_score'] = train.apply(jaccard_score, axis=1)

In [None]:
train.head()

In [None]:
# Plot histogram for Number of words in Selected Text
plt.figure(figsize=(10, 5))
sns.histplot(data=train, x='Num_words_ST', bins=20, kde=True)
plt.title('Distribution of Number of Words in Selected Text')
plt.xlabel('Number of Words')
plt.ylabel('Count')
plt.show()

# Plot histogram for Number of words in main text
plt.figure(figsize=(10, 5))
sns.histplot(data=train, x='Num_word_text', bins=20, kde=True)
plt.title('Distribution of Number of Words in Main Text')
plt.xlabel('Number of Words')
plt.ylabel('Count')
plt.show()

# Plot histogram for Difference in Number of words text and Selected Text
plt.figure(figsize=(10, 5))
sns.histplot(data=train, x='difference_in_words', bins=20, kde=True)
plt.title('Distribution of Difference in Number of Words (Text - Selected Text)')
plt.xlabel('Difference in Number of Words')
plt.ylabel('Count')
plt.show()

# Plot histogram for Jaccard Score
plt.figure(figsize=(10, 5))
sns.histplot(data=train, x='jaccard_score', bins=20, kde=True)
plt.title('Distribution of Jaccard Score')

#### We can infer the following:

* Number of Words in Selected Text: The histogram for the number of words in the selected text shows that a significant portion of selected text contains around 1 to 4 words. This suggests that the majority of selected text segments are concise and often represent short phrases or single words extracted from the main text.
</br>
</br>
* Number of Words in Main Text: The histogram for the number of words in the main text indicates that the main text of tweets generally consists of around 5 to 25 words. This suggests that tweets are generally of moderate length, with the majority containing a few sentences or phrases.
</br>
</br>
* Difference in Number of Words: The histogram for the difference in the number of words between the main text and selected text shows that most selected text segments are shorter than the main text. This is consistent with our previous understanding that the selected text is usually a subset of the main text, capturing a specific segment that represents the sentiment.

In [None]:
k = train[train['Num_word_text']<=2]
k.groupby('sentiment').mean()['jaccard_score']

In [None]:
k[k['sentiment']=='positive']

### Cleaning text:

In [None]:
import re
import string

def clean_text(text):
    # Remove text in square brackets
    text = re.sub(r'\[.*?\]', '', text)

    # Remove links
    text = re.sub(r'http\S+', '', text)

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove words containing numbers
    text = re.sub(r'\w*\d\w*', '', text)

    # Make text lowercase
    text = text.lower()

    return text

In [None]:
train['text'] = train['text'].apply(lambda x:clean_text(x))
train['selected_text'] = train['selected_text'].apply(lambda x:clean_text(x))

In [None]:
train.head()

### Let's now find the most common work in our target feature: "Selected-Text"

In [None]:
def remove_stopwords(text):
    # Tokenize the text into words
    words = text.split()

    stop_words = set(nltk.corpus.stopwords.words('english'))
    words = [word for word in words if word.lower() not in stop_words]

    # Join the words back into a single string
    cleaned_text = ' '.join(words)
    return cleaned_text

# Apply remove_stopwords to the 'selected_text' column in the train DataFrame
train['cleaned_selected_text'] = train['selected_text'].apply(remove_stopwords)

In [None]:
# Concatenate all the selected texts into a single string
all_selected_texts = ' '.join(train['cleaned_selected_text'])

# Split the text into words and count their frequencies
word_counts = Counter(all_selected_texts.split())

# Get the most common words
most_common_words = word_counts.most_common(10)
words, counts = zip(*most_common_words)

print(most_common_words)

In [None]:
import plotly.express as px

# Create a DataFrame for the most common words
df_most_common = pd.DataFrame({'words': words, 'counts': counts})

# Plot the treemap
fig = px.treemap(df_most_common, path=['words'], values='counts')
fig.update_layout(title='Top 10 Most Common Words in Cleaned Selected Text (Treemap)')
fig.show()

#### Similarly, let's find the most common words in "Text"

In [None]:
train['cleaned_text'] = train['text'].apply(remove_stopwords)

In [None]:
# Concatenate all the texts into a single string
all_texts = ' '.join(train['cleaned_text'])

# Split the text into words and count their frequencies
word_counts = Counter(all_texts.split())

# Get the most common words
most_common_words = word_counts.most_common(10)
words, counts = zip(*most_common_words)

print(most_common_words)

In [None]:
# Create a DataFrame for the most common words
df_most_common = pd.DataFrame({'words': words, 'counts': counts})

# Plot the treemap
fig = px.treemap(df_most_common, path=['words'], values='counts')
fig.update_layout(title='Top 10 Most Common Words in Selected Text')
fig.show()

### Clearly, the most common words in both "selected text" and "text" are similar.

In [None]:
# Filter the data by sentiment categories
positive_texts = ' '.join(train[train['sentiment'] == 'positive']['cleaned_selected_text'])
negative_texts = ' '.join(train[train['sentiment'] == 'negative']['cleaned_selected_text'])
neutral_texts = ' '.join(train[train['sentiment'] == 'neutral']['cleaned_selected_text'])

# Split the text into words and count their frequencies for each sentiment
positive_word_counts = Counter(positive_texts.split())
negative_word_counts = Counter(negative_texts.split())
neutral_word_counts = Counter(neutral_texts.split())

# Get the most common words for each sentiment
most_common_positive = positive_word_counts.most_common(10)
most_common_negative = negative_word_counts.most_common(10)
most_common_neutral = neutral_word_counts.most_common(10)

print("Most Common Words in Positive Sentiment:")
print(most_common_positive)

print("\nMost Common Words in Negative Sentiment:")
print(most_common_negative)

print("\nMost Common Words in Neutral Sentiment:")
print(most_common_neutral)

In [None]:
# Define the data for the heatmap
data = {
    'Word': [word for word, _ in most_common_positive + most_common_negative + most_common_neutral],
    'Sentiment': ['Positive'] * len(most_common_positive) + ['Negative'] * len(most_common_negative) + ['Neutral'] * len(most_common_neutral),
    'Frequency': [count for _, count in most_common_positive + most_common_negative + most_common_neutral]
}

df = pd.DataFrame(data)
# Convert the 'Frequency' column to integers
df['Frequency'] = df['Frequency'].astype(int)

# Create the heatmap using seaborn
plt.figure(figsize=(10, 6))
sns.heatmap(df.pivot('Sentiment', 'Word', 'Frequency'), cmap='YlGnBu', annot=True, fmt='.0f', cbar_kws={'label': 'Frequency'})
plt.title('Most Common Words Sentiments-Wise')
plt.xlabel('Word')
plt.ylabel('Sentiment')
plt.show()

In [None]:
# Get the most common positive words
positive_words, positive_counts = zip(*most_common_positive)

# Plot the most common positive words using a bar chart
plt.figure(figsize=(10, 6))
plt.bar(positive_words, positive_counts, color='blue')
plt.title('Most Common Positive Words')
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.show()

# Get the most common negative words
negative_words, negative_counts = zip(*most_common_negative)

# Plot the most common negative words using a bar chart
plt.figure(figsize=(10, 6))
plt.bar(negative_words, negative_counts, color='red')
plt.title('Most Common Negative Words')
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.show()

# Get the most common neutral words
neutral_words, neutral_counts = zip(*most_common_neutral)

# Plot the most common neutral words using a bar chart
plt.figure(figsize=(10, 6))
plt.bar(neutral_words, neutral_counts, color='green')
plt.title('Most Common Neutral Words')
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.show()

### Let's Look at Unique Words in each Segment

In [None]:
# Find unique words in each segment
unique_positive_words = set(positive_word_counts.keys())
unique_negative_words = set(negative_word_counts.keys())
unique_neutral_words = set(neutral_word_counts.keys())

# Find words unique to each sentiment
words_unique_to_positive = unique_positive_words - unique_negative_words - unique_neutral_words
words_unique_to_negative = unique_negative_words - unique_positive_words - unique_neutral_words
words_unique_to_neutral = unique_neutral_words - unique_positive_words - unique_negative_words

print("Words Unique to Positive Sentiment:")
print(words_unique_to_positive)

print("\nWords Unique to Negative Sentiment:")
print(words_unique_to_negative)

print("\nWords Unique to Neutral Sentiment:")
print(words_unique_to_neutral)

In [None]:
# Get the most common unique words in positive tweets
most_common_unique_positive = Counter(words_unique_to_positive).most_common(20)
print("The Top 20 Unique Words in Positive Tweets are:")
print(most_common_unique_positive)

# Get the most common unique words in negative tweets
most_common_unique_negative = Counter(words_unique_to_negative).most_common(20)
print("The Top 20 Unique Words in Negative Tweets are:")
print(most_common_unique_negative)

# Get the most common unique words in positive tweets
most_common_unique_neutral = Counter(words_unique_to_neutral).most_common(20)
print("The Top 20 Unique Words in Neutral Tweets are:")
print(most_common_unique_neutral)

In [None]:
# Create a DataFrame for each sentiment category
df_positive = pd.DataFrame(most_common_unique_positive, columns=['word', 'count'])
df_negative = pd.DataFrame(most_common_unique_negative, columns=['word', 'count'])
df_neutral = pd.DataFrame(most_common_unique_neutral, columns=['word', 'count'])

# Plot treemaps for each sentiment category
fig_positive = px.treemap(df_positive, path=['word'], values='count', title='Top 20 Unique Words in Positive Tweets')
fig_positive.show()

fig_negative = px.treemap(df_negative, path=['word'], values='count', title='Top 20 Unique Words in Negative Tweets')
fig_negative.show()

fig_neutral = px.treemap(df_neutral, path=['word'], values='count', title='Top 20 Unique Words in Neutral Tweets')
fig_neutral.show()

## WordClouds:

#### We will be building wordclouds in the following order:

#### - WordCloud of Neutral Tweets
#### - WordCloud of Positive Tweets
#### - WordCloud of Negative Tweets

In [None]:
from wordcloud import WordCloud

# Function to create WordCloud with custom parameters
def create_wordcloud(text, title):
    wordcloud = WordCloud(width=1000, height=600, background_color='white',
                          colormap='viridis', max_words=100, contour_width=3, contour_color='steelblue',
                          random_state=42).generate(text)

    plt.figure(figsize=(12, 8))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.title(title, fontsize=20)
    plt.axis('off')
    plt.show()

# Join the text of each sentiment category into a single string
positive_texts = ' '.join(train[train['sentiment'] == 'positive']['cleaned_selected_text'])
negative_texts = ' '.join(train[train['sentiment'] == 'negative']['cleaned_selected_text'])
neutral_texts = ' '.join(train[train['sentiment'] == 'neutral']['cleaned_selected_text'])

# Create WordCloud for each sentiment category
create_wordcloud(positive_texts, 'Word Cloud for Positive Tweets')
create_wordcloud(negative_texts, 'Word Cloud for Negative Tweets')
create_wordcloud(neutral_texts, 'Word Cloud for Neutral Tweets')

## Finally time for "Modelling":

#### To build the sentiment extraction model, we will be using the spaCy library for named entity recognition (NER) and sequence-to-sequence modeling. The task involves predicting the start and end positions of the selected text in the original tweet. We'll use the spaCy library, which is a powerful NLP library that provides pre-trained models and allows us to train our own models.

In [None]:
train['Num_words_text'] = train['text'].apply(lambda x: len(str(x).split()))

In [None]:
train = train[train['Num_words_text']>3]

In [None]:
def save_model(output_dir, nlp, new_model_name):
    ''' This Function Saves model to
    given output directory'''

    output_dir = f'../working/{output_dir}'
    if output_dir is not None:
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
        nlp.meta["name"] = new_model_name
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

output_dir = '/content'

In [None]:
import spacy
import random
from tqdm import tqdm
from spacy.util import minibatch
from spacy.training.example import Example

def train_ner(train_data, output_dir, n_iter=20, model=None):
    """Load the model, set up the pipeline and train the entity recognizer."""
    ""
    if model is not None:
        nlp = spacy.load(output_dir)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe('ner', last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")

    # add labels
    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        # sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        if model is None:
            nlp.begin_training()
        else:
            nlp.resume_training()


        for itn in tqdm(range(n_iter)):
          random.shuffle(train_data)
          losses = {}
          for batch in minibatch(train_data, size=compounding(4.0, 500.0, 1.001)):
                texts, annotations = zip(*batch)
                examples = []
                for i in range(len(texts)):
                      doc = nlp.make_doc(texts[i])
                      example = Example.from_dict(doc, annotations[i])
                      examples.append(example)
                      nlp.update(examples, drop=0.5, losses=losses)
                print("Losses", losses)
    save_model(output_dir, nlp, 'st_ner')

In [None]:
def get_model_out_path(sentiment):
    '''
    Returns Model output path
    '''
    model_out_path = None
    if sentiment == 'positive':
        model_out_path = '/content/models/model_pos'
    elif sentiment == 'negative':
        model_out_path = '/content/models/model_neg'
    return model_out_path

In [None]:
def get_training_data(sentiment):
    '''
    Returns Trainong data in the format needed to train spacy NER
    '''
    train_data = []
    for index, row in train.iterrows():
        if row.sentiment == sentiment:
            selected_text = row.selected_text
            text = row.text
            start = text.find(selected_text)
            end = start + len(selected_text)
            train_data.append((text, {"entities": [[start, end, 'selected_text']]}))
    return train_data

Training models for Postive and Negative Tweets

In [None]:
sentiment = 'positive'

train_data = get_training_data(sentiment)
model_path = get_model_out_path(sentiment)
# For DEmo Purposes I have taken 3 iterations you can train the model as you want
train_ner(train_data, model_path, n_iter=3, model=None)

In [None]:
sentiment = 'negative'

train_data = get_training_data(sentiment)
model_path = get_model_out_path(sentiment)

train_ner(train_data, model_path, n_iter=3, model=None)

Predicting with the trained Model

In [None]:
def predict_entities(text, model):
    doc = model(text)
    ent_array = []
    for ent in doc.ents:
        start = text.find(ent.text)
        end = start + len(ent.text)
        new_int = [start, end, ent.label_]
        if new_int not in ent_array:
            ent_array.append([start, end, ent.label_])
    selected_text = text[ent_array[0][0]: ent_array[0][1]] if len(ent_array) > 0 else text
    return selected_text

In [None]:
!kaggle datasets download -d rohitsingh9990/tse-spacy-model

In [None]:
!unzip tse-spacy-model.zip

In [None]:
selected_texts = []

# Load the models from the saved paths
model_pos = spacy.load('/working/models/model_neg/ner/model')
model_neg = spacy.load('/working/models/model_pos/ner/model')

for index, row in test.iterrows():
    text = row.text
    output_str = ""
    if row.sentiment == 'neutral' or len(text.split()) <= 2:
        selected_texts.append(text)
    elif row.sentiment == 'positive':
        selected_texts.append(predict_entities(text, model_pos))
    else:
        selected_texts.append(predict_entities(text, model_neg))

test['selected_text'] = selected_texts

Why Twitter Sentiment Extraction Matters:

Business Insights: For companies, understanding the sentiments of customers on Twitter can offer invaluable insights into brand perception, customer satisfaction, and areas of improvement.

Social Media Marketing: Marketers can utilize sentiment analysis to gauge the success of their marketing campaigns and adjust strategies accordingly.

Market Research: Sentiment analysis can aid market researchers in understanding consumer opinions, preferences, and trends.

Public Opinion Analysis: Policymakers and researchers can analyze public opinion on social issues, events, or political matters.

Conclusion:

The "Twitter Sentiment Extraction" project combines sentiment analysis and text extraction techniques to gain in-depth insights from Twitter data. By building an accurate sentiment analysis model and extracting key words or phrases, this project aims to unlock valuable information hidden within the vast landscape of tweets. The outcomes can empower businesses, marketers, researchers, and decision-makers with actionable insights to improve their strategies, products, and services based on the sentiments of the Twitterverse.

Acknowledgements
1. https://www.kaggle.com/aashita/word-clouds-of-various-shapes --> WORDCLOUDS FUNCTION
2. https://www.kaggle.com/rohitsingh9990/ner-training-using-spacy-0-628-lb --> For understanding how to train spacy NER on custom inputs
3. https://www.kaggle.com/code/tanulsingh077/twitter-sentiment-extaction-analysis-eda-and-model/notebook