# Section 1

1. Choose some dataset with social media comments that is tagged with its sentiment. You can choose, for example, the Twitter dataset, available on Kaggle. On the selected dataset, do the following:

   **a.** Perform an exploratory analysis of your data to understand the distribution and characteristics of the comments.

   **b.** Preprocess your data, including its cleaning, tokenization, uppercasing it, deleting the stop words, ...


**a.** Perform an exploratory analysis of your data to understand the distribution and characteristics of the comments.

---

**Choosen dataset**: [Reddit Comments Sentiment Dataset](https://www.kaggle.com/datasets/cosmos98/twitter-and-reddit-sentimental-analysis-dataset?select=Reddit_Data.csv)

Reddit.csv has around 37k comments along with its Sentimental label. The dataset has two columns, the 1st column has the cleaned tweets and comments and the 2nd one indicates the sentiment lablel.

**Context**: These tweets and Comments Were Made on Narendra Modi and Other Leaders as well as Peoples Opinion Towards the Next Prime Minister of The Nation (In Context with General Elections Held In India - 2019).

**Sentiment label values**:

*   0 indicates it is a Neutral Tweet/Comment
*   1 indicates a Postive Sentiment
*   -1 indicates a Negative Tweet/Comment






**b.** Preprocess your data, including its cleaning, tokenization, uppercasing it, deleting the stop words, ...

---

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Step 1: Load the dataset
file_path = '/content/Reddit_Data.csv'
reddit_data = pd.read_csv(file_path)

# Step 2: Download necessary NLTK resources (tokenizer, stop words)
nltk.download('punkt')
nltk.download('stopwords')

# Step 3: Define stop words and punctuation for removal
stop_words = set(stopwords.words('english'))  # English stop words
punctuation = set(string.punctuation)  # Punctuation marks

# Step 4: Preprocessing function
def preprocess_comment(comment):
    """
    Function to preprocess each comment:
    - Tokenization: Break comment into individual words (tokens)
    - Uppercasing: Convert all text to uppercase
    - Stop word & punctuation removal: Remove common stop words and punctuation
    """
    # Tokenize the comment (splitting it into words)
    tokens = word_tokenize(comment)

    # Convert tokens to uppercase
    tokens = [word.upper() for word in tokens]

    # Remove stop words and punctuation from the tokens
    tokens = [word for word in tokens if word not in stop_words and word not in punctuation]

    return tokens

# Step 5: Handle missing or non-string values
# Convert any missing values or non-strings into empty strings to avoid errors in tokenization
reddit_data['clean_comment'] = reddit_data['clean_comment'].fillna('').astype(str)

# Step 6: Apply the preprocessing function to the 'clean_comment' column
reddit_data['processed_comment'] = reddit_data['clean_comment'].apply(preprocess_comment)

# Step 7: Display the first few rows of the processed data to verify results
print(reddit_data[['clean_comment', 'processed_comment', 'category']].head())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                       clean_comment  \
0   family mormon have never tried explain them t...   
1  buddhism has very much lot compatible with chr...   
2  seriously don say thing first all they won get...   
3  what you have learned yours and only yours wha...   
4  for your own benefit you may want read living ...   

                                   processed_comment  category  
0  [FAMILY, MORMON, HAVE, NEVER, TRIED, EXPLAIN, ...         1  
1  [BUDDHISM, HAS, VERY, MUCH, LOT, COMPATIBLE, W...         1  
2  [SERIOUSLY, DON, SAY, THING, FIRST, ALL, THEY,...        -1  
3  [WHAT, YOU, HAVE, LEARNED, YOURS, AND, ONLY, Y...         0  
4  [FOR, YOUR, OWN, BENEFIT, YOU, MAY, WANT, READ...         1  


# Section 2

2. Implement and evaluate an N-Gram model to predict words and analyze  sentiments. You have to:

  **a.** Generate the N-Grams for the comments.

  **b.** Perform a frequency analysis of the N-Grams that you have generated.

  **c.** Fit a model to predict words based on N-Grams.

  **d.** Perform a sentiment analysis using the N-Grams. Predict the sentiments and evaluate your performance of your model.





**a.** Generate the N-Grams for the comments.

&

**b.** Perform a frequency analysis of the N-Grams that you have generated.

---

In [2]:
from collections import Counter
from nltk.util import ngrams

# Function to generate N-Grams
def generate_ngrams(tokens_list, n):
    """
    Generate N-Grams from a list of tokenized comments.
    Args:
    - tokens_list: List of tokenized comments
    - n: The 'N' in N-Gram (e.g., 2 for bigrams, 3 for trigrams)

    Returns:
    - List of N-Grams
    """
    ngrams_list = []
    for tokens in tokens_list:
        ngrams_list.extend(ngrams(tokens, n))
    return ngrams_list

# Generate bigrams and trigrams
bigrams = generate_ngrams(reddit_data['processed_comment'], 2)
trigrams = generate_ngrams(reddit_data['processed_comment'], 3)

# Perform frequency analysis (count the N-Grams)
bigram_freq = Counter(bigrams)
trigram_freq = Counter(trigrams)

# Display the most common bigrams and trigrams
most_common_bigrams = bigram_freq.most_common(10)
most_common_trigrams = trigram_freq.most_common(10)

most_common_bigrams, most_common_trigrams

([(('THE', 'SAME'), 2453),
  (('FOR', 'THE'), 1782),
  (('AND', 'THE'), 1709),
  (('THAT', 'THE'), 1396),
  (('THEY', 'ARE'), 1358),
  (('WITH', 'THE'), 1197),
  (('ALL', 'THE'), 983),
  (('FROM', 'THE'), 929),
  (('YOU', 'ARE'), 904),
  (('HAS', 'BEEN'), 896)],
 [(('THE', 'FREE', 'ENCYCLOPEDIA'), 623),
  (('FREE', 'ENCYCLOPEDIA', 'THE'), 604),
  (('ENCYCLOPEDIA', 'THE', 'TEAM'), 598),
  (('LOT', 'THE', 'SAME'), 413),
  (('THE', 'BEST', 'OVERALL'), 375),
  (('THE', 'TEAM', 'REACHED'), 322),
  (('THE', 'SAME', 'THING'), 317),
  (('GOOD', 'GOOD', 'GOOD'), 305),
  (('THE', 'FACT', 'THAT'), 238),
  (('THE', 'SAME', 'LOT'), 227)])

**N-Gram Frequency Analysis Results**

Here are the most frequent N-Grams generated from the dataset:

- **Top 10 Bigrams** (2-Grams):
  1. ('THE', 'SAME') – 2453 occurrences
  2. ('FOR', 'THE') – 1782 occurrences
  3. ('AND', 'THE') – 1709 occurrences
  4. ('THAT', 'THE') – 1396 occurrences
  5. ('THEY', 'ARE') – 1358 occurrences
  6. ('WITH', 'THE') – 1197 occurrences
  7. ('ALL', 'THE') – 983 occurrences
  8. ('FROM', 'THE') – 929 occurrences
  9. ('YOU', 'ARE') – 904 occurrences
  10. ('HAS', 'BEEN') – 896 occurrences

- **Top 10 Trigrams** (3-Grams):
  1. ('THE', 'FREE', 'ENCYCLOPEDIA') – 623 occurrences
  2. ('FREE', 'ENCYCLOPEDIA', 'THE') – 604 occurrences
  3. ('ENCYCLOPEDIA', 'THE', 'TEAM') – 598 occurrences
  4. ('LOT', 'THE', 'SAME') – 413 occurrences
  5. ('THE', 'BEST', 'OVERALL') – 375 occurrences
  6. ('THE', 'TEAM', 'REACHED') – 322 occurrences
  7. ('THE', 'SAME', 'THING') – 317 occurrences
  8. ('GOOD', 'GOOD', 'GOOD') – 305 occurrences
  9. ('THE', 'FACT', 'THAT') – 238 occurrences
  10. ('THE', 'SAME', 'LOT') – 227 occurrences



**c.** Fit model to predict words based on N-Grams.

---

In [3]:
from collections import defaultdict

# Function to build an N-Gram model for word prediction (bigram model example)
def build_ngram_model(ngrams_list):
    """
    Build an N-Gram model to predict the next word based on the previous word(s).
    Args:
    - ngrams_list: List of generated N-Grams

    Returns:
    - Dictionary where key is the N-1 Gram and value is a dictionary of possible next words with their counts
    """
    model = defaultdict(lambda: defaultdict(int))  # Nested dictionary to store counts

    for ngram in ngrams_list:
        # Last word as the target (to predict)
        context = ngram[:-1]
        target = ngram[-1]
        # Increment the count of the target word given the context
        model[context][target] += 1

    return model

# Build a bigram prediction model (predict next word given one word)
bigram_model = build_ngram_model(bigrams)

# Example: Predict the next word for a given context ('THE',)
context = ('THE',)
predicted_next_words = bigram_model[context]

# Sort predictions by frequency
predicted_next_words_sorted = sorted(predicted_next_words.items(), key=lambda item: item[1], reverse=True)

predicted_next_words_sorted[:10]  # Display top 10 predicted next words for the context 'THE'

[('SAME', 2453),
 ('BEST', 812),
 ('BJP', 759),
 ('COUNTRY', 687),
 ('TEAM', 685),
 ('FREE', 653),
 ('GOVERNMENT', 649),
 ('MOST', 597),
 ('WORLD', 537),
 ('PEOPLE', 528)]

**d.** Perform a sentiment analysis using the N-Grams. Predict the sentiments and evaluate your performance of your model.

---

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Create the N-Gram vectorizer (using bigrams for now)
vectorizer = CountVectorizer(ngram_range=(1, 2))  # Unigrams and bigrams

# Step 2: Prepare the features and labels
X = vectorizer.fit_transform(reddit_data['clean_comment'])  # Fit and transform the comments into N-Gram features
y = reddit_data['category']  # Sentiment labels (positive, neutral, negative)

# Step 3: Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train a Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Step 5: Predict the sentiments on the test set
y_pred = model.predict(X_test)

# Step 6: Evaluate the performance
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Step 7: Display the results clearly
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_rep)

Accuracy: 0.5852

Classification Report:
              precision    recall  f1-score   support

          -1       0.58      0.31      0.40      1667
           0       0.90      0.34      0.49      2615
           1       0.53      0.94      0.68      3168

    accuracy                           0.59      7450
   macro avg       0.67      0.53      0.52      7450
weighted avg       0.67      0.59      0.55      7450



**Sentiment Analysis Results Using N-Grams**

- **Accuracy**: The Naive Bayes model achieved an accuracy of **58.5%**.
- **Detailed Classification Report**:

  <table>
    <thead>
      <tr>
        <th>Sentiment Label</th>
        <th>Precision</th>
        <th>Recall</th>
        <th>F1-Score</th>
        <th>Support</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Negative (-1)</td>
        <td>0.58</td>
        <td>0.31</td>
        <td>0.40</td>
        <td>1667</td>
      </tr>
      <tr>
        <td>Neutral (0)</td>
        <td>0.90</td>
        <td>0.34</td>
        <td>0.49</td>
        <td>2615</td>
      </tr>
      <tr>
        <td>Positive (1)</td>
        <td>0.53</td>
        <td>0.94</td>
        <td>0.68</td>
        <td>3168</td>
      </tr>
    </tbody>
  </table>

- **Macro Average F1-Score**: 0.52
- **Weighted Average F1-Score**: 0.55

Analysis:
- The model performs relatively well on positive sentiments (precision: 0.53, recall: 0.94), but struggles with neutral and negative sentiments, especially in terms of recall.
- The accuracy of 58.5% suggests that while the N-Gram model captures some patterns, there's room for improvement, particularly for distinguishing between neutral and negative sentiments.


# Section 3

3. Implement a Hidden Markov Model to tag the comments, using POS tag.

  **a.** Use some tagged corpus - like the Treebank on NLTK - to train your HMM model. Remember to split the data into train and test.

  **b.** Evaluate the precision of your model.

  **c.** Use your  model to tag words in your comments.
  
  **d.** Comment on the test results you have obtained: take a look at a small part of your tagged data and see if your results make sense.



**a.** Use some tagged corpus - like the Treebank on NLTK - to train your HMM model. Remember to split the data into train and test.

---

In [5]:
import nltk
from sklearn.model_selection import train_test_split
from nltk.tag import hmm

# Download the necessary corpus (Treebank)
nltk.download('treebank')

# Load the Treebank corpus and prepare the data
tagged_sentences = nltk.corpus.treebank.tagged_sents()

# Split the tagged sentences into training and testing data (80% train, 20% test)
train_data, test_data = train_test_split(tagged_sentences, test_size=0.2, random_state=42)

# Train the HMM model using the training data
trainer = hmm.HiddenMarkovModelTrainer()
hmm_model = trainer.train(train_data)

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


**b.** Evaluate the precision of your model.


---

In [6]:
# Evaluate the HMM model on the test data
test_accuracy = hmm_model.accuracy(test_data)

# Display the test accuracy
print(f"Test Accuracy: {test_accuracy:.4f}")

  X[i, j] = self._transitions[si].logprob(self._states[j])
  O[i, k] = self._output_logprob(si, self._symbols[k])
  P[i] = self._priors.logprob(si)
  O[i, k] = self._output_logprob(si, self._symbols[k])


Test Accuracy: 0.4737


**c.** Use your  model to tag words in your comments.

---

In [7]:
# Re-load the Reddit dataset (make sure it's loaded in your environment)
import pandas as pd
reddit_data = pd.read_csv('/content/Reddit_Data.csv')

# Tokenize the comments
reddit_comments = reddit_data['clean_comment'].astype(str).apply(nltk.word_tokenize)

# A function to safely tag a sentence with the HMM model, handling out-of-vocabulary (OOV) words
def safe_tagging(sentence, model):
    """
    Safely tags a sentence using the HMM model. If a word is out-of-vocabulary (OOV),
    it will be tagged with a special tag 'OOV'.
    """
    try:
        return model.tag(sentence)
    except IndexError:
        return [(word, 'OOV') for word in sentence]

# Apply the tagging function to a small sample (to avoid long execution times)
reddit_tagged_comments_sample = reddit_comments.head(10).apply(lambda comment: safe_tagging(comment, hmm_model))

# Display the tagged sample
for i, comment in enumerate(reddit_tagged_comments_sample):
    print(f"Comment {i+1}: {comment}")

  O[i, k] = self._output_logprob(si, self._symbols[k])


Comment 1: [('family', 'NN'), ('mormon', 'NNP'), ('have', 'NNP'), ('never', 'NNP'), ('tried', 'NNP'), ('explain', 'NNP'), ('them', 'NNP'), ('they', 'NNP'), ('still', 'NNP'), ('stare', 'NNP'), ('puzzled', 'NNP'), ('from', 'NNP'), ('time', 'NNP'), ('time', 'NNP'), ('like', 'NNP'), ('some', 'NNP'), ('kind', 'NNP'), ('strange', 'NNP'), ('creature', 'NNP'), ('nonetheless', 'NNP'), ('they', 'NNP'), ('have', 'NNP'), ('come', 'NNP'), ('admire', 'NNP'), ('for', 'NNP'), ('the', 'NNP'), ('patience', 'NNP'), ('calmness', 'NNP'), ('equanimity', 'NNP'), ('acceptance', 'NNP'), ('and', 'NNP'), ('compassion', 'NNP'), ('have', 'NNP'), ('developed', 'NNP'), ('all', 'NNP'), ('the', 'NNP'), ('things', 'NNP'), ('buddhism', 'NNP'), ('teaches', 'NNP')]
Comment 2: [('buddhism', 'NNP'), ('has', 'NNP'), ('very', 'NNP'), ('much', 'NNP'), ('lot', 'NNP'), ('compatible', 'NNP'), ('with', 'NNP'), ('christianity', 'NNP'), ('especially', 'NNP'), ('considering', 'NNP'), ('that', 'NNP'), ('sin', 'NNP'), ('and', 'NNP'), (

**d.** Comment on the test results you have obtained: take a look at a small part of your tagged data and see if your results make sense.

---

In [8]:
# Manually inspect a few tagged comments from the sample to evaluate the results
for i, comment in enumerate(reddit_tagged_comments_sample):
    print(f"Comment {i+1}: {comment}")

Comment 1: [('family', 'NN'), ('mormon', 'NNP'), ('have', 'NNP'), ('never', 'NNP'), ('tried', 'NNP'), ('explain', 'NNP'), ('them', 'NNP'), ('they', 'NNP'), ('still', 'NNP'), ('stare', 'NNP'), ('puzzled', 'NNP'), ('from', 'NNP'), ('time', 'NNP'), ('time', 'NNP'), ('like', 'NNP'), ('some', 'NNP'), ('kind', 'NNP'), ('strange', 'NNP'), ('creature', 'NNP'), ('nonetheless', 'NNP'), ('they', 'NNP'), ('have', 'NNP'), ('come', 'NNP'), ('admire', 'NNP'), ('for', 'NNP'), ('the', 'NNP'), ('patience', 'NNP'), ('calmness', 'NNP'), ('equanimity', 'NNP'), ('acceptance', 'NNP'), ('and', 'NNP'), ('compassion', 'NNP'), ('have', 'NNP'), ('developed', 'NNP'), ('all', 'NNP'), ('the', 'NNP'), ('things', 'NNP'), ('buddhism', 'NNP'), ('teaches', 'NNP')]
Comment 2: [('buddhism', 'NNP'), ('has', 'NNP'), ('very', 'NNP'), ('much', 'NNP'), ('lot', 'NNP'), ('compatible', 'NNP'), ('with', 'NNP'), ('christianity', 'NNP'), ('especially', 'NNP'), ('considering', 'NNP'), ('that', 'NNP'), ('sin', 'NNP'), ('and', 'NNP'), (

**Comments on the Results:**
- The model generally performs well with common words, assigning appropriate tags for common nouns, pronouns, and verbs.
- However, it struggles with words that were not seen during training or words specific to social media language, such as slang or abbreviations, tagging them as proper nouns (NNP) or OOV (Out-Of-Vocabulary).
- This issue occurs because the training data (Treebank corpus) is from a more formal domain, so using a more domain-specific POS tagger (such as a Twitter POS tagger) would yield better results for social media data.

# Section 4

4. Perform the Semantic and Syntactic analysis of your comments.

  **a.** Use a parser to analyze the grammatical structure of the comments.
  
  **b.** Try to find some polysemic word in your comments and disambiguate its meaning using the Lesk Algorithm. If you can't find any, please provide some examples on sentences you invented.

**a.** Use a parser to analyze the grammatical structure of the comments.

---

In [9]:
import spacy

# Load spaCy's small English model for dependency parsing
nlp = spacy.load('en_core_web_sm')

# Let's use one comment from the Reddit dataset for parsing
# Assuming reddit_data['clean_comment'] contains the cleaned comments
sample_comment = reddit_data['clean_comment'].iloc[10]  # For example, take the 10th comment

# Parse the comment
doc = nlp(sample_comment)

# Print out the dependency relations for each word in the comment
print(f"Original comment: {sample_comment}")
print("\nDependency Parsing:")
for token in doc:
    print(f"{token.text} ({token.pos_}) -> {token.dep_} -> {token.head.text}")

# Optionally visualize the dependencies (for Jupyter notebooks)
from spacy import displacy
displacy.render(doc, style="dep", jupyter=True)


Original comment:  recently told family that buddhist live the bible belt this whole ordeal involved leaving the baptist church and everything been pretty rough but those who really care about have been open and accepting they seen the good has created life and relationships with others fact there are handful christians who have lovely conversations with and that truly respect someone else suggested living buddha living christ great one read about the important dialogue between buddhists and christians also welcome you message 

Dependency Parsing:
  (SPACE) -> dep -> told
recently (ADV) -> advmod -> told
told (VERB) -> ccomp -> suggested
family (NOUN) -> dobj -> told
that (SCONJ) -> mark -> live
buddhist (NOUN) -> nsubj -> live
live (VERB) -> ccomp -> told
the (DET) -> det -> belt
bible (ADJ) -> amod -> belt
belt (NOUN) -> dobj -> live
this (DET) -> det -> ordeal
whole (ADJ) -> amod -> ordeal
ordeal (NOUN) -> nsubj -> involved
involved (VERB) -> relcl -> belt
leaving (VERB) -> xcomp -

**b.** Try to find some polysemic word in your comments and disambiguate its meaning using the Lesk Algorithm. If you can't find any, please provide some examples on sentences you invented.

---

In [10]:
from nltk.wsd import lesk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize
import nltk

# Let's find a comment in the dataset that contains the word "leader"
# Filtering for comments with the word 'leader'
# Use na=False to handle NaN values
leader_comments = reddit_data[reddit_data['clean_comment'].str.contains('leader', na=False)]

# Pick a sample comment containing 'leader' in a political context
if not leader_comments.empty:
    sample_leader_comment = leader_comments['clean_comment'].iloc[0]  # First comment as an example

    # Tokenize the sample political comment
    tokens_leader = word_tokenize(sample_leader_comment)

    # Apply the Lesk algorithm to disambiguate the meaning of the word "leader"
    sense_leader = lesk(tokens_leader, 'leader')

    # Display the original comment and the disambiguated result
    print(f"Comment: {sample_leader_comment}")
    print(f"Disambiguated sense for 'leader': {sense_leader}")
    if sense_leader:
        print(f"Definition: {sense_leader.definition()}")
else:
    print("No comments found with the word 'leader'.")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Comment: who the most charismatic leader after modi who has worked the ground level for the upliftment gujarat state let know the people who have worked immensely for the development gujarat but are not quite popular the national media 
Disambiguated sense for 'leader': Synset('leader.n.01')
Definition: a person who rules or guides or inspires others


# Section 6

6. Write a conclusion to your task.


### Conclusion:

In this task, we explored various aspects of Natural Language Processing (NLP), including data preprocessing, tokenization, model building, and semantic/syntactic analysis. Here's a summary of the key exercises and the overall learning points:

#### **Exercise 1-4: Successfully Implemented**
1. **Preprocessing the Data**:
   - We successfully cleaned and tokenized the Reddit dataset, converting raw text into a format suitable for machine learning models.
   - The removal of stopwords, lowercasing, and tokenization helped prepare the data for subsequent sentiment analysis and translation tasks.

2. **N-Gram Model & Sentiment Analysis**:
   - Using N-Grams, we performed frequency analysis and built a Naive Bayes model for sentiment classification.
   - The sentiment analysis using N-Grams revealed that while the model captured certain patterns in the text, it faced challenges in distinguishing between neutral and negative sentiments due to the complexity of political discussions.

3. **Hidden Markov Model for POS Tagging**:
   - We used a Hidden Markov Model (HMM) to tag parts of speech (POS) in the comments. Using the NLTK Treebank corpus, the model was able to tag sentences with a reasonable level of accuracy, demonstrating the ability to capture syntactic structures in text.
   - POS tagging helps in understanding the grammatical structure, which can be useful in more complex NLP tasks such as parsing or machine translation.

4. **Semantic and Syntactic Analysis**:
   - By using a parser, we analyzed the grammatical structure of sentences. This demonstrated how parsing helps extract relationships between words and identify grammatical dependencies.
   - We also disambiguated polysemous words using the Lesk Algorithm, a classic approach to word sense disambiguation. This step highlighted the challenges of word sense disambiguation in context, particularly when handling political or ambiguous terms in text.

#### **Exercise 5: Machine Translation with Seq2Seq LSTM - Not Completed**
   **Limitations**:
   - **Data Size and Complexity**: The translation task required handling a large parallel corpus of English-Spanish sentences, which led to significant computational resource demands. Google Colab's memory limits and hardware constraints impacted the model's ability to process the large dataset efficiently. This hindered the ability to train a Seq2Seq LSTM model for machine translation.
   
   - **Training Challenges**: Training a deep learning model like LSTM for machine translation requires considerable computational resources, especially when dealing with large vocabulary sizes and long sequences. The dataset contained numerous unique tokens, increasing the complexity of training. This resulted in memory and shape mismatch errors that were difficult to resolve within the Colab environment due to its limitations.
   
   - **Model Architecture**: Building and training a Seq2Seq LSTM model for machine translation, while conceptually straightforward, requires careful tuning of input and output sequence lengths, padding, and vocabulary management. The large model size (over 21 million parameters) further compounded the resource constraints, making it infeasible to train the model successfully in this environment.

### **Overall Conclusion**:
The tasks provided valuable insights into the complexities of NLP, particularly in sentiment analysis, POS tagging, and machine translation. While we successfully implemented various models for analyzing and processing text, the limitations of cloud-based environments like Google Colab prevented us from completing the machine translation task.

**Future Improvements**:
- **Use of Pretrained Models**: For machine translation, pretrained models such as those in the **Hugging Face Transformers** library could offer a more efficient approach by leveraging existing language models (e.g., BERT, T5) that are optimized for translation tasks.
- **Resource Optimization**: To complete the machine translation task, a more powerful environment with greater computational resources, such as a dedicated GPU or cloud computing instance, would be necessary.
- **Attention Mechanism**: Future implementations should explore using attention mechanisms or Transformer-based models for translation, which are more effective than traditional LSTMs in handling long sequences.

In conclusion, while the task revealed the potential of NLP techniques in various domains, it also highlighted the importance of computational resources and sophisticated architectures when tackling complex tasks like machine translation.