<a href="https://colab.research.google.com/github/SfurtiR/Natural-Language-Processing/blob/main/Sentiment_Analysis_using_LSTM_(Deep_Learning).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## ***Sentiment Analysis NLP project***

We will:
*   Preprocess text using the pipeline.
*   Use TextBlob for sentiment analysis.
*   Use scikit-learn to train a simple model for sentiment classification.


**Step 1 :Install and Import Required Libraries**





In [None]:
pip install tensorflow keras numpy pandas nltk contractions emoji scikit-learn

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import numpy as np
import pandas as pd
import re
import string
import contractions
import emoji
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder


**Step 2 :Preprocessing Function for Sentiment Analysis**


In [None]:
# Download necessary NLTK datasets
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
# Load spaCy NLP model
nlp = spacy.load("en_core_web_sm")


In [None]:
def preprocess_text(text):
    """Function to clean and preprocess text data for sentiment analysis."""

    text = text.lower()
    text = contractions.fix(text)
    text = emoji.demojize(text, delimiters=(" ", " "))
    text = text.translate(str.maketrans('', '', string.punctuation))
    words = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return " ".join(words)


**Step 3: Load Sentiment Data**



We’ll use a sample dataset with text and sentiment labels (0 for negative, 1 for positive).


In [None]:
# Example dataset
data = {
    "text": [
        "I love this product! It's amazing 😊",
        "This is the worst experience ever. I hate it!",
        "I'm so happy with my purchase. It's perfect!",
        "The quality is terrible and I'm very disappointed.",
        "Absolutely wonderful! Will buy again.",
        "Not worth the money. Poor quality.",
        "Great customer service and fast delivery!",
        "This was a waste of my time and money."
    ],
    "sentiment": ["positive", "negative", "positive", "negative", "positive", "negative", "positive", "negative"]
}

df = pd.DataFrame(data)
df['clean_text'] = df['text'].apply(preprocess_text)

# Encode sentiment labels (0 = Negative, 1 = Positive)
label_encoder = LabelEncoder()
df['sentiment'] = label_encoder.fit_transform(df['sentiment'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(df['clean_text'], df['sentiment'], test_size=0.2, random_state=42)


**Step 4: Convert Text Data into Numerical Features**

We use TF-IDF (Term Frequency-Inverse Document Frequency) to convert text into a numerical format for training.

In [None]:
# Tokenize text
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)

# Convert text to sequences
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Padding to ensure equal length
max_length = max(len(seq) for seq in X_train_seq)
X_train_pad = pad_sequences(X_train_seq, maxlen=max_length, padding='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=max_length, padding='post')


**Step 5:Build LSTM Model**


In [None]:
# Define LSTM model
model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=max_length),
    LSTM(128, return_sequences=True),
    LSTM(64),
    Dropout(0.5),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')  # Sigmoid for binary classification
])

# Compile model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train_pad, y_train, epochs=5, batch_size=4, validation_data=(X_test_pad, y_test))


Epoch 1/5




[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 607ms/step - accuracy: 0.1944 - loss: 0.6941 - val_accuracy: 0.0000e+00 - val_loss: 0.6949
Epoch 2/5
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 81ms/step - accuracy: 1.0000 - loss: 0.6894 - val_accuracy: 1.0000 - val_loss: 0.6916
Epoch 3/5
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 89ms/step - accuracy: 0.6944 - loss: 0.6901 - val_accuracy: 1.0000 - val_loss: 0.6922
Epoch 4/5
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 142ms/step - accuracy: 1.0000 - loss: 0.6863 - val_accuracy: 1.0000 - val_loss: 0.6909
Epoch 5/5
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 97ms/step - accuracy: 1.0000 - loss: 0.6827 - val_accuracy: 1.0000 - val_loss: 0.6882


<keras.src.callbacks.history.History at 0x7b0e35fe7650>

**Step 6: Test the Model on New Data**

In [None]:
def predict_sentiment_lstm(text):
    """Predict sentiment using LSTM model"""
    clean_text = preprocess_text(text)
    seq = tokenizer.texts_to_sequences([clean_text])
    pad_seq = pad_sequences(seq, maxlen=max_length, padding='post')
    prediction = model.predict(pad_seq)[0][0]
    return "Positive" if prediction > 0.5 else "Negative"

# Test Predictions
test_texts = [
    "I absolutely love this! It's fantastic.",
    "Worst product ever. I regret buying it!",
    "Not bad, but could be better.",
    "The customer service was very helpful!"
]

for text in test_texts:
    print(f"Text: {text} --> Sentiment: {predict_sentiment_lstm(text)}")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 605ms/step
Text: I absolutely love this! It's fantastic. --> Sentiment: Positive
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 72ms/step
Text: Worst product ever. I regret buying it! --> Sentiment: Positive
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 87ms/step
Text: Not bad, but could be better. --> Sentiment: Negative
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
Text: The customer service was very helpful! --> Sentiment: Positive


**Option 2: Sentiment Analysis using BERT (Transformer)**

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch


**Load Pretrained BERT Model**

In [None]:
# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = BertForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]

**Predict Sentiment using BERT**

In [None]:
def predict_sentiment_bert(text):
    """Predict sentiment using BERT"""
    tokens = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    output = model(**tokens)
    prediction = torch.argmax(output.logits, dim=1).item()

    sentiments = {0: "Very Negative", 1: "Negative", 2: "Neutral", 3: "Positive", 4: "Very Positive"}
    return sentiments[prediction]

# Test Predictions
test_texts = [
    "I absolutely love this! It's fantastic.",
    "Worst product ever. I regret buying it!",
    "Not bad, but could be better.",
    "The customer service was very helpful!"
]

for text in test_texts:
    print(f"Text: {text} --> Sentiment: {predict_sentiment_bert(text)}")



Text: I absolutely love this! It's fantastic. --> Sentiment: Very Positive
Text: Worst product ever. I regret buying it! --> Sentiment: Very Negative
Text: Not bad, but could be better. --> Sentiment: Neutral
Text: The customer service was very helpful! --> Sentiment: Very Positive
