[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nepal-College-of-Information-Technology/AI-Data-Science-Workshop-2024/blob/main/Day%2010%20%3A%20Natural%20Language%20Processing/Part2_Sentiment_Analysis_Model.ipynb)


# Part 2: Building a Sentiment Analysis Model

In this notebook, we will build a simple **Sentiment Analysis Model** using text data from a dataset of movie reviews. The model will classify the sentiment of each review as either **positive** or **negative**.

We will use the following steps to build the model:
- Load and explore the dataset.
- Preprocess the text data (tokenization, stopwords removal, etc.).
- Convert the text data into a numerical format (vectorization).
- Train a machine learning model to classify sentiment.
- Evaluate the model's performance.

---

### Step 1: Import Libraries and Load Dataset

First, we will import the required libraries and load the dataset. We'll use a dataset of movie reviews.

In [6]:
# Import necessary libraries
import nltk
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Create a simple dataset of movie reviews and their sentiments
data = pd.DataFrame({
    'review': [
        "This movie was fantastic! I loved it.",
        "The film was a complete waste of time.",
        "What a brilliant and captivating story.",
        "I hated every minute of this film.",
        "Absolutely wonderful! A masterpiece.",
        "It was so boring, I fell asleep halfway.",
        "A delightful movie that brought tears to my eyes.",
        "Terrible. I wouldn't recommend this to anyone.",
        "I thoroughly enjoyed the plot and the characters.",
        "This is the worst movie I have ever seen."
    ],
    'sentiment': [
        'positive', 'negative', 'positive', 'negative', 'positive',
        'negative', 'positive', 'negative', 'positive', 'negative'
    ]
})

# Display the first few rows of the dataset
data.head()

Unnamed: 0,review,sentiment
0,This movie was fantastic! I loved it.,positive
1,The film was a complete waste of time.,negative
2,What a brilliant and captivating story.,positive
3,I hated every minute of this film.,negative
4,Absolutely wonderful! A masterpiece.,positive


### Step 2: Text Preprocessing

We’ll preprocess the text by tokenizing, removing stopwords, and lemmatizing the words. This step helps clean the text data for the model

In [7]:
# Download required NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Preprocessing function
def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text.lower())

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return " ".join(tokens)

# Apply preprocessing to the reviews
data['processed_review'] = data['review'].apply(preprocess_text)

# Show the original and processed text
data[['review', 'processed_review']].head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,review,processed_review
0,This movie was fantastic! I loved it.,movie fantastic ! loved .
1,The film was a complete waste of time.,film complete waste time .
2,What a brilliant and captivating story.,brilliant captivating story .
3,I hated every minute of this film.,hated every minute film .
4,Absolutely wonderful! A masterpiece.,absolutely wonderful ! masterpiece .


### Step 3: Convert Text Data to Numerical Format (Vectorization)

We’ll convert the processed text into numerical format using CountVectorizer so the machine learning model can use it for training.

In [8]:
# Split the data into training and test sets
X = data['processed_review']
y = data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)  # Encode labels (1 for positive, 0 for negative)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert text data to feature vectors using CountVectorizer
vectorizer = CountVectorizer(max_features=1000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Display the feature vectors shape
print(f"Feature Vectors Shape: {X_train_vec.shape}")

Feature Vectors Shape: (8, 28)


### Step 4: Train a Sentiment Analysis Model

Now we will train a Naive Bayes classifier on the training data.

In [9]:
# Train a Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train_vec, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test_vec)

# Evaluate the model
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Step 5: Test the Model with New Reviews

You can now test the model with a new review to see if it correctly predicts the sentiment.

In [10]:
# Test the model with a new review
new_review = ["The plot was amazing and the acting was superb!"]
new_review_processed = [preprocess_text(review) for review in new_review]
new_review_vec = vectorizer.transform(new_review_processed)

# Predict sentiment
prediction = model.predict(new_review_vec)
sentiment = "positive" if prediction[0] == 1 else "negative"
print(f"Sentiment: {sentiment}")

Sentiment: negative


### Conclusion:

In this simplified notebook, we manually created a dataset of movie reviews, preprocessed the text data, and trained a Naive Bayes classifier to classify the sentiment of the reviews as positive or negative. You can now experiment with different datasets, preprocessing techniques, and models.

---