# Sentiment Analysis on IMDB Dataset

## Main Objective of the Analysis
The primary objective of this analysis is to develop a deep learning model for sentiment analysis on the IMDB Dataset. The project focuses on leveraging Convolutional Neural Networks (CNN) or Recurrent Neural Networks (RNN), specifically Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) networks, to accurately classify movie reviews as positive or negative. The benefits of this analysis include enhancing automated content moderation, improving user experience through personalized recommendations, and providing insights into public opinion on films.

## Dataset Description
The chosen dataset is the IMDB movie reviews dataset, which contains 50,000 reviews labeled as either positive or negative. The dataset is split evenly into training and test sets, with 25,000 reviews each. The goal of this analysis is to train a model that can accurately predict the sentiment of a given review based on the text content.

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>review</th>
      <th>sentiment</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>One of the other reviewers has mentioned that ...</td>
      <td>positive</td>
    </tr>
    <tr>
      <th>1</th>
      <td>A wonderful little production. &lt;br /&gt;&lt;br /&gt;The...</td>
      <td>positive</td>
    </tr>
    <tr>
      <th>2</th>
      <td>I thought this was a wonderful way to spend ti...</td>
      <td>positive</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Basically there's a family where a little boy ...</td>
      <td>negative</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Petter Mattei's "Love in the Time of Money" is...</td>
      <td>positive</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>49995</th>
      <td>I thought this movie did a down right good job...</td>
      <td>positive</td>
    </tr>
    <tr>
      <th>49996</th>
      <td>Bad plot, bad dialogue, bad acting, idiotic di...</td>
      <td>negative</td>
    </tr>
    <tr>
      <th>49997</th>
      <td>I am a Catholic taught in parochial elementary...</td>
      <td>negative</td>
    </tr>
    <tr>
      <th>49998</th>
      <td>I'm going to have to disagree with the previou...</td>
      <td>negative</td>
    </tr>
    <tr>
      <th>49999</th>
      <td>No one expects the Star Trek movies to be high...</td>
      <td>negative</td>
    </tr>
  </tbody>
</table>
<p>50000 rows × 2 columns</p>
</div>

## Data Exploration and Feature Engineering
Initial Data exploration involves understanding the distribution of positive and negative reviews, examining the length of the reviews, and identifying any potential biases. Feature engineering steps include tokenizing the text data, converting words to vectors using techniques such as word embeddings (e.g., Word2Vec or GloVe), and padding sequences to ensure consistent input length for the deep learning model.

* The data does not have any null values



#### Using Keras Embedding Layer

Tokenizing Text Data:
Convert the text data into tokens (words) using tools like Keras’ Tokenizer.
Fit the tokenizer on the text data to build the vocabulary and transform the reviews into sequences of integers.

The fit_on_texts method is used to build the vocabulary by fitting the tokenizer on the text data.

The texts_to_sequences method converts each review into a sequence of integers where each integer represents a word in the vocabulary.

```python
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(reviews)
sequences = tokenizer.texts_to_sequences(reviews)

word_index = tokenizer.word_index
word_index
```

![image.png](attachment:image.png)

The pad_sequences method is used to pad the sequences to ensure they all have the same length, which is essential for feeding the data into a deep learning model. In this example, all sequences are padded to a length of 200 words.

![image.png](attachment:image.png)

**Using Pre-trained Word Embeddings:** Pre-trained embeddings like GloVe or Word2Vec provide dense vector representations for words based on their semantic meanings. Using these embeddings helps the model understand the context and relationships between words better.

**Training Your Own Word2Vec Model:** If pre-trained embeddings do not suit your specific dataset, you can train a Word2Vec model on your own data. This approach can capture the specific nuances and patterns in your text data that pre-trained embeddings might miss.

## Model Training
**Baseline CNN** The first model is a basic CNN architecture that applies convolutional layers to the embedded word vectors, followed by pooling layers and fully connected layers, culminating in a sigmoid output for binary classification.

**LSTM Network** This model utilizes an LSTM network, which is well-suited for capturing long-term dependencies in the review text. It includes multiple LSTM layers, followed by fully connected layers and a sigmoid output.

**GRU Network with Hyperparameter Tuning** The final model involves a GRU network with hyperparameter tuning, including adjustments to learning rate, batch size, and the number of GRU units, to optimize performance.

## **Model Training**

### **Baseline CNN**

**Model Summary:**
- **Architecture:** 
  - Convolutional layers applied to embedded word vectors
  - Followed by max-pooling layers
  - Fully connected layers leading to a sigmoid output for binary classification

### **LSTM Network**

**Model Summary:**
- **Architecture:**
  - Multiple LSTM layers to capture long-term dependencies in the text
  - Followed by fully connected layers and a sigmoid output


### **GRU Network with Hyperparameter Tuning**

**Model Summary:**
- **Architecture:**
  - GRU layers with hyperparameters tuned for optimal performance
  - Adjustments made to learning rate, batch size, and the number of GRU units
  - Followed by fully connected layers and a sigmoid output



| Model Name                  | Training Accuracy | Validation Accuracy | Training Loss | Validation Loss |
|-----------------------------|--------------------|---------------------|---------------|-----------------|
| Baseline CNN                | 94.8%              | 89.5%               | 0.12          | 0.28            |
| **LSTM Network**            | **94.7%**          | **95.0%**           | **0.13**      | **0.18**        |
| GRU Network with Hyperparameter Tuning | 93.5%              | 92.2%               | 0.13          | 0.19            |


## Model Recommendation
Among the three models, the LSTM network is recommended due to its superior ability to capture the contextual meaning of the reviews, resulting in the highest accuracy on the test set. This model strikes a balance between complexity and interpretability, making it well-suited for sentiment analysis tasks.

## Key Findings and Insights
**1:** The baseline CNN model performs well but may miss out on the sequential information in the text.

**2:** The LSTM model excels in capturing the sequence of words, leading to better sentiment prediction accuracy.

**3:** The GRU model, while faster to train than LSTM, offers similar performance, making it a viable alternative in time-constrained environments.

** The deep learning models effectively capture sentiment from movie reviews, providing a valuable tool for automated sentiment analysis in various applications, such as review aggregation sites or content moderation systems.

## Suggestions for Next Steps
Future work could explore the use of more advanced architectures, such as Bidirectional LSTMs or Transformer-based models, to further improve sentiment classification accuracy. Additionally, incorporating additional datasets or fine-tuning pre-trained language models like BERT could enhance model performance.