<a href="https://colab.research.google.com/github/KalusaniLaxman/nlp_lab/blob/main/proj%20_expo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Build, train, and evaluate a combined CNN and RNN (RCNN/C-LSTM) model for sentiment analysis using the `Tweets.csv` dataset.

## Load and Inspect Data

### Subtask:
Load the `Tweets.csv` dataset into a pandas DataFrame and perform an initial inspection to understand its structure, columns, and identify the text and target variables.


**Reasoning**:
I will load the `Tweets.csv` dataset into a pandas DataFrame, display its head, information, and descriptive statistics to understand its structure, columns, and variable types, which is essential for identifying text and target variables.



In [1]:
import pandas as pd

# Load the Tweets.csv file into a pandas DataFrame named df
df = pd.read_csv('/content/Tweets.csv')

print("First 5 rows of the DataFrame:")
# Display the first 5 rows of the DataFrame
print(df.head())

print("\nDataFrame Info:")
# Print a summary of the DataFrame
df.info()

print("\nDescriptive statistics of the DataFrame:")
# Generate descriptive statistics of the numerical columns
print(df.describe())

First 5 rows of the DataFrame:
             tweet_id airline_sentiment  airline_sentiment_confidence  \
0  570306133677760513           neutral                        1.0000   
1  570301130888122368          positive                        0.3486   
2  570301083672813571           neutral                        0.6837   
3  570301031407624196          negative                        1.0000   
4  570300817074462722          negative                        1.0000   

  negativereason  negativereason_confidence         airline  \
0            NaN                        NaN  Virgin America   
1            NaN                     0.0000  Virgin America   
2            NaN                        NaN  Virgin America   
3     Bad Flight                     0.7033  Virgin America   
4     Can't Tell                     1.0000  Virgin America   

  airline_sentiment_gold        name negativereason_gold  retweet_count  \
0                    NaN     cairdin                 NaN              0   
1

## Text Preprocessing

### Subtask:
Clean and preprocess the text data from the tweets, including tokenization, converting text to sequences, and padding sequences to a uniform length, and encode the target variable.


**Reasoning**:
I will extract the text and target variables, encode the target variable using `LabelEncoder`, and then preprocess the text data by tokenizing, converting to sequences, and padding them to a uniform length using `Tokenizer` and `pad_sequences` from Keras, following the instructions provided.



In [2]:
import numpy as np
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 1. Extract the 'text' column for the input features and the 'airline_sentiment' column for the target variable
X = df['text']
y = df['airline_sentiment']

print(f"Original X shape: {X.shape}")
print(f"Original y shape: {y.shape}")

# 2. Initialize a LabelEncoder and fit it to the 'airline_sentiment' column
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(y)

print(f"Encoded labels shape: {encoded_labels.shape}")
print(f"Original labels: {y.head().tolist()}")
print(f"Encoded labels: {encoded_labels[:5].tolist()}")
print(f"Classes: {label_encoder.classes_}")

# 3. Initialize a Tokenizer
num_words = 10000 # You can adjust this number based on your vocabulary size
oov_token = "<unk>"
tokenizer = Tokenizer(num_words=num_words, oov_token=oov_token)

# 4. Fit the tokenizer on the extracted text data
tokenizer.fit_on_texts(X)

print(f"\nTokenizer vocabulary size: {len(tokenizer.word_index)}")
print("Top 10 words in vocabulary:", list(tokenizer.word_index.items())[:10])

# 5. Convert the text data into sequences of integers
text_sequences = tokenizer.texts_to_sequences(X)

print(f"\nFirst 5 original texts: {X.head().tolist()}")
print(f"First 5 text sequences: {text_sequences[:5]}")

# 6. Determine a suitable maximum sequence length for padding
max_sequence_length = max(len(seq) for seq in text_sequences)
print(f"\nMaximum sequence length: {max_sequence_length}")

# 7. Pad the text_sequences
padded_sequences = pad_sequences(text_sequences, maxlen=max_sequence_length, padding='post')

print(f"\nPadded sequences shape: {padded_sequences.shape}")
print(f"First padded sequence: {padded_sequences[0].tolist()}")

Original X shape: (14640,)
Original y shape: (14640,)
Encoded labels shape: (14640,)
Original labels: ['neutral', 'positive', 'neutral', 'negative', 'negative']
Encoded labels: [1, 2, 1, 0, 0]
Classes: ['negative' 'neutral' 'positive']

Tokenizer vocabulary size: 15769
Top 10 words in vocabulary: [('<unk>', 1), ('to', 2), ('the', 3), ('i', 4), ('a', 5), ('united', 6), ('you', 7), ('for', 8), ('flight', 9), ('on', 10)]

First 5 original texts: ['@VirginAmerica What @dhepburn said.', "@VirginAmerica plus you've added commercials to the experience... tacky.", "@VirginAmerica I didn't today... Must mean I need to take another trip!", '@VirginAmerica it\'s really aggressive to blast obnoxious "entertainment" in your guests\' faces &amp; they have little recourse', "@VirginAmerica and it's a really big bad thing about it"]
First 5 text sequences: [[82, 63, 6687, 227], [82, 559, 591, 1160, 2537, 2, 3, 202, 6688], [82, 4, 208, 103, 806, 592, 4, 77, 2, 157, 151, 194], [82, 90, 137, 3793, 2, 470

## Build Text Embedding Layer

### Subtask:
Create an embedding layer for the preprocessed text data. This layer will convert numerical sequences into dense vector representations that the CNN and RNN layers can process effectively.


**Reasoning**:
I will import the `Embedding` layer, define the `embedding_dim`, and instantiate the embedding layer using the previously calculated `num_words` and `max_sequence_length`, as per the instructions.



In [3]:
from tensorflow.keras.layers import Embedding

# 1. Define the embedding dimension
embedding_dim = 100  # You can experiment with different dimensions

# 2. Create the Embedding layer
# The input_dim is num_words (vocabulary size)
# The output_dim is embedding_dim
# The input_length is max_sequence_length (length of padded sequences)
embedding_layer = Embedding(input_dim=num_words,
                            output_dim=embedding_dim,
                            input_length=max_sequence_length,
                            name="embedding_layer")

print(f"Embedding layer created with input_dim={num_words}, output_dim={embedding_dim}, input_length={max_sequence_length}")
print(f"Embedding layer output shape: (None, {max_sequence_length}, {embedding_dim})")

Embedding layer created with input_dim=10000, output_dim=100, input_length=36
Embedding layer output shape: (None, 36, 100)




**Reasoning**:
The previous code generated a UserWarning because the `input_length` argument for the `Embedding` layer is deprecated. I will remove this argument as suggested by the warning.



In [4]:
from tensorflow.keras.layers import Embedding

# 1. Define the embedding dimension
embedding_dim = 100  # You can experiment with different dimensions

# 2. Create the Embedding layer
# The input_dim is num_words (vocabulary size)
# The output_dim is embedding_dim
embedding_layer = Embedding(input_dim=num_words,
                            output_dim=embedding_dim,
                            name="embedding_layer")

print(f"Embedding layer created with input_dim={num_words}, output_dim={embedding_dim}")
print(f"Embedding layer output shape: (None, None, {embedding_dim})")

Embedding layer created with input_dim=10000, output_dim=100
Embedding layer output shape: (None, None, 100)


## Define RCNN/C-LSTM Model Architecture

### Subtask:
Construct the combined CNN and RNN (LSTM) model architecture, defining convolutional layers, pooling layers, LSTM layers, and combining them in an RCNN or C-LSTM fashion.


**Reasoning**:
I will import the necessary Keras layers and `Model` class, then define the input layer, apply the embedding, create separate CNN and LSTM branches, concatenate their outputs, add dense layers, and finally instantiate the model as described in the instructions.



In [5]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv1D, GlobalMaxPooling1D, LSTM, Dense, concatenate

# Get the number of unique classes for the output layer
num_classes = len(label_encoder.classes_)

# 1. Define the input layer
# The input shape is max_sequence_length because padded_sequences has this length
input_layer = Input(shape=(max_sequence_length,), name="input_layer")

# 2. Apply the previously created embedding_layer to the input
embedded_sequences = embedding_layer(input_layer)

# 3. Create a convolutional branch (CNN)
cov_branch = Conv1D(filters=128, kernel_size=5, activation='relu', name="conv_layer")(embedded_sequences)
cov_branch = GlobalMaxPooling1D(name="global_max_pooling_layer")(cov_branch)

# 4. Create an RNN branch (LSTM)
rnn_branch = LSTM(128, name="lstm_layer")(embedded_sequences)

# 5. Concatenate the outputs of the CNN branch and the RNN branch
merged_output = concatenate([cov_branch, rnn_branch], name="concatenate_layer")

# 6. Add one or more Dense layers for feature learning
dense_layer = Dense(64, activation='relu', name="dense_layer_1")(merged_output)

# 7. Add the final output Dense layer
output_layer = Dense(num_classes, activation='softmax', name="output_layer")(dense_layer)

# 8. Instantiate the Model
model = Model(inputs=input_layer, outputs=output_layer, name="rcnn_clstm_model")

# Print the model summary to verify the architecture
model.summary()


**Reasoning**:
The model architecture has been defined. The next step is to compile the model by specifying the optimizer, loss function, and metrics, which is crucial before training the model.



In [6]:
from tensorflow.keras.optimizers import Adam

# 9. Compile the model
# Use Adam optimizer and sparse_categorical_crossentropy loss function (since labels are integer encoded)
model.compile(optimizer=Adam(learning_rate=0.001),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

print("Model compiled successfully with Adam optimizer, sparse_categorical_crossentropy loss, and accuracy metric.")

Model compiled successfully with Adam optimizer, sparse_categorical_crossentropy loss, and accuracy metric.


## Compile and Train Model

### Subtask:
Compile the defined deep learning model with an appropriate optimizer, loss function, and metrics. Then, train the model using the preprocessed text data and corresponding target labels.


**Reasoning**:
I will split the preprocessed data into training and testing sets using `train_test_split` to prepare for model training. Then, I will train the model using the `fit` method with the training data and validate it using the testing data.



In [7]:
from sklearn.model_selection import train_test_split

# 1. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, encoded_labels, test_size=0.2, random_state=42)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

# 2. Train the model
epochs = 10
batch_size = 32

history = model.fit(X_train, y_train,
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_data=(X_test, y_test))

print("Model training complete.")

X_train shape: (11712, 36)
X_test shape: (2928, 36)
y_train shape: (11712,)
y_test shape: (2928,)
Epoch 1/10
[1m366/366[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 85ms/step - accuracy: 0.6783 - loss: 0.7602 - val_accuracy: 0.8036 - val_loss: 0.5000
Epoch 2/10
[1m366/366[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 83ms/step - accuracy: 0.8535 - loss: 0.3802 - val_accuracy: 0.8077 - val_loss: 0.4831
Epoch 3/10
[1m366/366[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 86ms/step - accuracy: 0.9274 - loss: 0.2062 - val_accuracy: 0.8016 - val_loss: 0.5417
Epoch 4/10
[1m366/366[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 85ms/step - accuracy: 0.9774 - loss: 0.0901 - val_accuracy: 0.7964 - val_loss: 0.7226
Epoch 5/10
[1m366/366[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 83ms/step - accuracy: 0.9888 - loss: 0.0472 - val_accuracy: 0.7917 - val_loss: 0.8510
Epoch 6/10
[1m366/366[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 84ms