#### Problem statement

Predict the political party from the tweet text and the handle

#### Data description
This dataset has three columns - label (party name), twitter handle, tweet text


#### Problem Description:

Design a feed forward deep neural network to predict the political party using the pytorch or tensorflow.
Build two models

1. Without using the handle

2. Using the handle


#### Deliverables

- Report the performance on the test set.

- Try multiple models and with different hyperparameters. Present the results of each model on the test set. No need to create a dev set.

- Experiment with:
    -L2 and dropout regularization techniques
    -SGD, RMSProp and Adamp optimization techniques



- Creating a fixed-sized vocabulary: Give a unique id to each word in your selected vocabulary and use it as the input to the network

    - Option 1: Feedforward networks can only handle fixed-sized inputs. You can choose to have a fixed-sized K words from the tweet text (e.g. the first K word, randomly selected K word etc.). K can be a hyperparameter.

    - Option 2: you can choose top N (e.g. N=1000) frequent words from the dataset and use an N-sized input layer. If a word is present in a tweet, pass the id, 0 otherwise
    
    -  Clearly state your design choices and assumptions. Think about the pros and cons of each option.



<b> Tabulate your results, either at the end of the code file or in the text box on the submission page. The final result should have:</b>

1. Experiment description

2. Hyperparameter used and their values

3. Performance on the test set



In [None]:
import pandas as pd
from collections import Counter
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from sklearn.metrics import accuracy_score


In [None]:
# Loading the dataset
train_df = pd.read_csv('/content/train.csv')
test_df = pd.read_csv('/content/test.csv')

# Making sure the Tweet column is of string type
train_df['Tweet'] = train_df['Tweet'].astype(str)
test_df['Tweet'] = test_df['Tweet'].astype(str)

# Handling the missing values in the Handle column
train_df['Handle'] = train_df['Handle'].fillna('')
test_df['Handle'] = test_df['Handle'].fillna('')

# Function to create a vocabulary from tweets
def create_vocabulary(tweets, max_size=1000):
    words = ' '.join(tweets).split()
    freq = Counter(words)
    vocab = {word: idx + 1 for idx, (word, _) in enumerate(freq.most_common(max_size))}  # Start indexing from 1
    return vocab

# Creating vocabulary from the train dataset
vocab = create_vocabulary(train_df['Tweet'])
vocab_size = len(vocab) + 1  # +1 for padding

# Function to encode tweets and handles into numerical IDs
def encode_tweet(tweet, vocab, max_length):
    tweet_ids = [vocab.get(word, 0) for word in tweet.split()]
    return tweet_ids[:max_length] + [0] * (max_length - len(tweet_ids))

def encode_handle(handle, max_length):
    handle_ids = [ord(char) for char in handle]  # using ASCII values of characters
    return handle_ids[:max_length] + [0] * (max_length - len(handle_ids))

max_length = 10  # Adjusting this to our desired fixed size

# Encoding the tweets and handles
train_df['Encoded'] = train_df['Tweet'].apply(lambda x: encode_tweet(x, vocab, max_length))
train_df['Encoded_Handle'] = train_df['Handle'].apply(lambda x: encode_handle(x, max_length))
test_df['Encoded'] = test_df['Tweet'].apply(lambda x: encode_tweet(x, vocab, max_length))
test_df['Encoded_Handle'] = test_df['Handle'].apply(lambda x: encode_handle(x, max_length))

# Converting the labels to numerical format
label_mapping = {label: idx for idx, label in enumerate(train_df['Party'].unique())}
train_df['Party'] = train_df['Party'].map(label_mapping)

print(train_df.head())
print(test_df.head())


  Unnamed: 0  Party         Handle  \
0          0      0  RepDarrenSoto   
1          1      0  RepDarrenSoto   
2          2      0  RepDarrenSoto   
3          3      0  RepDarrenSoto   
4          4      0  RepDarrenSoto   

                                               Tweet  \
0  Today, Senate Dems vote to #SaveTheInternet. P...   
1  RT @WinterHavenSun: Winter Haven resident / Al...   
2  RT @NBCLatino: .@RepDarrenSoto noted that Hurr...   
3  RT @NALCABPolicy: Meeting with @RepDarrenSoto ...   
4  RT @Vegalteno: Hurricane season starts on June...   

                                   Encoded  \
0  [108, 197, 0, 148, 2, 0, 155, 2, 63, 0]   
1          [8, 0, 0, 0, 0, 0, 0, 0, 0, 10]   
2       [8, 0, 0, 0, 16, 0, 0, 34, 879, 0]   
3     [8, 0, 0, 12, 0, 876, 67, 7, 304, 1]   
4         [8, 0, 0, 0, 0, 9, 0, 0, 859, 0]   

                                   Encoded_Handle  
0  [82, 101, 112, 68, 97, 114, 114, 101, 110, 83]  
1  [82, 101, 112, 68, 97, 114, 114, 101, 110, 83]  
2

In [None]:
# Defining the Dataset class for tweets with handles
class TweetWithHandleDataset(Dataset):
    def __init__(self, encodings, handles, labels):
        self.encodings = encodings
        self.handles = handles
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            'input_ids': torch.tensor(self.encodings[idx], dtype=torch.long),
            'handles': torch.tensor(self.handles[idx], dtype=torch.long),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

# Defining the Feedforward Neural Network that includes handle processing
class FeedForwardNNWithHandle(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, dropout_rate):
        super(FeedForwardNNWithHandle, self).__init__()
        self.tweet_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.handle_embedding = nn.Embedding(256, embedding_dim)
        self.fc1 = nn.Linear(max_length * embedding_dim * 2, hidden_dim)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, input_ids, handle_ids):
        tweet_embeds = self.tweet_embedding(input_ids)
        handle_embeds = self.handle_embedding(handle_ids)
        x = torch.cat((tweet_embeds.view(tweet_embeds.size(0), -1), handle_embeds.view(handle_embeds.size(0), -1)), dim=1)
        x = self.fc1(x)
        x = torch.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# Defining the Feedforward Neural Network that does NOT include handle processing
class FeedForwardNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, dropout_rate):
        super(FeedForwardNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc1 = nn.Linear(max_length * embedding_dim, hidden_dim)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, input_ids):
        embeds = self.embedding(input_ids)
        x = embeds.view(embeds.size(0), -1)
        x = self.fc1(x)
        x = torch.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x


In [None]:
# Preparing the training and test datasets
train_dataset_with_handle = TweetWithHandleDataset(
    list(train_df['Encoded']),
    list(train_df['Encoded_Handle']),
    list(train_df['Party'])
)
train_dataset_without_handle = TweetWithHandleDataset(
    list(train_df['Encoded']),
    [[] for _ in range(len(train_df))],
    list(train_df['Party'])
)

test_dataset = TweetWithHandleDataset(
    list(test_df['Encoded']),
    list(test_df['Encoded_Handle']),
    [0] * len(test_df)
)

# DataLoader
train_loader_with_handle = DataLoader(train_dataset_with_handle, batch_size=32, shuffle=True)
train_loader_without_handle = DataLoader(train_dataset_without_handle, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)


In [None]:
# Initializing Hyperparameters
embedding_dim = 64
hidden_dim = 128
output_dim = len(label_mapping)  # No. of unique political parties
dropout_rate = 0.5

# Initializing models with handle & Without Handle
model_with_handle = FeedForwardNNWithHandle(vocab_size, embedding_dim, hidden_dim, output_dim, dropout_rate)
model_without_handle = FeedForwardNN(vocab_size, embedding_dim, hidden_dim, output_dim, dropout_rate)

# Loss function
criterion = nn.CrossEntropyLoss()

# Defining optimizers with L2 regularization
optimizer_with_handle_adam = optim.Adam(model_with_handle.parameters(), lr=0.001, weight_decay=1e-4)
optimizer_without_handle_adam = optim.Adam(model_without_handle.parameters(), lr=0.001, weight_decay=1e-4)

optimizer_with_handle_sgd = optim.SGD(model_with_handle.parameters(), lr=0.01, weight_decay=1e-4)
optimizer_without_handle_sgd = optim.SGD(model_without_handle.parameters(), lr=0.01, weight_decay=1e-4)

optimizer_with_handle_rmsprop = optim.RMSprop(model_with_handle.parameters(), lr=0.001, weight_decay=1e-4)
optimizer_without_handle_rmsprop = optim.RMSprop(model_without_handle.parameters(), lr=0.001, weight_decay=1e-4)


In [None]:
# Training the function
def train_model(model, train_loader, criterion, optimizer, epochs=5, use_handle=True):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for batch in train_loader:
            optimizer.zero_grad()
            if use_handle:
                input_ids = batch['input_ids']
                handle_ids = batch['handles']
                outputs = model(input_ids, handle_ids)
            else:
                input_ids = batch['input_ids']
                outputs = model(input_ids)
            loss = criterion(outputs, batch['labels'])
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        print(f'Epoch {epoch + 1}, Loss: {total_loss / len(train_loader)}')

# Function to evaluate the model
def evaluate_model(model, test_loader, use_handle=True):
    model.eval()
    all_predictions = []
    all_labels = []
    with torch.no_grad():
        for batch in test_loader:
            if use_handle:
                input_ids = batch['input_ids']
                handle_ids = batch['handles']
                outputs = model(input_ids, handle_ids)
            else:
                input_ids = batch['input_ids']
                outputs = model(input_ids)
            _, predicted = torch.max(outputs, 1)
            all_predictions.extend(predicted.numpy())
            all_labels.extend(batch['labels'].numpy())

    return accuracy_score(all_labels, all_predictions)


In [None]:
# Train and assess the model with the handle using the Adam optimizer
train_model(model_with_handle, train_loader_with_handle, criterion, optimizer_with_handle_adam)
accuracy_with_handle_adam = evaluate_model(model_with_handle, test_loader, use_handle=True)
print(f'Accuracy with handle (Adam): {accuracy_with_handle_adam:.4f}')

# Train and assess the model without handle using the Adam optimizer
train_model(model_without_handle, train_loader_without_handle, criterion, optimizer_without_handle_adam, use_handle=False)
accuracy_without_handle_adam = evaluate_model(model_without_handle, test_loader, use_handle=False)
print(f'Accuracy without handle (Adam): {accuracy_without_handle_adam:.4f}')

# Train and evaluate the model with handle, without Handle using SGD and RMSProp optimizers
train_model(model_with_handle, train_loader_with_handle, criterion, optimizer_with_handle_sgd)
accuracy_with_handle_sgd = evaluate_model(model_with_handle, test_loader, use_handle=True)
print(f'Accuracy with handle (SGD): {accuracy_with_handle_sgd:.4f}')

train_model(model_without_handle, train_loader_without_handle, criterion, optimizer_without_handle_sgd, use_handle=False)
accuracy_without_handle_sgd = evaluate_model(model_without_handle, test_loader, use_handle=False)
print(f'Accuracy without handle (SGD): {accuracy_without_handle_sgd:.4f}')

train_model(model_with_handle, train_loader_with_handle, criterion, optimizer_with_handle_rmsprop)
accuracy_with_handle_rmsprop = evaluate_model(model_with_handle, test_loader, use_handle=True)
print(f'Accuracy with handle (RMSProp): {accuracy_with_handle_rmsprop:.4f}')

train_model(model_without_handle, train_loader_without_handle, criterion, optimizer_without_handle_rmsprop, use_handle=False)
accuracy_without_handle_rmsprop = evaluate_model(model_without_handle, test_loader, use_handle=False)
print(f'Accuracy without handle (RMSProp): {accuracy_without_handle_rmsprop:.4f}')


Epoch 1, Loss: 0.10704672720244447
Epoch 2, Loss: 0.02268048529416957
Epoch 3, Loss: 0.021294670939854255
Epoch 4, Loss: 0.019458702005206052
Epoch 5, Loss: 0.018997185375999696
Accuracy with handle (Adam): 0.4850
Epoch 1, Loss: 0.6880309031337984
Epoch 2, Loss: 0.6544161152587921
Epoch 3, Loss: 0.6271105544240896
Epoch 4, Loss: 0.6161232080771123
Epoch 5, Loss: 0.6094470341143711
Accuracy without handle (Adam): 0.5067
Epoch 1, Loss: 0.016528541567201686
Epoch 2, Loss: 0.016291438572187837
Epoch 3, Loss: 0.01617888709261888
Epoch 4, Loss: 0.015949408589658945
Epoch 5, Loss: 0.015779534401468266
Accuracy with handle (SGD): 0.4896
Epoch 1, Loss: 0.5869291955109373
Epoch 2, Loss: 0.5836974203953699
Epoch 3, Loss: 0.5814611516856854
Epoch 4, Loss: 0.5802821217525178
Epoch 5, Loss: 0.5794321566466135
Accuracy without handle (SGD): 0.4776
Epoch 1, Loss: 0.020346484979414756
Epoch 2, Loss: 0.019321491015070965
Epoch 3, Loss: 0.019143940945155483
Epoch 4, Loss: 0.019100866733100377
Epoch 5, Lo

In [None]:
# Results:-
results = {
    "Model": ["With Handle (Adam)", "Without Handle (Adam)",
              "With Handle (SGD)", "Without Handle (SGD)",
              "With Handle (RMSProp)", "Without Handle (RMSProp)"],
    "Accuracy": [accuracy_with_handle_adam, accuracy_without_handle_adam,
                 accuracy_with_handle_sgd, accuracy_without_handle_sgd,
                 accuracy_with_handle_rmsprop, accuracy_without_handle_rmsprop]
}

results_df = pd.DataFrame(results)
print(results_df)


                      Model  Accuracy
0        With Handle (Adam)  0.484992
1     Without Handle (Adam)  0.506703
2         With Handle (SGD)  0.489582
3      Without Handle (SGD)  0.477561
4     With Handle (RMSProp)  0.506338
5  Without Handle (RMSProp)  0.422847


### Description

Two models were designed to predict the political party based on tweets:
1. **Model with Handle**: Incorporates both the tweet content and the handle associated with the tweet.
2. **Model without Handle**: Uses only the tweet content for prediction.

Both models were tested using three optimization techniques: Adam, SGD, and RMSProp, with L2 regularization and dropout to prevent overfitting.

### Hyperparameters

The hyperparameters used across all experiments are as follows:

- **Embedding dimension**: 64
- **Hidden dimension**: 128
- **Dropout rate**: 0.5
- **Batch size**: 32
- **Output dimension**: Based on the number of unique political parties
- **Learning rate**:
    - Adam: 0.001
    - SGD: 0.01
    - RMSProp: 0.001
- **L2 regularization (weight decay)**: 1e-4

### Performance on the Test Set

| Model                    | Accuracy  |
|---------------------------|-----------|
| With Handle (Adam)         | 0.484992  |
| Without Handle (Adam)      | 0.506703  |
| With Handle (SGD)          | 0.489582  |
| Without Handle (SGD)       | 0.477561  |
| With Handle (RMSProp)      | 0.506338  |
| Without Handle (RMSProp)   | 0.422847  |

### Summary of Results

- The **Without Handle (Adam)** model achieved the highest accuracy on the test set, with an accuracy of **0.506703**.
- The **With Handle (RMSProp)** model performed similarly well, with an accuracy of **0.506338**.
- The **With Handle (Adam)** and **With Handle (SGD)** models had comparable performance, both achieving around **0.484992** and **0.489582**, respectively.
- The **Without Handle (RMSProp)** model had the lowest accuracy at **0.422847**.
