# Getting Hot

## Problem Statement

In a world where information is power, the evil APOCALYPSE organization has harnessed the might of large language models (LLMs) to spread fake news and manipulate public opinion. These nefarious actors have weaponized cutting-edge AI technology to undermine trust in legitimate sources of information and sow discord among the population.

But there is hope: a team of dedicated researchers and data scientists are working tirelessly to build a machine learning model that can detect LLM-generated content and flag it as potentially unreliable.

This cutting-edge technology analyzes not only the content of the text, but also the temperature associated with it. LLM-generated text tends to have a distinct temperature signature, which the model can use to distinguish it from genuine human-generated content.

You are given Base64 encoded sentences and their associated temperatures in train.csv but we are missing the temperatures for the Base64 encoded sentences in test.csv. Help us build a model to find the temperatures so we can stand up to the APOCALYPSE organization and their campaign of misinformation.

## Solution

We are going to use tf-idf vectorizers to transform the text, then build a pytorch regression model and train it. Note that this dataset and model are quite intensive and hence were run on a more powerful server (RTX 4090 and 64GB Ram). However, there exists a much more computationally efficient solution that runs on Google Colab and obtains a close score to this model (Scroll down below).

In [0]:
!pip install transformers pandas numpy scikit-learn tensorflow nltk gputil

In [None]:
!tar -xf GettingHot.tar.xz

In [124]:
import pandas as pd
import numpy as np
import base64
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer

In [125]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cuda


In [126]:
def process(text):
  try: return base64.b64decode(text).decode()
  except: return ""

In [127]:
df = pd.read_csv("package/train.csv")
df['sentence'] = df['sentence'].map(process)

# Split data into features and target
X = df['sentence']
y = df['temperature']

In [129]:
tokenizer = RegexpTokenizer(r'\w+')
porter_stemmer = PorterStemmer()

def text_process(text):
    text_processed=tokenizer.tokenize(text)
    text_processed = [porter_stemmer.stem(word) for word in text_processed]
    return text_processed

In [130]:
char_vec = TfidfVectorizer(analyzer='char', ngram_range=(3, 3), max_features=5000).fit(X)
word_vec = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), tokenizer=text_process, max_features=5000).fit(X)
bigram_vec = TfidfVectorizer(analyzer='word', ngram_range=(2, 2), tokenizer=text_process, max_features=2500).fit(X)



In [131]:
# Define custom vectorizer with multiple transformers
class CustomVectorizer(nn.Module):
    def __init__(self):
        super(CustomVectorizer, self).__init__()

    def forward(self, x):
        char_features = torch.tensor(char_vec.transform(x).toarray(), dtype=torch.float32)
        word_features = torch.tensor(word_vec.transform(x).toarray(), dtype=torch.float32)
        bigram_features = torch.tensor(bigram_vec.transform(x).toarray(), dtype=torch.float32)
        return torch.cat((char_features, word_features, bigram_features), dim=1)

# Tokenize text
vectorizer = CustomVectorizer()
X = vectorizer(X)

# Convert to PyTorch tensors
y = torch.tensor(y.values, dtype=torch.float32)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [147]:
# Define the model architecture
class RegressionModel(nn.Module):
    def __init__(self, input_dim):
        super(RegressionModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, 2048)
        self.fc2 = nn.Linear(2048, 128)
        self.fc3 = nn.Linear(128, 1)
        self.dropout = nn.Dropout(p=0.2)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = torch.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

# Initialize the model
input_dim = X_train.shape[1]
model = RegressionModel(input_dim)

In [148]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

RegressionModel(
  (fc1): Linear(in_features=12500, out_features=2048, bias=True)
  (fc2): Linear(in_features=2048, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=1, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)

In [None]:
# Define loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Define DataLoader
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Train the model
model.train()
for epoch in range(10):
    running_loss = 0.0
    with tqdm(total=len(train_loader), desc=f'Epoch {epoch+1}/10', unit='batch') as pbar:
        for batch_idx, (inputs, labels) in enumerate(train_loader):
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels.unsqueeze(1))
            loss.backward()
            optimizer.step()
            running_loss += loss.item() * inputs.size(0)
            pbar.set_postfix({'loss': running_loss / ((batch_idx + 1) * train_loader.batch_size)})
            pbar.update()

# Evaluate the model
model.eval()
with torch.no_grad():
    inputs, labels = X_test.to(device), y_test.to(device)
    outputs = model(inputs)
    mse = criterion(outputs, labels.unsqueeze(1)).item()
    print("Mean Squared Error:", mse)

Epoch 1/10: 100%|██████████| 1280/1280 [00:05<00:00, 242.47batch/s, loss=0.0243]
Epoch 2/10: 100%|██████████| 1280/1280 [00:05<00:00, 244.31batch/s, loss=0.015] 
Epoch 3/10: 100%|██████████| 1280/1280 [00:05<00:00, 243.24batch/s, loss=0.0111]
Epoch 4/10: 100%|██████████| 1280/1280 [00:05<00:00, 252.23batch/s, loss=0.00853]
Epoch 5/10: 100%|██████████| 1280/1280 [00:05<00:00, 245.98batch/s, loss=0.00729]
Epoch 6/10:  52%|█████▏    | 671/1280 [00:02<00:02, 241.12batch/s, loss=0.00642]

In [137]:
test_df = pd.read_csv("package/test.csv")
test_df['sentence'] = test_df['sentence'].map(process)
test_X_torch = vectorizer(test_df['sentence'])
test_dataset = TensorDataset(test_X_torch)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

In [None]:
# Evaluate the model on test data
model.eval()
outputs_array = []
with torch.no_grad():
    for inputs in tqdm(test_loader, desc="Testing", unit="batch"):
        inputs = inputs[0].to(device)  # Extracting inputs from DataLoader
        outputs = model(inputs)
        outputs_array.extend(outputs.cpu().numpy().flatten())

sub_df = pd.read_csv('package/submission.csv')
sub_df['temperature'] = outputs_array
sub_df.to_csv('package/submission.csv', index=False)

This model obtains a score of **89**.

## Solution (Using sklearn models

The concept is similar to the above, more computationally expensive model, except it uses scipy's Ridge regression model.

In [ ]:
from google.colab import drive
drive.mount('/content/drive')

In [ ]:
import pandas as pd
import numpy as np
import base64
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [ ]:
def process(text):
  try: return base64.b64decode(text).decode()
  except: return ""

In [ ]:
df = pd.read_csv("package/train.csv")
df['processed'] = df['sentence'].map(process)
df = df.dropna(subset=['processed'])

# Split data into features and target
X = df['processed']
y = df['temperature']

In [ ]:
# Convert text data into TF-IDF features
f_union = FeatureUnion(
    transformer_list=[
        ('char', Pipeline([
            ('tfidf', TfidfVectorizer(analyzer='char', ngram_range=(3, 3), max_features=5000)),
        ])),
        ('text', Pipeline([
            ('tfidf', TfidfVectorizer(analyzer='word', ngram_range=(1, 1), max_features=5000)),
        ])),
        ('word_bigrams', Pipeline([
            ('tfidf', TfidfVectorizer(analyzer='word', ngram_range=(1, 2), max_features=2500)),
        ])),
    ],
)

model = Ridge(alpha=1.0)

pipeline = Pipeline([
    ('union', f_union),
    ('clf', model)
])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [ ]:
pipeline.fit(X_train, y_train)

# Predict on test data
y_pred = pipeline.predict(X_test)

# Calculate mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

In [ ]:
df = pd.read_csv("package/test.csv")
df['processed'] = df['sentence'].map(process)
X = df['processed']

pred = pipeline.predict(X)
df = pd.read_csv('package/submission.csv')
df['temperature'] = pred
df.to_csv('package/submission.csv', index=False)

This model yields a score of **~85**