<a href="https://colab.research.google.com/github/Nandeesh-U/Deep-learning/blob/main/Nandeesh_Group_8_Kaggle_competition_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Dataset

The challenge in this competition is to predict whether a question asked on a well known public forum/platform is Toxic/inappropriate or not.

A toxic/inappropriate question is defined as a question intended to make a statement and not with a purpose of looking for helpful/meaningful answers. The following are some of the characteristics that can signify that a question is irrelevant/inappropriate:

* Based on false information, or contains absurd assumptions
* Does not have a non-neutral tone
* Has an exaggerated tone to underscore a point about a group of people
* Is rhetorical and meant to imply a statement about a group of people
* Is disparaging or inflammatory against an individual or a group of people
* Uses sexual content (such as incest, pedophilia), and not to seek genuine answers
* Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
* Based on an unrealistic premise about a group of people
* Is not grounded in reality

The training dataset includes the questions 1044897 that was asked, and whether it was identified as toxic/inappropriate (target = 1) or as relevant/appropriate (target = 0). The test dataset consists of approximately 261000 questions.

The training data might be imbalanced or noisy. They are not guaranteed to be perfect. Please take the necessary actions/steps while building the model.


## Description

This dataset has the following information:

1. **qid** - unique question identifier
2. **question_text** - the text of the question asked in the well known public forum/platform
3. **target** - a question labeled "toxic/inappropriate" has a value of 1, otherwise 0



## Problem Statement

To perform classification of approximately 261000 questions asked on a well known public form using Deep Neural Networks such as RNN/CNN/BERT/LSTM as 'toxic/inappropriate' questions or 'relevant/appropriate' questions

### 3. Upload your kaggle.json file using the following snippet in a code cell:



In [None]:
from google.colab import files
files.upload()

In [None]:
#If successfully uploaded in the above step, the 'ls' command here should display the kaggle.json file.
%ls

### 4. Install the Kaggle API using the following command


In [None]:
#!pip install urllib3

In [None]:
#!pip install -U -q kaggle==1.5.8

### 5. Move the kaggle.json file into ~/.kaggle, which is where the API client expects your token to be located:



In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
# Execute the following command to verify whether the kaggle.json is stored in the appropriate location: ~/.kaggle/kaggle.json
!ls ~/.kaggle

In [None]:
!chmod 600 /root/.kaggle/kaggle.json # run this command to ensure your Kaggle API token is secure on colab

### 6. Now download the Test Data from Kaggle

In [None]:
#If you get a forbidden link, you have most likely not joined the competition.
!kaggle competitions download -c irrelevant-questions-classification

In [None]:
'''import zipfile
zip_ref = zipfile.ZipFile('/content/irrelevant-questions-classification.zip', 'r')
zip_ref.extractall('/content/drive/MyDrive/colab_data_files/kaggle_1_dataset')
zip_ref.close()'''

In [None]:
!unzip /content/irrelevant-questions-classification.zip

In [None]:
#!pip install transformers

In [None]:
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
import tensorflow as tf
import nltk
from sklearn.metrics import f1_score
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
from gensim.utils import simple_preprocess

import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from sklearn.preprocessing import LabelEncoder

from keras.layers import Input, Embedding, Dense, Bidirectional, Dropout, GRU, LSTM
from keras.models import Sequential   # the model
import matplotlib.pyplot as plt

##   **Stage 1**:  Data Loading and Perform Exploratory Data Analysis (1 Points)

Extracting data from a stored folder to maintain consistency across the project

In [None]:
df_train = pd.read_csv("/content/drive/MyDrive/colab_data_files/kaggle_1_dataset/train_dataset.csv")
df_test = pd.read_csv("/content/drive/MyDrive/colab_data_files/kaggle_1_dataset/test_dataset.csv")

In [None]:
df_train.head()

In [None]:
#Let's check the distribution of target clas
df_train['target'].value_counts()

The data is heavily skewed. Only 6% of the samples are toxic/inappropriate

In [None]:
df_train.isnull().values.any()

no null values exist

In [None]:
df_train.dtypes

The columns are in the right format

In [None]:
df_train.shape

In [None]:
df_test.shape

In [None]:
df_train.tail()

In [None]:
df_test.head()

##   **Stage 2**: Data Pre-Processing  (1 Points)

####  Clean and Transform the data into a specified format


In [None]:
df_train['question_text'] = df_train['question_text'].apply(lambda x:simple_preprocess(x, max_len=30))

In [None]:
df_test['question_text'] = df_test['question_text'].apply(lambda x:simple_preprocess(x, max_len=30))

In [None]:
df_test.shape

In [None]:
df_train.tail()

In [None]:
# Remove stop words
stop_words = set(stopwords.words('english'))

df_train['question_text'] = df_train['question_text'].apply(lambda x: [w for w in x if not w in stop_words])
df_test['question_text'] = df_test['question_text'].apply(lambda x: [w for w in x if not w in stop_words])

In [None]:
# checking the max word count
tmp_list= [len(i) for i in df_train['question_text']]
print(max(tmp_list))

In [None]:
plt.hist(tmp_list)
plt.show()

In [None]:
plt.boxplot(tmp_list)
plt.show()

The maximum length of a question_text is 81. Looking at the box plot, a max seq length of 40 would sufficiently cover all the questions. let's select 40 as our max token count per question_text

##   **Stage 3**: Build the Word Embeddings using pretrained Word2vec/Glove (Text Representation) (1 Point)



In [None]:
# Hyperparameters
MAX_SENT_LEN = 40   # Number of words to consider from each question_text
MAX_VOCAB_SIZE = 165000   # Max vocabulary size
BATCH_SIZE = 128
N_EPOCHS = 15

Let's do padding to maintain a consistent input length

In [None]:
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts([' '.join(seq[:MAX_SENT_LEN]) for seq in df_train['question_text']])

print("Number of words in vocabulary:", len(tokenizer.word_index))

In [None]:
# Convert the sequence of words to sequnce of indices
X = tokenizer.texts_to_sequences([' '.join(seq[:MAX_SENT_LEN]) for seq in df_train['question_text']])
X = pad_sequences(X, maxlen=MAX_SENT_LEN, padding='post', truncating='post')

y = df_train['target']

In [None]:
#deleting dataframes to save RAM
del df_train

In [None]:
len(X[0])

In [None]:
print(X)

In [None]:
print(y)

Splitting the train data into train test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, train_size=0.9)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
del X

Let's load the 300 dimensional GloVe embedding

In [None]:
#!wget https://nlp.stanford.edu/data/glove.840B.300d.zip
#!wget https://nlp.stanford.edu/data/glove.6B.zip

In [None]:
#!unzip glove*.zip

In [None]:
'''import zipfile
zip_ref = zipfile.ZipFile('/content/glove.6B.zip', 'r')
zip_ref.extractall('/content/drive/MyDrive/colab_data_files')
zip_ref.close()'''

In [None]:
'''
embeddings_index = {}
# Loading the 300-dimensional vector of the model
f = open('/content/drive/MyDrive/colab_data_files/glove.6B.300d.txt')
count=0
for line in f:
  try:
    #print(line)
    print(count)
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
  except:
    pass
  count+=1
f.close()

print('Found %s word vectors.' % len(embeddings_index))
'''

In [None]:
import pickle
'''
pickle.dump({'embeddings_index' : embeddings_index } , open('/content/drive/MyDrive/colab_data_files/glove_embeddings_unpacked.pkl', 'wb'))
'''

Loading the embedding dictionary already stored as a pickle file. The key Will be a word and values are the embedded vectors as a python arra

In [None]:
file_to_read = open("/content/drive/MyDrive/colab_data_files/glove_embeddings_unpacked_6b_300d.pkl", "rb")

loaded_dict = pickle.load(file_to_read)

In [None]:
type(loaded_dict)

In [None]:
embeddings_index=loaded_dict['embeddings_index']

In [None]:
len(embeddings_index)

In [None]:
tokenizer

In [None]:
tokenizer.word_index

In [None]:
print(len(tokenizer.word_index))

creating an embeddings matrix where each row represents a word from the vocabulary we obtained from the training data and the colums represent embedding dimensions

In [None]:
# Adding 1 because of reversed 0 index
words_not_found = []
vocab_size = len(tokenizer.word_index) + 1
print('Loaded %s word vectors.' % len(embeddings_index))

embedding_dim = 300

# Create a weight matrix for words in the training data
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
  if i >= vocab_size:
    continue
  embedding_vector = embeddings_index.get(word)
  if (embedding_vector is not None) and len(embedding_vector) > 0:
    embedding_matrix[i] = embedding_vector
  else:
    words_not_found.append(word)

In [None]:
len(words_not_found)

64103 new words out of 160000 are there in the questions that doesnot exist in our embeddings from glove. This seems to be a bad embedding for this problem. But we are not able to use bigger pre trained embeddings due to compute issues

##   **Stage 4**: Build and Train the Deep networks model using Pytorch/Keras (5 Points)



Let's first build a Bi directional GRU to establish the baseline

In [None]:
# Build a sequential model by stacking neural net units
model = Sequential()
embedding_layer = Embedding(vocab_size,
                            embedding_dim,
                            weights = [embedding_matrix],
                            input_length = MAX_SENT_LEN,
                            trainable=False)
model.add(embedding_layer)
model.add(Bidirectional(GRU(128, return_sequences=True, dropout=0.50, name='first_gru_layer')))
model.add(Dropout(0.5))
model.add(Bidirectional(GRU(64, name='second_gru_layer')))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid', name='output_layer'))

In [None]:
print('Summary of the built model...')
model.summary()

Let's train the model

In [None]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
model.fit(X_train, y_train,
          batch_size=BATCH_SIZE,
          epochs=N_EPOCHS,
          validation_data=(X_test, y_test))

In [None]:
model.save('/content/drive/MyDrive/colab_data_files/models/bigru_128_64_64.keras')

##   **Stage 5**: Evaluate the Model and get model predictions on the test dataset (2 Points)








Let's evaluate the model on test data for accuracy

We have saved the model for reuse. So loading it here from there.

In [None]:
model = tf.keras.models.load_model('/content/drive/MyDrive/colab_data_files/models/bigru_128_64_64.keras')


In [None]:
print('Testing...')
model.evaluate(X_test, y_test)

In [None]:
# model predictions on the test data
preds = model.predict(X_test)

In [None]:
preds.shape

In [None]:
preds[:10]

Since the data is highly skewed, the default threshold of 0.5 might not be a right approach to determine the classes from probabilities. There are many ways we could locate the threshold with the optimal balance between false positive and true positive rates.

Sensitivity = True Positive Rate
Specificity = 1 – False Positive Rate
The Geometric Mean or G-Mean is a metric for imbalanced classification that, if optimized, will seek a balance between the sensitivity and the specificity.

G-Mean = sqrt(Sensitivity * Specificity)
One approach would be to test the model with each threshold returned from the call roc_auc_score() and select the threshold with the largest G-Mean value.

In [None]:
from sklearn.metrics import roc_curve
from numpy import sqrt
from numpy import argmax

In [None]:
# calculate roc curves
fpr, tpr, thresholds = roc_curve(y_test, preds)
# plot the roc curve for the model
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.plot(fpr, tpr, marker='.', label='Logistic')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
# show the plot
plt.show()

In [None]:
tpr

In [None]:
fpr

In [None]:
# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))

let's look at the scwtter plot of threshold vs gmeans

In [None]:
plt.scatter(thresholds, gmeans)
plt.show()

In [None]:
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

Though, the best value of threshold is 0.047 which gives the highest gmean on test dataset, the curve is not smooth near 0, so it would be a prudent approach to go with a threshold in the smoother part of the curve. Lets do the same analysis on y_train which is a larger dataset.

In [None]:
preds_train = model.predict(X_train)

In [None]:
# calculate roc curves
fpr, tpr, thresholds = roc_curve(y_train, preds_train)
# plot the roc curve for the model
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.plot(fpr, tpr, marker='.', label='Logistic')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
# show the plot
plt.show()

In [None]:
# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))

In [None]:
plt.scatter(thresholds, gmeans)
plt.show()

In [None]:
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

When we used a threshold of 0.15 which seemed to be the best from the scatter plot, the score on kaggle public dataset decreased. Probably, gmean is not a good metric. Let's use f1 score as our metric directly to perform a grid search over the threshold values.

In [None]:
#Let's get the list of threshold values to search over
t = [round(i,1) for i in thresholds]
t =list(set(t))
t =[i for i in t if i<=1]

In [None]:
t

In [None]:
f1scores=[]
for threshold in t:
  y_pred = [1 if i>threshold else 0 for i in preds]
  f1scores.append(f1_score(y_test, y_pred))

In [None]:
f1scores

In [None]:
#Let's check the same on train data
f1scores=[]
for threshold in t:
  y_pred = [1 if i>threshold else 0 for i in preds_train]
  f1scores.append(f1_score(y_train, y_pred))

In [None]:
f1scores

A threshold value of 0.3 seems to be the optimum cutoff for the best f1 score on both test and train dataset

Let's use the learnt model to make predictions on df_test

In [None]:
# Function to get the predictions
def predictions(df_test, threshold):

  questions_list_idx = tokenizer.texts_to_sequences([' '.join(seq[:MAX_SENT_LEN]) for seq in df_test['question_text']])

  # Pad the sequences of the data
  questions_list_idx = pad_sequences(questions_list_idx, maxlen=MAX_SENT_LEN, padding='post', truncating='post')

  # Get the predictons by using GRU model
  review_preds = model.predict(questions_list_idx)

  # Add the predictions to the movie reviews data
  df_test['predictions'] = review_preds

  # Set the threshold for the predictions
  pred_sentiment = np.array(list(map(lambda x : 1 if x > threshold else 0, review_preds)))

  # Add the sentiment predictions to the movie reviews
  df_test['predicted target'] = pred_sentiment

  return df_test

In [None]:
#df_test = predictions(df_test, 0.15)

In [None]:
df_test = predictions(df_test, 0.3)

In [None]:
df_test.shape

In [None]:
df_test.to_csv('/content/drive/MyDrive/colab_data_files/results/df_test_3.csv', index=False)

In [None]:
df_test.head()

In [None]:
def df_test_to_submission(df_test):
  result=df_test.rename(columns = {'predicted target':'target'})
  result= result[['qid', 'target']]
  return result

In [None]:
submission = df_test_to_submission(df_test)

In [None]:
submission.shape

In [None]:
submission.head()

Some issueexists in the submission file in kaggle. There are 23 qids which does not exist in test datasetbut are expected in the sample submission files. so we are going to default these 23 entries to the target value of '0' to make valid submission

In [None]:
sample_df = pd.read_csv('/content/drive/MyDrive/colab_data_files/kaggle_1_dataset/sample_submission.csv')

In [None]:
sub_df = pd.read_csv('/content/drive/MyDrive/colab_data_files/kaggle_1_dataset/test_dataset.csv')

In [None]:
sample_df.shape

In [None]:
problematic_qids =list(set(sample_df['qid'].values)-set(sub_df['qid'].values))

In [None]:
problematic_qids

In [None]:
len(problematic_qids)

In [None]:
submission.shape

In [None]:
for qid in problematic_qids:
  row = pd.Series([qid, 0], index = submission.columns)
  submission = submission.append(row, ignore_index=True)

In [None]:
submission.shape

In [None]:
#Let's remove those entries in submission which are not there in sample submission file
extra_qids =list(set(submission['qid'].values)-set(sample_df['qid'].values))
len(extra_qids)


In [None]:
submission =submission[submission['qid'].isin(list(sample_df['qid'].values))]

In [None]:
submission.shape

In [None]:
submission.to_csv('/content/drive/MyDrive/colab_data_files/results/submission_3.csv', index=False)

# ***Lets train an LSTM to see if the performance improves***

We will not remove stopwords now from the question text in the LSTM model. Since the problem is to find toxic/inappropriate questions, words like 'is', 'what', 'why' might be useful .

In [None]:
df_train = pd.read_csv("/content/drive/MyDrive/colab_data_files/kaggle_1_dataset/train_dataset.csv")
df_test = pd.read_csv("/content/drive/MyDrive/colab_data_files/kaggle_1_dataset/test_dataset.csv")

In [None]:
df_train['question_text'] = df_train['question_text'].apply(lambda x:simple_preprocess(x, max_len=30))
df_test['question_text'] = df_test['question_text'].apply(lambda x:simple_preprocess(x, max_len=30))


In [None]:
# Hyperparameters
MAX_SENT_LEN = 40   # Number of words to consider from each question_text
MAX_VOCAB_SIZE = 165000   # Max vocabulary size
BATCH_SIZE = 128
N_EPOCHS = 15

In [None]:
#Let's do padding to maintain a consistent input length
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts([' '.join(seq[:MAX_SENT_LEN]) for seq in df_train['question_text']])

print("Number of words in vocabulary:", len(tokenizer.word_index))

# Convert the sequence of words to sequnce of indices
X = tokenizer.texts_to_sequences([' '.join(seq[:MAX_SENT_LEN]) for seq in df_train['question_text']])
X = pad_sequences(X, maxlen=MAX_SENT_LEN, padding='post', truncating='post')

y = df_train['target']


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, train_size=0.9)

In [None]:
#Let's load the 300 dimensional GloVe embedding
import pickle
file_to_read = open("/content/drive/MyDrive/colab_data_files/glove_embeddings_unpacked_6b_300d.pkl", "rb")

loaded_dict = pickle.load(file_to_read)
embeddings_index=loaded_dict['embeddings_index']



In [None]:
# creating an embeddings matrix where each row represents a word from the vocabulary we obtained from the training data and the colums represent embedding dimensions
# Adding 1 because of reversed 0 index
words_not_found = []
vocab_size = len(tokenizer.word_index) + 1
print('Loaded %s word vectors.' % len(embeddings_index))

embedding_dim = 300

# Create a weight matrix for words in the training data
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
  if i >= vocab_size:
    continue
  embedding_vector = embeddings_index.get(word)
  if (embedding_vector is not None) and len(embedding_vector) > 0:
    embedding_matrix[i] = embedding_vector
  else:
    words_not_found.append(word)



In [None]:
len(words_not_found)

In [None]:
# Build a sequential model by stacking neural net units
model2 = Sequential()
embedding_layer = Embedding(vocab_size,
                            embedding_dim,
                            weights = [embedding_matrix],
                            input_length = MAX_SENT_LEN,
                            trainable=False)
model2.add(embedding_layer)
model2.add(Bidirectional(LSTM(128, return_sequences=True, dropout=0.50, name='first_lstm_layer')))
model2.add(Dropout(0.5))
model2.add(Bidirectional(LSTM(64, name='second_lstm_layer')))
model2.add(Dropout(0.5))
model2.add(Dense(64, activation='relu'))
model2.add(Dropout(0.5))
model2.add(Dense(1, activation='sigmoid', name='output_layer'))

In [None]:
print('Summary of the built model...')
model2.summary()

In [None]:
model2.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
model2.fit(X_train, y_train,
          batch_size=BATCH_SIZE,
          epochs=N_EPOCHS,
          validation_data=(X_test, y_test))

In [None]:
model2.save('/content/drive/MyDrive/colab_data_files/models/bilstm_128_64_64.keras')

We have saved the model for re-use. So loading it from there.

In [None]:
model2 = tf.keras.models.load_model('/content/drive/MyDrive/colab_data_files/models/bilstm_128_64_64.keras')


In [None]:
# model predictions on the test data
preds = model2.predict(X_test)


In [None]:
from sklearn.metrics import roc_curve
from numpy import sqrt
from numpy import argmax

In [None]:
t=[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

In [None]:
f1scores=[]
for threshold in t:
  y_pred = [1 if i>threshold else 0 for i in preds]
  f1scores.append(f1_score(y_test, y_pred))

In [None]:
f1scores

In [None]:
# model predictions on the test data
preds_train = model2.predict(X_train)


In [None]:
#Let's check the same on train data
f1scores=[]
for threshold in t:
  y_pred = [1 if i>threshold else 0 for i in preds_train]
  f1scores.append(f1_score(y_train, y_pred))

In [None]:
f1scores

A threshold value of 0.5 seems to be the optimum cutoff for the best f1 score on both test and train dataset

Let's use the learnt model to make predictions on df_test

In [None]:
# Function to get the predictions
def predictions(df_test, threshold):

  questions_list_idx = tokenizer.texts_to_sequences([' '.join(seq[:MAX_SENT_LEN]) for seq in df_test['question_text']])

  # Pad the sequences of the data
  questions_list_idx = pad_sequences(questions_list_idx, maxlen=MAX_SENT_LEN, padding='post', truncating='post')

  # Get the predictons by using GRU model
  review_preds = model2.predict(questions_list_idx)

  # Add the predictions to the movie reviews data
  df_test['predictions'] = review_preds

  # Set the threshold for the predictions
  pred_sentiment = np.array(list(map(lambda x : 1 if x > threshold else 0, review_preds)))

  # Add the sentiment predictions to the movie reviews
  df_test['predicted target'] = pred_sentiment

  return df_test

In [None]:
df_test = predictions(df_test, 0.5)
df_test.to_csv('/content/drive/MyDrive/colab_data_files/results/df_test_4.csv', index=False)

In [None]:
def df_test_to_submission(df_test):
  result=df_test.rename(columns = {'predicted target':'target'})
  result= result[['qid', 'target']]
  return result

In [None]:
submission = df_test_to_submission(df_test)

Some issue exists in the submission file in kaggle. There are 23 qids which does not exist in test dataset but are expected in the sample submission files. so we are going to default these 23 entries to the target value of '0' to make valid submission

In [None]:
sample_df = pd.read_csv('/content/drive/MyDrive/colab_data_files/kaggle_1_dataset/sample_submission.csv')
sub_df = pd.read_csv('/content/drive/MyDrive/colab_data_files/kaggle_1_dataset/test_dataset.csv')
problematic_qids =list(set(sample_df['qid'].values)-set(sub_df['qid'].values))
print(len(problematic_qids))
for qid in problematic_qids:
  row = pd.Series([qid, 0], index = submission.columns)
  submission = submission.append(row, ignore_index=True)

submission =submission[submission['qid'].isin(list(sample_df['qid'].values))]
print(submission.shape)




In [None]:
submission.to_csv('/content/drive/MyDrive/colab_data_files/results/submission_4.csv', index=False)

# **The above file is our final version of submission. We could not complete BERT training due to copute issues**

# ***Lets train a BERT model to see if the accuracy improves further***

In [None]:
#!pip install transformers

In [None]:
import os
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertModel, AdamW, get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

In [None]:
df_train = pd.read_csv("/content/drive/MyDrive/colab_data_files/kaggle_1_dataset/train_dataset.csv")
df_test = pd.read_csv("/content/drive/MyDrive/colab_data_files/kaggle_1_dataset/test_dataset.csv")

tr_texts = df_train['question_text'].to_list()
te_texts = df_test['question_text'].to_list()

tr_labels=df_train['target'].to_list()


In [None]:
#Lets Create a custom dataset class for text classification
class TextClassificationDataset(Dataset):
  def __init__(self, texts, labels, tokenizer, max_length):
    self.texts = texts
    self.labels = labels
    self.tokenizer = tokenizer
    self.max_length = max_length
  def __len__(self):
    return len(self.texts)
  def __getitem__(self, idx):
    text = self.texts[idx]
    label = self.labels[idx]
    encoding = self.tokenizer(text, return_tensors='pt', max_length=self.max_length, padding='max_length', truncation=True)
    return {'input_ids': encoding['input_ids'].flatten(), 'attention_mask': encoding['attention_mask'].flatten(), 'label': torch.tensor(label)}



In [None]:
#Build our customer BERT classifier
class BERTClassifier(nn.Module):
  def __init__(self, bert_model_name, num_classes):
    super(BERTClassifier, self).__init__()
    self.bert = BertModel.from_pretrained(bert_model_name)
    self.dropout = nn.Dropout(0.1)
    self.fc = nn.Linear(self.bert.config.hidden_size, num_classes)

  def forward(self, input_ids, attention_mask):
    outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
    pooled_output = outputs.pooler_output
    x = self.dropout(pooled_output)
    logits = self.fc(x)
    return logits

In [None]:
#Define the train() function
def train(model, data_loader, optimizer, scheduler, device):
  model.train()
  for batch in data_loader:
    optimizer.zero_grad()
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['label'].to(device)
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    loss = nn.CrossEntropyLoss()(outputs, labels)
    loss.backward()
    optimizer.step()
    scheduler.step()

In [None]:
# Build our evaluation method
def evaluate(model, data_loader, device):
  model.eval()
  predictions = []
  actual_labels = []
  with torch.no_grad():
    for batch in data_loader:
      input_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)
      labels = batch['label'].to(device)
      outputs = model(input_ids=input_ids, attention_mask=attention_mask)
      _, preds = torch.max(outputs, dim=1)
      predictions.extend(preds.cpu().tolist())
      actual_labels.extend(labels.cpu().tolist())
  return accuracy_score(actual_labels, predictions), classification_report(actual_labels, predictions)

In [None]:
#Build our prediction method
def predict_toxicity(text, model, tokenizer, device, max_length=128):
  model.eval()
  encoding = tokenizer(text, return_tensors='pt', max_length=max_length, padding='max_length', truncation=True)
  input_ids = encoding['input_ids'].to(device)
  attention_mask = encoding['attention_mask'].to(device)

  with torch.no_grad():
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    _, preds = torch.max(outputs, dim=1)
  return 1 if preds.item() == 1 else 0




In [None]:
#Define our model’s parameters
bert_model_name = 'bert-base-uncased'
num_classes = 2
max_length = 128
batch_size = 40
num_epochs = 5
learning_rate = 2e-5




In [None]:
#Loading and splitting the data.
train_texts, val_texts, train_labels, val_labels = train_test_split(tr_texts, tr_labels, test_size=0.2, random_state=42)


In [None]:
#Initialize tokenizer, dataset, and data loader
tokenizer = BertTokenizer.from_pretrained(bert_model_name)
train_dataset = TextClassificationDataset(train_texts, train_labels, tokenizer, max_length)
val_dataset = TextClassificationDataset(val_texts, val_labels, tokenizer, max_length)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BERTClassifier(bert_model_name, num_classes).to(device)

In [None]:
#Set up optimizer and learning rate scheduler
optimizer = AdamW(model.parameters(), lr=learning_rate)
total_steps = len(train_dataloader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

In [None]:
#Training the model
for epoch in range(num_epochs):
  print(f"Epoch {epoch + 1}/{num_epochs}")
  train(model, train_dataloader, optimizer, scheduler, device)
  accuracy, report = evaluate(model, val_dataloader, device)
  print(f"Validation Accuracy: {accuracy:.4f}")
  print(report)


The model is failing in training because it takes a lot of time to train. The compute provided by collab stops running midway.

In [None]:
torch.save(model.state_dict(), "bert_classifier.pth")