Offensive Detection: Preprocessing and Embedding
 
In this notebook, we perform the following steps:
 
1. **Data Loading and Exploration**: Load the training data and inspect its columns.
2. **Data Preprocessing**: Drop unnecessary columns and clean the tweet text by removing URLs, punctuation, stopwords, and extra spaces.
3. **Saving Preprocessed Data**: Save the cleaned dataset as `train_prepro.csv`.
4. **Embedding using GloVe**: Tokenize the cleaned tweet text, convert it to padded sequences, and create an embedding matrix using pretrained GloVe embeddings.
5. **Saving the Embedding Matrix**: Save the embedding matrix for later use in model building.


1. Data Loading and Exploration
 
In this section, we load the CSV file that contains the training data. The dataset includes the following columns:
- `count`
- `hate_speech_count`
- `offensive_language_count`
- `neither_count`
- `tweet`
- `class`
 
We will inspect the data to understand its structure.


Import

In [None]:
import pandas as pd
import numpy as np
import re
import string
import nltk
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [None]:
train_df = pd.read_csv(r'C:\Users\hp\Downloads\nlp ass2\train (2).csv')

print("Columns in the dataset:")
print(train_df.columns)
print("\nTraining Data Sample:")
print(train_df.head())


Columns in the dataset:
Index(['count', 'hate_speech_count', 'offensive_language_count',
       'neither_count', 'tweet', 'class'],
      dtype='object')

Training Data Sample:
   count  hate_speech_count  offensive_language_count  neither_count  \
0      3                  2                         0              1   
1      3                  0                         0              3   
2      3                  0                         3              0   
3      3                  0                         3              0   
4      6                  0                         6              0   

                                               tweet  class  
0  RT @FunSizedYogi: @TheBlackVoice well how else...      0  
1  Funny thing is....it's not just the people doi...      2  
2  RT @winkSOSA: "@AintShitSweet__: "@Rakwon_OGOD...      1  
3  @Jbrendaro30 @ZGabrail @ramsin1995 @GabeEli8 @...      1  
4                                S/o that real bitch      1  


2. Data Preprocessing
 
In this section, we:
- **Drop Unnecessary Columns**: We only need the `tweet` and `class` columns for text classification.
- **Clean the Tweet Text**: Remove URLs, punctuation, stopwords, and extra spaces to reduce noise in the text.
 
The cleaned text will be saved in a new column.


In [None]:

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Define a function to clean tweet text
def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Convert to lowercase
    text = text.lower()
    # Remove stopwords
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    # Join tokens back to string
    cleaned_text = " ".join(tokens)
    return cleaned_text

train_df = pd.read_csv(r'C:\Users\hp\Downloads\nlp ass2\train (2).csv')

train_df_clean = train_df[['tweet', 'class']].copy()

train_df_clean['clean_tweet'] = train_df_clean['tweet'].apply(clean_text)

max_words = 10000 
max_len = 100     

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train_df_clean['clean_tweet'])  

sequences = tokenizer.texts_to_sequences(train_df_clean['clean_tweet'])
X = pad_sequences(sequences, maxlen=max_len)

print("Shape of tokenized and padded tweet data:", X.shape)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Shape of tokenized and padded tweet data: (19826, 100)


3. Saving Preprocessed Data
 
We now save the preprocessed training data (with the cleaned tweets) as `train_prepro.csv` so that it can be reused later in the pipeline.


In [None]:
train_df_clean.to_csv('train_prepro.csv', index=False)
print("Preprocessed training data saved as 'train_prepro.csv'")


Preprocessed training data saved as 'train_prepro.csv'


4. Embedding using GloVe
 
In this section, we:
- **Tokenize the Cleaned Tweet Text**: Convert the text into sequences of integers.
- **Pad the Sequences**: Ensure each sequence has a fixed length.
- **Load Pretrained GloVe Embeddings**: Load the GloVe file (e.g., `glove.6B.100d.txt`) and create an embedding matrix that maps words in our vocabulary to their GloVe vectors.


In [None]:
# Load GloVe embeddings
embedding_index = {}
glove_file = 'glove.6B.100d.txt'  

with open(glove_file, encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs

print("Total number of words in GloVe:", len(embedding_index))


Total number of words in GloVe: 400000


In [None]:
embedding_dim = 100  
word_index = tokenizer.word_index
num_words = min(max_words, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, embedding_dim))

for word, i in word_index.items():
    if i >= max_words:
        continue
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

print("Embedding matrix shape:", embedding_matrix.shape)


Embedding matrix shape: (10000, 100)


5. Saving the Embedding Matrix
 
We now save the embedding matrix to a file so that it can be loaded later during model training without needing to reprocess the GloVe file.


In [None]:
np.savetxt('embedding_matrix.csv', embedding_matrix, delimiter=',')
print("Embedding matrix saved as 'embedding_matrix.csv'")


Embedding matrix saved as 'embedding_matrix.csv'


#  Summary

In this notebook, we:
- Loaded and explored the training data.
- Dropped unnecessary columns and cleaned the tweet text by removing noise (URLs, punctuation, stopwords, extra spaces).
- Saved the preprocessed data as `train_prepro.csv`.
- Tokenized the cleaned tweet text, padded the sequences, and created an embedding matrix using pretrained GloVe embeddings.
- Saved the embedding matrix as `embedding_matrix.npy` for later use in model training.

This workflow prepares the data for the next steps in building and training your offensive language detection models.
