# 🧠 PART A

### 📌 **Import Libraries and Load SpaCy Model**

This cell imports the necessary Python libraries for the preprocessing task:

- `spacy` for natural language processing
- `pandas` for data handling
- `re` and `string` for text cleaning
- `tqdm` for progress bars
- `time` to measure execution time

We're also configuring SpaCy to use GPU (if available) for faster processing and loading the `en_core_web_sm` model with the named entity recognizer and parser disabled for improved speed, since they're not needed for preprocessing.

In [1]:
import spacy
import pandas as pd
import re
import string
from tqdm import tqdm
import time

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])

### 📌 **Load the Training and Test Datasets**

In this step, we load the training and test datasets from the specified paths using `pandas.read_csv()`. We also print the size of each dataset and preview a couple of rows from the training set.

**Make sure to update the paths to your actual dataset locations.**

In [2]:
train_df = pd.read_csv('/kaggle/input/nlp-a2-dataset/train.csv')
test_df = pd.read_csv('/kaggle/input/nlp-a2-dataset/test.csv')

print(f"Training set size: {len(train_df)}")
print(f"Test set size: {len(test_df)}")
print("\nSample raw training data:")
print(train_df.head(2))

Training set size: 13879
Test set size: 100

Sample raw training data:
                     title                                               text
0  Port St. Lucie, Florida  Port St. Lucie is a city in St. Lucie County, ...
1              Dirty Dozen  Dirty Dozen may refer to:\n\nBooks, film and t...


### 🧹 **Preprocess Text Data**

This is a critical step in any NLP pipeline. Here's what we're doing:

1. **Convert text to lowercase**
2. **Remove non-ASCII characters**
3. **Remove punctuation**
4. **Tokenize the text using SpaCy**
5. **Remove stopwords**
6. **Lemmatize** the remaining words

We also use `tqdm` to visualize the progress during preprocessing since NLP tasks can be time-consuming.

In [3]:
tqdm.pandas()

def preprocess_text(text):
    text = text.lower().encode('ascii', errors='ignore').decode('utf-8')
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]
    return " ".join(tokens)


start = time.time()

# Process training data
print("Processing training data...")
train_df['processed_text'] = train_df['text'].progress_apply(preprocess_text)

# Process test data
print("\nProcessing test data...")
test_df['processed_text'] = test_df['text'].progress_apply(preprocess_text)

print("\nTotal time taken in preprocessing: ", time.time() - start)

Processing training data...


100%|██████████| 13879/13879 [15:39<00:00, 14.77it/s]



Processing test data...


100%|██████████| 100/100 [00:06<00:00, 14.97it/s]


Total time taken in preprocessing:  946.4992537498474





### 🧪 **Create a Validation Set**

Here, we're splitting off the first 500 rows from the training dataset to create a validation set. This helps in evaluating model performance during training. The training set is then reset to exclude those 500 samples.

In [4]:
# Create validation set
validation_df = train_df.iloc[:500].copy()
train_df = train_df.iloc[500:].reset_index(drop=True)

### 📊 **Cell 5: Dataset Summary After Preprocessing**

This cell provides a summary of the dataset sizes after preprocessing and shows a couple of processed text samples alongside their titles.

In [5]:
# Shape after preprocessing
print("\nDataset shape after preprocessing:")
print(f"Training set size: {len(train_df)}")
print(f"Validation set size: {len(validation_df)}")
print(f"Test set size: {len(test_df)}")

print("\nSample processed data:")
print(train_df[['title', 'processed_text']].head(2))


Dataset shape after preprocessing:
Training set size: 13379
Validation set size: 500
Test set size: 100

Sample processed data:
           title                                     processed_text
0    Mike & Mike  mike mike mike mike morning american sportstal...
1  Carson Palmer  carson hilton palmer bear december american fo...
