# **Week 11 – Natural Language Processing (NLP)**

This week introduces core Natural Language Processing techniques using both deep learning (word embeddings + LSTM) and traditional ML preprocessing (stopwords, tokenization, lemmatization, TF-IDF).

Since the main house price dataset does not contain text, the Class Task uses a sample text dataset **(IMDB Movie Reviews)**. Assignment 11 follows two paths:

- If your project dataset contains text → apply full NLP preprocessing

- If not (like house prices) → demonstrate NLP workflows on a small sample dataset

This approach ensures you learn NLP techniques even if your project dataset is tabular.

# **Class Task – Tokenization & Word Embeddings (IMDB Dataset)**

In the Class Task, I implemented tokenization and deep learning–based word embeddings using the IMDB Movie Reviews dataset. The steps included:

- Loading the dataset as sequences of integer tokens

- Padding sequences to a fixed length

- Building a model using an Embedding layer + LSTM

- Training and evaluating a sentiment classifier

This task helped me understand how raw text is transformed into structured numerical representations using tokenization and embeddings. I also observed how LSTMs capture contextual meaning and sequential patterns within text.

**Step 1: Load IMDB Dataset**

In [2]:
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load top 10,000 most frequent words
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)

print("Training samples:", len(X_train))
print("Test samples:", len(X_test))


Training samples: 25000
Test samples: 25000


**Step 2: Pad Sequences**

In [3]:
max_len = 200  # sequence length after padding

X_train = pad_sequences(X_train, maxlen=max_len)
X_test = pad_sequences(X_test, maxlen=max_len)


**Step 3: Build Simple Embedding + LSTM Model**

In [4]:
from tensorflow.keras import models, layers

model = models.Sequential([
    layers.Embedding(input_dim=10000, output_dim=64, input_length=max_len),
    layers.LSTM(64),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()



**Step 4: Train Model**

In [5]:
history = model.fit(X_train, y_train, epochs=3, batch_size=128, validation_split=0.2)

Epoch 1/3
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 304ms/step - accuracy: 0.7373 - loss: 0.4990 - val_accuracy: 0.8472 - val_loss: 0.3575
Epoch 2/3
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 273ms/step - accuracy: 0.8921 - loss: 0.2670 - val_accuracy: 0.8658 - val_loss: 0.3165
Epoch 3/3
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 254ms/step - accuracy: 0.9270 - loss: 0.1944 - val_accuracy: 0.8464 - val_loss: 0.3487


**Step 5: Evaluate**

In [6]:
loss, acc = model.evaluate(X_test, y_test)
print("Test Accuracy:", acc)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 32ms/step - accuracy: 0.8504 - loss: 0.3464
Test Accuracy: 0.8503599762916565


# **Assignment 11 – NLP Preprocessing (Stopwords, Tokenization, Lemmatization, TF-IDF)**

Although the main project dataset (house prices) is tabular and lacks text content, I implemented a separate NLP preprocessing pipeline using a small sample text corpus (IMDB reviews). The steps included:

- Lowercasing and punctuation removal

- Tokenization (splitting text into words)

- Stop-word removal

- Lemmatization

- TF-IDF vectorization

This exercise demonstrates my understanding of standard NLP workflows. It also clarifies why such methods are not applicable to the current project: without text data (e.g., descriptions, reviews, or time-series), textual feature engineering is not feasible.

If in the future the project includes textual data such as house descriptions or seller notes, this exact pipeline can be directly used to extract meaningful features for machine learning models.

**Step 1: Load IMDB Raw Text**

In [10]:
import tensorflow as tf
import pandas as pd

dataset = tf.keras.utils.get_file(
    "aclImdb_v1.tar.gz",
    "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
    untar=True,
    cache_dir='.',
)

folder = "./datasets/aclImdb_v1_extracted/aclImdb"

# Load train text files
import os

train_texts = []
train_labels = []

for label in ["pos", "neg"]:
    labeled_dir = os.path.join(folder, "train", label)
    for file in os.listdir(labeled_dir):
        with open(os.path.join(labeled_dir, file), "r", encoding="utf-8") as f:
            train_texts.append(f.read())
            train_labels.append(1 if label == "pos" else 0)

df = pd.DataFrame({"text": train_texts, "label": train_labels})
df.head()

Unnamed: 0,text,label
0,Bromwell High is a cartoon comedy. It ran at t...,1
1,Homelessness (or Houselessness as George Carli...,1
2,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,This is easily the most underrated film inn th...,1
4,This is not the typical Mel Brooks film. It wa...,1


**Step 2: Clean Text (Lowercase, Remove symbols)**

In [11]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    return text

df["clean"] = df["text"].apply(clean_text)
df.head()


Unnamed: 0,text,label,clean
0,Bromwell High is a cartoon comedy. It ran at t...,1,bromwell high is a cartoon comedy it ran at th...
1,Homelessness (or Houselessness as George Carli...,1,homelessness or houselessness as george carlin...
2,Brilliant over-acting by Lesley Ann Warren. Be...,1,brilliant overacting by lesley ann warren best...
3,This is easily the most underrated film inn th...,1,this is easily the most underrated film inn th...
4,This is not the typical Mel Brooks film. It wa...,1,this is not the typical mel brooks film it was...


**Step 3: Tokenization + Stopword Removal + Lemmatization**

In [14]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download("stopwords")
nltk.download("wordnet")

stop = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    tokens = text.split()
    tokens = [t for t in tokens if t not in stop]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return " ".join(tokens)

df["processed"] = df["clean"].apply(preprocess)
df.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\naree\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\naree\AppData\Roaming\nltk_data...


Unnamed: 0,text,label,clean,processed
0,Bromwell High is a cartoon comedy. It ran at t...,1,bromwell high is a cartoon comedy it ran at th...,bromwell high cartoon comedy ran time program ...
1,Homelessness (or Houselessness as George Carli...,1,homelessness or houselessness as george carlin...,homelessness houselessness george carlin state...
2,Brilliant over-acting by Lesley Ann Warren. Be...,1,brilliant overacting by lesley ann warren best...,brilliant overacting lesley ann warren best dr...
3,This is easily the most underrated film inn th...,1,this is easily the most underrated film inn th...,easily underrated film inn brook cannon sure f...
4,This is not the typical Mel Brooks film. It wa...,1,this is not the typical mel brooks film it was...,typical mel brook film much less slapstick mov...


**Step 4: TF-IDF Vectorization**

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

tfidf = TfidfVectorizer(max_features=5000)

X = tfidf.fit_transform(df["processed"]).toarray()
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Step 5: Train Logistic Regression Classifier**

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model_tfidf = LogisticRegression(max_iter=200)
model_tfidf.fit(X_train, y_train)

pred = model_tfidf.predict(X_test)
print("TF-IDF Model Accuracy:", accuracy_score(y_test, pred))

TF-IDF Model Accuracy: 0.8792
