In [1]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)


# Week 11: Natural Language Processing (NLP)

This week focused on introducing Natural Language Processing (NLP) techniques for handling and analyzing textual data.
Core NLP preprocessing steps were implemented to convert unstructured text into a structured numerical format suitable for machine-learning models.

**Dataset:**  
Simulated student feedback and academic comments (text data)

**Goal:**  
Build an NLP preprocessing pipeline including tokenization, stopword removal, and TF-IDF vectorization, and prepare text data for future analysis and modeling.








In [2]:
import sys
!"{sys.executable}" -m pip install nltk




In [1]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Irtaza_Majid\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Irtaza_Majid\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [3]:
import nltk
print("NLTK version:", nltk.__version__)


NLTK version: 3.9.2


In [4]:
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


In [5]:
data = {
    "text": [
        "The course content was very helpful and easy to understand",
        "I struggled with assignments but learned a lot",
        "The instructor explained concepts clearly",
        "The workload was heavy and stressful",
        "I enjoyed the practical examples in class"
    ]
}

df = pd.DataFrame(data)
df


Unnamed: 0,text
0,The course content was very helpful and easy t...
1,I struggled with assignments but learned a lot
2,The instructor explained concepts clearly
3,The workload was heavy and stressful
4,I enjoyed the practical examples in class


In [6]:
df["clean_text"] = df["text"].str.lower()
df


Unnamed: 0,text,clean_text
0,The course content was very helpful and easy t...,the course content was very helpful and easy t...
1,I struggled with assignments but learned a lot,i struggled with assignments but learned a lot
2,The instructor explained concepts clearly,the instructor explained concepts clearly
3,The workload was heavy and stressful,the workload was heavy and stressful
4,I enjoyed the practical examples in class,i enjoyed the practical examples in class


In [7]:
import nltk
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Irtaza_Majid\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [8]:
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Irtaza_Majid\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
df["tokens"] = df["clean_text"].apply(word_tokenize)
df


Unnamed: 0,text,clean_text,tokens
0,The course content was very helpful and easy t...,the course content was very helpful and easy t...,"[the, course, content, was, very, helpful, and..."
1,I struggled with assignments but learned a lot,i struggled with assignments but learned a lot,"[i, struggled, with, assignments, but, learned..."
2,The instructor explained concepts clearly,the instructor explained concepts clearly,"[the, instructor, explained, concepts, clearly]"
3,The workload was heavy and stressful,the workload was heavy and stressful,"[the, workload, was, heavy, and, stressful]"
4,I enjoyed the practical examples in class,i enjoyed the practical examples in class,"[i, enjoyed, the, practical, examples, in, class]"


In [10]:
stop_words = set(stopwords.words('english'))

df["filtered_tokens"] = df["tokens"].apply(
    lambda words: [word for word in words if word.isalpha() and word not in stop_words]
)

df


Unnamed: 0,text,clean_text,tokens,filtered_tokens
0,The course content was very helpful and easy t...,the course content was very helpful and easy t...,"[the, course, content, was, very, helpful, and...","[course, content, helpful, easy, understand]"
1,I struggled with assignments but learned a lot,i struggled with assignments but learned a lot,"[i, struggled, with, assignments, but, learned...","[struggled, assignments, learned, lot]"
2,The instructor explained concepts clearly,the instructor explained concepts clearly,"[the, instructor, explained, concepts, clearly]","[instructor, explained, concepts, clearly]"
3,The workload was heavy and stressful,the workload was heavy and stressful,"[the, workload, was, heavy, and, stressful]","[workload, heavy, stressful]"
4,I enjoyed the practical examples in class,i enjoyed the practical examples in class,"[i, enjoyed, the, practical, examples, in, class]","[enjoyed, practical, examples, class]"


In [11]:
df["processed_text"] = df["filtered_tokens"].apply(lambda x: " ".join(x))
df


Unnamed: 0,text,clean_text,tokens,filtered_tokens,processed_text
0,The course content was very helpful and easy t...,the course content was very helpful and easy t...,"[the, course, content, was, very, helpful, and...","[course, content, helpful, easy, understand]",course content helpful easy understand
1,I struggled with assignments but learned a lot,i struggled with assignments but learned a lot,"[i, struggled, with, assignments, but, learned...","[struggled, assignments, learned, lot]",struggled assignments learned lot
2,The instructor explained concepts clearly,the instructor explained concepts clearly,"[the, instructor, explained, concepts, clearly]","[instructor, explained, concepts, clearly]",instructor explained concepts clearly
3,The workload was heavy and stressful,the workload was heavy and stressful,"[the, workload, was, heavy, and, stressful]","[workload, heavy, stressful]",workload heavy stressful
4,I enjoyed the practical examples in class,i enjoyed the practical examples in class,"[i, enjoyed, the, practical, examples, in, class]","[enjoyed, practical, examples, class]",enjoyed practical examples class


In [12]:
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df["processed_text"])

print("TF-IDF Matrix Shape:", X_tfidf.shape)


TF-IDF Matrix Shape: (5, 20)


In [13]:
tfidf.get_feature_names_out()


array(['assignments', 'class', 'clearly', 'concepts', 'content', 'course',
       'easy', 'enjoyed', 'examples', 'explained', 'heavy', 'helpful',
       'instructor', 'learned', 'lot', 'practical', 'stressful',
       'struggled', 'understand', 'workload'], dtype=object)

In [14]:
tfidf_df = pd.DataFrame(
    X_tfidf.toarray(),
    columns=tfidf.get_feature_names_out()
)

tfidf_df


Unnamed: 0,assignments,class,clearly,concepts,content,course,easy,enjoyed,examples,explained,heavy,helpful,instructor,learned,lot,practical,stressful,struggled,understand,workload
0,0.0,0.0,0.0,0.0,0.447214,0.447214,0.447214,0.0,0.0,0.0,0.0,0.447214,0.0,0.0,0.0,0.0,0.0,0.0,0.447214,0.0
1,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.5,0.0,0.0
2,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.57735
4,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0


In [15]:
similarity_matrix = cosine_similarity(X_tfidf)
similarity_matrix


array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [16]:
print("Higher cosine similarity values indicate more similar text documents.")


Higher cosine similarity values indicate more similar text documents.


# Week 11: Summary

**Dataset:**  
Student feedback and academic comments (text data)

**Technique Used:**  
Natural Language Processing (NLP) preprocessing using NLTK and TF-IDF (scikit-learn)

### Key Steps:
1. Normalized text data by converting all text to lowercase.

2. Tokenized text into individual words using NLTK tokenization techniques.

3. Removed stopwords and non-alphabetic tokens to reduce noise in the dataset.

4. Reconstructed clean text after preprocessing for feature extraction.

5. Applied TF-IDF vectorization to transform text into numerical feature representations.

6. Analyzed similarity between text samples using cosine similarity.

### Insights:
- Tokenization and stopword removal significantly improved text clarity and reduced irrelevant information.

- TF-IDF effectively captured important words and their relevance across different text samples.

- NLP preprocessing is essential for converting raw text into a format suitable for machine-learning models.

- The prepared NLP pipeline can be extended to sentiment analysis or text classification tasks.

**Project Milestone:**  
NLP preprocessing pipeline successfully completed â€” text data is now ready for feature extraction and modeling.
This milestone enables future integration of text-based analysis into the overall project.

