# QUESTION 2

**Q2: Use the dataset “INFO 617_QA_Question_Category.csv” to complete the following task.**
a) Create a new column named “Answer_Token_Count” to represent the number of words in the
column “Question_English”. To do this, you should tokenize texts in the “Question_English”
column using the word tokenizer we discussed in class, and count the number of elements in the
result.


b) In class, we showed how TF-IDF vectors can be used to group questions into mutually-related
clusters. Now, perform a classification task that predicts a question’s category (for consistency,
use “Cat1” instead of “Cat2” as the outcome variable) using its TF-IDF vector. You might want to
follow the steps for data preprocessing before implementing a classification model (e.g., remove
unnecessary features and duplicate data, impute or remove missing values, encode the categorical
variable, data segmentation, and data resampling.) In particular, as a few categories only include a
very small number of samples, remove the following categories from your data: “Sexuality and
sex”, “Social incidents and cultural issues” and “Physical health.”


In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

In [None]:
# For Google Colab integration
import os
from google.colab import drive
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.metrics import silhouette_score


from google.colab import drive
drive.mount('/content/drive')

# For data manipulation
import pandas as pd
import numpy as np

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# import data as dataframe
file_path = '/content/drive/MyDrive/Colab Notebooks/INFO 617_QA_Question_Category.csv'
df = pd.read_csv(file_path)


# calling head() method
df.head()

Unnamed: 0,QID,Question_English,Cat1,Cat2
0,100788013,I have been suffering from insomnia for half a...,Mental and emotional well-being,Physical health
1,100788017,"I can't describe what exactly is ""odd"" about m...",Family dynamics and parenting,Mental and emotional well-being
2,100788025,Always curious and wanting to experience hurti...,Behavioral issues and undesirable habits,Mental and emotional well-being
3,100788036,Is it normal to express anger? Is it necessary...,Interpersonal relationships and social skills,Mental and emotional well-being
4,100788040,"I dare not express my anger, but is it necessa...",Family dynamics and parenting,Interpersonal relationships and social skills


**Dataset Overview: Structure and Data Types**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3858 entries, 0 to 3857
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   QID               3858 non-null   int64 
 1   Question_English  3858 non-null   object
 2   Cat1              3858 non-null   object
 3   Cat2              3858 non-null   object
dtypes: int64(1), object(3)
memory usage: 120.7+ KB


QID (int64): Unique identifier for each question.
Question_English (object): The textual representation of questions in English.
Cat1 (object): Primary category classification of the question.
Cat2 (object): Secondary category classification.
Answer_Token_Count (int64): The number of tokens (words) in the answer.

**Statistical Summary of the Dataset**

In [None]:
df.describe()

Unnamed: 0,QID
count,3858.0
mean,100821900.0
std,15621.95
min,100788000.0
25%,100804300.0
50%,100828600.0
75%,100832000.0
max,100839600.0


A descriptive analysis of the dataset provides insights into the distribution of numerical variables:

QID: A unique identifier for each question, ranging from 100,788,000 to 100,839,600 with a standard deviation of 15,621.95.
Answer_Token_Count:
Mean: 229.44 tokens per answer.
Minimum: 10 tokens, indicating very short responses.
Maximum: 991 tokens, representing the longest responses.
Interquartile Range (IQR):
25th percentile: 125 tokens
50th percentile (Median): 219 tokens
75th percentile: 313 tokens
The wide standard deviation (134.66) suggests high variability in answer lengths, which may impact text analysis and model performance. Further investigation into answer length distributions and potential outliers may be beneficial for refining data preprocessing and feature selection.

In [None]:
# Some features have zero variation, which offer no information for prediction
zerovariation = [se for se in df.columns[4:] if df[se].std() == 0]
print(zerovariation)

[]


In [None]:
display(df.isna().sum())

Unnamed: 0,0
QID,0
Question_English,0
Cat1,0
Cat2,0


A check for missing values confirms that the dataset is complete, with zero missing values across all columns (QID, Question_English, Cat1, Cat2, and Answer_Token_Count). This ensures that no additional data imputation or handling is required, allowing for a smooth preprocessing and modeling workflow without concerns about data gaps or inconsistencies.









In [None]:
# Check if any missing values exist in each column
missing_values_per_row = df.isna().any(axis = 1)
print(missing_values_per_row)

0       False
1       False
2       False
3       False
4       False
        ...  
3853    False
3854    False
3855    False
3856    False
3857    False
Length: 3858, dtype: bool


In [None]:
# Check if any missing values exist in each column
missing_values_per_column = df.isna().any(axis=0)
print(missing_values_per_column)

QID                 False
Question_English    False
Cat1                False
Cat2                False
dtype: bool


In [None]:
# Report columns and rows with missing values
print(df.columns[missing_values_per_column])
print(df.index[missing_values_per_row])

Index([], dtype='object')
Index([], dtype='int64')


**Tokenization Setup: Preparing NLTK for Text Processing**

In [None]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize

# Download NLTK tokenizer if not already installed
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

To enable text analysis and natural language processing, the NLTK (Natural Language Toolkit) tokenizer is initialized by downloading the 'punkt' package. This allows for efficient word tokenization, breaking text into individual words or tokens, which is essential for linguistic analysis, text classification, and machine learning applications.

**Ensuring NLTK Tokenizer Readiness for Text Processing**

In [None]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize

# Ensure the correct tokenizer package is downloaded
nltk.download('punkt', force=True)

# Manually specify the tokenizer path if needed
nltk.data.path.append('/usr/local/nltk_data')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


To facilitate efficient word tokenization, the 'punkt' tokenizer package is force-downloaded to ensure availability. Additionally, the NLTK data path is manually specified to /usr/local/nltk_data to prevent any issues with locating the tokenizer. These steps are crucial for breaking down text into meaningful units (tokens), enabling further linguistic analysis, sentiment detection, and natural language processing tasks.

**Tokenizing and Counting Words in Questions for Text Analysis**

In [None]:
# Download the 'punkt_tab' data package
nltk.download('punkt_tab')

# Ensure the column exists
if "Question_English" in df.columns:
    # Tokenize and count words
    df["Answer_Token_Count"] = df["Question_English"].astype(str).apply(lambda x: len(word_tokenize(x)))

    # Display the updated DataFrame using pandas display function
    # instead of the potentially unavailable 'ace_tools'
    display(df)  # Use pandas' display to show the DataFrame
else:
    print("Column 'Question_English' not found in the dataset.")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Unnamed: 0,QID,Question_English,Cat1,Cat2,Answer_Token_Count
0,100788013,I have been suffering from insomnia for half a...,Mental and emotional well-being,Physical health,219
1,100788017,"I can't describe what exactly is ""odd"" about m...",Family dynamics and parenting,Mental and emotional well-being,283
2,100788025,Always curious and wanting to experience hurti...,Behavioral issues and undesirable habits,Mental and emotional well-being,113
3,100788036,Is it normal to express anger? Is it necessary...,Interpersonal relationships and social skills,Mental and emotional well-being,299
4,100788040,"I dare not express my anger, but is it necessa...",Family dynamics and parenting,Interpersonal relationships and social skills,322
...,...,...,...,...,...
3853,100839605,"How do you all feel about the idea of ""startin...",Family dynamics and parenting,Education and school life,178
3854,100839618,"Seeing my mom with her new partner together, I...",Family dynamics and parenting,Mental and emotional well-being,46
3855,100839625,"In my third year of high school, I don't know ...",Interpersonal relationships and social skills,Mental and emotional well-being,281
3856,100839634,My realization today is: learning to empathize...,Interpersonal relationships and social skills,Family dynamics and parenting,185


To facilitate linguistic analysis and text-based feature engineering, the 'punkt_tab' tokenizer was downloaded, ensuring accurate tokenization of the "Question_English" column. Each question was processed to count the number of word tokens, resulting in the creation of a new column:

"Answer_Token_Count" now represents the number of words in each question.
This feature enables deeper analysis of question complexity, verbosity, and potential correlations with other attributes such as category classifications (Cat1, Cat2).

In [None]:
#Remove Unwanted Categories First
categories_to_remove = ["Sexuality and sex", "Social incidents and cultural issues", "Physical health"]
df = df[~df["Cat1"].isin(categories_to_remove)]  # Keep only relevant categories

In [None]:
# Drop duplicates
df = df.drop_duplicates(subset=['QID', 'Question_English', 'Cat1', 'Cat2', 'Answer_Token_Count'])

In [None]:
# Handle missing values (drop rows where either Question_English or Cat1 is NaN)
df = df.dropna(subset=["Question_English", "Cat1"])


### TF-IDF Matrix



Transforming Text into Numerical Features Using TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

#Convert Text to TF-IDF Features
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df["Question_English"])  # Transform text into numerical vectors

A TF-IDF (Term Frequency-Inverse Document Frequency) matrix has been generated to convert textual data into numerical representations for machine learning and text analysis.

**TF-IDF Matrix Dimensions: Understanding Textual Feature Representation**

In [None]:
# Display the shape of the TF-IDF matrix
print(f"TF-IDF Matrix Shape: {X.shape}")  # (num_questions, num_unique_words)

TF-IDF Matrix Shape: (3844, 12502)


The TF-IDF matrix has a shape of (3844, 12,502), indicating that:

The dataset contains 3,844 questions (rows).
There are 12,502 unique words (columns) across all questions after stop-word removal and text preprocessing.

**Finding the Most Similar Question Using TF-IDF and Cosine Similarity**

In [None]:
# What is most similar question to the first question in the dataset based on their TF-IDF vectors?
from sklearn.metrics.pairwise import cosine_similarity
most_simi_index = 0
most_simi_score = 0
for i in range(1, X.shape[0]):
  simi_score = cosine_similarity(X[0], X[i])
  if simi_score > most_simi_score:
    most_simi_score = simi_score
    most_simi_index = i
print(most_simi_score, most_simi_index)

print("\nOriginal Question:\n", df["Question_English"].iloc[0])
print("\nMost Similar Question:\n", df["Question_English"].iloc[most_simi_index])

[[0.26963764]] 1897

Original Question:
 I have been suffering from insomnia for half a year. I have difficulty falling asleep, and occasionally when I do, I get woken up by nightmares. I feel extremely exhausted physically, but my mind remains in an excited state, making it really difficult for me to fall asleep. I have tried various methods, such as drinking milk and foot baths before bed, but they have had the opposite effect, making it even harder for me to sleep. I used to be able to use boring textbooks to lull myself to sleep, but this year it doesn't work anymore, and I become more excited the more I read. I have also sought counseling, but after the counselor asked about my symptoms, they did not provide any advice on how to solve or alleviate them. I also tried self-hypnosis, but it had no effect. I attempted changing my environment, but I still got disturbed by various internal and external factors. I also tried to change my mental state, as I don't usually feel anxious. How

Using TF-IDF vectorization and cosine similarity, the question most similar to the first question in the dataset was identified. The cosine similarity score between the two questions is 0.2696, indicating a moderate level of textual similarity.

Most Similar Question Index: The most similar question is found at index 1897 in the dataset.

In [None]:
from sklearn.preprocessing import LabelEncoder

#Encode Target Variable (Cat1)
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df["Cat1"])  # Convert category labels to numbers

In [None]:
from sklearn.model_selection import train_test_split

#Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score


### **6. Train a Classification Model**
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

A Random Forest Classifier with 100 decision trees was trained using the preprocessed LIWC feature set to predict whether an individual receives a bonus. The model leverages ensemble learning to enhance prediction accuracy by combining multiple decision trees, reducing overfitting, and improving generalization.

With random_state set to 42, the results will be reproducible, ensuring consistency in future evaluations. The next steps involve evaluating model performance using accuracy, precision, recall, and F1-score on both validation and test sets to assess its effectiveness in predicting bonus allocation.

**Evaluating Random Forest Classifier Performance on Test Data**

In [None]:
#Model Evaluation
y_pred = clf.predict(X_test)

# Print Accuracy and Classification Report
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=label_encoder.classes_))

Accuracy: 0.6410923276983095

Classification Report:
                                                precision    recall  f1-score   support

     Behavioral issues and undesirable habits       0.00      0.00      0.00        10
                    Education and school life       1.00      0.13      0.24        30
                Family dynamics and parenting       0.69      0.85      0.76       186
Interpersonal relationships and social skills       0.53      0.57      0.55       135
              Mental and emotional well-being       0.58      0.71      0.64       160
         Personal growth and self-development       0.40      0.05      0.09        42
          Romantic relationships and marriage       0.73      0.77      0.75       152
                 Work, career, and employment       0.74      0.37      0.49        54

                                     accuracy                           0.64       769
                                    macro avg       0.58      0.43      0.

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The Random Forest Classifier achieved an accuracy of 64.11%, indicating moderate predictive performance. The classification report provides further insights into the model's effectiveness across different categories:

Highest Performance:
Family dynamics and parenting (85% recall, 76% F1-score)
Romantic relationships and marriage (77% recall, 75% F1-score)

Lowest Performance:
Behavioral issues and undesirable habits (0% precision and recall, suggesting poor classification for this category)
Personal growth and self-development (5% recall, 9% F1-score)

The macro average F1-score (44%) and weighted average F1-score (61%) indicate that while the model performs well on more frequent categories, it struggles with underrepresented classes.

Further improvements could involve:

Balancing class distribution (e.g., oversampling minority classes)
Feature selection or engineering for better representation
Hyperparameter tuning to optimize model performance.

# Question 3
Use the dataset “INFO 617_QA_Question_Category.csv” to complete the following task. In class, we
performed LDA (Latent Dirichlet Allocation) on questions in this dataset but the results are not
satisfactory. Likely, the reason is the frequent appearances of stop words, such as “A”, “but”, and “he”.
First, remove stop words from all questions (you can use the “stopwords” module in nltk package for this
purpose). Second, replicate the LDA. Third, present the new results by showing the top 10 words for each
topic.

We eliminate stop words (such as "the," "and," "is") from each question to emphasize more significant terms. The questions are then transformed into a matrix, where each word is represented by its frequency within the dataset, retaining only the 5,000 most frequently occurring words.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np

nltk.download('stopwords')

stop_words = set(stopwords.words("english"))
df["Processed_Question"] = df["Question_English"].apply(lambda x: " ".join([word for word in word_tokenize(str(x).lower()) if word.isalpha() and word not in stop_words]))

# Convert text data into a CountVectorizer matrix
vectorizer = CountVectorizer(stop_words="english", max_features=5000)  # Limit features for efficiency
question_bow = vectorizer.fit_transform(df["Processed_Question"])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


To enhance topic modeling accuracy, the text data underwent preprocessing before being transformed into a bag-of-words (BoW) representation using CountVectorizer. The key steps include:

Stopword Removal: Eliminated common English stopwords using NLTK's stopword list to retain only meaningful words.

Tokenization & Lowercasing: Converted text into lowercase and split it into individual words using word_tokenize.

Filtering Non-Alphabetic Words: Removed punctuation and numerical values to maintain a clean word list.

Feature Selection: The 5,000 most frequent words were retained to optimize computational efficiency.

Bag-of-Words Transformation: Converted preprocessed text into a CountVectorizer matrix, where each row represents a question, and each column represents a word’s frequency.

**Fitting LDA Model**

In [None]:
# Train LDA Model with 10 topics
lda = LatentDirichletAllocation(n_components=10, random_state=0)
lda.fit(question_bow)

A Latent Dirichlet Allocation (LDA) model was trained with 10 topics to uncover underlying themes within the dataset. LDA is a popular topic modeling technique that assumes each document (question) consists of multiple topics, and each topic is characterized by a distribution of words.

n_components = 10: Specifies 10 distinct topics to be identified.
Random_state = 0: Ensures reproducibility of results.
This model will help in categorizing questions into meaningful clusters, aiding in text analysis, trend identification, and content organization

In [None]:
# Display Top 10 words for each topic
for idx, topic in enumerate(lda.components_):
    print(f"Top 10 words for Topic #{idx}:")
    print([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-11:-1]])
    print("\n")

# Get the topic distribution for each question
topic_distribution = lda.transform(question_bow)

Top 10 words for Topic #0:
['relationship', 'like', 'feel', 'person', 'time', 'girlfriend', 'good', 'want', 'know', 'friend']


Top 10 words for Topic #1:
['mother', 'family', 'child', 'parents', 'money', 'father', 'home', 'husband', 'feel', 'years']


Top 10 words for Topic #2:
['work', 'job', 'feel', 'like', 'feeling', 'working', 'company', 'want', 'life', 'new']


Top 10 words for Topic #3:
['like', 'feel', 'want', 'know', 'love', 'time', 'relationship', 'really', 'said', 'ca']


Top 10 words for Topic #4:
['said', 'asked', 'angry', 'child', 'husband', 'time', 'told', 'saying', 'later', 'went']


Top 10 words for Topic #5:
['feel', 'like', 'day', 'time', 'want', 'year', 'ca', 'exam', 'sleep', 'school']


Top 10 words for Topic #6:
['feel', 'people', 'feeling', 'relationships', 'life', 'fear', 'afraid', 'anxiety', 'sense', 'friends']


Top 10 words for Topic #7:
['feel', 'like', 'ca', 'want', 'emotions', 'know', 'time', 'things', 'feeling', 'control']


Top 10 words for Topic #8:
['s

A Latent Dirichlet Allocation (LDA) model was used to extract 10 major topics from the dataset, each representing a unique theme based on the most frequently associated words.

Key Topics and Their Interpretations:

Topic #0: Romantic Relationships & Friendships (relationship, girlfriend, friend, like, feel)

Topic #1: Family & Parenting (mother, parents, child, home, husband)

Topic #2: Work & Career (work, job, company, working, life)

Topic #3: Love & Emotional Bonds (love, relationship, want, feel, time)

Topic #4: Conflict & Communication Issues (said, angry, asked, told, later)

Topic #5: Daily Life & Routines (day, time, school, exam, sleep)

Topic #6: Anxiety & Psychological Well-being (fear, anxiety, relationships, people, sense)

Topic #7: Emotional Regulation & Self-Control (feel, emotions, control, want, things)

Topic #8: Education & Student Life (school, class, teacher, high, college)

Topic #9: Mental Health & Depression (depression, anxiety, psychological, emotions, mental)

These topics highlight common themes in personal concerns and experiences, offering valuable insights for text classification, behavioral analysis, and psychological research.