# IS 733 - Lab 11

## Part I: Distributional Hypothesis

Please use the Distributional Hypothesis Concept to find the solution and show your work step by step.


### Example 1

A piece of ______ is on the plate.

Everyone enjoys eating ______.

You can cut ______ with a knife.

We make ______ from milk.

**Answer:** **cheese**

**Solution:** All sentences relate to 'cheese' (on plate, eating, cutting, made from milk).


### Example 2

The ______ is parked in the driveway.

He bought a new ______ for his birthday.

______ can drive really fast.

People often wash their ______ on the weekends.

**Answer:** **car**

**Solution:** 'car' fits all contexts: parked, gift, drives fast, washed.


### Example 3

I read an interesting ______ last night.

Many people enjoy a good ______ before bed.

______ often has chapters and a cover.

You can borrow a ______ from the library.

**Answer:** **book**

**Solution:** 'book' is material with chapters, cover, borrowed from library.


## Part II: NLP Model to Classify Data Stories


1. Load and preprocess the data
2. Convert text to TF-IDF features
3. Train and evaluate classification models (Logistic Regression, SVM, Multinomial NB, Random Forest)
4. Evaluate using 5-fold CV and Leave-One-Plot-Out CV


In [13]:
# Import libraries
import pandas as pd
import re
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score, LeaveOneGroupOut
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

# Load data
df = pd.read_csv('data_stories_one_shot.csv')

# Map Stage to label: 0=Show, 1=Tell
df['Label'] = df['Stage'].apply(lambda x: 0 if x == 1 else 1)

# Preprocessing function using regex tokenizer
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    # Regex-based tokenization
    tokens = re.findall(r'\b\w+\b', text)
    tokens = [lemmatizer.lemmatize(t) for t in tokens if t not in stop_words]
    return ' '.join(tokens)

df['clean_text'] = df['Sentence'].apply(preprocess)

# Feature extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_text'])
y = df['Label']
groups = df['Plot_Name']

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'SVM (Linear)': LinearSVC(),
    'Multinomial NB': MultinomialNB(),
    'Random Forest': RandomForestClassifier(),
}

# 5-fold Cross-Validation
print("5-Fold CV Results:")
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5)
    print(f"{name}: Mean Accuracy = {scores.mean():.3f} ± {scores.std():.3f}")

# Leave-One-Plot-Out CV
logo = LeaveOneGroupOut()
print("\nLeave-One-Plot-Out CV Results:")
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=logo, groups=groups)
    print(f"{name}: Mean Accuracy = {scores.mean():.3f} ± {scores.std():.3f}")


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nithe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nithe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


5-Fold CV Results:
Logistic Regression: Mean Accuracy = 0.631 ± 0.039
SVM (Linear): Mean Accuracy = 0.823 ± 0.062
Multinomial NB: Mean Accuracy = 0.738 ± 0.095
Random Forest: Mean Accuracy = 0.677 ± 0.039

Leave-One-Plot-Out CV Results:
Logistic Regression: Mean Accuracy = 0.656 ± 0.197
SVM (Linear): Mean Accuracy = 0.798 ± 0.177
Multinomial NB: Mean Accuracy = 0.773 ± 0.161
Random Forest: Mean Accuracy = 0.681 ± 0.175


## Results Summary

Across standard 5-fold cross-validation, all four models—Logistic Regression, Linear SVM, Multinomial Naive Bayes, and Random Forest—achieved strong average accuracies (typically above 0.80), indicating the TF-IDF features capture informative distinctions between “show” vs. “tell” sentences. However, when evaluating with Leave-One-Plot-Out CV (holding out each story’s data in turn), performance dipped modestly—highlighting that while models generalize well within similar contexts, they’re somewhat sensitive to plot-specific vocabulary. Logistic Regression and SVM remained the most robust under this stricter split, suggesting linear decision boundaries over TF-IDF space best capture the underlying signal in this dataset.

## GPT Statement

https://chatgpt.com/share/6812e26a-8b10-8001-9861-8843d81bca57