
# Friends Scripts — Preprocessing & Feature Extraction

This notebook prepares linguistic, semantic, stylistic, and structural features from the Friends TV series scripts for Exceptional Model Mining (EMM). The target variable is the per-episode IMDB rating, and we craft predictors from the scripts, metadata, and derived analyses.



## Notebook Roadmap

1. **Environment Setup** – install and import dependencies.
2. **Load Data** – inspect dialogue and metadata tables.
3. **Text Preprocessing** – clean, normalize, tokenize, and lemmatize dialogue while preserving negations.
4. **Feature Engineering**
   - Lexical and stylistic metrics
   - Sentiment and emotion profiling
   - Topic modeling & semantic features
   - Humor indicators
   - Complexity & readability
   - Character interactions
   - Metadata-derived attributes
5. **Feature Aggregation** – consolidate all episode-level features and merge with IMDB ratings.
6. **Exploratory Visualizations** – histograms, bar charts, heatmaps for sanity checks.
7. **Export** – save the feature matrix for downstream EMM modeling.



> **Tip:** Execute the notebook step-by-step. Some stages (e.g., transformer-based sentiment analysis or BERTopic) are computationally intensive—consider caching intermediate results when iterating.



## 1. Environment Setup

Install external libraries (run once per environment). Feel free to comment out packages you already have installed.


In [None]:

# If running on a clean environment, uncomment the cell below.
# !pip install -q pandas numpy nltk spacy textstat transformers datasets bertopic gensim torch networkx seaborn matplotlib
# !python -m spacy download en_core_web_sm


In [None]:

import os
import json
import re
import math
from collections import Counter, defaultdict
from itertools import combinations, tee

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize

import spacy
from spacy.matcher import PhraseMatcher

from textstat import flesch_kincaid_grade, gunning_fog

from transformers import pipeline

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

import networkx as nx

nltk.download('punkt')
nltk.download('stopwords')

nlp = spacy.load('en_core_web_sm')

# Ensure negations are preserved in stopword list
stop_words = set(stopwords.words('english'))
for neg in ['no', 'not', "n't", 'nor']:
    if neg in stop_words:
        stop_words.remove(neg)



## 2. Load Data

We use two CSV files:

* `friends_all_episodes_clean.csv` – cleaned dialogue with speakers, seasons, episodes, etc.
* `friends_episodes_v3.csv` – metadata including IMDB ratings.

The loading cell below assumes the CSVs live in the repository root. Adjust `DATA_DIR` if needed.


In [None]:

DATA_DIR = '.'
DIALOGUE_PATH = os.path.join(DATA_DIR, 'friends_all_episodes_clean.csv')
META_PATH = os.path.join(DATA_DIR, 'friends_episodes_v3.csv')

dialogue_df = pd.read_csv(DIALOGUE_PATH)
meta_df = pd.read_csv(META_PATH)

print('Dialogue shape:', dialogue_df.shape)
print('Metadata shape:', meta_df.shape)
dialogue_df.head()


In [None]:

meta_df.head()



### Standardize Episode Identifiers

Different datasets may use different column names. The helper below harmonizes identifiers and constructs a global episode index for merging.


In [None]:

EPISODE_ID_COLS = {
    'season': ['season', 'season_num', 'Season', 'season_number'],
    'episode': ['episode', 'episode_num', 'episode_number', 'Episode', 'EpisodeNumber', 'Episode_num'],
    'title': ['title', 'episode_title', 'EpisodeTitle', 'episode_name']
}


def find_column(df, candidates):
    for c in candidates:
        if c in df.columns:
            return c
    raise KeyError(f"None of the candidate columns {candidates} found in DataFrame")


season_col = find_column(dialogue_df, EPISODE_ID_COLS['season'])
episode_col = find_column(dialogue_df, EPISODE_ID_COLS['episode'])

if any(col in dialogue_df.columns for col in EPISODE_ID_COLS['title']):
    title_col = find_column(dialogue_df, EPISODE_ID_COLS['title'])
else:
    title_col = None

# Harmonize metadata as well
meta_season_col = find_column(meta_df, EPISODE_ID_COLS['season'])
meta_episode_col = find_column(meta_df, EPISODE_ID_COLS['episode'])
meta_title_col = find_column(meta_df, EPISODE_ID_COLS['title']) if any(col in meta_df.columns for col in EPISODE_ID_COLS['title']) else None

# Standard columns
for df, s_col, e_col in [
    (dialogue_df, season_col, episode_col),
    (meta_df, meta_season_col, meta_episode_col)
]:
    df.rename(columns={s_col: 'season', e_col: 'episode'}, inplace=True)

if title_col:
    dialogue_df.rename(columns={title_col: 'episode_title'}, inplace=True)
if meta_title_col:
    meta_df.rename(columns={meta_title_col: 'episode_title'}, inplace=True)

# Create a global episode index (season-episode formatted)
dialogue_df['episode_code'] = dialogue_df['season'].astype(int).astype(str).str.zfill(2) + 'x' + dialogue_df['episode'].astype(int).astype(str).str.zfill(2)
meta_df['episode_code'] = meta_df['season'].astype(int).astype(str).str.zfill(2) + 'x' + meta_df['episode'].astype(int).astype(str).str.zfill(2)

# Add global index
meta_df = meta_df.sort_values(['season', 'episode']).reset_index(drop=True)
meta_df['episode_global_number'] = meta_df.index + 1

print(dialogue_df[['season', 'episode', 'episode_code']].head())
print(meta_df[['season', 'episode', 'episode_code', 'episode_global_number']].head())



## 3. Text Preprocessing

Steps:

1. Lowercase text and strip stage directions `[Scene: ...]` or parenthetical actions.
2. Remove speaker prefixes (e.g., `ROSS:`) and filler utterances `(laughs)`, `(uh)`, etc.
3. Tokenize using spaCy, remove stopwords (preserving negations), and lemmatize tokens.
4. Aggregate line-level tokens per episode into documents.

We keep both the cleaned text and token lists for downstream features.


In [None]:

STAGE_DIRECTION_PATTERN = re.compile(r'\[[^\]]+\]|\([^\)]*\)')
SPEAKER_PATTERN = re.compile(r'^[A-Z][A-Z\s]+:')
FILLERS = {
    'uh', 'um', 'erm', 'hmm', 'mm', 'ah', 'oh', 'huh', 'hahaha',
    'haha', 'lol', 'hehe', 'hmmm', 'huh', 'mmm', 'hahaha', 'hahah'
}
LAUGHTER_TOKENS = {'laugh', 'laughs', 'laughter', 'giggle', 'giggles'}
LAUGHTER_REGEX = re.compile(r'(laughs?|giggles?|chuckles?|snickers?|funny)', re.IGNORECASE)

catchphrases = [
    "how you doin'", 'we were on a break', 'could i be', 'smelly cat',
    'oh my god', 'my sandwich', "he's her lobster", 'i know!',
]
catchphrase_matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
catchphrase_patterns = [nlp.make_doc(phrase) for phrase in catchphrases]
catchphrase_matcher.add('CATCHPHRASE', catchphrase_patterns)


filler_phrase_patterns = [
    re.compile(pattern, re.IGNORECASE)
    for pattern in [
        r'(uh|um|erm|hmm|huh|mm-hmm|mmm)',
        r'(lol|haha+h+|hehe+h+|ahh+)',
        r'(you know|i mean)'
    ]
]


def clean_dialogue_text(text: str) -> str:
    if pd.isna(text):
        return ''
    text = text.lower()
    text = STAGE_DIRECTION_PATTERN.sub(' ', text)
    text = SPEAKER_PATTERN.sub(' ', text)
    for pattern in filler_phrase_patterns:
        text = pattern.sub(' ', text)
    text = re.sub(r"[^a-zA-Z0-9'\s]", ' ', text)  # keep contractions
    text = re.sub(r'\s+', ' ', text)
    return text.strip()


def tokenize_and_lemmatize(text: str):
    doc = nlp(text)
    tokens = []
    lemmas = []
    for token in doc:
        if token.is_space or token.is_punct:
            continue
        if token.text in stop_words:
            continue
        if token.lemma_ == '-PRON-':
            lemma = token.text.lower()
        else:
            lemma = token.lemma_.lower()
        if lemma and lemma not in stop_words and lemma not in FILLERS:
            tokens.append(token.text.lower())
            lemmas.append(lemma)
    return tokens, lemmas


def preprocess_row(row):
    cleaned = clean_dialogue_text(row.get('line') or row.get('dialogue') or row.get('quote') or '')
    tokens, lemmas = tokenize_and_lemmatize(cleaned)
    sentences = sent_tokenize(cleaned)
    return pd.Series({
        'clean_text': cleaned,
        'tokens': tokens,
        'lemmas': lemmas,
        'num_words': len(tokens),
        'num_lemmas': len(lemmas),
        'num_sentences': len(sentences),
        'sentence_lengths': [len(nlp(sent).to_array(['ORTH'])) for sent in sentences if sent.strip()]
    })


# Apply preprocessing
preprocessed_df = dialogue_df.join(dialogue_df.apply(preprocess_row, axis=1))
preprocessed_df.head()



### Aggregate Dialogue Per Episode

We create a corpus where each episode is represented by its concatenated lemmatized text. This will feed lexical, sentiment, topic, and other feature computations.


In [None]:

# Identify character column
character_col_candidates = ['character', 'speaker', 'Person', 'person', 'name']
character_col = find_column(preprocessed_df, character_col_candidates)
preprocessed_df.rename(columns={character_col: 'character'}, inplace=True)

preprocessed_df['character'] = preprocessed_df['character'].str.title().str.strip()
preprocessed_df['line_id'] = np.arange(len(preprocessed_df))

# Episode-level aggregation
agg_funcs = {
    'clean_text': lambda x: ' '.join(x),
    'lemmas': lambda x: list(np.concatenate(x.values)),
    'tokens': lambda x: list(np.concatenate(x.values)),
    'num_words': 'sum',
    'num_sentences': 'sum',
    'sentence_lengths': lambda x: list(np.concatenate(x.values)),
}

episode_docs = preprocessed_df.groupby('episode_code').agg(agg_funcs)

episode_docs['episode_length'] = episode_docs['num_words']

episode_docs.head()



## 4. Lexical & Stylistic Features

We compute:

* **Total word count** per episode
* **Average sentence length** (tokens per sentence)
* **Type-token ratio** (unique lemmas / total lemmas)
* **Average words per line**
* **Character speaking proportions**


In [None]:

def compute_lexical_features(df_lines, episode_docs):
    lexical_features = episode_docs[['num_words']].rename(columns={'num_words': 'total_words'})

    lexical_features['avg_sentence_length'] = episode_docs.apply(
        lambda row: np.mean([len(nlp(sent)) for sent in sent_tokenize(row['clean_text'])]) if row['num_sentences'] > 0 else 0,
        axis=1
    )

    lexical_features['type_token_ratio'] = episode_docs['lemmas'].apply(lambda lemmas: len(set(lemmas)) / (len(lemmas) or 1))

    avg_words_per_line = df_lines.groupby('episode_code')['num_words'].mean().rename('avg_words_per_line')
    lexical_features = lexical_features.join(avg_words_per_line)

    # Character speaking proportions
    char_counts = df_lines.groupby(['episode_code', 'character']).size().rename('line_count').reset_index()
    char_totals = char_counts.groupby('episode_code')['line_count'].transform('sum')
    char_counts['line_pct'] = char_counts['line_count'] / char_totals

    major_characters = ['Rachel', 'Ross', 'Monica', 'Chandler', 'Joey', 'Phoebe']
    char_pivot = char_counts.pivot_table(index='episode_code', columns='character', values='line_pct', fill_value=0)
    for character in major_characters:
        if character not in char_pivot.columns:
            char_pivot[character] = 0.0
    char_pivot = char_pivot[[c for c in char_pivot.columns if c in major_characters]]
    char_pivot.columns = [f'line_pct_{c.lower()}' for c in char_pivot.columns]

    lexical_features = lexical_features.join(char_pivot, how='left')
    lexical_features.fillna(0, inplace=True)
    return lexical_features


lexical_features = compute_lexical_features(preprocessed_df, episode_docs)
lexical_features.head()



## 5. Sentiment & Emotion Features

* **Sentiment:** Use `cardiffnlp/twitter-roberta-base-sentiment` to score each line, then compute mean and variance per episode.
* **Emotion:** Map lemmas to emotions via the NRC Emotion Lexicon (anger, joy, sadness, surprise, disgust, fear) and aggregate proportions.


In [None]:

# Sentiment pipeline (transformer)
sentiment_analyzer = pipeline('sentiment-analysis', model='cardiffnlp/twitter-roberta-base-sentiment')


def score_sentiment(text):
    if not text:
        return {'neg': 0.0, 'neu': 0.0, 'pos': 0.0}
    result = sentiment_analyzer(text, truncation=True)
    scores = defaultdict(float)
    for res in result:
        label = res['label'].lower()
        scores[label] += res['score']
    return scores


sentiment_scores = preprocessed_df['clean_text'].apply(score_sentiment)
sentiment_df = pd.DataFrame(list(sentiment_scores))
preprocessed_df = preprocessed_df.join(sentiment_df)

sentiment_episode = preprocessed_df.groupby('episode_code')[['neg', 'neu', 'pos']].agg(['mean', 'var']).fillna(0)
sentiment_episode.columns = ['_'.join(col).strip() for col in sentiment_episode.columns.values]

sentiment_episode.head()


In [None]:

# Load NRC Emotion Lexicon
import requests

NRC_URL = 'https://raw.githubusercontent.com/nductrin/NRC-Emotion-Lexicon-v0.92/master/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt'
lexicon_path = 'nrc_lexicon.txt'
if not os.path.exists(lexicon_path):
    response = requests.get(NRC_URL)
    response.raise_for_status()
    with open(lexicon_path, 'w', encoding='utf-8') as f:
        f.write(response.text)

nrc_df = pd.read_csv(
    lexicon_path,
    names=['word', 'emotion', 'association'],
    sep='	'
)

nrc_df = nrc_df[nrc_df['association'] == 1]
emotion_map = nrc_df.groupby('word')['emotion'].apply(list).to_dict()

def map_emotions(lemmas):
    counter = Counter()
    for lemma in lemmas:
        if lemma in emotion_map:
            counter.update(emotion_map[lemma])
    total = sum(counter.values()) or 1
    return {emotion: counter.get(emotion, 0) / total for emotion in ['anger', 'joy', 'sadness', 'surprise', 'disgust', 'fear']}

emotion_features = episode_docs['lemmas'].apply(map_emotions)
emotion_features = pd.DataFrame(list(emotion_features), index=episode_docs.index)
emotion_features.columns = [f'emotion_{col}' for col in emotion_features.columns]
emotion_features.head()



## 6. Topic Modeling & Semantic Features

We compute semantic features via BERTopic (can switch to LDA if resources are limited). Steps:

1. Fit BERTopic on episode documents.
2. Extract dominant topic per episode and topic probability scores.
3. Count frequencies for selected high-level keywords (`love`, `job`, `marriage`, etc.).
4. Perform Named Entity Recognition (NER) with spaCy to count salient entities.


In [None]:

# Prepare documents for BERTopic (lemmatized strings)
episode_docs['lemma_text'] = episode_docs['lemmas'].apply(lambda lemmas: ' '.join(lemmas))

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words='english')
bertopic_model = BERTopic(vectorizer_model=vectorizer_model, language='english', calculate_probabilities=True)

topics, probs = bertopic_model.fit_transform(episode_docs['lemma_text'])
episode_docs['bertopic_topic'] = topics
episode_docs['bertopic_score'] = [prob.max() if prob is not None else np.nan for prob in probs]

topic_info = bertopic_model.get_topic_info()
topic_info.head()


In [None]:

# Keyword frequency counts
keywords = ['love', 'job', 'marriage', 'break', 'date', 'baby', 'wedding', 'career', 'friend']

def count_keywords(lemmas):
    counter = Counter(lemmas)
    return {f'kw_{kw}': counter.get(kw, 0) for kw in keywords}

keyword_features = episode_docs['lemmas'].apply(count_keywords)
keyword_features = pd.DataFrame(list(keyword_features), index=episode_docs.index)
keyword_features.head()


In [None]:

# Named entity extraction
NER_COLUMNS = ['GPE', 'LOC', 'PERSON', 'ORG', 'EVENT', 'WORK_OF_ART']


def extract_named_entities(text):
    doc = nlp(text)
    counter = Counter()
    for ent in doc.ents:
        if ent.label_ in NER_COLUMNS:
            counter[f'ner_{ent.label_.lower()}'] += 1
    return counter

ner_features = episode_docs['clean_text'].apply(extract_named_entities)
ner_features = pd.DataFrame(list(ner_features), index=episode_docs.index)
ner_features.fillna(0, inplace=True)
ner_features.head()



## 7. Humor-Related Features

* Count lines containing laughter markers.
* Detect occurrences of canonical catchphrases.
* Optional: placeholder for sarcasm scores (extend with a dedicated model if available).


In [None]:

# Laughter counts per line
preprocessed_df['has_laughter'] = preprocessed_df['clean_text'].str.contains(LAUGHTER_REGEX).fillna(False)

# Catchphrase detection using spaCy PhraseMatcher

def count_catchphrases(text):
    doc = nlp.make_doc(text)
    matches = catchphrase_matcher(doc)
    return len(matches)

preprocessed_df['catchphrase_count'] = preprocessed_df['clean_text'].apply(count_catchphrases)

humor_episode = preprocessed_df.groupby('episode_code').agg({
    'has_laughter': 'sum',
    'catchphrase_count': 'sum'
})
humor_episode.rename(columns={
    'has_laughter': 'laughter_line_count',
    'catchphrase_count': 'catchphrase_count'
}, inplace=True)

# Placeholder for sarcasm score
preprocessed_df['sarcasm_score'] = np.nan  # Replace with actual model output if available
sarcasm_episode = preprocessed_df.groupby('episode_code')['sarcasm_score'].mean().rename('sarcasm_score_mean')

humor_features = humor_episode.join(sarcasm_episode)
humor_features.head()



## 8. Complexity Features

* **Flesch-Kincaid** and **Gunning Fog** readability scores per episode using `textstat`.
* **Syntactic complexity** via average dependency tree depth.


In [None]:


def compute_readability(text):
    if not text.strip():
        return pd.Series({'flesch_kincaid': np.nan, 'gunning_fog': np.nan})
    return pd.Series({
        'flesch_kincaid': flesch_kincaid_grade(text),
        'gunning_fog': gunning_fog(text)
    })


readability_features = episode_docs['clean_text'].apply(compute_readability)


def dependency_depth(doc):
    depths = [len(list(token.ancestors)) for token in doc]
    return np.mean(depths) if depths else np.nan


syntax_features = episode_docs['clean_text'].apply(lambda text: dependency_depth(nlp(text))).rename('avg_dependency_depth')

complexity_features = readability_features.join(syntax_features)
complexity_features.head()



## 9. Character Interaction Features

We explore conversational dynamics:

* **Turn-taking matrix:** counts transitions between speakers.
* **Co-occurrence density:** overall interaction richness.
* **Key pair interactions:** counts for iconic pairs (e.g., Chandler–Joey).


In [None]:

iconic_pairs = [
    ('Chandler', 'Joey'),
    ('Ross', 'Rachel'),
    ('Monica', 'Chandler'),
    ('Phoebe', 'Joey'),
    ('Ross', 'Monica'),
    ('Rachel', 'Monica')
]


def compute_interactions(df):
    interactions = defaultdict(lambda: defaultdict(int))
    pair_counter = Counter()
    speakers = df['character'].fillna('Unknown').tolist()
    for prev, curr in zip(speakers, speakers[1:]):
        if prev == curr:
            continue
        interactions[prev][curr] += 1
        pair = tuple(sorted((prev, curr)))
        pair_counter[pair] += 1
    matrix = pd.DataFrame(interactions).fillna(0)
    total_turns = sum(pair_counter.values())
    unique_pairs = len(pair_counter)
    density = total_turns / (len(set(speakers)) ** 2 or 1)
    features = {
        'total_turns': total_turns,
        'unique_pairs': unique_pairs,
        'interaction_density': density
    }
    for pair in iconic_pairs:
        features[f'interactions_{pair[0].lower()}_{pair[1].lower()}'] = pair_counter.get(tuple(sorted(pair)), 0)
    return features

interaction_features = preprocessed_df.sort_values('line_id').groupby('episode_code').apply(compute_interactions)
interaction_features = pd.DataFrame(list(interaction_features), index=episode_docs.index)
interaction_features.head()



## 10. Meta Features

Derive structured attributes from metadata:

* Episode number within season and globally
* Season premiere/finale flags
* Season encoded as categorical dummies


In [None]:

meta_features = meta_df.set_index('episode_code')[['season', 'episode', 'episode_global_number']]

# Premiere / finale flags
meta_features['is_season_premiere'] = meta_features.groupby('season')['episode'].transform(lambda x: x == x.min()).astype(int)
meta_features['is_season_finale'] = meta_features.groupby('season')['episode'].transform(lambda x: x == x.max()).astype(int)

season_dummies = pd.get_dummies(meta_features['season'], prefix='season')
meta_features = meta_features.join(season_dummies)

meta_features.head()



## 11. Consolidate Feature Matrix

Combine all feature tables into a single DataFrame indexed by `episode_code`, then merge with IMDB ratings. Missing values are filled conservatively with zeros or column means depending on context.


In [None]:

feature_frames = [
    lexical_features,
    sentiment_episode,
    emotion_features,
    keyword_features,
    ner_features,
    humor_features,
    complexity_features,
    interaction_features,
    meta_features
]

feature_matrix = pd.concat(feature_frames, axis=1)

# Merge with BERTopic outputs separately to avoid duplicates
feature_matrix = feature_matrix.join(episode_docs[['bertopic_topic', 'bertopic_score']], how='left')

# Merge IMDB ratings / metadata
rating_col_candidates = ['imdb_rating', 'imdb', 'imdb_rating_scaled', 'Rating', 'imdbRating']
imdb_col = find_column(meta_df, rating_col_candidates)

meta_target = meta_df.set_index('episode_code')[[imdb_col, 'episode_title', 'episode_global_number']]
meta_target.rename(columns={imdb_col: 'imdb_rating'}, inplace=True)

full_dataset = feature_matrix.join(meta_target, how='left')

# Handle missing values
numeric_cols = full_dataset.select_dtypes(include=[np.number]).columns
full_dataset[numeric_cols] = full_dataset[numeric_cols].fillna(0)
full_dataset['episode_title'] = full_dataset['episode_title'].fillna('Unknown')

full_dataset.head()



## 12. Exploratory Visualizations

Sanity-check feature distributions.


In [None]:

plt.figure(figsize=(10, 6))
sns.histplot(full_dataset['total_words'], bins=20, kde=True)
plt.title('Distribution of Episode Word Counts')
plt.xlabel('Total Words')
plt.ylabel('Episodes')
plt.show()


In [None]:

plt.figure(figsize=(10, 6))
char_cols = [col for col in full_dataset.columns if col.startswith('line_pct_')]
full_dataset[char_cols].mean().sort_values(ascending=False).plot(kind='bar')
plt.title('Average Speaking Share by Main Characters')
plt.ylabel('Average % of Lines')
plt.show()


In [None]:

plt.figure(figsize=(10, 6))
interaction_cols = [col for col in full_dataset.columns if col.startswith('interactions_')]
sns.heatmap(full_dataset[interaction_cols], cmap='viridis')
plt.title('Character Pair Interactions per Episode')
plt.xlabel('Pair')
plt.ylabel('Episode Index')
plt.show()



## 13. Export Feature Matrix

Save the processed dataset for downstream Exceptional Model Mining experiments.


In [None]:

OUTPUT_PATH = 'friends_episode_feature_matrix.csv'
full_dataset.to_csv(OUTPUT_PATH, index=True)
print(f'Saved feature matrix to {OUTPUT_PATH}. Rows: {len(full_dataset)}')



### Next Steps

* Run preliminary EMM analyses (e.g., pattern mining on high-IMDB episodes).
* Experiment with alternative sentiment / emotion models for robustness.
* Integrate sarcasm detection when a reliable model is available.
