In [None]:
# Exercise sheet 9 with the 5 Star Trek tv shows, and name of character. Load the data set into your console.

# Task 1
In moodle you will find the file trek.json and characters.csv. The first file contains tran- scripts of 5 Star Trek tv shows, separated into the individual episodes. 

The second file contains the name of characters, the tv show they appear in and their respective rank or role in the show.

In this exercise, we will investigate, how well Word2Vec models the relationships between char- acters in the Star Trek franchise and how different window sizes can change the relationships that are being mapped by the model.

Please note: The names “obrien” and “tpol” originally contained an apostrophe. For Word2Vec to recognize the characters correctly, you have to remove each apostrophe with an empty string!


In [1]:
pip install gensim pandas scipy

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [38]:
import os
import json
import pandas as pd
import numpy as np
import gensim
from gensim.models import Word2Vec
from gensim.corpora import Dictionary
from scipy.spatial.distance import cdist
from itertools import combinations

In [13]:
# File paths
file_path = '/Users/oayanwale/Downloads/NLP_Exercise_24_25/Data'
trek_json_file = f'{file_path}/trek.json'
characters_csv_file = f'{file_path}/characters.csv'

# Task 1: Load data from JSON and CSV files.
with open(trek_json_file, 'r') as file:
    transcripts = json.load(file)

characters_df = pd.read_csv(characters_csv_file)

# Preprocess character names by removing apostrophes.
characters_df['Character'] = characters_df['Character'].str.replace("'", "", regex=False)


In [63]:
# option 2 best for this task

data = pd.read_json("/Users/oayanwale/Downloads/NLP_Exercise_23/trek.json")
texts = data.values.flatten().tolist()  # Ensures all values are extracted as a list
texts = [str(x) for x in texts if isinstance(x, str)]  # Ensure all values are strings


# Task 2
# Preprocess the texts so that they are fit for an analysis. 
# Argue the use the preprocessing steps you take for the given analysis.

In [64]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download("stopwords")
nltk.download("wordnet")

# Initialize lemmatizer and stopwords
lemma = WordNetLemmatizer()
stop = set(stopwords.words("english"))

# Ensure stopwords are clean
stop = {lemma.lemmatize(re.sub(r"[^a-z]", "", x)) for x in stop}

def preprocess(text):
    # Remove apostrophes to match Word2Vec format
    text = text.replace("'", "")

    # Try to remove a date if present
    date_match = re.search(r"\d{1,2}(th|st|nd|rd)? \w+,? \d{4}|\w+ \d{1,2}(th|st|nd|rd)?,? \d{4}", text)
    if date_match:
        text = text.split(date_match.group(0))[-1]  # Take everything after the date

    # Convert to lowercase
    text = text.lower()

    # Remove non-alphabetic characters
    text = re.sub(r"[^a-z ]", " ", text)

    # Tokenize and remove extra spaces
    tokens = text.split()

    # Lemmatize and remove stopwords
    tokens = [lemma.lemmatize(word) for word in tokens if word not in stop]

    return tokens  # Return a list of tokens for Word2Vec

# Apply preprocessing
processed_texts = [preprocess(text) for text in texts]


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/oayanwale/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/oayanwale/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Preprocessing Steps and Their Justifications
Lowercasing Text:: Lowercasing the text helps in maintaining uniformity. This ensures that words considered the same semantically but differing in case (like "Data" and "data") are treated as identical. This is essential for word embeddings like Word2Vec, which are case-sensitive and could learn separate embeddings for the capitalized and the lowercase version of the same word.

Removing Apostrophes: In the context of the given task, apostrophes might break the continuity of character names or create variations that are challenging to match consistently across dialogues. By removing them, we standardize character names ensuring that Word2Vec correctly identifies and groups them.

Replacing Hyphens with Spaces:  Hyphens can connect words (e.g., "long-term" or "Jean-Luc") which might need to be interpreted separately by the model. Replacing hyphens with spaces splits compound words into their components, which can then be handled separately by the model. This helps in building more accurate word associations and embeddings.

Tokenizing Text: Tokenizing the text (i.e., splitting it into individual words) is a critical step in preparing data for Word2Vec or any other word embedding model. These models require lists of words as input to learn the context and relationships between them.

Summary
The preprocessing steps are crucial for cleaning and standardizing the text data which directly impacts the quality and reliability of the Word2Vec embeddings. By converting text to lowercase, removing apostrophes, and replacing hyphens with spaces, we ensure that the names of characters and other textual data are uniformly represented. This helps in preventing the model from treating varying representations as different entities and ensures more accurate context and relationship modeling between words. Tokenization is fundamental as the Word2Vec model operates on tokenized words to learn their embeddings and contextual similarities.


# Task 3
# Train a Word2Vec model on all transcripts with a window size of two (i.e. two words in each direction) and a vector dimension of 300. 
# Train another model with the same parameters and only change the window size to ten.

In [65]:
from gensim.models import Word2Vec

# Train Word2Vec model with window size of two
model1 = Word2Vec(processed_texts, vector_size=300, window=2, min_count=1, workers=4)

# Train Word2Vec model with window size of ten
model2 = Word2Vec(processed_texts, vector_size=300, window=10, min_count=1, workers=4)


In [66]:
from gensim.corpora import Dictionary

# Create dictionary and corpus using preprocessed tokens
dictionary = Dictionary(processed_texts)
corpus = [dictionary.doc2bow(script) for script in processed_texts]


# Task 4
# We will now use the characters from characters.csv and see, how well Word2Vec differenti- ates the different tv shows. Calculate the cosine similarities of all possible character pairs for both models. 

# Then, calculate the average similarity between all character pairs within each tv show and the average pairwise similarity to all characters of a different tv show. In the end you should have a 5x5 matrix, containing average pairwise similarities between and within all 5 tv shows.

# What do you notice? Which model does differentiate the characters of a tv show better from other tv shows?


In [83]:
import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist

characters_df = pd.read_csv(characters_csv_file)
characters_df

Unnamed: 0,Character,Series,Roles
0,archer,ENT,Captains
1,kirk,TOS,Captains
2,picard,TNG,Captains
3,sisko,DS9,Captains
4,janeway,VOY,Captains
5,tucker,ENT,Engineers
6,scott,TOS,Engineers
7,laforge,TNG,Engineers
8,obrien,DS9,Engineers
9,torres,VOY,Engineers


In [84]:
# Step 2: Calculate pairwise distances using the full DataFrame
if "Character" in characters_df.columns:
    characters_df["Distances1"] = calculate_pairwise_distances(characters_df["Character"].tolist(), model=model1)
    characters_df["Distances2"] = calculate_pairwise_distances(characters_df["Character"].tolist(), model=model2)

# Display updated DataFrame with distances
print(characters_df)

   Character Series           Roles  \
0     archer    ENT        Captains   
1       kirk    TOS        Captains   
2     picard    TNG        Captains   
3      sisko    DS9        Captains   
4    janeway    VOY        Captains   
5     tucker    ENT       Engineers   
6      scott    TOS       Engineers   
7    laforge    TNG       Engineers   
8     obrien    DS9       Engineers   
9     torres    VOY       Engineers   
10      tpol    ENT  First Officers   
11     spock    TOS  First Officers   
12     riker    TNG  First Officers   
13      kira    DS9  First Officers   
14  chakotay    VOY  First Officers   
15      trip    ENT       Nicknames   
16    scotty    TOS       Nicknames   
17   beverly    TNG       Nicknames   
18    jadzia    DS9       Nicknames   
19     harry    VOY       Nicknames   

                                           Distances1  \
0   [0.0, 0.7208909048931522, 0.6895365561243276, ...   
1   [0.7208909048931522, 0.0, 0.602005696738748, 0...   
2   [0.68

In [88]:
def calculate_average_similarity(df, similarity_col):
    avg_within_series = {}
    avg_between_series = {}

    # Group by Series and calculate averages within each group
    grouped = df.groupby('Series')

    for name, group in grouped:
        # Flatten the list of lists for within-series calculations
        all_similarities_within = [sim for sublist in group[similarity_col].tolist() for sim in sublist]
        avg_within_series[name] = np.mean(all_similarities_within)

        # Calculate averages with other series
        other_groups = df[df['Series'] != name]

        all_similarities_between = []
        for _, other_group in other_groups.groupby('Series'):
            all_similarities_between.extend([sim for sublist in other_group[similarity_col].tolist() for sim in sublist])
        
        avg_between_series[name] = np.mean(all_similarities_between) if all_similarities_between else 0

    return avg_within_series, avg_between_series

In [89]:
# Step 3: Calculate average similarities using both models
avg_within_1, avg_between_1 = calculate_average_similarity(characters_df, "Distances1")
avg_within_2, avg_between_2 = calculate_average_similarity(characters_df, "Distances2")


In [92]:
def create_similarity_matrix(avg_within, avg_between):
    series_names = list(avg_within.keys())
    
    # Initialize a square matrix with zeros
    similarity_matrix = np.zeros((len(series_names), len(series_names)))

    # Fill diagonal with within-series averages
    for i in range(len(series_names)):
        similarity_matrix[i][i] = avg_within[series_names[i]]

    # Fill off-diagonal with between-series averages
    for i in range(len(series_names)):
        for j in range(len(series_names)):
            if i != j:
                similarity_matrix[i][j] = avg_between[series_names[j]]

    return pd.DataFrame(similarity_matrix, index=series_names, columns=series_names)

In [93]:
# Step 4: Create similarity matrices for both models
similarity_matrix_model_1 = create_similarity_matrix(avg_within_1, avg_between_1)
similarity_matrix_model_2 = create_similarity_matrix(avg_within_2, avg_between_2)

# Display matrices
print("Similarity Matrix Model 1:\n", similarity_matrix_model_1)
print("Similarity Matrix Model 2:\n", similarity_matrix_model_2)

Similarity Matrix Model 1:
           DS9       ENT       TNG       TOS       VOY
DS9  0.672919  0.644596  0.646871  0.644426  0.640239
ENT  0.634404  0.632154  0.646871  0.644426  0.640239
TNG  0.634404  0.644596  0.623053  0.644426  0.640239
TOS  0.634404  0.644596  0.646871  0.632832  0.640239
VOY  0.634404  0.644596  0.646871  0.644426  0.649578
Similarity Matrix Model 2:
           DS9       ENT       TNG       TOS       VOY
DS9  0.860263  0.845245  0.838786  0.844609  0.827370
ENT  0.831916  0.806946  0.838786  0.844609  0.827370
TNG  0.831916  0.845245  0.832780  0.844609  0.827370
TOS  0.831916  0.845245  0.838786  0.809490  0.827370
VOY  0.831916  0.845245  0.838786  0.844609  0.878447


# Model Comparison:

Model 1: The values are generally lower than those in Model 2, suggesting that this model may not differentiate as effectively between characters across different series.

Model 2: Higher similarity values indicate stronger relationships or contextual similarities among character pairs, which might suggest that this model captures character relationships more effectively.

Relative Differentiation Between Models:
If you compare specific pairwise similarities (e.g., DS9 vs. TNG), you may notice that the differences in similarity scores between the two models can be quite pronounced.

In Model 1, similarities may not vary significantly across shows; however, in Model 2, there appears to be a clearer distinction among them.

Consistent Patterns Across Shows:

For both models, certain series have similar patterns of relationship with others. For instance, DS9 seems to maintain relatively high similarity scores with other series in both models but even more so in Model 2.

The differences among shows might reflect how often characters interact or share similar themes and contexts.

Diagonal vs. Off-Diagonal Values:

In both matrices, diagonal values (representing average similarities within the same show) are typically higher than off-diagonal values (representing average similarities between different shows). This is expected since characters within the same show often share more context and dialogue.

However, in Model 2, there is a notable difference between diagonal and off-diagonal values, indicating that it can differentiate better between characters of different shows.



# Task 5
# Repeat task for for the role-column, which contains information of the role the characters represent in the tv show. 

# Again, compare the inner vs. outer similarities within these groups. Which model works better for this task?

In [102]:
#Step 1: Calculate Pairwise Distances Based on Roles

# Calculate pairwise distances using the Role column
if "Roles" in characters_df.columns:
    characters_df["Role_Distances1"] = calculate_pairwise_distances(characters_df["Character"].tolist(), model=model1)
    characters_df["Role_Distances2"] = calculate_pairwise_distances(characters_df["Character"].tolist(), model=model2)

# Display updated DataFrame with distances based on roles
print(characters_df)

   Character Series           Roles  \
0     archer    ENT        Captains   
1       kirk    TOS        Captains   
2     picard    TNG        Captains   
3      sisko    DS9        Captains   
4    janeway    VOY        Captains   
5     tucker    ENT       Engineers   
6      scott    TOS       Engineers   
7    laforge    TNG       Engineers   
8     obrien    DS9       Engineers   
9     torres    VOY       Engineers   
10      tpol    ENT  First Officers   
11     spock    TOS  First Officers   
12     riker    TNG  First Officers   
13      kira    DS9  First Officers   
14  chakotay    VOY  First Officers   
15      trip    ENT       Nicknames   
16    scotty    TOS       Nicknames   
17   beverly    TNG       Nicknames   
18    jadzia    DS9       Nicknames   
19     harry    VOY       Nicknames   

                                           Distances1  \
0   [0.0, 0.7208909048931522, 0.6895365561243276, ...   
1   [0.7208909048931522, 0.0, 0.602005696738748, 0...   
2   [0.68

In [103]:
# Step 2: Calculate Average Similarities Within and Between Roles
# Next, define a function similar to what you did in Task 4 but adjusted for roles:

def calculate_average_similarity_by_role(df, similarity_col):
    avg_within_role = {}
    avg_between_role = {}

    # Group by Role and calculate averages within each group
    grouped = df.groupby('Roles')

    for name, group in grouped:
        # Flatten the list of lists for within-role calculations
        all_similarities_within = [sim for sublist in group[similarity_col].tolist() for sim in sublist]
        avg_within_role[name] = np.mean(all_similarities_within) if all_similarities_within else 0

        # Calculate averages with other roles
        other_groups = df[df['Roles'] != name]

        all_similarities_between = []
        for _, other_group in other_groups.groupby('Roles'):
            all_similarities_between.extend([sim for sublist in other_group[similarity_col].tolist() for sim in sublist])
        
        avg_between_role[name] = np.mean(all_similarities_between) if all_similarities_between else 0

    return avg_within_role, avg_between_role

In [104]:
# Step 3: Calculate average similarities using both models based on Roles
avg_within_roles_1, avg_between_roles_1 = calculate_average_similarity_by_role(characters_df, "Role_Distances1")
avg_within_roles_2, avg_between_roles_2 = calculate_average_similarity_by_role(characters_df, "Role_Distances2")

In [105]:
# Step 4: Create Similarity Matrices Based on Roles

# Create similarity matrices for both models based on Roles
role_similarity_matrix_model_1 = create_similarity_matrix(avg_within_roles_1, avg_between_roles_1)
role_similarity_matrix_model_2 = create_similarity_matrix(avg_within_roles_2, avg_between_roles_2)

# Display matrices
print("Role Similarity Matrix Model 1:\n", role_similarity_matrix_model_1)
print("Role Similarity Matrix Model 2:\n", role_similarity_matrix_model_2)

Role Similarity Matrix Model 1:
                 Captains  Engineers  First Officers  Nicknames
Captains        0.635921   0.650206         0.64263   0.631423
Engineers       0.644169   0.617810         0.64263   0.631423
First Officers  0.644169   0.650206         0.64054   0.631423
Nicknames       0.644169   0.650206         0.64263   0.674159
Role Similarity Matrix Model 2:
                 Captains  Engineers  First Officers  Nicknames
Captains        0.834309   0.851137        0.838587   0.821940
Engineers       0.838677   0.796931        0.838587   0.821940
First Officers  0.838677   0.851137        0.834579   0.821940
Nicknames       0.838677   0.851137        0.838587   0.884522


Model Comparison for Task 5

Model 1:
The similarity scores in Model 1 are lower overall compared to those in Model 2.
The distinctions between different roles are less pronounced, indicating that this model may not effectively differentiate characters based on their roles.

Model 2:
The similarity scores in Model 2 are consistently higher across all role comparisons.
There are clearer distinctions between the average similarities of different roles, suggesting that this model captures contextual relationships more effectively.
For example, the highest score (0.884522) for "Nicknames" indicates strong contextual relationships among characters with that role.

Conclusion
Model 2 works better for this task of differentiating characters based on their roles within the TV shows. The higher cosine similarity values and clearer differentiation among roles suggest that Model 2 is better at capturing the nuances of character interactions as influenced by their assigned roles.

Implications
This implies that when analyzing character relationships within a franchise like Star Trek, using a Word2Vec model with a larger window size (as seen in your second model) allows for capturing broader contextual information from dialogues and interactions across episodes.