# Task 1 

In moodle you will find the file trek.json and characters.csv. The first file contains transcripts of 5 Star Trek tv shows, separated into the individual episodes. The second file contains
the name of characters, the tv show they appear in and their respective rank or role in the
show.


In this exercise, we will investigate, how well Word2Vec models the relationships between characters in the Star Trek franchise and how different window sizes can change the relationships
that are being mapped by the model.


Please note: The names “obrien” and “tpol” originally contained an apostrophe. For Word2Vec
to recognize the characters correctly, you have to remove each apostrophe with an empty string!


In [4]:
import json
import pandas as pd
from gensim.models import Word2Vec
from gensim.parsing.preprocessing import remove_stopwords
import re

# Load trek.json
with open('trek.json') as f:
    trek_data = json.load(f)

# Load characters.csv
characters_df = pd.read_csv('characters.csv')

characters_df

Unnamed: 0,Character,Series,Roles
0,archer,ENT,Captains
1,kirk,TOS,Captains
2,picard,TNG,Captains
3,sisko,DS9,Captains
4,janeway,VOY,Captains
5,tucker,ENT,Engineers
6,scott,TOS,Engineers
7,laforge,TNG,Engineers
8,obrien,DS9,Engineers
9,torres,VOY,Engineers


# Task 2


Preprocess the texts so that they are fit for an analysis. Argue the use the preprocessing steps
you take for the given analysis.

In [18]:
# Load the JSON data from the file
filename = 'trek.json'
with open(filename, 'r') as file:
    trek_data = json.load(file)

# List to hold preprocessed sentences (as lists of words)
sentences = []

# Loop through each show and its episodes
for show, episodes in trek_data.items():
    for episode_title, episode_text in episodes.items():
        # Lowercase and remove specific apostrophes from character names
        text = episode_text.lower().replace("o'brien", "obrien").replace("t'pol", "tpol")
        # Tokenize the text, keeping only words
        tokens = re.findall(r'\w+', text)
        sentences.append(tokens)

# Example: Print the first few tokens from the first episode's processing
if sentences:
    print("First few tokens from the first episode:", sentences[0][:100])

First few tokens from the first episode: ['the', 'deep', 'space', 'nine', 'transcripts', 'emissary', 'emissary', 'stardate', '46379', '1', 'original', 'airdate', '3', 'jan', '1993', 'on', 'stardate', '43997', 'captain', 'jean', 'luc', 'picard', 'of', 'the', 'federation', 'starship', 'enterprise', 'was', 'kidnapped', 'for', 'six', 'days', 'by', 'an', 'invading', 'force', 'known', 'as', 'the', 'borg', 'surgically', 'altered', 'he', 'was', 'forced', 'to', 'lead', 'an', 'assault', 'on', 'starfleet', 'at', 'wolf', '359', 'saratoga', 'bridge', 'locutus', 'on', 'viewscreen', 'resistance', 'is', 'futile', 'you', 'will', 'disarm', 'your', 'weapons', 'and', 'escort', 'us', 'to', 'sector', 'zero', 'zero', 'one', 'if', 'you', 'attempt', 'to', 'intervene', 'we', 'will', 'destroy', 'you', 'captain', 'a', 'vulcan', 'red', 'alert', 'load', 'all', 'torpedo', 'bays', 'ready', 'phasers', 'move', 'us', 'to', 'position', 'alpha']


# Task 3


Train a Word2Vec model on all transcripts with a window size of two (i.e. two words in each
direction) and a vector dimension of 300. Train another model with the same parameters and
only change the window size to ten.

In [19]:
from gensim.models import Word2Vec

# Assuming `sentences` is already prepared and contains the preprocessed and tokenized transcripts

# Train Word2Vec model with a window size of 2
model_window_2 = Word2Vec(sentences=sentences, vector_size=300, window=2, min_count=1, workers=4, sg=0)

# Train Word2Vec model with a window size of 10
model_window_10 = Word2Vec(sentences=sentences, vector_size=300, window=10, min_count=1, workers=4, sg=0)

# sg=0 specifies the training algorithm: CBOW (Continuous Bag of Words). 
# If you prefer to use the Skip-gram model, set sg=1.


# Task 4


We will now use the characters from characters.csv and see, how well Word2Vec differentiates the different tv shows. Calculate the cosine similarities of all possible character pairs for
both models. Then, calculate the average similarity between all character pairs within each tv
show and the average pairwise similarity to all characters of a different tv show. In the end you
should have a 5x5 matrix, containing average pairwise similarities between and within all 5 tv
shows.


What do you notice? Which model does differentiate the characters of a tv show better from
other tv shows?


In [27]:
import pandas as pd
from itertools import combinations, product

# Load characters.csv
characters_df = pd.read_csv('characters.csv')

# Ensure character names match the preprocessing applied earlier (e.g., removing apostrophes)
characters_df['Character'] = characters_df['Character'].str.lower().replace({"o'brien": "obrien", "t'pol": "tpol"}, regex=True)

# Prepare a list of unique shows
shows = characters_df['Series'].unique()

# Initialize an empty dictionary to hold similarity matrices for each model
similarity_matrices = {'model_window_2': None, 'model_window_10': None}

# Function to calculate average similarities (assuming model is already trained)
def calculate_average_similarity(model, characters_df):
    # Initialize a matrix to store average similarities between shows
    avg_similarity_matrix = pd.DataFrame(0, index=shows, columns=shows, dtype=float)
    
    # Iterate over each combination of shows to calculate average similarities
    for show1, show2 in product(shows, repeat=2):
        characters1 = characters_df[characters_df['Series'] == show1]['Character'].tolist()
        characters2 = characters_df[characters_df['Series'] == show2]['Character'].tolist()
        
        total_similarity = 0
        count = 0
        
        # Compute cosine similarity for each character pair between the two shows
        for char1, char2 in product(characters1, characters2):
            try:
                similarity = model.wv.similarity(char1, char2)
                total_similarity += similarity
                count += 1
            except KeyError:  # Character not in model vocabulary
                continue
                
        # Calculate average similarity if there were any valid comparisons
        if count > 0:
            avg_similarity = total_similarity / count
        else:
            avg_similarity = None  # Indicate no valid comparisons
        
        avg_similarity_matrix.loc[show1, show2] = avg_similarity
    
    return avg_similarity_matrix

# Assuming 'model_window_2' and 'model_window_10' are your trained Word2Vec models
similarity_matrices['model_window_2'] = calculate_average_similarity(model_window_2, characters_df)
similarity_matrices['model_window_10'] = calculate_average_similarity(model_window_10, characters_df)

# The resulting 'similarity_matrices' dictionary contains the 5x5 matrices for each model

similarity_matrices

{'model_window_2':           ENT       TOS       TNG       DS9       VOY
 ENT  0.658812  0.530895  0.527138  0.542008  0.563587
 TOS  0.530895  0.618966  0.483849  0.462472  0.515654
 TNG  0.527138  0.483849  0.593197  0.517354  0.538675
 DS9  0.542008  0.462472  0.517354  0.623971  0.546402
 VOY  0.563587  0.515654  0.538675  0.546402  0.645253,
 'model_window_10':           ENT       TOS       TNG       DS9       VOY
 ENT  0.548281  0.209373  0.154335  0.198789  0.226451
 TOS  0.209373  0.551053  0.203047  0.119182  0.160095
 TNG  0.154335  0.203047  0.463180  0.170131  0.254550
 DS9  0.198789  0.119182  0.170131  0.451880  0.182436
 VOY  0.226451  0.160095  0.254550  0.182436  0.494807}

The results of the cosine similarity calculations between character pairs for both models, with window sizes of 2 and 10, offer interesting insights into how Word2Vec captures relationships between characters across different Star Trek TV shows. Here's an analysis based on the provided results:

    Observations:
Higher Intra-show Similarities: For both models, the diagonal entries in the matrices (which represent average pairwise similarities within the same TV show) are consistently higher than the off-diagonal entries. This indicates that characters within the same show tend to have more similar contexts (as captured by Word2Vec) than characters across different shows, which is an expected and desired outcome when analyzing character relationships.

    Comparison Between Models:
Model with Window Size of 2: This model shows relatively high similarities both within and between shows, with intra-show similarities being the highest. The similarity scores are generally above 0.5, indicating a closer relationship between characters. This suggests that a smaller window size captures more immediate contextual relationships, potentially leading to a stronger association between characters who interact closely within the narratives of their respective shows.
        
Model with Window Size of 10: This model shows a much more significant distinction between intra-show and inter-show similarities. The intra-show similarities remain the highest, but the inter-show similarities drop significantly, especially compared to the model with a window size of 2. For example, similarities between "ENT" and other shows drop to values around 0.2 or lower, indicating a broader contextual gap captured by the larger window size.
        
    Analysis:
Differentiation of TV Shows: The model with a window size of 10 differentiates characters of one TV show from characters of other TV shows more distinctly than the model with a window size of 2. This is evident from the significantly lower inter-show similarities in the model with the larger window size. The broader context captured with a window size of 10 helps the model to better understand and distinguish the unique narrative contexts of each show.

Intra-show Character Relationships: Both models effectively capture intra-show relationships, as seen in the higher average similarities within shows. However, the model with a window size of 2 presents a less pronounced difference between intra-show and inter-show relationships, suggesting it captures more localized character interactions.

    Conclusion:
The model with a window size of 10 more effectively differentiates the characters of a TV show from characters of other TV shows, as indicated by the stark contrast in average similarities. This suggests that a larger window size may be more suitable for capturing the distinct narrative and thematic contexts of different TV shows, leading to clearer distinctions in character relationships across the Star Trek franchise.

# Task 5


Repeat task for for the role-column, which contains information of the role the characters
represent in the tv show. Again, compare the inner vs. outer similarities within these groups.
Which model works better for this task?


In [28]:
import pandas as pd
from itertools import combinations_with_replacement


In [29]:
# Preprocess character names (if necessary, based on previous steps)
characters_df['Character'] = characters_df['Character'].str.lower().replace({"o'brien": "obrien", "t'pol": "tpol"}, regex=True)

# Group characters by role
role_groups = characters_df.groupby('Roles')['Character'].apply(list).to_dict()

# Function to calculate average similarities based on roles for a given model
def calculate_average_role_similarity(model, role_groups):
    roles = list(role_groups.keys())
    avg_similarity_matrix = pd.DataFrame(0, index=roles, columns=roles, dtype=float)
    
    for role1, role2 in combinations_with_replacement(roles, 2):
        characters1 = role_groups[role1]
        characters2 = role_groups[role2]
        total_similarity = 0
        count = 0
        
        for char1 in characters1:
            for char2 in characters2:
                # Ensure both characters are in the model's vocabulary
                if char1 in model.wv.key_to_index and char2 in model.wv.key_to_index:
                    similarity = model.wv.similarity(char1, char2)
                    total_similarity += similarity
                    count += 1
        
        # Calculate average similarity if there were valid comparisons
        avg_similarity = total_similarity / count if count > 0 else float('nan')
        avg_similarity_matrix.loc[role1, role2] = avg_similarity
        if role1 != role2:
            avg_similarity_matrix.loc[role2, role1] = avg_similarity  # Fill symmetric value
    
    return avg_similarity_matrix

# Assuming 'model_window_2' and 'model_window_10' are your trained Word2Vec models
avg_role_similarity_matrix_window_2 = calculate_average_role_similarity(model_window_2, role_groups)
avg_role_similarity_matrix_window_10 = calculate_average_role_similarity(model_window_10, role_groups)


In [32]:
avg_role_similarity_matrix_window_2

Unnamed: 0,Captains,Engineers,First Officers,Nicknames
Captains,0.828997,0.606871,0.680319,0.310843
Engineers,0.606871,0.778255,0.610394,0.32684
First Officers,0.680319,0.610394,0.73028,0.320958
Nicknames,0.310843,0.32684,0.320958,0.651628


In [33]:
avg_role_similarity_matrix_window_10

Unnamed: 0,Captains,Engineers,First Officers,Nicknames
Captains,0.395027,0.210442,0.26103,0.052812
Engineers,0.210442,0.593059,0.259147,0.214536
First Officers,0.26103,0.259147,0.375868,0.097544
Nicknames,0.052812,0.214536,0.097544,0.45525


Analyzing the average cosine similarities for characters based on their roles within the Star Trek universe, two Word2Vec models with different window sizes (2 and 10) reveal distinct capabilities:

Model with Window Size of 2: Excels at highlighting strong, role-specific contexts, demonstrated by higher inner-role similarities. It effectively captures the immediate textual context around character mentions, making it ideal for analyzing closely shared role characteristics.

Model with Window Size of 10: Stands out in distinguishing between the unique narrative contexts of different roles. Lower between-role similarities indicate this model's superior differentiation capabilities, ideal for exploring how roles diverge across narratives.

Conclusion: The choice between models hinges on the analytical goal. For in-depth exploration of role-specific contexts, the model with window size 2 is preferable. Conversely, for distinguishing between different roles’ narrative functions, the model with window size 10 offers clearer insights.