# Action Network â€” Group 97  
### Social Graphs & Interactions

![Project Cover Image](https://w0.peakpx.com/wallpaper/194/584/HD-wallpaper-skyscraper-movie-10k-skyscraper-movie-2018-movies-movies-dwayne-johnson.jpg)

ðŸ”— **GitHub Repository:**  
[Click here to view the repository](INSERT_GITHUB_LINK_HERE)

---

## Welcome to Our Project Explainer!

This Jupyter Notebook contains **all the code and explanations** related to the final project for **Group 97** in the course *Social Graphs & Interactions*.

Each section of the notebook is structured to guide you through:
- The **purpose** of the code  
- The **implementation details**  
- And **how everything ties together**

ðŸ’¡ Throughout the notebook, you will find **clear comments and explanations** for every major part of the code to ensure transparency and easy understanding.

---

**How to Use This Notebook**
- Expand the sections below to explore each part of the project  
- Follow the comments inside the code cells for step-by-step explanations  
- Use the GitHub link above for the full project structure, version control, and documentation

---

Enjoy exploring our work â€” and feel free to reach out if you have questions! ðŸ™Œ


# Scraping section

## Packages

In [None]:
import os
import csv
import time
import requests
from collections import defaultdict
from community import community_louvain
from collections import Counter
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import re
import numpy as np
import itertools
import matplotlib.pyplot as plt
import math


## Function for scraping and storing the data

This section handles the automated retrieval and storage of movie reviews from the TMDB API.

1. **Fetching review data from TMDB**  
   The function `fetch_movie_data` sends a request to the TMDB API for a given movie ID and returns the corresponding review data in JSON format. A brief delay is added periodically to avoid exceeding rate limits.

2. **Appending review data to a CSV file**  
   The function `write_to_csv` appends processed review information to a CSV file used to store all collected reviews.

3. **Resuming progress based on last processed movie**  
   The function `get_last_row_movie` reads the last row of the review file to determine which movie ID was most recently processed.  
   The main processing function uses this value to resume data collection without repeating previously completed work.

4. **Processing movie metadata and extracting review text**  
   For each movie, the script checks whether it has already been processed and whether it has a valid TMDB ID.  
   If valid, reviews are fetched, and the first few reviews are extracted, cleaned, and concatenated into a single string.

5. **Saving processed reviews**  
   Each movieâ€™s cleaned review text is stored along with its movie ID in the output CSV file.  
   This ensures that the dataset can be built incrementally and used later for sentiment analysis.


In [None]:
# Fetch movie reviews from the TMDB API
def fetch_movie_data(tmdb_id, i, TMDB_API_KEY):
    if i % 39 == 0:
        time.sleep(0.25)

    url = (
        f"https://api.themoviedb.org/3/movie/{tmdb_id}/reviews"
        f"?api_key={TMDB_API_KEY}&language=en-US"
    )

    data = requests.get(url).json()
    return data


# Append a row to a CSV file
def write_to_csv(row, filename="ml-latest/reviews.csv"):
    with open(filename, "a", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(row)


# Get the last processed movieId from the CSV file
def get_last_row_movie(filename="ml-latest/reviews.csv"):
    with open(filename, "r", encoding="utf-8") as f:
        last_line = f.readlines()[-1]
        return int(last_line.split(",")[0])


# Fetch, process, and store movie reviews
def process_movie_data(data, TMDB_API_KEY, file_path="ml-latest/reviews.csv"):

    # Check if output file exists
    if os.path.exists(file_path):
        last_movie_id = get_last_row_movie(file_path)
        print(f"Resuming from movieId: {last_movie_id}")
    else:
        last_movie_id = 0
        with open(file_path, "w", newline="", encoding="utf-8") as f:
            writer = csv.writer(f)
            writer.writerow(["movieId", "reviews"])
        print("Creating new reviews.csv file")

    for i, (movieId, imdbId, tmdbId) in enumerate(data.itertuples(index=False)):

        # Skip already processed movies
        if movieId <= last_movie_id:
            continue

        # Skip missing TMDB IDs
        if pd.isna(tmdbId):
            continue

        tmdbId = int(tmdbId)

        # Fetch review data
        review_data = fetch_movie_data(tmdbId, i, TMDB_API_KEY)

        # Extract and clean review texts
        reviews_raw = review_data.get("results", [])
        reviews_texts = []

        for r in reviews_raw[:5]:
            content = r.get("content", "")
            content = content.replace("\n", " ").replace(",", "").replace("'", "")
            reviews_texts.append(content)

        reviews_joined = "|".join(reviews_texts)

        # Save result to CSV
        row = [movieId, reviews_joined]
        write_to_csv(row, file_path)

        print(f"Saved movieId {movieId}")


In [None]:
TMDB_API_KEY = "be0552b72397e07ffaa4d7d488b22b92"

movies_df = pd.read_csv("data/links_action.csv")
process_movie_data(movies_df, TMDB_API_KEY=TMDB_API_KEY, file_path="data/reviews_final_one_AA.csv")

# Graph section

## Initialization of data and preprocessing

This section prepares the movie data and constructs the actor collaboration network.

1. **Load and clean datasets**  
   The overview and movie metadata datasets are loaded.  
   The cast column is cleaned by filling missing values and converting all entries to strings.

2. **Convert cast and genre data to list format**  
   Cast names are split into lists of individual actors.  
   Genres are split into lists of individual genre labels to allow flexible filtering.

3. **Merge datasets on movie ID**  
   The overview data is merged with the genre information using `movieId` as the key.

4. **Filter movies by target genre**  
   The dataset is filtered to include only movies belonging to the selected genre (Action).

5. **Construct the actor network**  
   A graph is created where each actor is a node.  
   Edges are added between actors who appear in the same movie, with edge weights representing repeated collaborations.

6. **Remove isolated nodes**  
   Actors with no connections are removed to ensure the network only contains collaborative relationships.

In [None]:
# Load datasets
df = pd.read_csv("data/overview.csv")
movies_df = pd.read_csv("data/movies.csv")

# Clean cast column
df["cast_names"] = df["cast_names"].fillna("").astype(str)

# Convert cast strings to lists
df["cast_list"] = df["cast_names"].apply(
    lambda s: [c.strip() for c in s.split("|") if c.strip() != ""]
)

# Merge overview with genre data
merged_df = df.merge(movies_df[["movieId", "genres"]], on="movieId", how="left")

# Prepare genre column
merged_df["genres"] = merged_df["genres"].fillna("")
merged_df["genre_list"] = merged_df["genres"].apply(
    lambda s: [g.strip() for g in s.split("|") if g.strip() != ""]
)

# Filter by target genre
target_genre = "Action"
genre_df = merged_df[
    merged_df["genre_list"].apply(lambda lst: target_genre in lst)
].copy()


In [None]:
# Create graph
G = nx.Graph()

for cast in genre_df["cast_list"]:
    # Add actors as nodes
    for actor in cast:
        if actor not in G:
            G.add_node(actor)
    
    # Add edges between actors appearing in the same movie
    for a, b in itertools.combinations(cast, 2):
        if G.has_edge(a, b):
            G[a][b]["weight"] += 1  # increase weight if edge already exists
        else:
            G.add_edge(a, b, weight=1)

H = G.copy()
isolated = [n for n, d in H.degree() if d == 0]
H.remove_nodes_from(isolated)

## Centrality


This section evaluates the importance of actors in the network using three different centrality measures. Each measure highlights a different aspect of influence or prominence within the actor collaboration graph.

1. **Degree centrality**  
   Measures how many direct connections each actor has.  
   Actors with high degree centrality have collaborated with many others and are therefore highly embedded in the network.  
   The code computes degree centrality for all nodes and prints the top five actors with the most connections.

2. **Betweenness centrality**  
   Measures how often a node lies on the shortest paths between other nodes.  
   Actors with high betweenness centrality serve as bridges between otherwise separate parts of the network and may play an important structural role.  
   The code calculates betweenness centrality and outputs the top five actors acting as key intermediaries.

3. **Eigenvector centrality**  
   Measures influence by considering not only how many connections an actor has but also the importance of the actors they are connected to.  
   High eigenvector centrality indicates that an actor is connected to other highly influential actors.  
   The code computes eigenvector centrality and lists the top five most influential actors under this measure.

In [None]:
# Calculate degree centrality for all nodes
degree_centrality = nx.degree_centrality(G)

# Sort actors by centrality score
sorted_degree = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)

# Print top 5 actors
print("\nTop 5 skuespillere efter degree centrality:")
for actor, centrality in sorted_degree[:5]:
    print(f"{actor}: {centrality:.4f}")


In [None]:
# Calculate betweenness centrality for all nodes
betweenness_centrality = nx.betweenness_centrality(G)

# Sort actors by betweenness score
sorted_betweenness = sorted(betweenness_centrality.items(), key=lambda x: x[1], reverse=True)

# Print top 5 actors
print("\nTop 5 skuespillere efter betweenness centrality:")
for actor, centrality in sorted_betweenness[:5]:
    print(f"{actor}: {centrality:.4f}")


In [None]:
# Calculate eigenvector centrality for all nodes
eigenvector_centrality = nx.eigenvector_centrality(G, max_iter=1000)

# Sort actors by eigenvector score
sorted_eigenvector = sorted(eigenvector_centrality.items(), key=lambda x: x[1], reverse=True)

# Print top 5 actors
print("\nTop 5 skuespillere efter eigenvector centrality:")
for actor, centrality in sorted_eigenvector[:5]:
    print(f"{actor}: {centrality:.4f}")


## Community Detection

In this part of the notebook, we detect and interpret communities in the actor network. The code does the following:

1. **Extract the Giant Connected Component (GCC)**  
   From the cleaned graph `H`, the largest connected component is selected and stored as `H_gcc`. All further analysis is restricted to this subgraph.

2. **Run Louvain community detection on the GCC**  
   The Louvain algorithm is applied to `H_gcc` (using edge weights as collaboration strength).  
   This produces a partition `partition_gcc` that assigns each actor to a community.  
   The number of communities and the modularity score are computed to assess the structure.

3. **Measure and rank community sizes**  
   The size (number of actors) of each community is counted, and the largest communities are identified.

4. **Identify hub actors and name communities**  
   For each community, a subgraph is created and the weighted degree of each actor is computed.  
   The top actor by degree is selected as the hub, and the community is given an automatic name based on this hub.

5. **Build a text corpus per community**  
   Using `actor_movies` and `movie_overview`, all movie overviews associated with actors in each community are collected into `community_corpus`.

6. **Extract keywords with TFâ€“IDF**  
   For communities with enough text (at least 10 overviews), a TFâ€“IDF model is fitted.  
   The average TFâ€“IDF score per term is computed, and the top keywords are selected as `community_keywords[comm]`.

7. **Present selected communities**  
   For a chosen list of communities, the code prints the community ID, its automatically generated hub-based name, and the main TFâ€“IDF keywords that characterize the movies associated with that community.


In [None]:
# Extract connected components
components = nx.connected_components(H)

# Select the largest connected component
gcc_nodes = max(components, key=len)

# Create subgraph of the giant connected component
H_gcc = H.subgraph(gcc_nodes).copy()

# Print size of the GCC
print(f"Size of GCC: {H_gcc.number_of_nodes()} nodes, {H_gcc.number_of_edges()} edges")

# Apply Louvain community detection on the GCC
partition_gcc = community_louvain.best_partition(H_gcc, weight="weight")

# Count number of detected communities
num_comms = len(set(partition_gcc.values()))
print(f"Louvain on GCC found {num_comms} communities")

# Compute modularity of the partition
Q = community_louvain.modularity(partition_gcc, H_gcc)
print(f"Modularity on GCC: {Q:.4f}")

# Count community sizes
community_sizes_gcc = Counter(partition_gcc.values())

# Print the largest communities
for comm, size in community_sizes_gcc.most_common(20):
    print(f"Community {comm}: {size} actors")


In [None]:
community_names = {}
community_hubs = {}

# Find top hub and assign community name
for comm in set(partition_gcc.values()):
    # Get nodes in the community
    nodes = [n for n, c in partition_gcc.items() if c == comm]
    sub = H_gcc.subgraph(nodes)
    
    # Compute weighted degrees
    degrees = sub.degree(weight="weight")
    
    # Select top-1 node by degree
    top1 = sorted(degrees, key=lambda x: x[1], reverse=True)[:1]
    community_hubs[comm] = top1
    
    # Create community name from top hub
    hub_names = [actor for actor, deg in top1]
    community_name = " - ".join(hub_names)
    
    community_names[comm] = community_name

# Count community sizes
sizes = Counter(partition_gcc.values())


In [None]:
# Map movie IDs to their overviews
movie_overview = dict(zip(genre_df["movieId"], genre_df["overview"]))

# Store movies for each actor
actor_movies = defaultdict(list)

# Assign movie IDs to each actor based on cast lists
for movie_id, cast in zip(genre_df["movieId"], genre_df["cast_list"]):
    for actor in cast:
        actor_movies[actor].append(movie_id)


In [None]:
# Collect movie overviews for each community
community_corpus = defaultdict(list)

for actor, comm in partition_gcc.items():
    for movie_id in actor_movies.get(actor, []):
        overview = movie_overview.get(movie_id, "")
        if isinstance(overview, str) and len(overview.strip()) > 0:
            community_corpus[comm].append(overview)


In [None]:
# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)

community_keywords = {}

for comm, docs in community_corpus.items():
    # Skip very small communities
    if len(docs) < 10:
        continue
    
    # Compute TF-IDF matrix
    tfidf = vectorizer.fit_transform(docs)
    features = vectorizer.get_feature_names_out()
    
    # Compute average TF-IDF scores
    avg_scores = tfidf.mean(axis=0).A1
    top_idx = avg_scores.argsort()[-15:][::-1]
    
    # Extract top keywords
    keywords = [features[i] for i in top_idx]
    community_keywords[comm] = keywords


In [None]:
# Select specific top communities
top_communities = [65, 54, 3, 9, 11, 25, 32, 4, 47, 63]

# Print community names and keywords
for comm in top_communities:
    print(f"\nCommunity {comm}: {community_names[comm]}")
    print("Keywords:")
    print(", ".join(community_keywords.get(comm, ["No data"])))


## Sentiment Analysis

In this section, sentiment scores are computed and aggregated from reviews to actors and communities.

1. **Load and process movie reviews**  
   Movie reviews are loaded and a LabMT-based sentiment score is computed for each review using the `labmt_sentiment` function.

2. **Aggregate sentiment at movie level**  
   The mean sentiment score is computed for each movie by grouping reviews by `movieId`.  
   All missing (NaN) values are removed.

3. **Aggregate sentiment at actor level**  
   For each actor, the sentiment scores of all movies they appear in are collected.  
   The actor is assigned the mean of these values, or `None` if no valid values exist.

4. **Aggregate sentiment at community level**  
   Actor sentiment scores are grouped by community using the Louvain partition.  
   The mean sentiment is then computed for each community.

5. **Rank communities by sentiment**  
   Communities are sorted by their mean sentiment score in descending order.

6. **Link top communities with names and sentiment**  
   The top 10 communities are selected and combined with their automatically generated names and sentiment scores into a structured list for presentation.


In [None]:
# Load LabMT happiness dataset
labmt = pd.read_csv("Data_Set_S1.txt", sep="\t")

# Create word-to-happiness dictionary
labmt_dict = dict(zip(
    labmt["word"].astype(str).str.lower(),
    labmt["happiness_average"]
))

# Simple tokenizer for lowercase word extraction
def tokenize(text):
    return re.findall(r"[a-z']+", str(text).lower())


In [None]:
# Compute LabMT-based sentiment score for a text
def labmt_sentiment(text, word_dict, lens=1.0, center=5.0):
    # Tokenize input text
    tokens = tokenize(text)
    
    # Retrieve happiness scores for known words
    scores = [word_dict[t] for t in tokens if t in word_dict]

    # Remove neutral words based on threshold
    filtered = [s for s in scores if s < center - lens or s > center + lens]

    # Return None if no valid words remain
    if not filtered:
        return None

    # Return mean sentiment score
    return float(np.mean(filtered))


In [None]:
# Load reviews
reviews_df = pd.read_csv("data/reviews.csv")

# Compute sentiment score for each review
reviews_df["sentiment"] = reviews_df["reviews"].apply(
    lambda text: labmt_sentiment(text, labmt_dict)
)


In [None]:
# Compute average sentiment per movie
movie_sentiment = (
    reviews_df.groupby("movieId")["sentiment"].mean().to_dict()
)

# Remove NaN values
movie_sentiment = {m: s for m, s in movie_sentiment.items() if s == s}

# Print result
print(movie_sentiment)


In [None]:
# Compute average sentiment per actor
actor_sentiment = {}

for actor, movies in actor_movies.items():
    vals = []
    for m in movies:
        if m in movie_sentiment:
            if not np.isnan(movie_sentiment[m]):
                vals.append(movie_sentiment[m])
    
    # Assign mean sentiment or None if no values exist
    actor_sentiment[actor] = np.mean(vals) if len(vals) > 0 else None


In [None]:
# Collect actor sentiment scores per community
community_sentiment = defaultdict(list)

for actor, comm in partition_gcc.items():
    if actor_sentiment[actor] is not None:
        community_sentiment[comm].append(actor_sentiment[actor])

# Compute mean sentiment per community
community_sentiment_mean = {
    comm: np.mean(vals) for comm, vals in community_sentiment.items()
}

# Sort communities by mean sentiment score
sorted_sent = sorted(
    community_sentiment_mean.items(),
    key=lambda x: x[1],
    reverse=True
)

# Create structured list of top communities by sentiment
linked = []

for comm, sent in sorted_sent[:10]:
    name = community_names.get(comm, "Unknown")
    linked.append({
        "community": comm,
        "name": name,
        "sentiment": sent
    })

## Visualizations

In this section, a structural backbone of the network is extracted and visualized using the Disparity Filter method.

1. **Apply the Disparity Filter**  
   The filter is applied to the cleaned graph `H` with a chosen significance level `alpha = 0.5`.  
   This removes statistically insignificant edges while preserving the main structural connections of the network.

2. **Compute layout for visualization**  
   A spring layout is computed for the backbone graph to position nodes in a visually interpretable way.

3. **Visualize the backbone network**  
   The filtered network is plotted with small nodes and semi-transparent edges to highlight the main connectivity structure after edge reduction.


In [None]:
# Compute degree values
degrees = [d for n, d in H.degree()]

# Plot degree distribution
plt.hist(degrees, bins=range(1, max(degrees) + 1), edgecolor="black")
plt.title("Degree Distribution")
plt.xlabel("Degree")
plt.ylabel("Number of Nodes")

# Use logarithmic scale for better visibility
plt.yscale("log")

# Show plot
plt.show()


In [None]:
# Sort nodes by degree
degree_sequence = sorted(H.degree(), key=lambda x: x[1], reverse=True)

# Print top 5 actors by degree
print("\nTop 5 actors from degree:")
for actor, degree in degree_sequence[:5]:
    print(f"{actor}: {degree}")

# Print bottom 5 actors by degree
print("\nBottom 5 actors from degree:")
for actor, degree in degree_sequence[-5:]:
    print(f"{actor}: {degree}")


In [None]:
# Extract network backbone using the Disparity Filter
def disparity_filter(G, alpha=0.05):
    backbone = nx.Graph()
    backbone.add_nodes_from(G.nodes(data=True))

    for node in G.nodes():
        k = len(list(G.neighbors(node)))  # Node degree

        # Keep all edges for nodes with degree 1
        if k <= 1:
            for nbr, data in G[node].items():
                backbone.add_edge(node, nbr, **data)
            continue

        # Sum of incident edge weights
        w_sum = sum(data["weight"] for _, data in G[node].items())

        for nbr, data in G[node].items():
            w = data["weight"]
            p_ij = w / w_sum

            # Disparity Filter significance test
            alpha_ij = 1 - (1 - p_ij) ** (k - 1)

            # Keep statistically significant edges
            if alpha_ij < alpha:
                backbone.add_edge(node, nbr, **data)

    return backbone


In [None]:
# Apply the disparity filter to the graph
backbone_H = disparity_filter(H, alpha=0.5)

# Compute layout for visualization
pos = nx.spring_layout(backbone_H, seed=42)

# Plot the backbone graph
plt.figure(figsize=(12, 12))
nx.draw_networkx_nodes(backbone_H, pos, node_size=20, node_color="blue")
nx.draw_networkx_edges(backbone_H, pos, alpha=0.5)
plt.show()
