<a href="https://colab.research.google.com/github/Sundaynot/HP_Big_data_project/blob/main/HP_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **BIG DATA PROJECT**

This project aims to analyse the seven *Harry Potter* books, written by *J.K.Rowling* between 1997 and 2007.

(1) Initialize some libraries


In [7]:
# Installiamo solo le librerie Python che ci servono
!pip install pyspark networkx
print("Pyspark e NetworkX installati.")

Pyspark e NetworkX installati.


(2) Create the SparkSession and Clone the repo from personal Github


In [8]:
from pyspark.sql import SparkSession
import os
# Creiamo una sessione Spark standard
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("HP_Analysis_NetworkX") \
    .getOrCreate()

print("Sessione Spark standard avviata.")


Sessione Spark standard avviata.


In [9]:
# Folder's name and repository's URL
repo_name = "HP_Big_data_project"
repo_url = "https://github.com/sundaynot/HP_Big_data_project.git"

# If repo doesn't exist create, else print
if not os.path.exists(repo_name):
    print(f"Cloning repo '{repo_name}'...")
    !git clone {repo_url}
else:
    print(f"Repository '{repo_name}' existing.")

Repository 'HP_Big_data_project' existing.


In [10]:
# Useful libraries
from pyspark.sql import DataFrame, Window
from pyspark.sql import functions as F
from pyspark.sql.types import (ArrayType, StructType, StructField, IntegerType, DoubleType)
from functools import reduce

(3) Read .txt files with Spark

In [11]:
df_hp1 = spark.read.text("/content/HP_Big_data_project/database/01 Harry Potter and the Sorcerers Stone.txt")
df_hp2 = spark.read.text("/content/HP_Big_data_project/database/02 Harry Potter and the Chamber of Secrets.txt")
df_hp3 = spark.read.text("/content/HP_Big_data_project/database/03 Harry Potter and the Prisoner of Azkaban.txt")
df_hp4 = spark.read.text("/content/HP_Big_data_project/database/04 Harry Potter and the Goblet of Fire.txt")
df_hp5 = spark.read.text("/content/HP_Big_data_project/database/05 Harry Potter and the Order of the Phoenix.txt")
df_hp6 = spark.read.text("/content/HP_Big_data_project/database/06 Harry Potter and the Half-Blood Prince.txt")
df_hp7 = spark.read.text("/content/HP_Big_data_project/database/07 Harry Potter and the Deathly Hallows.txt")

(3.1) Show the first row of each file (to see if there are errors)

In [12]:
df_hp1.show(1, truncate=False)


+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                                                 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mys

In [13]:
df_hp2.show(1, truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                     |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Not for the first time, an argument had broken out over breakfast at number four, Privet Drive. Mr. Vernon Dursley had been woken in the early hours of the morning by a loud, hooting noise from his nephew Harry’s room.|
+-------------------------------------------------------------------------------------------------------------------

In [14]:
df_hp3.show(1, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                                                            |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Harry Potter was a highly unusual boy in many ways. For one thing, he hated the summer holidays more than any other time of year. For another, he really wanted to do his h

In [15]:
df_hp4.show(1, truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+-----------------------------------------------------------------------------------------------------------------------

In [16]:
df_hp5.show(1, truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                                                                                                                                            

In [17]:
df_hp6.show(1, truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                    

In [18]:
df_hp7.show(1, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

(4) Let's process the text

(4.1) Define a function to find the chapters and their relatives names

In [19]:

def process_book_chapters(df_raw, book_number):
    """
    Process a raw DataFrame of a book's text and segments it into chapters,
    recognizing various heading formats ("CHAPTER I", "Chapter 2 - Title", etc.).
    """
    # PHASE 1: Initially cleaning
    # Add to the raw DataFrame a new column "row_id", and mantain the original order of the text (increasing id for each row)
    # Filter removing empty rows and rows composed just by spaces
    # Create a Window that mantains the original order of the text (like the row_id)

    df_ordered = df_raw.withColumn("row_id", F.monotonically_increasing_id())
    df_cleaned = df_ordered.filter((F.col("value").isNotNull()) & (F.trim(F.col("value")) != ""))
    window_spec = Window.orderBy("row_id")

    # PHASE 2: Create the first chapter
    # (for the first chapters there aren't "CHAPTER ONE" or similar, they start with the first row of the book and end when there is "CHAPTER TWO" )
    # So I decide to take the first row for the chapter's name

    first_line_title = df_cleaned.first()["value"]

    # PHASE 3: CHAPTERS' MARKERS
    # A regular expression (RegEx) to find "CHAPTER" or "Chapter"
    # the relative number and the relative chapter's name

    chapter_regex = r"^(?:CHAPTER|Chapter|CAPTER|CHATER)\s+([A-Za-z0-9]+)(?:\s*[-–—:]\s*(.+))?$"

    # add the columns "chapter_match" and "is_new_chapter_line"
    # in "chapter_match" if the row in value matches with chapter_regex
    # insert all the matched string, else an empty string
    # in "is_new_chapter_line" if "chapter_match"!= ""
    # insert TRUE, else insert FALSE

    df_with_markers = df_cleaned.withColumn("chapter_match",
        F.regexp_extract(F.col("value"), chapter_regex, 0)
        ).withColumn("is_new_chapter_line", F.col("chapter_match") != "")

    # add the column "chapter_title_raw"
    # if "is_new_chapter_line"= TRUE
    # in "chapter_title_raw" insert the part of index 2 of the matched string (chapter's name)

    df_with_markers = df_with_markers.withColumn("chapter_title_raw",
        F.when(F.col("is_new_chapter_line"), F.regexp_extract(F.col("value"), chapter_regex, 2)))

    # PHASE 4: CHAPTER'S NAME PROPAGATION
    # If the regex doesn't find the chapter's name, search it in the next row
    # I use lead to take the next row after the specified window
    # add a new column "chapter_title_marker" where:
    # insert lead_value if "is_new_chapter_line"==TRUE and "chapter_title_raw"=""
    # else insert "chapter_title_raw"

    lead_value = F.lead("value").over(window_spec)
    df_with_titles = df_with_markers.withColumn("chapter_title_marker",
        F.when((F.col("is_new_chapter_line")) & (F.col("chapter_title_raw") == ""),
            lead_value).otherwise(F.col("chapter_title_raw")))

    # add the new column "chapter_title_propagated" where
    # fills down the last non-null chapter_name across the window
    # and stores it in a new column called chapter_title.

    df_with_titles = df_with_titles.withColumn("chapter_title_propagated",
        F.last("chapter_title_marker", ignorenulls=True).over(window_spec))

    # PHASE 5: CHAPTER'S ID
    # add a new column chapter_id with the defaul value = 1
    # and everytime is_new_chapter_line=TRUE add 1, else add 0

    df_with_ids = df_with_titles.withColumn("chapter_id",
        F.lit(1) + F.sum(F.when(F.col("is_new_chapter_line"), 1).otherwise(0)).over(window_spec))

    # add the final column "chapter_title", where
    # if "chapter_id"=1 (first chapter) insert the first row for the chapter's name
    # else copy from

    df_with_final_titles = df_with_ids.withColumn("chapter_title",
        F.when(F.col("chapter_id") == 1, F.lit(first_line_title)
        ).otherwise(F.col("chapter_title_propagated")))

    # PHASE 6: Cleaning the text, removing from value CHAPTER and CHAPTER'S NAME
    # add the new column if "is_title_line"
    # where if "is_new_chapter_line"=TRUE insert TRUE in the next row
    # else insert FALSE

    df_with_meta_flags = df_with_final_titles.withColumn( "is_title_line",
        F.lag("is_new_chapter_line", 1, False).over(window_spec) )

    # Filter the text
    # mantain the rows where is_new_chapter_line=FALSE and "is_title_line"=FALSE
    # " ~ " equal to NOT

    df_final_text = df_with_meta_flags.filter(
        (~F.col("is_new_chapter_line")) & (~F.col("is_title_line")) )

    # PHASE 7: GROUPBY
    # Group-by equal chapter_id and equal chapter_title,
    # rename the column "value" with "lines"
    # Create a new column chapter_text by joining all strings in lines with spaces
    # Create a new column "book_id" with the value = book_number from the function intestation
    # select only book_id,chapter_id, chapter_title, and chapter_text and orders rows by chapter_id.

    df_chapters = (
        df_final_text
        .groupBy("chapter_id", "chapter_title")
        .agg(F.collect_list("value").alias("lines"))
        .withColumn("chapter_text", F.concat_ws(" ", F.col("lines")))
        .withColumn("book_id", F.lit(book_number))
        .select("book_id", "chapter_id", "chapter_title", "chapter_text")
        .orderBy("chapter_id"))

    # If initially there is one letter, one o more spaces and text put all together (N early -> Nearly)

    df_chapters = df_chapters.withColumn(
    "chapter_text",
    F.regexp_replace(
        F.trim(F.col("chapter_text")),
        r"^(\w)\s+(.*)$",
        "$1$2"))

    return df_chapters


In [20]:
# Dataframe list
all_hp_dfs = [df_hp1, df_hp2, df_hp3, df_hp4, df_hp5, df_hp6, df_hp7]

processed_books_list = []

print("Starting the elaboration...")

# Use book number to take count of the number of each book
for i, df_book_raw in enumerate(all_hp_dfs):
    book_number = i + 1
    print(f"Processing book {book_number}...")
    df_processed = process_book_chapters(df_book_raw, book_number)
    processed_books_list.append(df_processed)
print("Work done.")

# Only one DataFrame
all_chapters_df = reduce(DataFrame.unionAll, processed_books_list)
all_chapters_df.cache()
print(f"Total chapters: {all_chapters_df.count()}")

# Show the result
all_chapters_df.orderBy("book_id", "chapter_id").show(18,truncate=20)

Starting the elaboration...
Processing book 1...
Processing book 2...
Processing book 3...
Processing book 4...
Processing book 5...
Processing book 6...
Processing book 7...
Work done.
Total chapters: 198
+-------+----------+--------------------+--------------------+
|book_id|chapter_id|       chapter_title|        chapter_text|
+-------+----------+--------------------+--------------------+
|      1|         1|Mr. and Mrs. Durs...|Mr. and Mrs. Durs...|
|      1|         2| THE VANISHING GLASS|Nearly ten years ...|
|      1|         3| LETTERS FROM NO ONE|The escape of the...|
|      1|         4|THE KEEPER OF THE...|BOOM. They knocke...|
|      1|         5|        DIAGON ALLEY|Harry woke early ...|
|      1|         6|THE JOURNEY FROM ...|Harry’s last mont...|
|      1|         7|     THE SORTING HAT|The door swung op...|
|      1|         8|  THE POTIONS MASTER|There, look.” “Wh...|
|      1|         9|   THE MIDNIGHT DUEL|Harry had never b...|
|      1|        10|           HALLOWE

In [21]:
# 1. SQL TempView
all_chapters_df.createOrReplaceTempView("harry_potter_saga")

# SQL Query to know how many chapters in each book
spark.sql("""
    SELECT book_id, COUNT(chapter_id) as num_chapters
    FROM harry_potter_saga
    GROUP BY book_id
    ORDER BY book_id
""").show()

+-------+------------+
|book_id|num_chapters|
+-------+------------+
|      1|          17|
|      2|          18|
|      3|          22|
|      4|          37|
|      5|          38|
|      6|          30|
|      7|          36|
+-------+------------+



(5). An interesting count


In [22]:
from pyspark.ml.feature import StopWordsRemover # Importa la classe

# PHASE 1: TOKENIZE
# Take the column "chapter_text", all in lower case, and spit everytime there is a space (\s)
# Rename this column as "word"
# select just 3 column: "book_id","chapter_id" and "word"

df_words = all_chapters_df.select("book_id","chapter_id",
    F.explode(F.split(F.lower(F.col("chapter_text")), r"\s+")).alias("word"))

# PHASE 2: NORMALIZE
# Remove not alphanumeric from the text
# Remove words long only 1 letter
df_cleaned_words = df_words.withColumn("word",
    F.regexp_replace(F.col("word"), r"[^\w]", "")
).filter(F.col("word") != "").filter(F.length(F.col("word")) >= 2)

# Create an array for each chapter
# composed by the words without simbols and rename it "words_array"
df_word_arrays = df_cleaned_words.groupBy("book_id", "chapter_id").agg(
    F.collect_list("word").alias("words_array"))

#PHASE 3: CUSTOMIZE STOPWORDSREMOVER
# Load StopWordsRemover (language = english)
stop_words_list = StopWordsRemover.loadDefaultStopWords("english")

# Add other words
custom_stop_words = stop_words_list + [
    # Principal characters (Name and Surname)
    "harry", "potter",
    "ron", "weasley",
    "hermione", "granger",
    "dumbledore", "albus",
    "hagrid",
    "voldemort", "tom", "riddle",
    "snape", "severus",
    "malfoy", "draco",

    # Titles
    "professor", "mr", "mrs", "miss", "madam", "lord","harrys",

    # Narrative  verbs
    "said", "asked", "looked", "thought", "knew", "know","saw","come",
    "didnt", "dont", "wasnt", "isnt", "its", "hes", "shes","got","seemed",
    "get","go", "see","looking","think","hed", "going", "look","im",

    # Others
     "one", "well", "like","around","still","something","right","long","head","us",
     "though","time","eyes","face","voice", "head", "little", "yes", "first", "never"
]

# Initialize the remover on "words_array" and call the output column "filtered_words"
remover = StopWordsRemover(
    inputCol="words_array",
    outputCol="filtered_words")

remover.setStopWords(custom_stop_words)

# PHASE 4: APPLICATION
df_filtered_arrays = remover.transform(df_word_arrays)

# From the cleaning DataFrame select "book_id","chapter_id",
# and "word" (explosed version of "filtered_words")
df_meaningful_words = df_filtered_arrays.select("book_id","chapter_id",
    F.explode(F.col("filtered_words")).alias("word"))

# PHASE 5: WORDS COUNT FOR BOOK
df_word_counts_per_book = (
    df_meaningful_words.groupBy("book_id", "word").count())

windowSpec = Window.partitionBy("book_id").orderBy(F.col("count").desc())

# Add a column "rank" for each row
df_ranked_words = df_word_counts_per_book.withColumn("rank", F.row_number().over(windowSpec))

# Select only the 5 most used words for each books
df_top5_word_per_book = df_ranked_words.filter(F.col("rank") <=5)

# Show the final result
print("The five most used words for each books:")
df_top5_word_per_book.select("book_id", "word", "count").orderBy("book_id").show(35, truncate=False)


The five most used words for each books:
+-------+--------+-----+
|book_id|word    |count|
+-------+--------+-----+
|1      |back    |259  |
|1      |uncle   |121  |
|1      |dudley  |116  |
|1      |door    |105  |
|1      |vernon  |105  |
|2      |back    |279  |
|2      |lockhart|196  |
|2      |dobby   |132  |
|2      |door    |127  |
|2      |school  |108  |
|3      |lupin   |371  |
|3      |back    |353  |
|3      |black   |314  |
|3      |sirius  |156  |
|3      |door    |141  |
|4      |back    |582  |
|4      |moody   |306  |
|4      |crouch  |281  |
|4      |wand    |267  |
|4      |cedric  |221  |
|5      |back    |776  |
|5      |sirius  |580  |
|5      |umbridge|498  |
|5      |door    |374  |
|5      |room    |342  |
|6      |back    |418  |
|6      |slughorn|337  |
|6      |room    |247  |
|6      |ginny   |212  |
|6      |hand    |205  |
|7      |wand    |566  |
|7      |back    |537  |
|7      |death   |304  |
|7      |room    |256  |
|7      |away    |241  |
+-------+

This analysis perfectly maps the unique narrative focus of each book by identifying its key characters, locations, and themes.

Book 1: The Dursleys (uncle, dudley, vernon).

Book 2: The new characters (lockhart, dobby) and the setting (school).

Book 3: The Marauders (lupin, black, sirius).

Book 4: The Triwizard Tournament (moody, crouch, cedric).

Book 5: The conflict (sirius, umbridge) and the key locations (door, room).

Book 6: The key characters (slughorn, ginny), the location (room), and the mystery (hand).

Book 7: The themes (wand, death, away) and the location (room).

Also the word "back" is the central. It represents:

- The Return of Voldemort: The entire plot is driven by Voldemort "coming back" to power.

- The Return to School: The narrative structure of the first six books is built on "going back" to Hogwarts.

- Looking Back (The Past): So much of the plot is discovered by "looking back" into memories (the Pensieve, Tom Riddle's diary).

- The Physical Action: Characters are constantly "going back" to rescue someone, "coming back" from a fight, or being "held back."

It's the narrative glue that holds the whole series together.

In [23]:
# Lists of enchantments
spell_list = [
    "lumos", "nox", "accio", "stupefy", "expelliarmus",
    "riddikulus", "obliviate", "incendio", "protego",
    "sectumsempra", "alohomora", "crucio", "imperio",
    "confringo", "diffindo" ]

multi_word_spells = {
    "avada kedavra": "avada_kedavra",
    "expecto patronum": "expecto_patronum",
    "petrificus totalus": "petrificus_totalus",
    "wingardium leviosa": "wingardium_leviosa"
}

# Complete list
all_spell_tokens = spell_list + list(multi_word_spells.values())

# PREPROCESSING: all the text in lower case
temp_df = all_chapters_df.withColumn("processed_text", F.lower(F.col("chapter_text")))

# From "avada kedavra" to "avada_kedavra")
# (spell = what I search (avada kedavra), token = how substitute it (avada_kedavra))
for spell, token in multi_word_spells.items():
    temp_df = temp_df.withColumn("processed_text",
        F.regexp_replace(F.col("processed_text"), spell, token))

# Explode the processed text
df_words_adv = temp_df.select("book_id",
    F.explode(F.split(F.col("processed_text"), r"\s+")).alias("word"))

# Remove not alphanumeric from the text, (mantain _)
df_cleaned_words_adv = df_words_adv.withColumn("word",
    F.regexp_replace(F.col("word"), r"[^\w_]", ""))

# Filter for our enchantments list ( if the word is an enchantment mantain, else remove)
df_spells_adv = df_cleaned_words_adv.filter(
    F.col("word").isin(all_spell_tokens))

# Count
df_total_spell_counts_adv = (
    df_spells_adv.groupBy("word")
    .count()
    .orderBy(F.col("count").desc()))

print("Total count of enchantments:")
df_total_spell_counts_adv.show(truncate=False)

Total count of enchantments:
+------------------+-----+
|word              |count|
+------------------+-----+
|expecto_patronum  |36   |
|accio             |33   |
|stupefy           |26   |
|expelliarmus      |25   |
|lumos             |22   |
|avada_kedavra     |19   |
|riddikulus        |16   |
|crucio            |14   |
|petrificus_totalus|11   |
|protego           |11   |
|sectumsempra      |9    |
|imperio           |8    |
|alohomora         |7    |
|wingardium_leviosa|5    |
|diffindo          |5    |
|obliviate         |4    |
|incendio          |3    |
|nox               |2    |
|confringo         |2    |
+------------------+-----+



Here are the most interesting insights from this data:

1. ***Hope and Utility Outrank Attack***
The most telling detail is that the top two spells are not combat-focused:
 - expecto_patronum (36): This is the thematic spell of the series. It's not an attack, but a defense against despair (Dementors).

 - accio (33): This is the utility spell. Its high frequency shows the characters' growth. They aren't just in duels; they are actively solving problems, retrieving items, and using magic in practical ways.

2. ***The Data Proves Harry's Signature Spell***
The core combat spells are stupefy (26) uses and expelliarmus (25). They are practically tied. Stupefy is the standard, expelliarmus is famously Harry's personal, signature spell.

3. ***The Threat of the Unforgivable Curses***
The series gets incredibly dark, and the data shows it. The Unforgivable Curses are all high on the list:

- avada_kedavra (19)

- crucio (14)

- imperio (8)

4. ***Famous vs. Frequent***
This is a great insight into storytelling. wingardium_leviosa is arguably the most famous spell from the franchise ("It's Levi-O-sa, not Levio-SAH!").
And yet, it was only used 5 times.
In contrast, a simple utility spell like lumos (22) is used constantly but is far less "famous."

5. ***"Book-Specific" Spells***
You can clearly see which spells defined the plot of a specific book:

- riddikulus (16): This is almost certainly all from Book 3 (Prisoner of Azkaban) and the Boggart lessons.

- sectumsempra (9): This is the dark mystery at the heart of Book 6 (The Half-Blood Prince).


(6). TF-IDF

In [24]:
# Other useful librarie
from pyspark.ml.feature import CountVectorizer, IDF
from pyspark.ml import Pipeline

# 1. Configure CountVectorizer
# Input = filtered_words output = raw_features
# minDF=2.0 --> "ignore words that don't appear at least in 2 chapters"
cv = CountVectorizer(inputCol="filtered_words",outputCol="raw_features",vocabSize=10000,minDF=2.0)

# 2. Configure IDF
# Takes raw_features and calculate TF-IDF points
idf = IDF(inputCol="raw_features", outputCol="tfidf_features")

# 3. Make the Pipeline to do consequences steps.
pipeline = Pipeline(stages=[cv, idf])

# 4. Train the pipeline with the data
print("Starting the pipeline training (CV + IDF)...")
pipeline_model = pipeline.fit(df_filtered_arrays)
print("Training complete.")

# 5. Apply the model transformation
tfidf_df = pipeline_model.transform(df_filtered_arrays)

# print("DataFrame with TF-IDF:")
# Sparse Vector
# tfidf_df.select("book_id", "chapter_id", "tfidf_features").show(truncate=80)

Starting the pipeline training (CV + IDF)...
Training complete.


In [25]:
# 6. Extract the vocabulary from the model
# Create a new DataFrame where word = word and index = the relative index.
vocabulary = pipeline_model.stages[0].vocabulary
vocab_df = spark.createDataFrame(enumerate(vocabulary), ["index", "word"])

# 7. Define and Apply the UDF
# From a sparse Vector (es. (10000, [5, 25], [0.1, 0.8]))
# to an arry of readable pairs: [ (5, 0.1), (25, 0.8) ]
def vector_to_array(v):
    return list(zip([int(i) for i in v.indices], [float(f) for f in v.values]))

to_array_udf = F.udf(vector_to_array,
    ArrayType(StructType([StructField("index", IntegerType()),StructField("score", DoubleType())])))

# Add a new column "scores_array" to insert the UDF function results
df_with_scores = tfidf_df.withColumn("scores_array", to_array_udf(F.col("tfidf_features")))

# Explode the new column and rename it "score_struct"
df_exploded = df_with_scores.select("book_id",
    F.explode(F.col("scores_array")).alias("score_struct"))

# 8. Join with vocabulary to translate indexes in words
df_word_scores = df_exploded.join(
    vocab_df,df_exploded.score_struct.index == vocab_df.index
).select("book_id", "word", "score_struct.score")

# Group by to eliminate duplicates
df_word_scores = df_word_scores.groupBy("book_id", "word") \
    .agg(F.max("score").alias("score"))

# 9. Find the 5 most important words for each book
windowSpec = Window.partitionBy("book_id").orderBy(F.col("score").desc())

df_top_words = df_word_scores.withColumn("rank", F.row_number().over(windowSpec)) \
                            .filter(F.col("rank") <= 5) \
                            .orderBy("book_id", "rank")

print("The five most important words for each book are (using TF-IDF):")
df_top_words.show(n=336, truncate=False)

The five most important words for each book are (using TF-IDF):
+-------+-----------+------------------+----+
|book_id|word       |score             |rank|
+-------+-----------+------------------+----+
|1      |quirrell   |126.83875566013481|1   |
|1      |dursley    |113.432224611812  |2   |
|1      |vernon     |87.1990801996875  |3   |
|1      |ronan      |78.14020927209204 |4   |
|1      |dudley     |68.99029093625391 |5   |
|2      |dobby      |105.99899747163643|1   |
|2      |lockhart   |80.41503929096756 |2   |
|2      |nick       |71.22474908628554 |3   |
|2      |bludger    |70.47239588371765 |4   |
|2      |car        |69.51827052940315 |5   |
|3      |marge      |151.7079321220254 |1   |
|3      |stan       |134.8031123969516 |2   |
|3      |ern        |100.67262086535317|3   |
|3      |pettigrew  |95.94356774606277 |4   |
|3      |buckbeak   |92.49501959737943 |5   |
|4      |frank      |186.4040704165901 |1   |
|4      |dobby      |135.2401002224327 |2   |
|4      |winky  

**1. Harry Potter and the Philosopher's Stone**

***quirrell***: The definition of a TF-IDF hit. He is the central villain for this book and never appears again.

***dursley, vernon, dudley***: During the Book 1 a lot of text explain the Muggle world before Hogwarts compared to the sequels.

***ronan***: The Centaur. Marks the first significant plot point in the Forbidden Forest.

**2. Harry Potter and the Chamber of Secrets**

***dobby***: The entire plot is driven by his attempts to "save" Harry.

***lockhart***: The exclusive Defense Against the Dark Arts teacher for this specific year.

***car, bludger***: Unique plot devices, the Flying Ford Anglia and the tampered Bludger are specific of this book.

**3. Harry Potter and the Prisoner of Azkaban**

***marge***: She appears in only one chapter.

***stan, ern***: The Knight Bus. The algorithm picks these up because the dialogue on the bus is repetitive and condensed, creating a statistical spike for these characters who rarely appear elsewhere.

***pettigrew***: The central mystery of the plot (Scabbers).

***4. Harry Potter and the Goblet of Fire***

***frank***: Frank Bryce is the protagonist of Chapter 1 and then disappears.

***winky, crouch***: The House-Elf subplot.

***cedric***: The tragic anchor of the Triwizard Tournament.

**5. Harry Potter and the Order of the Phoenix**

***umbridge***: She is present in almost every chapter as the antagonist, dominating the text frequency.

***prophecy***: The entire plot revolves to retrieve this object.

***ter***: A linguistic artifact. This is Hagrid's accent (phonetic "to"). It spikes here because Hagrid has massive monologues explaining his journey to the Giants, repeating this non-standard word hundreds of times.

**6. Harry Potter and the Half-Blood Prince**

***prime***: The political context. Refers to the Muggle Prime Minister.

***ogden, morfin***: The Pensieve Memories. These are not present-day characters, but figures from the Gaunt family flashbacks.

***slughorn***: The new professor and the holder of the key memory.

**7. Harry Potter and the Deathly Hallows**

***xenophilius***: Luna's dad, used to explain the Deathly Hallows symbol.

***griphook, greyback***: The story shifts away from Hogwarts classes to the Gringott's banks.

***kreacher***: Unlike in Book 5, Kreacher becomes a pivotal ally here (leading the trio to the locket).

(7). LSH


In [26]:
from pyspark.ml.feature import Normalizer, BucketedRandomProjectionLSH

# STEP 1: Normalize
# Normalize 'tfidf_features'
# This is important to avoid that the lenght influences the result
normalizer = Normalizer(inputCol="tfidf_features", outputCol="normalized_features", p=2.0)
df_normalized = normalizer.transform(tfidf_df)

# STEP 2: LSH Configuration
# input = normalized_features , output = hashes
# BucketLength: bucket's widht
# NumHashTables: how many times do you want to try the operation?
brp = BucketedRandomProjectionLSH(
    inputCol="normalized_features", outputCol="hashes",
    bucketLength=2.0, numHashTables=3)

# Train the LSH model on normalized data
lsh_model = brp.fit(df_normalized)
df_hashed = lsh_model.transform(df_normalized)

print("Hashing complete. Seraching for similarities...")

# STEP 3: Find similar chapters
# threshold=1.2: distance threshold, using the Euclidean Distance measure
pairs = lsh_model.approxSimilarityJoin(df_hashed, df_hashed, threshold=1.2, distCol="EuclideanDistance")

# STEP 4: Cleaning
# 1. Remove comparison between the same chapter of the same book (distance = 0)
# 2. Remove duplicates ( book 1 chapt 2 and book 3 chapt 4 == book 3 chapt 4 and book 1 chapt 2)

clean_pairs = pairs.filter(
    (F.col("datasetA.book_id") < F.col("datasetB.book_id")) |
    ((F.col("datasetA.book_id") == F.col("datasetB.book_id")) &
     (F.col("datasetA.chapter_id") < F.col("datasetB.chapter_id")))
).select(
    F.col("datasetA.book_id").alias("Book_A"),
    F.col("datasetA.chapter_id").alias("Chapter_A"),
    F.col("datasetB.book_id").alias("Book_B"),
    F.col("datasetB.chapter_id").alias("Chapter_B"),
    F.format_number(F.col("EuclideanDistance"), 4).alias("Distance")
).orderBy("Distance")

print("The most similar pairs of chapters are (Low distance = High Similarity):")
clean_pairs.show(30, truncate=False)

Hashing complete. Seraching for similarities...
The most similar pairs of chapters are (Low distance = High Similarity):
+------+---------+------+---------+--------+
|Book_A|Chapter_A|Book_B|Chapter_B|Distance|
+------+---------+------+---------+--------+
|1     |3        |5     |2        |0.7565  |
|1     |3        |4     |4        |0.7766  |
|4     |4        |5     |2        |0.8219  |
|1     |3        |7     |3        |0.8280  |
|1     |3        |2     |1        |0.8330  |
|1     |3        |4     |3        |0.8386  |
|2     |1        |5     |2        |0.8632  |
|2     |1        |4     |4        |0.8633  |
|1     |2        |1     |3        |0.8686  |
|4     |3        |5     |2        |0.8703  |
|7     |24       |7     |25       |0.8830  |
|4     |4        |7     |3        |0.8838  |
|5     |2        |7     |3        |0.8884  |
|4     |3        |4     |4        |0.8956  |
|2     |1        |7     |3        |0.9134  |
|1     |2        |2     |1        |0.9255  |
|2     |2        |4     

***1. The Dursley Cluster***

Chapters: 1-3 (The Letters from No One), 5-2 (A Peck of Owls), 4-4 (Back to the Burrow – though it starts at the Dursleys'), 2-1 (The Worst Birthday), 7-3 (The Departure of the Dursleys).

Why: In these chapters, there is often an invasion of letters or owls at the Dursley household.

Relevant words: Uncle, Vernon, Aunt, Petunia, Dudley, Letter, Owl, Kitchen, Scream, television, drill, living room.

***2. Narrative Continuity***

7-20 vs 7-21 (Xenophilius Lovegood & The Tale of the Three Brothers).

Relevant words: Hallows, Wand, Peverell, Cloak, Stone.

7-24 vs 7-25 (The Wandmaker & Shell Cottage).

Relevant words: Griphook, Ollivander, Wand.

Why: In the 7th book, the plot is a continuous stream (the journey in the tent), lacking the distinct school-year structure.

***3. Grimmauld Place***

Chapters: 5-6 (The Noble and Most Ancient House of Black) and 7-10 (Kreacher’s Tale).

Why: Both chapters take place entirely inside Number 12, Grimmauld Place.

Relevant words: Kreacher, Portrait, Sirius, Mother, Walburga, Regulus, Locket, Clean.

***4. The Dobby Connection***

Chapters: 2-2 (Dobby’s Warning) and 4-21 (The House-Elf Liberation Front).

Why: in both chapters Dobby speaks obsessively to Harry.

Relevant words: Dobby, Elf, Sir.

***Conclusion***

J.K. Rowling has lexical "templates".

When she writes about the Dursleys, she always uses the same specific set of words (anger, Muggle objects), making those chapters mathematically isolated from the magical world.

When the plot becomes static (Harry hiding or traveling in Book 7), adjacent chapters resemble each other closely because the setting does not change.


(8). PAGE RANK

In [30]:
# Create an array for each chapter
# composed by the words without simbols and rename it "words_array"
# I create a new clean DF because into the first, to analyze the story and not the characters,
# I remove their names from the "words_array" with a custom_stop_word list
df_word_arrays_PR = df_cleaned_words.groupBy("book_id", "chapter_id").agg(
    F.collect_list("word").alias("words_array"))

# Load StopWordsRemover (language = english)
stop_words_list = StopWordsRemover.loadDefaultStopWords("english")

# Initialize the remover on "words_array" and call the output column "filtered_words"
remover = StopWordsRemover(inputCol="words_array",outputCol="filtered_words")

df_filtered_arrays_PR = remover.transform(df_word_arrays_PR)

In [33]:
# 1. Define the principal characters (lower case)
characters_list = [
    "harry", "ron", "hermione", "dumbledore", "voldemort", "snape",
    "draco", "hagrid", "neville", "ginny", "lupin", "sirius",
    "mcgonagall", "dobby", "kreacher"]

# 2. Create an univoc key (as book_id_chapter_id ), calls chapter_key
# explode the text from array to single word
df_words_per_chapter_PR = df_filtered_arrays_PR.select(
    F.concat(F.col("book_id"), F.lit("_"), F.col("chapter_id")).alias("chapter_key"),
    F.explode(F.col("filtered_words")).alias("word"))

# 3. Filter all the word column, if there is a character mantain, else remove
# distinct --> no duplicates
df_mentions_PR = df_words_per_chapter_PR.filter(
    F.col("word").isin(characters_list)).distinct()

# 4. Create to new copy of df_mentions_PR to make the Self-Join
df_A = df_mentions_PR.alias("A")
df_B = df_mentions_PR.alias("B")

print("Starting the edges count...")

# 5. Join where A.word<B.word, this to avoid Harry-Ron and Ron-Harry and Harry-Harry
# Rename the 2 nodes as source node and destination node
df_edges_raw_PR = df_A.join(df_B, on="chapter_key") \
    .where(F.col("A.word") < F.col("B.word")) \
    .select(F.col("A.word").alias("src_node"), F.col("B.word").alias("dst_node"))

# 6. Group by (src_node, dst_node)
# make the count and rename the column "count" with "weight" (correct term to speak about weighted graphes)
edges_PR = df_edges_raw_PR.groupBy("src_node", "dst_node").count().withColumnRenamed("count", "weight")

print("Found Edges :")
edges_PR.orderBy(F.col("weight").desc()).show()

Starting the edges count...
Found Edges :
+----------+---------+------+
|  src_node| dst_node|weight|
+----------+---------+------+
|     harry|      ron|   177|
|     harry| hermione|   174|
|dumbledore|    harry|   173|
|  hermione|      ron|   171|
|dumbledore|      ron|   157|
|dumbledore| hermione|   156|
|     harry|voldemort|   126|
|    hagrid|    harry|   122|
|     harry|    snape|   118|
|    hagrid|      ron|   115|
|    hagrid| hermione|   114|
|  hermione|    snape|   113|
|dumbledore|   hagrid|   113|
|dumbledore|voldemort|   113|
|  hermione|voldemort|   112|
|dumbledore|    snape|   111|
|       ron|voldemort|   111|
|       ron|    snape|   111|
|     harry|   sirius|   100|
|     ginny|    harry|    98|
+----------+---------+------+
only showing top 20 rows



In [35]:
import networkx as nx

#From Spark to Pandas
edges_pandas_df = edges_PR.toPandas()

print("Starting...")

G = nx.from_pandas_edgelist(
    edges_pandas_df, source='src_node',
    target='dst_node', edge_attr='weight')

# Calculate the PageRank
pagerank_scores = nx.pagerank(G, weight='weight')

# Sorted Results
print("Results of PageRank (using NetworkX) : ")
sorted_pagerank = sorted(pagerank_scores.items(), key=lambda item: item[1], reverse=True)

for character, score in sorted_pagerank:
    print(f"- {character.capitalize()}: {score:.4f}")

Starting...
Results of PageRank (using NetworkX) : 
- Harry: 0.1015
- Ron: 0.0972
- Hermione: 0.0964
- Dumbledore: 0.0942
- Hagrid: 0.0740
- Snape: 0.0722
- Voldemort: 0.0720
- Ginny: 0.0622
- Sirius: 0.0620
- Mcgonagall: 0.0606
- Neville: 0.0587
- Draco: 0.0503
- Lupin: 0.0503
- Dobby: 0.0250
- Kreacher: 0.0233


**TIER 1 (The Core Four)**

Who: Harry, Ron, Hermione, Dumbledore.

Why: Their scores are nearly identical. The key discovery is that Dumbledore is mathematically as important as the Trio, acting as a strategic "bridge" connecting all different groups (heroes, villains, Ministry).

**TIER 2 (The Secondary Hubs)**

Who: Hagrid, Snape, Voldemort.

Why: Hagrid, Snape and Voldemort aren't important because they know many people, but because they are obsessively connected to the Core Four.

**TIER 3 (The Key Allies)**

Who: Ginny, Sirius, McGonagall, Neville, Draco, Lupin.

Why: This is the main supporting cast. Ginny and Sirius rank highest in this group due to their strong, exclusive link to Harry (the network's most important node).

**TIER 4 (The Isolated Specialists)**

Who: Dobby, Kreacher.

Why: They are isolated, speak almost only to Harry, and are not "connectors" in the network.