# Datasets Creation

## Dataset Overview

The goal of this notebook is to create individual datasets containing user reviews for a selected set of movies, including the *Star Wars* saga and several other well-known films.  
Each dataset will be structured to support downstream tasks such as topic modelling.

The original datasets contain the following columns:
- `Review_ID`: unique identifier of the review  
- `Movie_ID`: unique identifier of the movie  
- `Movie_Title`: title of the reviewed movie  
- `Rating`: numerical rating assigned by the user to the film
- `Review_Date`: date when the review was posted  
- `Review_Title`: title of the review  
- `Review_Text`: full text of the review  
- `Helpful_Votes`: number of users who found the review helpful  
- `Total_Votes`: total number of votes the review received

The datasets include the following movies:

**Star Wars Series:**
- Episode I – The Phantom Menace (1999)  
- Episode II – Attack of the Clones (2002)  
- Episode III – Revenge of the Sith (2005)  
- Episode IV – A New Hope (1977)  
- Episode V – The Empire Strikes Back (1980)  
- Episode VI – Return of the Jedi (1983)  
- Episode VII – The Force Awakens (2015)  
- Episode VIII – The Last Jedi (2017)  
- Episode IX – The Rise of Skywalker (2019)

**Other Movies:**
- Parasite (2019)  
- The Good, the Bad and the Ugly (1966)  
- Harry Potter and the Sorcerer's Stone (2001)  
- Oppenheimer (2023)  
- La La Land (2016)  
- Raiders of the Lost Ark (1981)

## Setup: Installing and Importing Required Libraries

In [None]:
import subprocess
import sys

# List of required packages
required_packages = [
    "pandas", "tqdm", "selenium", "pickle"
]

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)


pandas is already installed.
tqdm is already installed.
selenium is already installed.


In [3]:
from Retriever import Retriever
import pandas as pd
import pickle
import os

## Retrieve Reviews on Star Wars Movies

This cell defines a list of IMDb IDs corresponding to the nine Star Wars episodes. These IDs will be used to identify and extract reviews for each film in chronological order.

In [None]:
# STAR WARS MOVIES
list_of_movies_ids_sw = ['0076759', # Star Wars: Episode IV - A New Hope
                         '0080684', # Star Wars: Episode V - The Empire Strikes Back
                         '0086190', # Star Wars: Episode VI - Return of the Jedi
                         '0120915', # Star Wars: Episode I - The Phantom Menace
                         '0121765', # Star Wars: Episode II - Attack of the Clones
                         '0121766', # Star Wars: Episode III - Revenge of the Sith
                         '2488496', # Star Wars: Episode VII - The Force Awakens
                         '2527336', # Star Wars: Episode VIII - The Last Jedi
                         '2527338'] # Star Wars: Episode IX - The Rise of Skywalker

### Retrieving Star Wars Reviews

This cell initializes a `Retriever` instance and iterates through the IMDb IDs of the Star Wars movies to collect their reviews.  
For each movie ID, the corresponding review IDs are retrieved. If no reviews are found, the movie is skipped.  
Otherwise, the reviews are fetched, converted into a DataFrame, and appended to a cumulative DataFrame (`reviews_df_sw`) containing all Star Wars reviews.

In [None]:
# Initialize the Retriever instance
retriever = Retriever()

# Start with an empty DataFrame
reviews_df_sw = pd.DataFrame()

# Loop through each movie ID and retrieve reviews
for movie_id in list_of_movies_ids_sw:

    # Retrieve review IDs for the current movie
    review_ids = retriever.get_reviews_ids_from_movie_id(movie_id)

    # If no reviews are found, skip to the next movie
    if not review_ids:
        print(f'No reviews found for movie ID {movie_id}. Skipping...')
        continue

    # Print the retrieved review IDs
    print('From the movie ' + str(movie_id) + ' were retrieved the following reviews:')
    print(review_ids)

    # Retrieve reviews as a DataFrame and concatenate it to the main DataFrame
    temp_df = retriever.get_reviews_dataframe_from_set_of_review_ids(review_ids)
    reviews_df_sw = pd.concat([reviews_df_sw,temp_df], ignore_index=False)

print(reviews_df_sw)

From the movie 0076759 were retrieved the following reviews:
{'2221293', '4756672', '0156096', '0155657', '0155649', '4953160', '0156097', '9774717', '6820587', '8894332', '0155615', '2700918', '5626227', '3698225', '4355842', '1874335', '8970299', '0155911', '3456651', '3066084', '0155944', '0156370', '1105625', '0156018', '0155577', '0155574', '2986047', '0156338', '0156302', '1819423', '10149651', '5662844', '0155954', '0156009', '9122781', '0155989', '1369318', '0155977', '2362318', '0155808', '1368831', '6776240', '4210998', '5392236', '0156023', '8154338', '0156019', '5459915', '0155934', '0156173', '0155916', '1024666', '0156362', '0156264', '0155959', '9346556', '1295150', '6240366', '5861351', '8646714', '0155997', '0977440', '0155776', '6176397', '0156225', '1885864', '1230435', '4541638', '0155742', '5779293', '0156104', '9553053', '0155659', '3595682', '1883776', '0156044', '3770220', '4193274', '0980099', '1685061', '6529355', '0155889', '0155602', '4037497', '5686327', '1

retrieving reviews: 100%|██████████| 2158/2158 [40:23<00:00,  1.12s/review]


From the movie 0080684 were retrieved the following reviews:
{'0175218', '0175219', '7850609', '2228217', '3362763', '5379421', '2923020', '0175269', '0175178', '3505039', '10123864', '8153753', '3462860', '1282496', '2315025', '2700351', '2946604', '6538305', '5497640', '0175331', '0175406', '3070618', '3250778', '0175209', '0175039', '0175353', '5412748', '5559711', '2247664', '1618217', '4139144', '2135470', '0175378', '0175055', '3247399', '0175140', '6698019', '5825447', '5215636', '9885867', '7386910', '1098779', '0175112', '0175267', '0175065', '9773188', '0175419', '0175197', '1445338', '0175375', '1764556', '3796121', '9362274', '7272603', '5294660', '0943055', '0175119', '8113160', '0175224', '4175073', '0175144', '6931343', '0175094', '1674596', '8513636', '0175052', '1879538', '2565647', '0175203', '0994521', '2590540', '1805155', '0175387', '0175033', '1100072', '8036856', '0175139', '0175148', '6359976', '0175292', '8280169', '0175020', '2703374', '8396182', '6905675', '1

retrieving reviews: 100%|██████████| 1507/1507 [28:39<00:00,  1.14s/review]


From the movie 0086190 were retrieved the following reviews:
{'8924035', '0204248', '3478682', '0204017', '1954323', '2093372', '3321363', '1404502', '0204224', '5466920', '1983468', '2522632', '0204142', '8201800', '3788069', '6063538', '0204041', '3384641', '9013697', '10004272', '1136251', '0204031', '3241943', '1991077', '1202676', '2397123', '2401983', '2932391', '1475765', '7266771', '5578417', '1229462', '3833130', '1605355', '3460918', '2537951', '3061618', '1057136', '1096197', '1917833', '0204292', '6378547', '5803942', '0204167', '3372223', '3377333', '9658249', '3841859', '0204141', '5927754', '0204171', '0204160', '2749449', '7276764', '7583103', '2540059', '0204128', '8044147', '0204289', '5394519', '3426518', '0204226', '0204257', '4899406', '2433590', '1383496', '9620438', '6364058', '9546540', '8364394', '0204091', '1114197', '5009557', '1153732', '3371180', '0204188', '0204209', '3329626', '9895872', '2656637', '0204236', '0204140', '0204130', '0204189', '0204067', '3

retrieving reviews: 100%|██████████| 1017/1017 [19:19<00:00,  1.14s/review]


From the movie 0120915 were retrieved the following reviews:
{'0480405', '0480861', '1124198', '0480392', '0483332', '1546491', '0481354', '0480414', '0480917', '0482819', '0481770', '0481517', '0480815', '0481329', '7261049', '6355171', '0480635', '1670783', '0482560', '9773015', '0482676', '8235560', '0482753', '2919809', '0481245', '1717240', '0482305', '0482313', '0482144', '0482970', '4348241', '0481451', '0481554', '0483090', '5530596', '6403072', '5437590', '0483307', '0482400', '0481958', '0482584', '1259010', '0482180', '3316006', '0481416', '0481734', '0482786', '2005699', '0482674', '1537699', '0482135', '0482458', '0481874', '0480529', '2254075', '0480365', '1831095', '1070880', '1188744', '0481926', '0481917', '0481151', '0482211', '0482349', '0480510', '0482252', '0481819', '0481188', '0482308', '0482449', '0480501', '0483049', '0483151', '7031568', '1642175', '0482703', '0480679', '0482234', '8298108', '0483048', '0481262', '0482325', '0480596', '3347239', '1228562', '04

retrieving reviews: 100%|██████████| 4094/4094 [1:19:07<00:00,  1.16s/review]


From the movie 0121765 were retrieved the following reviews:
{'0484620', '0485891', '0485713', '0483884', '0485028', '5748093', '0486473', '0486259', '0486353', '0484510', '0485086', '0486827', '0486516', '0484208', '6423009', '0484718', '0485463', '0484253', '0486427', '0484193', '0484401', '0485880', '0484482', '0486555', '0486576', '0485654', '0484120', '0483994', '0485041', '0484555', '9011133', '1835839', '0484799', '0486192', '0486887', '0486820', '7645172', '0484819', '0486219', '0485421', '0484992', '0485247', '0484229', '0990298', '0486001', '4203753', '0485029', '0486057', '0484137', '3466652', '0485014', '0485233', '3375981', '6574524', '0485422', '0486724', '0484000', '0486004', '4523039', '0486746', '1092624', '0484749', '0484198', '0485955', '9772807', '0484552', '0486359', '0486623', '0485328', '0485234', '5044917', '0486270', '0486826', '0485060', '0485002', '0486703', '0484045', '1334682', '0484227', '6644500', '0484536', '0486150', '0485125', '0486095', '10220303', '0

retrieving reviews: 100%|██████████| 3880/3880 [1:15:17<00:00,  1.16s/review]


From the movie 0121766 were retrieved the following reviews:
{'5328176', '1076500', '1087664', '1083438', '1120256', '1087672', '1622656', '1088332', '1255447', '1125323', '1287254', '1239739', '1216033', '1083855', '1539781', '1086911', '1086417', '1084044', '1095676', '1095545', '1101919', '1087355', '1099222', '1079881', '4994636', '1084403', '1096150', '1082940', '1090771', '1086758', '1088716', '1087891', '1089279', '1085326', '1083245', '1090432', '1129458', '1140090', '1084954', '1073963', '5432337', '7324305', '1086049', '7270854', '1087267', '1091107', '1083846', '1086597', '1088313', '1135306', '1084551', '1086936', '1086868', '2212063', '1199983', '1621552', '1084824', '1092304', '1089382', '1084518', '1085722', '9071466', '1085527', '1086560', '3851257', '1084056', '1088811', '5753339', '1631820', '1562530', '1084348', '1084519', '1525159', '1116769', '1087068', '5234911', '1211833', '1086965', '1665313', '1087450', '1084569', '5817878', '5476801', '1254863', '1084026', '10

retrieving reviews: 100%|██████████| 3876/3876 [1:15:31<00:00,  1.17s/review]


From the movie 2488496 were retrieved the following reviews:
{'3377036', '3375558', '3616455', '3374686', '8903870', '5281803', '3400557', '4394376', '3378063', '3385740', '3376818', '3603755', '3390787', '3377046', '3374059', '3376328', '3385339', '3392413', '5363280', '3468354', '3375168', '3376360', '4084633', '4009450', '3376098', '4025604', '3379279', '7445387', '3557043', '3382484', '3374075', '9404593', '3375843', '4808829', '3379618', '5398808', '3379075', '3782728', '3380382', '3383126', '3376006', '3436985', '3374708', '5355493', '3382655', '3479042', '6104941', '3376438', '3444134', '3373474', '3389547', '5516106', '3378339', '3378649', '3389258', '3378304', '3374132', '3378934', '3383071', '5694496', '3372584', '3482611', '3375025', '4046737', '3375663', '8224457', '3375245', '3434297', '3378091', '3379708', '3385663', '3602122', '3440845', '3376932', '3407640', '3509437', '7112169', '9494227', '3437745', '3373396', '3395483', '9935590', '4161832', '3401006', '3381898', '35

retrieving reviews: 100%|██████████| 4860/4860 [1:35:35<00:00,  1.18s/review]


From the movie 2527336 were retrieved the following reviews:
{'4011769', '4011249', '4290970', '4022132', '4143730', '5074524', '4004829', '4001479', '4005700', '4002260', '5383487', '4322222', '4004079', '4002819', '4010559', '4007836', '4001620', '4125620', '4478234', '4001202', '4017960', '4004176', '4005023', '4015829', '4006321', '4019997', '4002738', '4329059', '4020672', '4209369', '4001957', '4009805', '9055294', '4005493', '4000710', '4101704', '4007420', '4261279', '4007490', '4586231', '4003610', '4022888', '4560258', '4011862', '8330789', '4541991', '4267442', '5340668', '4002650', '4128905', '4003364', '4005158', '4008317', '4006341', '4001015', '4002892', '4002122', '4011987', '6783402', '6983204', '4021677', '4618918', '4001219', '4505075', '5872359', '4011507', '4009367', '4625990', '4025003', '4024770', '4104720', '4105726', '4597293', '4152303', '4066110', '5032710', '4006822', '4549052', '4112535', '4011814', '4491583', '8356580', '4010403', '4002579', '4034113', '52

retrieving reviews: 100%|██████████| 6909/6909 [2:13:56<00:00,  1.16s/review]  


From the movie 2527338 were retrieved the following reviews:
{'5361859', '5339380', '5335747', '5345879', '5360668', '5334297', '5332197', '5332813', '5345439', '5738785', '5334697', '8236846', '5334215', '5356773', '5385120', '5341291', '5331703', '5350173', '5355251', '5334855', '5340506', '5555956', '5886307', '7540763', '5342896', '5340278', '5331305', '5334953', '5343943', '6156374', '5331815', '5336829', '5744611', '5335872', '6086745', '5427454', '8746554', '5373365', '5575337', '5417133', '5337184', '5555584', '5350593', '5331037', '5375836', '5685846', '7257618', '5354920', '5362995', '5330990', '5710010', '5345080', '5382677', '5417782', '6412909', '5334943', '5584278', '5361937', '5333592', '5345358', '5361848', '5350501', '5333581', '10092607', '5353401', '5327575', '5332387', '5338146', '5899810', '5420263', '10108802', '5348457', '5426619', '5340720', '5356410', '5335197', '5333387', '5379060', '5372412', '7902790', '5341323', '8270900', '5331146', '5349462', '5331705', '

retrieving reviews: 100%|██████████| 7891/7891 [2:32:46<00:00,  1.16s/review]  

     Review_ID   Movie_ID                                    Movie_Title  \
0      2221293  tt0076759             Star Wars: Episode IV - A New Hope   
1      4756672  tt0076759             Star Wars: Episode IV - A New Hope   
2      0156096  tt0076759             Star Wars: Episode IV - A New Hope   
3      0155657  tt0076759             Star Wars: Episode IV - A New Hope   
4      0155649  tt0076759             Star Wars: Episode IV - A New Hope   
...        ...        ...                                            ...   
7886   5552672  tt2527338  Star Wars: Episode IX - The Rise of Skywalker   
7887   5342141  tt2527338  Star Wars: Episode IX - The Rise of Skywalker   
7888   5334366  tt2527338  Star Wars: Episode IX - The Rise of Skywalker   
7889   5831082  tt2527338  Star Wars: Episode IX - The Rise of Skywalker   
7890   5339016  tt2527338  Star Wars: Episode IX - The Rise of Skywalker   

      Rating       Review_Date                                  Review_Title  \
0      




### Saving Star Wars Reviews Dataset

Once all Star Wars reviews have been collected into a single DataFrame, this cell saves the dataset as a `.pkl` file using the `pickle` module.  
The file is stored at the specified path (`../Dataset/sw_reviews.pkl`) for future use.

In [None]:
# File path
file_path = "../Dataset/sw_reviews.pkl"

# Save data
with open(file_path, "wb") as file:
    pickle.dump((reviews_df_sw), file)

print(f"Data stored in: {file_path}")

Data stored in: ../Dataset/sw_reviews.pkl


### Loading and Displaying the Star Wars Reviews Dataset

This test cell loads the previously saved Star Wars reviews dataset from the `.pkl` file using the `pickle` module.  
After loading, the contents of the DataFrame are displayed to verify that the data was correctly stored and retrieved.


In [None]:
# File path
file_path = "../Dataset/sw_reviews.pkl"

# Load data
with open(file_path, 'rb') as file:
    reviews_df_sw = pickle.load(file)
    
# Display the first few rows of the DataFrame
reviews_df_sw.head()

Unnamed: 0,Review_ID,Movie_ID,Movie_Title,Rating,Review_Date,Review_Title,Review_Text,Helpful_Votes,Total_Votes
0,2221293,tt0076759,Star Wars: Episode IV - A New Hope,,15 March 2010,Impossible to watch with fresh eyes,It was a long time ago when I first saw Star W...,0.0,0.0
1,4756672,tt0076759,Star Wars: Episode IV - A New Hope,10.0,1 April 2019,It's Still Just Star Wars to Me,While I will acknowledge its faults this is st...,0.0,0.0
2,156096,tt0076759,Star Wars: Episode IV - A New Hope,10.0,19 January 1999,A modern myth that can't be beat,Star Wars is a modern myth that has a story li...,0.0,0.0
3,155657,tt0076759,Star Wars: Episode IV - A New Hope,,28 August 1999,There is a God and his name is George Lucas,I saw for the first time when I was six years ...,0.0,0.0
4,155649,tt0076759,Star Wars: Episode IV - A New Hope,1.0,31 August 1999,Good but over-rated.,"Frankly, I think ""Star wars"" is a great movie....",7.0,53.0


### Accessing a Specific Review by ID

This cell demonstrates how to access the full text of a specific review using its `Review_ID`.  
The `loc` method is used to filter the DataFrame and retrieve the corresponding `Review_Text` for a given review.


In [None]:
# Display a specific review text
specific_text = reviews_df_sw.loc[reviews_df_sw['Review_ID'] == '2221293', 'Review_Text'].values[0]
print(specific_text)

It was a long time ago when I first saw Star Wars, I watched it as part of the trilogy in the early eighties, on TV and then as Lucas' CGI altered edits.There's not much I can add that isn't already littered on the, internet, countless books and so on. It has become ingrained in popular culture and it is impossible for me to watch it with fresh eyes. It was great to see my son watch it for the first time and no doubt his children will enjoy it too.The story is that dreamer Luke Skywalker must try to save Princess Leia from the evil clutches of Darth Vader. It could have been an awful b-movie but its strength is a great bold script, memorial characters, fantastic effects, costumes and John Williams timeless orchestral score. It has a princess, lasers, alien creatures, spaceships, and more. It's a good old fashioned tale of good versus evil and there really isn't much not to like.It has inspired filmmakers and has been parodied, imitated in numerous films, books and games. It has changed

## Retrieve Reviews on Other Movies

This cell defines a list of IMDb IDs corresponding to a selection of other movies.   These IDs will be used to retrieve and organize reviews for each of the selected films.

In [None]:
# OTHER MOVIES
list_of_movies_ids_others = ['6751668', # Parasite
                             '0060196', # The Good, the Bad and the Ugly
                             '0241527', # Harry Potter and the Sorcerer's Stone
                             '15398776', # Oppenheimer
                             '3783958', # La La Land
                             '0082971'] # Raiders of the Lost Ark

### Retrieving Other Movies Reviews

This cell initializes a `Retriever` instance and loops through the IMDb IDs of the selected movies to collect their reviews.  
For each movie, the review IDs are retrieved and used to fetch the corresponding review data.  
All reviews are then combined into a single DataFrame (`reviews_df_other`), which contains the complete set of reviews for the selected films.

In [None]:
# Initialize the Retriever instance
retriever = Retriever()

# Start with an empty DataFrame
reviews_df_other = pd.DataFrame()

# Loop through each movie ID and retrieve reviews
for movie_id in list_of_movies_ids_others:

    # Retrieve review IDs for the current movie
    review_ids = retriever.get_reviews_ids_from_movie_id(movie_id)

    # If no reviews are found, skip to the next movie
    if not review_ids:
        print(f'No reviews found for movie ID {movie_id}. Skipping...')
        continue

    # Print the retrieved review IDs
    print('From the movie ' + str(movie_id) + ' were retrieved the following reviews:')
    print(review_ids)

    # Retrieve reviews as a DataFrame and concatenate it to the main DataFrame
    temp_df = retriever.get_reviews_dataframe_from_set_of_review_ids(review_ids)
    reviews_df_other = pd.concat([reviews_df_other,temp_df], ignore_index=False)

print(reviews_df_other)

From the movie 6751668 were retrieved the following reviews:
{'9637661', '5510542', '5182892', '5499682', '6094155', '6432630', '8575840', '5479460', '5455102', '5497707', '5474801', '8604936', '5402089', '5562991', '5879734', '5689101', '5619903', '5473251', '5206652', '8370505', '7212564', '5068069', '6657588', '6111922', '5477602', '5709645', '7071309', '5413066', '5558988', '6971643', '10266177', '5657277', '9421473', '5582893', '5489428', '5476557', '5519180', '5519990', '9006868', '5464771', '5558930', '5968755', '5490408', '5525434', '5877800', '5098391', '5620712', '5411736', '5433291', '8674413', '6385885', '7573166', '5496998', '5479757', '6772747', '7598506', '5811761', '5466157', '5400281', '5475773', '7455592', '5636877', '5489234', '5534594', '8343717', '5231818', '5204711', '5850696', '5507171', '9424838', '5605791', '5579011', '5644782', '5411660', '7426632', '5347852', '5510482', '5293690', '5532134', '6967991', '7929899', '7782703', '5255606', '5596019', '5346582', '5

retrieving reviews: 100%|██████████| 3702/3702 [1:15:39<00:00,  1.23s/review]


From the movie 0060196 were retrieved the following reviews:
{'4931608', '0092838', '0092840', '2311851', '0092820', '0092672', '2466802', '3501397', '2041196', '0092856', '4426074', '5563839', '9822308', '3828803', '1551920', '3256242', '0092699', '3008633', '1423163', '2865594', '5248693', '3063298', '4160029', '0092742', '10157155', '1943619', '1977466', '4286679', '0092772', '8744929', '3238569', '5928471', '2605947', '2936610', '1816659', '5390527', '0092716', '1034877', '3398661', '0092853', '2641450', '2909579', '3012753', '1405659', '8273474', '3094099', '7082895', '0092800', '9945134', '0092607', '1801489', '0092615', '5676059', '6139539', '6887121', '0092688', '2446482', '0092612', '1599480', '4923425', '1330608', '6626019', '2588733', '2494149', '5599111', '4458309', '0092609', '3074009', '3691224', '10346495', '0092666', '9678809', '10186166', '9597033', '5706797', '7607662', '4265564', '0092719', '8682477', '2325320', '2648729', '3002756', '4615031', '1016231', '3863779', 

retrieving reviews: 100%|██████████| 1430/1430 [27:18<00:00,  1.15s/review]


From the movie 0241527 were retrieved the following reviews:
{'3524771', '4864065', '0717347', '0716768', '0717040', '0717991', '0717457', '2758217', '0717146', '2274960', '0717650', '0717967', '0716732', '9471254', '7905027', '0717702', '0717263', '0717255', '0717006', '0716733', '0717071', '0717570', '1055846', '0716815', '1766397', '0716844', '1314847', '1484252', '0717239', '3389731', '3674660', '5768409', '1885227', '0717430', '7637107', '2722827', '5049275', '4478861', '8994704', '1533439', '0716802', '0716871', '10186577', '8848418', '2904842', '10327811', '4366842', '0717811', '7776240', '4588750', '0717744', '0717276', '5661792', '0716975', '3426700', '9909535', '0717384', '0717812', '7559018', '0716828', '2277936', '3616152', '0716987', '0717435', '1689285', '6102575', '0717139', '0717516', '5991457', '0716756', '0716993', '0717017', '0717676', '0716857', '0717588', '3062511', '0717789', '0717215', '3474640', '3628645', '0717779', '1634106', '0717528', '0716788', '4614738', '

retrieving reviews: 100%|██████████| 2059/2059 [39:09<00:00,  1.14s/review]


From the movie 15398776 were retrieved the following reviews:
{'9972671', '9426804', '9205872', '9414242', '9712655', '10233375', '9290740', '9554992', '9205902', '9626902', '9272801', '9243646', '9210554', '9205643', '9290506', '9203751', '9206522', '9323685', '9217352', '9218337', '9329759', '9657847', '9226328', '9205080', '9202920', '9735458', '9229269', '9210983', '9209084', '9206454', '9370545', '9203930', '9432992', '9204435', '9205434', '9205126', '9217651', '9207207', '9253011', '9227350', '9577136', '9207633', '9212166', '9308134', '9220229', '9324747', '9370324', '9254970', '9244290', '9203875', '9207519', '9309033', '9219698', '9210003', '9231793', '9202814', '9211698', '9211634', '9249989', '9617145', '10287651', '9209292', '9303836', '9629812', '9208534', '9283845', '9205306', '9625935', '9591828', '9321019', '9200545', '9262217', '9211034', '9222516', '9270493', '9467689', '9227569', '9251330', '9209985', '9859519', '9709164', '9203105', '9213209', '9209707', '9269782', 

retrieving reviews: 100%|██████████| 4375/4375 [1:23:18<00:00,  1.14s/review]


From the movie 3783958 were retrieved the following reviews:
{'4027807', '3709335', '6633178', '3755401', '3752426', '3700429', '9527639', '3628244', '3654436', '7777171', '5036892', '3647629', '5963724', '4579697', '4539384', '3680171', '3626662', '3835359', '3778899', '3685098', '3666500', '3620981', '3623199', '3631197', '5034988', '6120908', '3628581', '3633429', '3614980', '3668763', '3619455', '3630003', '9675288', '5590831', '3687946', '10042315', '8122073', '3604887', '3621988', '3642683', '6407423', '9617683', '7435811', '6409006', '4130451', '3618245', '4689722', '3687187', '5609906', '5688844', '5826814', '8390916', '4276104', '3603453', '3701680', '7545246', '3611071', '3616424', '3771603', '7845270', '3626924', '6900336', '3628872', '3842139', '4683785', '3661208', '3603788', '3603098', '8761573', '3593980', '4431857', '3629518', '5952086', '7324770', '3623297', '3642173', '3702678', '3760799', '3625552', '3598330', '7146013', '3624419', '6593846', '3596091', '6531812', '5

retrieving reviews: 100%|██████████| 2369/2369 [45:41<00:00,  1.16s/review] 


From the movie 0082971 were retrieved the following reviews:
{'0188765', '3070630', '0188569', '5237029', '5664690', '2799283', '8784007', '8028775', '3315668', '0188624', '0188815', '3461143', '7922062', '2230341', '1531992', '1871988', '0188819', '7218640', '1039445', '2621203', '1596748', '0188652', '0188887', '5645826', '3291347', '0188693', '5064327', '0994119', '1004318', '2667166', '1241513', '0188821', '5406308', '9139950', '2586170', '1343514', '1365387', '8640280', '1914944', '2012732', '3634935', '0947804', '0188942', '9672754', '0188633', '4338320', '4193071', '2305129', '1885047', '9290507', '3064519', '9187382', '0188882', '2336982', '8784737', '0188728', '0188897', '8196795', '9482188', '8758251', '7093438', '0188775', '1896009', '8960640', '5231845', '10262201', '0188933', '0188764', '5166508', '1520670', '2592309', '9124506', '4950705', '9702747', '6822073', '0188883', '3829289', '4694758', '0188864', '1950832', '8376660', '0188595', '5652261', '0188679', '1762865', '2

retrieving reviews: 100%|██████████| 1197/1197 [23:15<00:00,  1.17s/review]

     Review_ID   Movie_ID              Movie_Title  Rating        Review_Date  \
0      9637661  tt6751668                 Parasite     5.0   23 February 2024   
1      5510542  tt6751668                 Parasite    10.0   26 February 2020   
2      5182892  tt6751668                 Parasite    10.0    12 October 2019   
3      5499682  tt6751668                 Parasite     9.0   21 February 2020   
4      6094155  tt6751668                 Parasite     8.0  14 September 2020   
...        ...        ...                      ...     ...                ...   
1192   7751614  tt0082971  Raiders of the Lost Ark    10.0    13 January 2022   
1193   1839402  tt0082971  Raiders of the Lost Ark    10.0      13 March 2008   
1194   0188748  tt0082971  Raiders of the Lost Ark    10.0      29 March 1999   
1195   3751186  tt0082971  Raiders of the Lost Ark     9.0       10 July 2017   
1196   0188782  tt0082971  Raiders of the Lost Ark    10.0    20 October 1999   

                           




### Saving Other Movies Reviews Dataset

This cell saves the collected reviews of the selected non–Star Wars movies into a `.pkl` file using the `pickle` module.  
The resulting DataFrame (`reviews_df_other`) is stored at the specified path (`../Dataset/others_reviews.pkl`) for future access and analysis.


In [None]:
# File path
file_path = "../Dataset/others_reviews.pkl"

# Save data
with open(file_path, "wb") as file:
    pickle.dump((reviews_df_other), file)

print(f"Data stored in: {file_path}")

Data stored in: ../Dataset/others_reviews.pkl


### Loading and Displaying the Other Movies Reviews Dataset

This test cell loads the saved dataset containing reviews of the selected movies from the `.pkl` file using the `pickle` module.  
After loading, the DataFrame (`reviews_df_other`) is displayed to confirm that the data has been correctly stored and retrieved.


In [12]:
# File path
file_path = "../Dataset/others_reviews.pkl"

# Load data
with open(file_path, 'rb') as file:
    reviews_df_other = pickle.load(file)
    
# Display the first few rows of the DataFrame
reviews_df_other.head()

Unnamed: 0,Review_ID,Movie_ID,Movie_Title,Rating,Review_Date,Review_Title,Review_Text,Helpful_Votes,Total_Votes
0,9637661,tt6751668,Parasite,5.0,23 February 2024,"Solid Film Craftsmanship, Trash Story",I'm genuinely baffled this film won not only b...,3.0,8.0
1,5510542,tt6751668,Parasite,10.0,26 February 2020,MASTERPIECE,Just watch it. It has everything; entertainmen...,3.0,5.0
2,5182892,tt6751668,Parasite,10.0,12 October 2019,First Hit: I really enjoyed this story as it d...,First Hit: I really enjoyed this story as it d...,24.0,40.0
3,5499682,tt6751668,Parasite,9.0,21 February 2020,If you love cliché stories this movie is not f...,I was not expecting that much of this movie. N...,2.0,5.0
4,6094155,tt6751668,Parasite,8.0,14 September 2020,Amazing.,"Good acting, cinematography, twists and screen...",0.0,0.0


### Accessing a Specific Review by ID

This cell demonstrates how to access the full text of a specific review using its `Review_ID`.  
The `loc` method is used to filter the DataFrame and retrieve the corresponding `Review_Text` for a given review.

In [7]:
# To access a specific review use the Review_ID in combination with the loc method

specific_text = reviews_df_other.loc[reviews_df_other['Review_ID'] == '9637661', 'Review_Text'].values[0]
print(specific_text)

I'm genuinely baffled this film won not only best foreign film, best directing and best screenwriting, but also Best Picture (historically reserved for American films only). Of all the films to break the barrier and be the first, they chose THIS? When will all this self loathing end? I guess never because as long as humans are alive there will always be hordes of abysmally depressed people who think the more they hate humanity and the ways of the world they more self righteous they'll feel about "calling it out". What a joke. While the film is extremely well made its story is soulless and not as intelligent of a critique of capitalism, egalitarianism, and meritocracy as the snobs make it seem. At this point it's the ultimate cliche. It play more like a weak melodrama/soft noir movie. It's just a weird film that meanders and is only occasionally entertaining. Some good humor and interesting situations that hold you in brief suspense. Other than that the only interesting parts are seeing

## Splitting Movie Reviews by Title


This section involves loading the two datasets: one containing reviews of *Star Wars* films (`sw_reviews.pkl`), and the one with reviews of other selected movies (`others_reviews.pkl`).  
After loading, the unique movie titles in each dataset are inspected. The datasets are then split into separate `.pkl` files, each containing reviews for a single film.  
Files are saved in the `Dataset` folder using a consistent and descriptive naming convention to facilitate future access and analysis.

### Defining Output Filenames for Each Movie

This cell defines the manual mapping between movie titles and the desired output filenames. Both *Star Wars* and other selected films are assigned explicit filenames for clarity and consistency.

In [1]:
# Manual mapping of movie titles to output filenames

file_mapping = {
    # === Star Wars Episodes ===
    "Star Wars: Episode I - The Phantom Menace": "SW_Episode1.pkl",
    "Star Wars: Episode II - Attack of the Clones": "SW_Episode2.pkl",
    "Star Wars: Episode III - Revenge of the Sith": "SW_Episode3.pkl",
    "Star Wars: Episode IV - A New Hope": "SW_Episode4.pkl",
    "Star Wars: Episode V - The Empire Strikes Back": "SW_Episode5.pkl",
    "Star Wars: Episode VI - Return of the Jedi": "SW_Episode6.pkl",
    "Star Wars: Episode VII - The Force Awakens": "SW_Episode7.pkl",
    "Star Wars: Episode VIII - The Last Jedi": "SW_Episode8.pkl",
    "Star Wars: Episode IX - The Rise of Skywalker": "SW_Episode9.pkl",
    
    # === Other Movies ===
    "Harry Potter and the Sorcerer's Stone": "HarryPotter.pkl",
    "Raiders of the Lost Ark": "IndianaJones.pkl",
    "La La Land": "LaLaLand.pkl",
    "Parasite": "Parasite.pkl",
    "The Good, the Bad and the Ugly": "GoodBadUgly.pkl",
    "Oppenheimer": "Oppenheimer.pkl"
}

### Splitting Datasets and Saving by Movie Title

This cell loads the full review datasets, splits them according to the predefined filenames, and saves each subset as an individual `.pkl` file in the `Dataset` directory.

In [None]:
# Load datasets
sw_df = pd.read_pickle("../Dataset/sw_reviews.pkl")
others_df = pd.read_pickle("../Dataset/others_reviews.pkl")

# Combine the two DataFrames for unified processing
combined_df = pd.concat([sw_df, others_df], ignore_index=True)

# Output directory
output_dir = "../Dataset/Reviews_By_Movie"
os.makedirs(output_dir, exist_ok=True)

# Iterate over the mapping and save subsets
for title, filename in file_mapping.items():
    subset = combined_df[combined_df["Movie_Title"] == title]
    if not subset.empty:
        subset.to_pickle(os.path.join(output_dir, filename))
    else:
        print(f"Warning: No reviews found for '{title}', skipping file {filename}")

print("All movie-specific review files successfully saved to the 'Dataset/Reviews_By_Movie' folder.")

All movie-specific review files successfully saved to the 'Dataset/Reviews_By_Movie' folder.


## Retrieve Keywords from all Movies

This section focuses on extracting the most relevant keywords from the reviews of each movie in the dataset.  
The goal is to generate a structured collection of representative keywords per film, which will then be used to build a **Validation/Test dataset** for evaluating the performance of topic modeling algorithms.  
These keywords serve as a reference for assessing the coherence and relevance of the topics extracted from the review texts.

### IMDb Movie ID Lists

This cell defines two lists of IMDb movie IDs.  
The first list contains the IDs of the nine *Star Wars* episodes, while the second includes the other six selected movies.  

In [None]:
# STAR WARS MOVIES
list_of_movies_ids_sw = ['0076759', # Star Wars: Episode IV - A New Hope
                         '0080684', # Star Wars: Episode V - The Empire Strikes Back
                         '0086190', # Star Wars: Episode VI - Return of the Jedi
                         '0120915', # Star Wars: Episode I - The Phantom Menace
                         '0121765', # Star Wars: Episode II - Attack of the Clones
                         '0121766', # Star Wars: Episode III - Revenge of the Sith
                         '2488496', # Star Wars: Episode VII - The Force Awakens
                         '2527336', # Star Wars: Episode VIII - The Last Jedi
                         '2527338'] # Star Wars: Episode IX - The Rise of Skywalker

# OTHER MOVIES
list_of_movies_ids_others = ['6751668', # Parasite
                             '0060196', # The Good, the Bad and the Ugly
                             '0241527', # Harry Potter and the Sorcerer's Stone
                             '15398776', # Oppenheimer
                             '3783958', # La La Land
                             '0082971'] # Raiders of the Lost Ark

### Extracting Keywords for Validation

This cell initializes the `Retriever` instance and iterates over all selected IMDb movie IDs (both *Star Wars* and other films) to extract associated keywords.  
For each movie, the method `get_movie_keywords_from_movie_id()` is used to retrieve keywords, which are then aggregated into a single DataFrame (`keywords_df`).  

In [None]:
# Initialize the Retriever instance
retriever = Retriever()

# Combine both lists of movie IDs
all_movies_ids = list_of_movies_ids_sw + list_of_movies_ids_others

# Start with an empty DataFrame
keywords_df = pd.DataFrame()

# Loop through each movie ID and retrieve keywords
for movie_id in all_movies_ids:

    # Print the movie ID being processed
    print('Retrieving keywords from the movie ' + str(movie_id))

    # Retrieve keywords as a DataFrame and concatenate it to the main DataFrame
    temp_df = retriever.get_movie_keywords_from_movie_id(movie_id)
    keywords_df = pd.concat([keywords_df,temp_df], ignore_index=False)

print(keywords_df)

Retrieving keywords from the movie 0076759
Retrieving keywords from the movie 0080684
Retrieving keywords from the movie 0086190
Retrieving keywords from the movie 0120915
Retrieving keywords from the movie 0121765
Retrieving keywords from the movie 0121766
Retrieving keywords from the movie 2488496
Retrieving keywords from the movie 2527336
Retrieving keywords from the movie 2527338
Retrieving keywords from the movie 6751668
Retrieving keywords from the movie 0060196
Retrieving keywords from the movie 0241527
Retrieving keywords from the movie 15398776
Retrieving keywords from the movie 3783958
Retrieving keywords from the movie 0082971
    Movie_ID               Keyword  Helpful  Not_Helpful
0    0076759             rebellion       15            0
1    0076759              princess       12            0
2    0076759           space opera       11            0
3    0076759      good versus evil       10            0
4    0076759                 droid        9            0
..       ...

### Saving the Keywords Dataset

This cell saves the extracted keywords from all movies into a `.pkl` file using the `pickle` module.  
The resulting file (`keywords_ground_truth.pkl`) is stored in the `Dataset` folder and will be used as a **Validation/Test set** for topic modeling evaluation.

In [None]:
# File path
file_path = "../Dataset/keywords_ground_truth.pkl"

# Save data
with open(file_path, "wb") as file:
    pickle.dump((keywords_df), file)

print(f"Data stored in: {file_path}")

Data stored in: ../Dataset/keywords_df.pkl


### Loading and Displaying the Keywords Dataset

This test cell loads the previously saved `keywords_ground_truth.pkl` file containing movie-specific keywords using the `pickle` module. The loaded DataFrame (`keywords_df`) is then displayed to verify that the data has been correctly stored and retrieved.

In [None]:
import pickle
import pandas as pd

# File path
file_path = "../Dataset/keywords_ground_truth.pkl"

# Load data
with open(file_path, 'rb') as file:
    keywords_df = pickle.load(file)

# Total number of keywords
total_keywords = len(keywords_df)
print(f"Total number of keywords: {total_keywords}\n")

# Display the first few rows
display(keywords_df.head())

# Count number of keywords per movie
print("\nNumber of keywords per movie:")
keywords_per_movie = keywords_df.groupby("Movie_ID").size()
print(keywords_per_movie)


Total number of keywords: 5617



Unnamed: 0,Movie_ID,Keyword,Helpful,Not_Helpful
0,tt0076759,rebellion,15,0
1,tt0076759,princess,12,0
2,tt0076759,space opera,11,0
3,tt0076759,good versus evil,10,0
4,tt0076759,droid,9,0



Number of keywords per movie:
Movie_ID
tt0060196     273
tt0076759     347
tt0080684     291
tt0082971     220
tt0086190     223
tt0120915     410
tt0121765     386
tt0121766     477
tt0241527     400
tt15398776    325
tt2488496     515
tt2527336     546
tt2527338     598
tt3783958     290
tt6751668     316
dtype: int64
