## IMDB Classification Project

Project Objective: To correctly classify the genre of each film
- Part 1: Ingest, Preprocess, Visualize, & Save (Processed) Dataset (Exploratory Data Analysis)
- Part 2: Train & Evaluate Modified Dataset

Dataset Source: https://www.kaggle.com/datasets/hijest/genre-classification-dataset-imdb

##### Import Necessary Libraries

In [0]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from sklearn.metrics import classification_report, accuracy_score
import pyspark.sql.functions as F

##### Start Spark NLP Session

In [0]:
spark = sparknlp.start()

##### Data Ingestion

In [0]:
file_location = "/FileStore/tables/prepped_imdb_ds/part-00000-tid-8800166687608654366-d1d8971a-b0cc-44cc-86e4-24dd5bb81c1b-7881-1-c000.csv"
file_type = "csv"

# CSV options
infer_schema = "false"
first_row_is_header = "true"
delimiter = "\t"

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type)\
  .option("inferSchema", infer_schema)\
  .option("header", first_row_is_header)\
  .option("sep", delimiter)\
  .load(file_location)

display(df)

label,text
drama,"Night Call (2016) : Simon's world is turned upside down when his little girl Katie is abducted during a family day out. After weeks of searching and appeals Simon decides to punish himself by offering a young prostitute with mounting debts, life changing money to end his life."
comedy,"Tunnel Vision (1976) : A committee investigating TV's first uncensored network examines a typical day's programming, which includes shows, commercials, news programs, you name it. What they discover will surely crack you up! This outrageous and irreverent spoof of television launched the careers of some of the greatest comedians of all time. This 1976 film tries to predict what American television will be like in the year 1985. Tunnelvision is America's first ""uncensored and free"" television network. Although wildly popular, it is also blamed for increased crime and unemployment. Christian A. Broder, president and founder of Tunnelvision, is called to defend his network in front of a Senate sub-committee. The sub-committee decides to view excerpts from a ""typical"" day of Tunnelvision broadcasting. What follows is a series of brief skits lampooning television, including cop shows, news broadcasts, situation comedies, and (of course) commercials."
documentary,"7 días con Alberto Corazón (2015) : The objective of this documentary is to convey in an exciting way who is Alberto Corazón. His work, his career, his creative process, his contribution to Spanish culture. His personality, his character and his ideas. Its contribution to the Spanish culture - especially the visual culture - through the recognition of marks and symbols, corporate identities, industrial designs, objects and signs. His creative process, showing closely and transparently how he create. His trajectory, linked to the recent history of Spain. His work in the broadest sense, including editorial design, posters, wall paintings, sundials, objects, paintings, sculptures, books. His personality, character and ideas, respecting the space of intimacy."
comedy,"""The Young Professionals"" (2015) : Whether it's blocking up mouse holes, running from Landlords or making puppet shows in the bath, it's never a dull moment for The Young Professionals. Desperate to break into the online world and escape the terrors of temping, Natalie presents the lives of six housemates struggling to get on the career ladder after uni and pay their rent on time. Which is all helped along with Keara - the one with the 'real' job."
documentary,"The End of Ageing (2010) : All over the world, human beings are living longer than ever before. This is due to many factors, including improved living conditions, lifestyle choices and medical advancements. While there is not a single cause, a growing community of scientists are pushing the limits of life expectancy. In the not-too-distant future, they may even be able to halt ageing altogether. As the world's population continues to live longer, our current economic systems will no longer be sustainable. Health care systems, and the economies that fund them, need to make major changes. Because a growing number of people are healthy enough continue to work and play, we will need to reevaluate the nature of employment and recreation."
short,"Begegnung mit Fritz Lang (1964) : Interview with Fritz Lang on the roof of Villa Malaparte on Capri during the filming of the fictitious film ""Odysseus"" and the filming of ""Le mépris"" by Jean-Luc Godard, in which Fritz Lang plays the role of an old film director called Lang. Interview with Fritz Lang on the roof of Villa Malaparte on Capri during the filming of the fictitious film ""Odysseus"" and the filming of ""Le mépris"" by Jean-Luc Godard, in which Fritz Lang plays the role of an old film director called Lang. During the interview, excerpts from the long films ""The Nibelungen"", ""The Tired Death"" and ""M"" are shown."
documentary,"Race Across America: Push Beyond (2017) : Marshall Nord is a 49-year-old executive, Father, husband and amateur endurance athlete. He has 30 years of multi-sport experience behind him, having run marathons, Iron Mans, triathlons, you name it. But with the big 5-0 looming, he is desperate to achieve something truly epic. And what is more epic than taking on the toughest ultra-endurance cycling event in the world? The Race Across America is a 3,070 mile bicycle race from the West to East of the USA, that must be completed in under 12 days. With a small support crew made up of family and friends but not a minute of RAAM experience between them, the odds are stacked against him. Marshall is determined to push beyond in order to achieve his goal however, no matter what obstacles the race, rules or road can throw in his way."
comedy,"""George & Leo"" (1997) : George Stoody is a mild-mannered bookstore owner who encounters a hoodlum/magician named Leo Wagonman, the estranged father of his new daughter-in-law Casey. Leo, on the run from a mob intent on collecting the payoff money Leo stole from a Las Vegas casino, decides to stay in the spare room above George's bookstore."
short,"La capsula (2009) : Italian millionaire Moritz Craffonara dreams about 'a bed under the stars'. He asks his friend Ross Lovegrove for help and the British designer constructs something that nobody thought was possible: a floating space capsule on top of a 2100 meter mountain in the heart of the Alps. The project was named ""Alpine Capsule"" and gained media attention all over the world. But eventually it is something way bigger: It became Moritz' monument. A documentary about the limits of modern technology and the human fear of being forgotten."
drama,"Herr über Leben und Tod (1955) : Barbara is married to Georg Bertram, a professor of medicine who once saved her father's life. Things go awry for the couple when she gives birth to a mentally defective child. For Georg, coldly clinical, euthanizing the infant is the only way out. He is about to commit the irreparable when Barbara manages to interrupt his fatal act. By mutual agreement, husband and wife decide that Barbara will go to Saint-Guénolé in Brittany, where she and their son will be cared for by Louise Kerbrec, Georg's former nurse, in the hypothetical hope that the boy's condition will improve. What they do not know yet is that Barbara will meet there another doctor, Daniel Karentis, much more sympathetic than Georg and also much handsomer..."


##### Split Dataset into Training & Testing Datasets (50/50)

In [0]:
train_ds, test_ds = df.randomSplit(weights=[0.50, 0.50], seed=42)
print(train_ds.count())
print(test_ds.count())

5059
4999


##### Basic Values/Constants

In [0]:
NUM_OF_EPOCHS = 2
BATCH_SIZE = 64
LR = 3e-3
VERBOSITY_LEVEL = 1
MAX_LENGTH = 188

##### Define Pipeline Stages & Pipeline

In [0]:
# Document Assembler
doc = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Tokenizer
tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

# DistilBert Embeddings
bert_embeds = DistilBertEmbeddings.pretrained()\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeds")\
    .setMaxSentenceLength(MAX_LENGTH)

# Sentence Embeddings
sent_embeds = SentenceEmbeddings()\
    .setInputCols(["document", "embeds"])\
    .setOutputCol("sent_embeds")\
    .setPoolingStrategy("AVERAGE")

# clf_model
clf = ClassifierDLApproach()\
    .setInputCols(["sent_embeds"])\
    .setOutputCol("class")\
    .setLabelColumn("label")\
    .setBatchSize(BATCH_SIZE)\
    .setLr(LR)\
    .setMaxEpochs(NUM_OF_EPOCHS)\
    .setVerbose(VERBOSITY_LEVEL)

# put pipeline together
nlp_clf_pipeline = Pipeline().setStages([
    doc, 
    tokenizer, 
    bert_embeds, 
    sent_embeds, 
    clf
])

distilbert_base_cased download started this may take some time.
Approximate size to download 232.7 MB
[ | ][OK!]


##### Fit/Train Model

In [0]:
imdb_model = nlp_clf_pipeline.fit(train_ds)

##### Inference: Predict Values Based on Test Dataset

In [0]:
preds = imdb_model.transform(test_ds)

##### Convert Relevant Features to Pandas DataFrame

In [0]:
preds_in_pandas = (preds.select(F.col('text').alias("text"), F.col('label').alias("ground_truth"), F.col('class.result').alias("prediction"))).toPandas()

##### Display Classification Report

In [0]:
preds_in_pandas['prediction'] = preds_in_pandas['prediction'].apply(lambda x : x[0])

report = classification_report(preds_in_pandas['ground_truth'], preds_in_pandas['prediction'])

print(report)

              precision    recall  f1-score   support

      comedy       0.62      0.77      0.69      1203
 documentary       0.69      0.86      0.77      1287
       drama       0.64      0.57      0.60      1233
       short       0.68      0.43      0.53      1276

    accuracy                           0.66      4999
   macro avg       0.66      0.66      0.65      4999
weighted avg       0.66      0.66      0.65      4999



##### End/Stop Spark Session

In [0]:
spark.stop()