## IMDB Classification Project

Project Objective: To correctly classify the genre of each film
- Part 1: Ingest, Preprocess, Visualize, & Save (Processed) Dataset (Exploratory Data Analysis)
- Part 2: Train & Evaluate Modified Dataset

Dataset Source: https://www.kaggle.com/datasets/hijest/genre-classification-dataset-imdb

##### Import Necessary Libraries

In [0]:
from pyspark.ml import Pipeline
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType

##### Data Ingestion

In [0]:
file_location = "/FileStore/tables/imdb/train_data.txt"
file_type = "text"

# CSV options
infer_schema = "false"
first_row_is_header = "false"

df = spark.read.text(file_location)

display(df)

value
"1 ::: Oscar et la dame rose (2009) ::: drama ::: Listening in to a conversation between his doctor and parents, 10-year-old Oscar learns what nobody has the courage to tell him. He only has a few weeks to live. Furious, he refuses to speak to anyone except straight-talking Rose, the lady in pink he meets on the hospital stairs. As Christmas approaches, Rose uses her fantastical experiences as a professional wrestler, her imagination, wit and charm to allow Oscar to live life and love to the full, in the company of his friends Pop Corn, Einstein, Bacon and childhood sweetheart Peggy Blue."
2 ::: Cupid (1997) ::: thriller ::: A brother and sister with a past incestuous relationship have a current murderous relationship. He murders the women who reject him and she murders the women who get too close to him.
"3 ::: Young, Wild and Wonderful (1980) ::: adult ::: As the bus empties the students for their field trip to the Museum of Natural History, little does the tour guide suspect that the students are there for more than just another tour. First, during the lecture films, the coeds drift into dreams of the most erotic fantasies one can imagine. After the films, they release the emotion of the fantasies in the most erotic and uncommon ways. One slips off to the curator's office for a little ""acquisition. "" Another finds the anthropologist to see what bones can be identified. Even the head teacher isn't immune. Soon the tour is over, but as the bus departs, everyone admits it was quite an education."
"4 ::: The Secret Sin (1915) ::: drama ::: To help their unemployed father make ends meet, Edith and her twin sister Grace work as seamstresses . An invalid, Grace falls prey to the temptations of Chinatown opium and becomes an addict, a condition worsened by a misguided physician who prescribes morphine to ease her pain. When their father strikes oil, the family enjoys a new prosperity and the sisters meet the eligible Jack Herron, a fellow oil prospector. To Grace's shock, Jack falls in love with Edith and in her jealousy, Grace tells Jack that Edith, not she, has a drug problem. Hinting that her sister will soon need more morphine, Grace arranges for a dinner in Chinatown with the couple. While her sister and Jack dance, Grace slips away to an opium den. Edith follows her, but ends up in the wrong den and is arrested in an ensuing drug raid. After he bails her out of jail, Edith takes an angry Jack to search for Grace and stumbles across her half-conscious body lying in the street. The truth about the sisters is revealed, and after sending Grace to a sanitarium in the country, Jack and Edith are married."
"5 ::: The Unrecovered (2007) ::: drama ::: The film's title refers not only to the un-recovered bodies at ground zero, but also to the state of the nation at large. Set in the hallucinatory period of time between September 11 and Halloween of 2001, The Unrecovered examines the effect of terror on the average mind, the way a state of heightened anxiety and/or alertness can cause the average person to make the sort of imaginative connections that are normally made only by artists and conspiracy theorists-both of whom figure prominently in this film. The Unrecovered explores the way in which irony, empathy, and paranoia relate to one another in the wake of 9/11."
"6 ::: Quality Control (2011) ::: documentary ::: Quality Control consists of a series of 16mm single take shots filmed in the summer of 2010,over a two day period, in a dry cleaners facility in Pritchard, Alabama, near Mobile, Quality Control exhibits the acts as well the conditions around labor and showcases, in Everson's words ""the fine folks of Alabama producing a superior product."" It is similar stylistically, in form and rhythm, to certain scenarios in Everson's award-winning and critically acclaimed previous films, including Erie (IFFR 2010) and in thematic concerns to several other short form works which follow the daily, quotidian tasks of workers in rest and in motion, and is an oblique sequel, ten years hence, to Everson's Creative Capital granted project A Week in the Hole (2001), which focused on an employee's adjustment to materials, time, space and personnel. Quality Control consists of a series of 16mm single take shots, filmed over a two day period in the summer of 2010, in a dry cleaners facility in Pritchard, Alabama, near Mobile. Quality Control exhibits the acts as well the conditions around labor. It is similar stylistically, in form and rhythm, to certain scenarios in Everson's award-winning and critically acclaimed previous films, including Erie (IFFR 2010) and in thematic concerns to several other short form works which follow the daily, quotidian tasks of workers in rest and in motion, including the factory routine captured in the short film A Week in the Hole (2001), which focused on an employee's adjustment to materials, time, space and personnel. Principal cast includes Shay Wright and Annette Speight."
"7 ::: ""Pink Slip"" (2009) ::: comedy ::: In tough economic times Max and Joey have all but run out of ideas until, they discover that senior housing is cheap. Not only that but Max's aunt just kicked the bucket and no one knows yet. In a hilarious series that always keeps you on your toes, the two friends take us on a cross-dressing, desperate and endearing ride through being broke."
"8 ::: One Step Away (1985) ::: crime ::: Ron Petrie (Keanu Reeves) is a troubled teen whose life is hanging by a thread, as he's on the verge of suspension from school, subject to arrest for breaking and entering, and the cause of his single mother's impending eviction from her apartment. Unless he can find a resolution, his only option seems to be life of street crime."
"9 ::: ""Desperate Hours"" (2016) ::: reality-tv ::: A sudden calamitous event, causing great loss of life, damage, or hardship, like a flood, a tornado, an airplane crash, or an earthquake. This is not only a documentary but a live account of dramatic events in real time. In this unique 13- part series you'll be an eyewitness to some of the greatest disasters of the last 100 years and you will have a rare opportunity to compare disasters across time and distance and decide which you think is the worst."
"10 ::: Spirits (2014/I) ::: horror ::: Four high school students embark on a terrifying journey through ShadowView Manor 2 years after a horrifying séance gone wrong. Intern Raven, decides to reconnect with her elementary school friends Kota, William, and Jessica by bringing them to her new workplace, ShadowView Manor for a bit of paranormal investigating. Hearing more forbidden secrets from the night janitor sends them into a dark descending spiral of terror."


##### Distribute Features to Proper Columns

In [0]:
split_col = F.split(df['value'], ':::', 4)

df = df.withColumn("title", split_col.getItem(1))\
    .withColumn("genre", split_col.getItem(2))\
    .withColumn("description", split_col.getItem(3))

display(df)

value,title,genre,description
"1 ::: Oscar et la dame rose (2009) ::: drama ::: Listening in to a conversation between his doctor and parents, 10-year-old Oscar learns what nobody has the courage to tell him. He only has a few weeks to live. Furious, he refuses to speak to anyone except straight-talking Rose, the lady in pink he meets on the hospital stairs. As Christmas approaches, Rose uses her fantastical experiences as a professional wrestler, her imagination, wit and charm to allow Oscar to live life and love to the full, in the company of his friends Pop Corn, Einstein, Bacon and childhood sweetheart Peggy Blue.",Oscar et la dame rose (2009),drama,"Listening in to a conversation between his doctor and parents, 10-year-old Oscar learns what nobody has the courage to tell him. He only has a few weeks to live. Furious, he refuses to speak to anyone except straight-talking Rose, the lady in pink he meets on the hospital stairs. As Christmas approaches, Rose uses her fantastical experiences as a professional wrestler, her imagination, wit and charm to allow Oscar to live life and love to the full, in the company of his friends Pop Corn, Einstein, Bacon and childhood sweetheart Peggy Blue."
2 ::: Cupid (1997) ::: thriller ::: A brother and sister with a past incestuous relationship have a current murderous relationship. He murders the women who reject him and she murders the women who get too close to him.,Cupid (1997),thriller,A brother and sister with a past incestuous relationship have a current murderous relationship. He murders the women who reject him and she murders the women who get too close to him.
"3 ::: Young, Wild and Wonderful (1980) ::: adult ::: As the bus empties the students for their field trip to the Museum of Natural History, little does the tour guide suspect that the students are there for more than just another tour. First, during the lecture films, the coeds drift into dreams of the most erotic fantasies one can imagine. After the films, they release the emotion of the fantasies in the most erotic and uncommon ways. One slips off to the curator's office for a little ""acquisition. "" Another finds the anthropologist to see what bones can be identified. Even the head teacher isn't immune. Soon the tour is over, but as the bus departs, everyone admits it was quite an education.","Young, Wild and Wonderful (1980)",adult,"As the bus empties the students for their field trip to the Museum of Natural History, little does the tour guide suspect that the students are there for more than just another tour. First, during the lecture films, the coeds drift into dreams of the most erotic fantasies one can imagine. After the films, they release the emotion of the fantasies in the most erotic and uncommon ways. One slips off to the curator's office for a little ""acquisition. "" Another finds the anthropologist to see what bones can be identified. Even the head teacher isn't immune. Soon the tour is over, but as the bus departs, everyone admits it was quite an education."
"4 ::: The Secret Sin (1915) ::: drama ::: To help their unemployed father make ends meet, Edith and her twin sister Grace work as seamstresses . An invalid, Grace falls prey to the temptations of Chinatown opium and becomes an addict, a condition worsened by a misguided physician who prescribes morphine to ease her pain. When their father strikes oil, the family enjoys a new prosperity and the sisters meet the eligible Jack Herron, a fellow oil prospector. To Grace's shock, Jack falls in love with Edith and in her jealousy, Grace tells Jack that Edith, not she, has a drug problem. Hinting that her sister will soon need more morphine, Grace arranges for a dinner in Chinatown with the couple. While her sister and Jack dance, Grace slips away to an opium den. Edith follows her, but ends up in the wrong den and is arrested in an ensuing drug raid. After he bails her out of jail, Edith takes an angry Jack to search for Grace and stumbles across her half-conscious body lying in the street. The truth about the sisters is revealed, and after sending Grace to a sanitarium in the country, Jack and Edith are married.",The Secret Sin (1915),drama,"To help their unemployed father make ends meet, Edith and her twin sister Grace work as seamstresses . An invalid, Grace falls prey to the temptations of Chinatown opium and becomes an addict, a condition worsened by a misguided physician who prescribes morphine to ease her pain. When their father strikes oil, the family enjoys a new prosperity and the sisters meet the eligible Jack Herron, a fellow oil prospector. To Grace's shock, Jack falls in love with Edith and in her jealousy, Grace tells Jack that Edith, not she, has a drug problem. Hinting that her sister will soon need more morphine, Grace arranges for a dinner in Chinatown with the couple. While her sister and Jack dance, Grace slips away to an opium den. Edith follows her, but ends up in the wrong den and is arrested in an ensuing drug raid. After he bails her out of jail, Edith takes an angry Jack to search for Grace and stumbles across her half-conscious body lying in the street. The truth about the sisters is revealed, and after sending Grace to a sanitarium in the country, Jack and Edith are married."
"5 ::: The Unrecovered (2007) ::: drama ::: The film's title refers not only to the un-recovered bodies at ground zero, but also to the state of the nation at large. Set in the hallucinatory period of time between September 11 and Halloween of 2001, The Unrecovered examines the effect of terror on the average mind, the way a state of heightened anxiety and/or alertness can cause the average person to make the sort of imaginative connections that are normally made only by artists and conspiracy theorists-both of whom figure prominently in this film. The Unrecovered explores the way in which irony, empathy, and paranoia relate to one another in the wake of 9/11.",The Unrecovered (2007),drama,"The film's title refers not only to the un-recovered bodies at ground zero, but also to the state of the nation at large. Set in the hallucinatory period of time between September 11 and Halloween of 2001, The Unrecovered examines the effect of terror on the average mind, the way a state of heightened anxiety and/or alertness can cause the average person to make the sort of imaginative connections that are normally made only by artists and conspiracy theorists-both of whom figure prominently in this film. The Unrecovered explores the way in which irony, empathy, and paranoia relate to one another in the wake of 9/11."
"6 ::: Quality Control (2011) ::: documentary ::: Quality Control consists of a series of 16mm single take shots filmed in the summer of 2010,over a two day period, in a dry cleaners facility in Pritchard, Alabama, near Mobile, Quality Control exhibits the acts as well the conditions around labor and showcases, in Everson's words ""the fine folks of Alabama producing a superior product."" It is similar stylistically, in form and rhythm, to certain scenarios in Everson's award-winning and critically acclaimed previous films, including Erie (IFFR 2010) and in thematic concerns to several other short form works which follow the daily, quotidian tasks of workers in rest and in motion, and is an oblique sequel, ten years hence, to Everson's Creative Capital granted project A Week in the Hole (2001), which focused on an employee's adjustment to materials, time, space and personnel. Quality Control consists of a series of 16mm single take shots, filmed over a two day period in the summer of 2010, in a dry cleaners facility in Pritchard, Alabama, near Mobile. Quality Control exhibits the acts as well the conditions around labor. It is similar stylistically, in form and rhythm, to certain scenarios in Everson's award-winning and critically acclaimed previous films, including Erie (IFFR 2010) and in thematic concerns to several other short form works which follow the daily, quotidian tasks of workers in rest and in motion, including the factory routine captured in the short film A Week in the Hole (2001), which focused on an employee's adjustment to materials, time, space and personnel. Principal cast includes Shay Wright and Annette Speight.",Quality Control (2011),documentary,"Quality Control consists of a series of 16mm single take shots filmed in the summer of 2010,over a two day period, in a dry cleaners facility in Pritchard, Alabama, near Mobile, Quality Control exhibits the acts as well the conditions around labor and showcases, in Everson's words ""the fine folks of Alabama producing a superior product."" It is similar stylistically, in form and rhythm, to certain scenarios in Everson's award-winning and critically acclaimed previous films, including Erie (IFFR 2010) and in thematic concerns to several other short form works which follow the daily, quotidian tasks of workers in rest and in motion, and is an oblique sequel, ten years hence, to Everson's Creative Capital granted project A Week in the Hole (2001), which focused on an employee's adjustment to materials, time, space and personnel. Quality Control consists of a series of 16mm single take shots, filmed over a two day period in the summer of 2010, in a dry cleaners facility in Pritchard, Alabama, near Mobile. Quality Control exhibits the acts as well the conditions around labor. It is similar stylistically, in form and rhythm, to certain scenarios in Everson's award-winning and critically acclaimed previous films, including Erie (IFFR 2010) and in thematic concerns to several other short form works which follow the daily, quotidian tasks of workers in rest and in motion, including the factory routine captured in the short film A Week in the Hole (2001), which focused on an employee's adjustment to materials, time, space and personnel. Principal cast includes Shay Wright and Annette Speight."
"7 ::: ""Pink Slip"" (2009) ::: comedy ::: In tough economic times Max and Joey have all but run out of ideas until, they discover that senior housing is cheap. Not only that but Max's aunt just kicked the bucket and no one knows yet. In a hilarious series that always keeps you on your toes, the two friends take us on a cross-dressing, desperate and endearing ride through being broke.","""Pink Slip"" (2009)",comedy,"In tough economic times Max and Joey have all but run out of ideas until, they discover that senior housing is cheap. Not only that but Max's aunt just kicked the bucket and no one knows yet. In a hilarious series that always keeps you on your toes, the two friends take us on a cross-dressing, desperate and endearing ride through being broke."
"8 ::: One Step Away (1985) ::: crime ::: Ron Petrie (Keanu Reeves) is a troubled teen whose life is hanging by a thread, as he's on the verge of suspension from school, subject to arrest for breaking and entering, and the cause of his single mother's impending eviction from her apartment. Unless he can find a resolution, his only option seems to be life of street crime.",One Step Away (1985),crime,"Ron Petrie (Keanu Reeves) is a troubled teen whose life is hanging by a thread, as he's on the verge of suspension from school, subject to arrest for breaking and entering, and the cause of his single mother's impending eviction from her apartment. Unless he can find a resolution, his only option seems to be life of street crime."
"9 ::: ""Desperate Hours"" (2016) ::: reality-tv ::: A sudden calamitous event, causing great loss of life, damage, or hardship, like a flood, a tornado, an airplane crash, or an earthquake. This is not only a documentary but a live account of dramatic events in real time. In this unique 13- part series you'll be an eyewitness to some of the greatest disasters of the last 100 years and you will have a rare opportunity to compare disasters across time and distance and decide which you think is the worst.","""Desperate Hours"" (2016)",reality-tv,"A sudden calamitous event, causing great loss of life, damage, or hardship, like a flood, a tornado, an airplane crash, or an earthquake. This is not only a documentary but a live account of dramatic events in real time. In this unique 13- part series you'll be an eyewitness to some of the greatest disasters of the last 100 years and you will have a rare opportunity to compare disasters across time and distance and decide which you think is the worst."
"10 ::: Spirits (2014/I) ::: horror ::: Four high school students embark on a terrifying journey through ShadowView Manor 2 years after a horrifying séance gone wrong. Intern Raven, decides to reconnect with her elementary school friends Kota, William, and Jessica by bringing them to her new workplace, ShadowView Manor for a bit of paranormal investigating. Hearing more forbidden secrets from the night janitor sends them into a dark descending spiral of terror.",Spirits (2014/I),horror,"Four high school students embark on a terrifying journey through ShadowView Manor 2 years after a horrifying séance gone wrong. Intern Raven, decides to reconnect with her elementary school friends Kota, William, and Jessica by bringing them to her new workplace, ShadowView Manor for a bit of paranormal investigating. Hearing more forbidden secrets from the night janitor sends them into a dark descending spiral of terror."


Output can only be rendered in Databricks

##### Remove Unnecessary Whitespace & Reduce Number of Classes Due to Class Sizes

In [0]:
df = df.withColumn("title", F.trim("title"))\
    .withColumn("genre", F.trim("genre"))\
    .withColumn("description", F.trim("description"))\
    .drop("value")

cols_to_keep = ['documentary','drama', 'comedy', 'short']

df = df.filter(df.genre.isin(cols_to_keep))

df = df.na.drop()

display(df)

title,genre,description
Oscar et la dame rose (2009),drama,"Listening in to a conversation between his doctor and parents, 10-year-old Oscar learns what nobody has the courage to tell him. He only has a few weeks to live. Furious, he refuses to speak to anyone except straight-talking Rose, the lady in pink he meets on the hospital stairs. As Christmas approaches, Rose uses her fantastical experiences as a professional wrestler, her imagination, wit and charm to allow Oscar to live life and love to the full, in the company of his friends Pop Corn, Einstein, Bacon and childhood sweetheart Peggy Blue."
The Secret Sin (1915),drama,"To help their unemployed father make ends meet, Edith and her twin sister Grace work as seamstresses . An invalid, Grace falls prey to the temptations of Chinatown opium and becomes an addict, a condition worsened by a misguided physician who prescribes morphine to ease her pain. When their father strikes oil, the family enjoys a new prosperity and the sisters meet the eligible Jack Herron, a fellow oil prospector. To Grace's shock, Jack falls in love with Edith and in her jealousy, Grace tells Jack that Edith, not she, has a drug problem. Hinting that her sister will soon need more morphine, Grace arranges for a dinner in Chinatown with the couple. While her sister and Jack dance, Grace slips away to an opium den. Edith follows her, but ends up in the wrong den and is arrested in an ensuing drug raid. After he bails her out of jail, Edith takes an angry Jack to search for Grace and stumbles across her half-conscious body lying in the street. The truth about the sisters is revealed, and after sending Grace to a sanitarium in the country, Jack and Edith are married."
The Unrecovered (2007),drama,"The film's title refers not only to the un-recovered bodies at ground zero, but also to the state of the nation at large. Set in the hallucinatory period of time between September 11 and Halloween of 2001, The Unrecovered examines the effect of terror on the average mind, the way a state of heightened anxiety and/or alertness can cause the average person to make the sort of imaginative connections that are normally made only by artists and conspiracy theorists-both of whom figure prominently in this film. The Unrecovered explores the way in which irony, empathy, and paranoia relate to one another in the wake of 9/11."
Quality Control (2011),documentary,"Quality Control consists of a series of 16mm single take shots filmed in the summer of 2010,over a two day period, in a dry cleaners facility in Pritchard, Alabama, near Mobile, Quality Control exhibits the acts as well the conditions around labor and showcases, in Everson's words ""the fine folks of Alabama producing a superior product."" It is similar stylistically, in form and rhythm, to certain scenarios in Everson's award-winning and critically acclaimed previous films, including Erie (IFFR 2010) and in thematic concerns to several other short form works which follow the daily, quotidian tasks of workers in rest and in motion, and is an oblique sequel, ten years hence, to Everson's Creative Capital granted project A Week in the Hole (2001), which focused on an employee's adjustment to materials, time, space and personnel. Quality Control consists of a series of 16mm single take shots, filmed over a two day period in the summer of 2010, in a dry cleaners facility in Pritchard, Alabama, near Mobile. Quality Control exhibits the acts as well the conditions around labor. It is similar stylistically, in form and rhythm, to certain scenarios in Everson's award-winning and critically acclaimed previous films, including Erie (IFFR 2010) and in thematic concerns to several other short form works which follow the daily, quotidian tasks of workers in rest and in motion, including the factory routine captured in the short film A Week in the Hole (2001), which focused on an employee's adjustment to materials, time, space and personnel. Principal cast includes Shay Wright and Annette Speight."
"""Pink Slip"" (2009)",comedy,"In tough economic times Max and Joey have all but run out of ideas until, they discover that senior housing is cheap. Not only that but Max's aunt just kicked the bucket and no one knows yet. In a hilarious series that always keeps you on your toes, the two friends take us on a cross-dressing, desperate and endearing ride through being broke."
The Spirit World: Ghana (2016),documentary,"Tom Beacham explores Ghana with Director of Photography Alex Holberton, in search of Spirits. The film looks at the overlap between the spirit world and the physical world, voodoo, witchcraft, the power of the Holy Spirit and magic. They meet with royal family members, magicians and voodoo priests as well as getting rare footage of a full Voodoo Ceremony."
In the Gloaming (1997),drama,"Danny, dying of Aids, returns home for his last months. Always close to his mother, they share moments of openness that tend to shut out Danny's father and his sister."
Pink Ribbons: One Small Step (2009),documentary,"A sister's breast cancer diagnosis and her brother's need to take action. Highlighting the events that took place during Franke's participation in the Susan G. Komen 3-Day Walk, including the training and fund-raising events in the six months preceding the Walk. It includes numerous interviews with breast cancer survivors who share the wealth of their experience with the viewer. Also included are appearances by local celebrities, athletes and musical artists, as well as informative interviews with health care professionals who explain what we should all know about dealing with this illness. This is the film the doctor should have given my sister before she left his office that fateful day. So I made this film for my sister, and for all our sisters... that they would always remember, ""You are not alone."""
The Glass Menagerie (1973),drama,"Amanda Wingfield dominates her children with her faded gentility and exaggerated tales of her Southern belle past. Her son plans escape; her daughter withdraws into a dream world. When a ""gentleman caller"" appears, things move to crisis point. Amanda is faced with the dilemma of a dependent life, for both her remaining years, but more importantly, for her dependent daughter, Laura. Amanda struggles to use the only means she has at her disposable to secure a future for her frightened and fragile ""spinster"" daughter."
Night Call (2016),drama,"Simon's world is turned upside down when his little girl Katie is abducted during a family day out. After weeks of searching and appeals Simon decides to punish himself by offering a young prostitute with mounting debts, life changing money to end his life."


Output can only be rendered in Databricks

##### Combine Title & Description

In [0]:
df = df.select(F.col("genre").alias("label"), F.concat_ws(" : ", "title", "description").alias("text"))

##### Compute Word Length for Each Text Input (Title + Description)

In [0]:
df = df.withColumn("text_length", F.size(F.split(F.col('text'), ' ')))
display(df)

label,text,text_length
drama,"Oscar et la dame rose (2009) : Listening in to a conversation between his doctor and parents, 10-year-old Oscar learns what nobody has the courage to tell him. He only has a few weeks to live. Furious, he refuses to speak to anyone except straight-talking Rose, the lady in pink he meets on the hospital stairs. As Christmas approaches, Rose uses her fantastical experiences as a professional wrestler, her imagination, wit and charm to allow Oscar to live life and love to the full, in the company of his friends Pop Corn, Einstein, Bacon and childhood sweetheart Peggy Blue.",99
drama,"The Secret Sin (1915) : To help their unemployed father make ends meet, Edith and her twin sister Grace work as seamstresses . An invalid, Grace falls prey to the temptations of Chinatown opium and becomes an addict, a condition worsened by a misguided physician who prescribes morphine to ease her pain. When their father strikes oil, the family enjoys a new prosperity and the sisters meet the eligible Jack Herron, a fellow oil prospector. To Grace's shock, Jack falls in love with Edith and in her jealousy, Grace tells Jack that Edith, not she, has a drug problem. Hinting that her sister will soon need more morphine, Grace arranges for a dinner in Chinatown with the couple. While her sister and Jack dance, Grace slips away to an opium den. Edith follows her, but ends up in the wrong den and is arrested in an ensuing drug raid. After he bails her out of jail, Edith takes an angry Jack to search for Grace and stumbles across her half-conscious body lying in the street. The truth about the sisters is revealed, and after sending Grace to a sanitarium in the country, Jack and Edith are married.",197
drama,"The Unrecovered (2007) : The film's title refers not only to the un-recovered bodies at ground zero, but also to the state of the nation at large. Set in the hallucinatory period of time between September 11 and Halloween of 2001, The Unrecovered examines the effect of terror on the average mind, the way a state of heightened anxiety and/or alertness can cause the average person to make the sort of imaginative connections that are normally made only by artists and conspiracy theorists-both of whom figure prominently in this film. The Unrecovered explores the way in which irony, empathy, and paranoia relate to one another in the wake of 9/11.",110
documentary,"Quality Control (2011) : Quality Control consists of a series of 16mm single take shots filmed in the summer of 2010,over a two day period, in a dry cleaners facility in Pritchard, Alabama, near Mobile, Quality Control exhibits the acts as well the conditions around labor and showcases, in Everson's words ""the fine folks of Alabama producing a superior product."" It is similar stylistically, in form and rhythm, to certain scenarios in Everson's award-winning and critically acclaimed previous films, including Erie (IFFR 2010) and in thematic concerns to several other short form works which follow the daily, quotidian tasks of workers in rest and in motion, and is an oblique sequel, ten years hence, to Everson's Creative Capital granted project A Week in the Hole (2001), which focused on an employee's adjustment to materials, time, space and personnel. Quality Control consists of a series of 16mm single take shots, filmed over a two day period in the summer of 2010, in a dry cleaners facility in Pritchard, Alabama, near Mobile. Quality Control exhibits the acts as well the conditions around labor. It is similar stylistically, in form and rhythm, to certain scenarios in Everson's award-winning and critically acclaimed previous films, including Erie (IFFR 2010) and in thematic concerns to several other short form works which follow the daily, quotidian tasks of workers in rest and in motion, including the factory routine captured in the short film A Week in the Hole (2001), which focused on an employee's adjustment to materials, time, space and personnel. Principal cast includes Shay Wright and Annette Speight.",262
comedy,"""Pink Slip"" (2009) : In tough economic times Max and Joey have all but run out of ideas until, they discover that senior housing is cheap. Not only that but Max's aunt just kicked the bucket and no one knows yet. In a hilarious series that always keeps you on your toes, the two friends take us on a cross-dressing, desperate and endearing ride through being broke.",67
documentary,"The Spirit World: Ghana (2016) : Tom Beacham explores Ghana with Director of Photography Alex Holberton, in search of Spirits. The film looks at the overlap between the spirit world and the physical world, voodoo, witchcraft, the power of the Holy Spirit and magic. They meet with royal family members, magicians and voodoo priests as well as getting rare footage of a full Voodoo Ceremony.",65
drama,"In the Gloaming (1997) : Danny, dying of Aids, returns home for his last months. Always close to his mother, they share moments of openness that tend to shut out Danny's father and his sister.",35
documentary,"Pink Ribbons: One Small Step (2009) : A sister's breast cancer diagnosis and her brother's need to take action. Highlighting the events that took place during Franke's participation in the Susan G. Komen 3-Day Walk, including the training and fund-raising events in the six months preceding the Walk. It includes numerous interviews with breast cancer survivors who share the wealth of their experience with the viewer. Also included are appearances by local celebrities, athletes and musical artists, as well as informative interviews with health care professionals who explain what we should all know about dealing with this illness. This is the film the doctor should have given my sister before she left his office that fateful day. So I made this film for my sister, and for all our sisters... that they would always remember, ""You are not alone.""",139
drama,"The Glass Menagerie (1973) : Amanda Wingfield dominates her children with her faded gentility and exaggerated tales of her Southern belle past. Her son plans escape; her daughter withdraws into a dream world. When a ""gentleman caller"" appears, things move to crisis point. Amanda is faced with the dilemma of a dependent life, for both her remaining years, but more importantly, for her dependent daughter, Laura. Amanda struggles to use the only means she has at her disposable to secure a future for her frightened and fragile ""spinster"" daughter.",89
drama,"Night Call (2016) : Simon's world is turned upside down when his little girl Katie is abducted during a family day out. After weeks of searching and appeals Simon decides to punish himself by offering a young prostitute with mounting debts, life changing money to end his life.",48


Output can only be rendered in Databricks

Output can only be rendered in Databricks

##### Reduce Number of Samples

In [0]:
df = df.filter(df.text_length < 151)
df = df.drop(df.text_length)

print(df.count())
display(df)

32430


label,text
drama,"Oscar et la dame rose (2009) : Listening in to a conversation between his doctor and parents, 10-year-old Oscar learns what nobody has the courage to tell him. He only has a few weeks to live. Furious, he refuses to speak to anyone except straight-talking Rose, the lady in pink he meets on the hospital stairs. As Christmas approaches, Rose uses her fantastical experiences as a professional wrestler, her imagination, wit and charm to allow Oscar to live life and love to the full, in the company of his friends Pop Corn, Einstein, Bacon and childhood sweetheart Peggy Blue."
drama,"The Unrecovered (2007) : The film's title refers not only to the un-recovered bodies at ground zero, but also to the state of the nation at large. Set in the hallucinatory period of time between September 11 and Halloween of 2001, The Unrecovered examines the effect of terror on the average mind, the way a state of heightened anxiety and/or alertness can cause the average person to make the sort of imaginative connections that are normally made only by artists and conspiracy theorists-both of whom figure prominently in this film. The Unrecovered explores the way in which irony, empathy, and paranoia relate to one another in the wake of 9/11."
comedy,"""Pink Slip"" (2009) : In tough economic times Max and Joey have all but run out of ideas until, they discover that senior housing is cheap. Not only that but Max's aunt just kicked the bucket and no one knows yet. In a hilarious series that always keeps you on your toes, the two friends take us on a cross-dressing, desperate and endearing ride through being broke."
documentary,"The Spirit World: Ghana (2016) : Tom Beacham explores Ghana with Director of Photography Alex Holberton, in search of Spirits. The film looks at the overlap between the spirit world and the physical world, voodoo, witchcraft, the power of the Holy Spirit and magic. They meet with royal family members, magicians and voodoo priests as well as getting rare footage of a full Voodoo Ceremony."
drama,"In the Gloaming (1997) : Danny, dying of Aids, returns home for his last months. Always close to his mother, they share moments of openness that tend to shut out Danny's father and his sister."
documentary,"Pink Ribbons: One Small Step (2009) : A sister's breast cancer diagnosis and her brother's need to take action. Highlighting the events that took place during Franke's participation in the Susan G. Komen 3-Day Walk, including the training and fund-raising events in the six months preceding the Walk. It includes numerous interviews with breast cancer survivors who share the wealth of their experience with the viewer. Also included are appearances by local celebrities, athletes and musical artists, as well as informative interviews with health care professionals who explain what we should all know about dealing with this illness. This is the film the doctor should have given my sister before she left his office that fateful day. So I made this film for my sister, and for all our sisters... that they would always remember, ""You are not alone."""
drama,"The Glass Menagerie (1973) : Amanda Wingfield dominates her children with her faded gentility and exaggerated tales of her Southern belle past. Her son plans escape; her daughter withdraws into a dream world. When a ""gentleman caller"" appears, things move to crisis point. Amanda is faced with the dilemma of a dependent life, for both her remaining years, but more importantly, for her dependent daughter, Laura. Amanda struggles to use the only means she has at her disposable to secure a future for her frightened and fragile ""spinster"" daughter."
drama,"Night Call (2016) : Simon's world is turned upside down when his little girl Katie is abducted during a family day out. After weeks of searching and appeals Simon decides to punish himself by offering a young prostitute with mounting debts, life changing money to end his life."
comedy,"Babylon Vista (2001) : Frankie Reno was a child star on a TV show. But that was thirty years ago. Now he's busy making ends meet running ""Babylon Vista"", a Hollywood apartment inhabited by has-beens and wannabees - with more stories than rent payments. Frankie slowly finds himself getting sucked into the bizarre world of his dysfunctional tenants."
documentary,"""Wo Grafen schlafen - Eine Schlösser-Reise"" (2014) : The story of the Castle and Family of Norbert Salburg-Falkenstein, Commander of the Knights of Malte for Österreich and Bohemia, who married the daughter of a naval commander from Sweden and was able to restore the family property correctly with her help. A Wagnerian opera singer sings Welsh and Wagnerian music in the video, one of a series on how the nobility of Deutschland and Österreich live in the various castles. Lively and fun commentary from the TV hosts, one of whom is a descendant himself of the great Hapsburg dynasty."


##### Remove Data Imbalance

In [0]:
samples_each_class = 2500

class_ratios = {}

# Unique Labels
for x in cols_to_keep:
    # Select all samples with the selected 'label' value
    current_df = df.filter(F.col("label") == x)
    print(x)
    
    # Ratio to downsize
    ratio = samples_each_class/current_df.count()
    print("Ratio:", ratio)
    this_ratio = {f"{x}" : f"{ratio}"}
    class_ratios.update(this_ratio)

print(class_ratios)
    
sampled_df = df.sampleBy("label", 
                         fractions=class_ratios, 
                         seed=42)

print(sampled_df.count())

documentary
Ratio: 0.2403383964622188
drama
Ratio: 0.22681908909453818
comedy
Ratio: 0.3897723729342064
short
Ratio: 0.544425087108014
{'documentary': '0.2403383964622188', 'drama': '0.22681908909453818', 'comedy': '0.3897723729342064', 'short': '0.544425087108014'}
10058


In [0]:
display(sampled_df)

label,text
drama,"Night Call (2016) : Simon's world is turned upside down when his little girl Katie is abducted during a family day out. After weeks of searching and appeals Simon decides to punish himself by offering a young prostitute with mounting debts, life changing money to end his life."
comedy,"Tunnel Vision (1976) : A committee investigating TV's first uncensored network examines a typical day's programming, which includes shows, commercials, news programs, you name it. What they discover will surely crack you up! This outrageous and irreverent spoof of television launched the careers of some of the greatest comedians of all time. This 1976 film tries to predict what American television will be like in the year 1985. Tunnelvision is America's first ""uncensored and free"" television network. Although wildly popular, it is also blamed for increased crime and unemployment. Christian A. Broder, president and founder of Tunnelvision, is called to defend his network in front of a Senate sub-committee. The sub-committee decides to view excerpts from a ""typical"" day of Tunnelvision broadcasting. What follows is a series of brief skits lampooning television, including cop shows, news broadcasts, situation comedies, and (of course) commercials."
documentary,"7 días con Alberto Corazón (2015) : The objective of this documentary is to convey in an exciting way who is Alberto Corazón. His work, his career, his creative process, his contribution to Spanish culture. His personality, his character and his ideas. Its contribution to the Spanish culture - especially the visual culture - through the recognition of marks and symbols, corporate identities, industrial designs, objects and signs. His creative process, showing closely and transparently how he create. His trajectory, linked to the recent history of Spain. His work in the broadest sense, including editorial design, posters, wall paintings, sundials, objects, paintings, sculptures, books. His personality, character and ideas, respecting the space of intimacy."
comedy,"""The Young Professionals"" (2015) : Whether it's blocking up mouse holes, running from Landlords or making puppet shows in the bath, it's never a dull moment for The Young Professionals. Desperate to break into the online world and escape the terrors of temping, Natalie presents the lives of six housemates struggling to get on the career ladder after uni and pay their rent on time. Which is all helped along with Keara - the one with the 'real' job."
documentary,"The End of Ageing (2010) : All over the world, human beings are living longer than ever before. This is due to many factors, including improved living conditions, lifestyle choices and medical advancements. While there is not a single cause, a growing community of scientists are pushing the limits of life expectancy. In the not-too-distant future, they may even be able to halt ageing altogether. As the world's population continues to live longer, our current economic systems will no longer be sustainable. Health care systems, and the economies that fund them, need to make major changes. Because a growing number of people are healthy enough continue to work and play, we will need to reevaluate the nature of employment and recreation."
short,"Begegnung mit Fritz Lang (1964) : Interview with Fritz Lang on the roof of Villa Malaparte on Capri during the filming of the fictitious film ""Odysseus"" and the filming of ""Le mépris"" by Jean-Luc Godard, in which Fritz Lang plays the role of an old film director called Lang. Interview with Fritz Lang on the roof of Villa Malaparte on Capri during the filming of the fictitious film ""Odysseus"" and the filming of ""Le mépris"" by Jean-Luc Godard, in which Fritz Lang plays the role of an old film director called Lang. During the interview, excerpts from the long films ""The Nibelungen"", ""The Tired Death"" and ""M"" are shown."
documentary,"Race Across America: Push Beyond (2017) : Marshall Nord is a 49-year-old executive, Father, husband and amateur endurance athlete. He has 30 years of multi-sport experience behind him, having run marathons, Iron Mans, triathlons, you name it. But with the big 5-0 looming, he is desperate to achieve something truly epic. And what is more epic than taking on the toughest ultra-endurance cycling event in the world? The Race Across America is a 3,070 mile bicycle race from the West to East of the USA, that must be completed in under 12 days. With a small support crew made up of family and friends but not a minute of RAAM experience between them, the odds are stacked against him. Marshall is determined to push beyond in order to achieve his goal however, no matter what obstacles the race, rules or road can throw in his way."
comedy,"""George & Leo"" (1997) : George Stoody is a mild-mannered bookstore owner who encounters a hoodlum/magician named Leo Wagonman, the estranged father of his new daughter-in-law Casey. Leo, on the run from a mob intent on collecting the payoff money Leo stole from a Las Vegas casino, decides to stay in the spare room above George's bookstore."
short,"La capsula (2009) : Italian millionaire Moritz Craffonara dreams about 'a bed under the stars'. He asks his friend Ross Lovegrove for help and the British designer constructs something that nobody thought was possible: a floating space capsule on top of a 2100 meter mountain in the heart of the Alps. The project was named ""Alpine Capsule"" and gained media attention all over the world. But eventually it is something way bigger: It became Moritz' monument. A documentary about the limits of modern technology and the human fear of being forgotten."
drama,"Herr über Leben und Tod (1955) : Barbara is married to Georg Bertram, a professor of medicine who once saved her father's life. Things go awry for the couple when she gives birth to a mentally defective child. For Georg, coldly clinical, euthanizing the infant is the only way out. He is about to commit the irreparable when Barbara manages to interrupt his fatal act. By mutual agreement, husband and wife decide that Barbara will go to Saint-Guénolé in Brittany, where she and their son will be cared for by Louise Kerbrec, Georg's former nurse, in the hypothetical hope that the boy's condition will improve. What they do not know yet is that Barbara will meet there another doctor, Daniel Karentis, much more sympathetic than Georg and also much handsomer..."


Output can only be rendered in Databricks

##### Save Processed Dataset

In [0]:
def save_modified_ds(name):
    sampled_df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "\t").mode("overwrite").save(f"dbfs:/FileStore/tables/{name}")
 
save_modified_ds("prepped_imdb_ds")