<a href="https://colab.research.google.com/github/ElieB-1012/SpoilMe_GPT_3/blob/master/Dataset_Prepocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SpoilMe


In this Google Colab notebook, I import two datasets: one obtained from Kaggle and the other derived directly from the MooviePooper website using data scraping techniques. I perform preprocessing on the dataset to prepare it for fine-tuning.

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

"Action Archives - MoviePooper full.csv" is the dataset that we scrapped with Octoparse.

"IMDB Kaggle Dataset.json" is the dataset that we find from Kaggle.

In [38]:
df = pd.read_csv('Action Archives - MoviePooper full.csv')
df1 = pd.read_json('IMDB Kaggle Dataset.json')

Overview of the two dataset.

In [39]:
df.head()

Unnamed: 0,Title,Spoiler
0,+1(2013),Short Version:\nDavid #1 kills Jill #1 with a ...
1,10(1979),
2,10 Cloverfield Lane(2016),Howard (John Goodman) is truly hiding from a r...
3,10 Things I Hate About You(1999),Kat (Julia Stiles) reveals to Bianca (Larisa O...
4,10 to Midnight(1983),Frustrated that he cannot get a confession fro...


In [40]:
df1.head()

Unnamed: 0,movie_id,plot_summary,duration,genre,rating,release_date,plot_synopsis,title
0,tt0105112,"Former CIA analyst, Jack Ryan is in England wi...",1h 57min,"[Action, Thriller]",6.9,1992-06-05,"Jack Ryan (Ford) is on a ""working vacation"" in...",Patriot Games
1,tt1204975,"Billy (Michael Douglas), Paddy (Robert De Niro...",1h 45min,[Comedy],6.6,2013-11-01,Four boys around the age of 10 are friends in ...,Last Vegas
2,tt0243655,"The setting is Camp Firewood, the year 1981. I...",1h 37min,"[Comedy, Romance]",6.7,2002-04-11,,Wet Hot American Summer
3,tt0040897,"Fred C. Dobbs and Bob Curtin, both down on the...",2h 6min,"[Adventure, Drama, Western]",8.3,1948-01-24,Fred Dobbs (Humphrey Bogart) and Bob Curtin (T...,The Treasure of the Sierra Madre
4,tt0126886,Tracy Flick is running unopposed for this year...,1h 43min,"[Comedy, Drama, Romance]",7.3,1999-05-07,Jim McAllister (Matthew Broderick) is a much-a...,Election


## Preprocessing

In order to facilitate the merging process between the two datasets, we identified the need for a shared column. Consequently, we utilized the "Title" column as a common identifier for merging the datasets together.

In [12]:
df1.rename(columns = {'title':'Title'}, inplace = True)

In the "Title" column of the dataset containing spoilers, there were instances where the film release years were included. We had to remove the release years from the titles in order to successfully merge the datasets.

In [13]:
df['Title'] = df['Title'].str.split('(').str[0]

In [14]:
df.head()

Unnamed: 0,Title,Spoiler
0,+1,Short Version:\nDavid #1 kills Jill #1 with a ...
1,10,
2,10 Cloverfield Lane,Howard (John Goodman) is truly hiding from a r...
3,10 Things I Hate About You,Kat (Julia Stiles) reveals to Bianca (Larisa O...
4,10 to Midnight,Frustrated that he cannot get a confession fro...


Merge on "Title" column.

In [15]:
df2 = pd.merge(df1, df, on = 'Title')

In [16]:
df2.head()

Unnamed: 0,movie_id,plot_summary,duration,genre,rating,release_date,plot_synopsis,Title,Spoiler
0,tt0105112,"Former CIA analyst, Jack Ryan is in England wi...",1h 57min,"[Action, Thriller]",6.9,1992-06-05,"Jack Ryan (Ford) is on a ""working vacation"" in...",Patriot Games,Ryan (Harrison Ford) manages to get his own fa...
1,tt1204975,"Billy (Michael Douglas), Paddy (Robert De Niro...",1h 45min,[Comedy],6.6,2013-11-01,Four boys around the age of 10 are friends in ...,Last Vegas,Billy (Michael Douglas) is getting married for...
2,tt0243655,"The setting is Camp Firewood, the year 1981. I...",1h 37min,"[Comedy, Romance]",6.7,2002-04-11,,Wet Hot American Summer,Beth (Janeane Garofalo) and Henry (David Hyde ...
3,tt0040897,"Fred C. Dobbs and Bob Curtin, both down on the...",2h 6min,"[Adventure, Drama, Western]",8.3,1948-01-24,Fred Dobbs (Humphrey Bogart) and Bob Curtin (T...,The Treasure of the Sierra Madre,The Treasure of the Sierra Madre\r\nAfter Fred...
4,tt0126886,Tracy Flick is running unopposed for this year...,1h 43min,"[Comedy, Drama, Romance]",7.3,1999-05-07,Jim McAllister (Matthew Broderick) is a much-a...,Election,Short version:\nTracy Flick wins the school el...


Create a Dataframe with only relevant column

In [29]:
df3 = df2[['Title', 'plot_summary','plot_synopsis','Spoiler']]

Data Cleaning

In [30]:
df3.replace('\n', ' ')

Unnamed: 0,Title,plot_summary,plot_synopsis,Spoiler
0,Patriot Games,"Former CIA analyst, Jack Ryan is in England wi...","Jack Ryan (Ford) is on a ""working vacation"" in...",Ryan (Harrison Ford) manages to get his own fa...
1,Last Vegas,"Billy (Michael Douglas), Paddy (Robert De Niro...",Four boys around the age of 10 are friends in ...,Billy (Michael Douglas) is getting married for...
2,Wet Hot American Summer,"The setting is Camp Firewood, the year 1981. I...",,Beth (Janeane Garofalo) and Henry (David Hyde ...
3,The Treasure of the Sierra Madre,"Fred C. Dobbs and Bob Curtin, both down on the...",Fred Dobbs (Humphrey Bogart) and Bob Curtin (T...,The Treasure of the Sierra Madre\r\nAfter Fred...
4,Election,Tracy Flick is running unopposed for this year...,Jim McAllister (Matthew Broderick) is a much-a...,Short version:\nTracy Flick wins the school el...
...,...,...,...,...
1151,Dogma,An abortion clinic worker with a special herit...,The film opens with a homeless man (Bud Cort) ...,Dogma\r\nThe comatose homeless man was God all...
1152,The Boy in the Striped Pajamas,Young Bruno lives a wealthy lifestyle in prewa...,,The “farm” is a Nazi death camp which Bruno’s ...
1153,The Butterfly Effect,Evan Treborn grows up in a small town with his...,"In the year 1998, Evan Treborn (Ashton Kutcher...",The Butterfly Effect\r\nSeveral folks have wri...
1154,Shame,Brandon is a 30-something man living in New Yo...,"Brandon (Michael Fassbender) is a successful, ...",Short Ending:\nBrandon Sullivan is a self-loat...


In [31]:
df3['Spoiler'] = df3['Spoiler'].str.replace('\n', ' ')
df3['Spoiler'] = df3['Spoiler'].str.replace('\r', ' ')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3['Spoiler'] = df3['Spoiler'].str.replace('\n', ' ')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3['Spoiler'] = df3['Spoiler'].str.replace('\r', ' ')


In [32]:
df3.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3.dropna(inplace=True)


In [33]:
df3

Unnamed: 0,Title,plot_summary,plot_synopsis,Spoiler
0,Patriot Games,"Former CIA analyst, Jack Ryan is in England wi...","Jack Ryan (Ford) is on a ""working vacation"" in...",Ryan (Harrison Ford) manages to get his own fa...
1,Last Vegas,"Billy (Michael Douglas), Paddy (Robert De Niro...",Four boys around the age of 10 are friends in ...,Billy (Michael Douglas) is getting married for...
2,Wet Hot American Summer,"The setting is Camp Firewood, the year 1981. I...",,Beth (Janeane Garofalo) and Henry (David Hyde ...
3,The Treasure of the Sierra Madre,"Fred C. Dobbs and Bob Curtin, both down on the...",Fred Dobbs (Humphrey Bogart) and Bob Curtin (T...,The Treasure of the Sierra Madre After Fred C...
4,Election,Tracy Flick is running unopposed for this year...,Jim McAllister (Matthew Broderick) is a much-a...,Short version: Tracy Flick wins the school ele...
...,...,...,...,...
1151,Dogma,An abortion clinic worker with a special herit...,The film opens with a homeless man (Bud Cort) ...,Dogma The comatose homeless man was God all a...
1152,The Boy in the Striped Pajamas,Young Bruno lives a wealthy lifestyle in prewa...,,The “farm” is a Nazi death camp which Bruno’s ...
1153,The Butterfly Effect,Evan Treborn grows up in a small town with his...,"In the year 1998, Evan Treborn (Ashton Kutcher...",The Butterfly Effect Several folks have writt...
1154,Shame,Brandon is a 30-something man living in New Yo...,"Brandon (Michael Fassbender) is a successful, ...",Short Ending: Brandon Sullivan is a self-loath...


In [34]:
df3 = df3[df3['Spoiler'].str.len() >= 50]

In [35]:
df3.sort_values(by='Title', inplace=True)
df3 = df3.reset_index(drop=True)
df3.drop(columns=['plot_synopsis'], inplace=True)
df3.rename(columns = {'plot_summary':'Synopsis'}, inplace = True)
df3

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3.sort_values(by='Title', inplace=True)


Unnamed: 0,Title,Synopsis,Spoiler
0,10 Things I Hate About You,"Adapted from William Shakespeare's play ""The T...",Kat (Julia Stiles) reveals to Bianca (Larisa O...
1,12 Angry Men,The defense and the prosecution have rested an...,"One by one, Juror #8 (Henry Fonda) convinces t..."
2,12 Monkeys,An unknown and lethal virus has wiped out five...,James Cole’s (Bruce Willis’) memory/dream is o...
3,12 Years a Slave,Based on an incredible true story of one man's...,Academy Awards BEST PICTURE Solomon Northup...
4,127 Hours,127 Hours is the true story of mountain climbe...,Aron Ralston (James Franco) goes hiking and cl...
...,...,...,...
991,Zero Dark Thirty,Maya is a CIA operative whose first experience...,After years of dead ends and even more terrori...
992,Zodiac,A serial killer in the San Francisco Bay Area ...,Robert Graysmith (Jake Gyllenhaal) concludes t...
993,Zombieland,Searching for family. In the early twenty-firs...,"On the way to Pacific Playland, which Columbus..."
994,Zoolander,Derek Zoolander is VH1's three time male model...,Zoolander Derek (Ben Stiller) discovers that ...


Save as a file the Dataframe

In [None]:
from google.colab import files
df3.to_csv('Project_Dataset.csv', encoding='utf-8')
files.download('Project_Dataset.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>