
## Plot Transformation into Features
*   This notebook is to be ran within the google colabl environment 
*   Takes the plot of movies and tokenises them to be used in our data models.



In [33]:
import os
# Find the latest version of spark 3.0  from http://www-us.apache.org/dist/spark/ and enter as the spark version
# For example:
# spark_version = 'spark-3.0.2'
spark_version = 'spark-3.0.2'
os.environ['SPARK_VERSION']=spark_version

# Install Spark and Java
!apt-get update
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

0% [Working]            Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
            Hit:2 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (91.189.91.38)] [Co                                                                               Hit:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (91.189.91.38)] [Co0% [1 InRelease gpgv 242 kB] [Waiting for headers] [Connecting to security.ubun                                                                               Hit:4 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
0% [1 InRelease gpgv 242 kB] [Waiting for headers] [Connecting to security.ubun                                                                               Hit:5 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
0% [1 InRelease gpgv 242 kB] [Waiting for headers] [Con

In [34]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("plot_NLP").getOrCreate()

In [35]:
from pyspark import SparkFiles
import pandas as pd

url = "https://data-bootcamp-ztc.s3.amazonaws.com/movies_complete_cleaned.csv"
spark.sparkContext.addFile(url)

df = spark.read.option('header', 'true').csv(SparkFiles.get("movies_complete_cleaned.csv"), inferSchema=True, sep = ",")
df.show(10)

+--------------------+--------------------+------------------+-------+----------+----+-----+--------------+--------------------+--------------------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+----------+----------+----------------+--------+------------+--------+--------------------+------+
|                name|          production|          director|runtime|  released|year|month|country_kaggle|        country_omdb|         star_kaggle|         actors_omdb|    writer_kaggle|        writers_omdb|       language_omdb|                plot|              awards|score_imdb|votes_imdb|score_metacritic|  budget|genre_kaggle|   gross|         genres_omdb|rating|
+--------------------+--------------------+------------------+-------+----------+----+-----+--------------+--------------------+--------------------+--------------------+-----------------+--------------------+--------------------+--------------------+-------

## Transform DataFrame to fit review_rating table

In [36]:
plot_df = df.select(["name","plot"])
plot_df.show()

+--------------------+--------------------+
|                name|                plot|
+--------------------+--------------------+
|                Gold|With the sudden d...|
|          The Choice|In a small coasta...|
|Middle School: Th...|Imaginative quiet...|
|    Midnight Special|Alton Meyer is a ...|
|     A Monster Calls|The monster does ...|
|The Brothers Grimsby|                null|
|Pride and Prejudi...|The five highly t...|
|Mike and Dave Nee...|Hard-partying bro...|
|             Snowden|SNOWDEN stars Jos...|
|             The Boy|Greta, a young Am...|
|Miracles from Heaven|MIRACLES FROM HEA...|
|    D�a del atentado|                null|
|            The Boss|A titan of indust...|
|             Arrival|Linguistics profe...|
|       Gods of Egypt|Set, the merciles...|
|Warcraft: The Beg...|                null|
|Everybody Wants S...|In Texas in the f...|
|The Birth of a Na...|Two brothers, Phi...|
| Presencia siniestra|                null|
|                Goat|Reeling fr

In [37]:
plot_df = plot_df.filter("plot IS NOT NULL")
plot_df.show()

+--------------------+--------------------+
|                name|                plot|
+--------------------+--------------------+
|                Gold|With the sudden d...|
|          The Choice|In a small coasta...|
|Middle School: Th...|Imaginative quiet...|
|    Midnight Special|Alton Meyer is a ...|
|     A Monster Calls|The monster does ...|
|Pride and Prejudi...|The five highly t...|
|Mike and Dave Nee...|Hard-partying bro...|
|             Snowden|SNOWDEN stars Jos...|
|             The Boy|Greta, a young Am...|
|Miracles from Heaven|MIRACLES FROM HEA...|
|            The Boss|A titan of indust...|
|             Arrival|Linguistics profe...|
|       Gods of Egypt|Set, the merciles...|
|Everybody Wants S...|In Texas in the f...|
|The Birth of a Na...|Two brothers, Phi...|
|                Goat|Reeling from a te...|
|Miss Peregrine's ...|"When Jacob disco...|
|            War Dogs|Two friends in th...|
|            Deadpool|This is the origi...|
|            The Void|When polic

## Create Data Pipeline

In [38]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF
# Create all the features to the data set
tokenizer = Tokenizer(inputCol="plot", outputCol="token_text")
stopremove = StopWordsRemover(inputCol='token_text',outputCol='stop_tokens')
hashingTF = HashingTF(inputCol="token_text", outputCol='hash_token')
idf = IDF(inputCol='hash_token', outputCol='idf_token')

In [39]:
# from pyspark.ml.feature import VectorAssembler
# from pyspark.ml.linalg import Vector

# # Create feature vectors
# clean_up = VectorAssembler(inputCols=['idf_token', ''], outputCol='features')

In [40]:
# Create and run a data processing Pipeline
from pyspark.ml import Pipeline
data_prep_pipeline = Pipeline(stages=[tokenizer, stopremove, hashingTF, idf])

## Transform DataFrame

In [41]:
#@title Default title text
# Fit and transform the pipeline
cleaner = data_prep_pipeline.fit(plot_df)
cleaned = cleaner.transform(plot_df)

In [42]:
# Show label of ham spame and resulting features
cleaned.select(['name', 'plot','idf_token']).show()


+--------------------+--------------------+--------------------+
|                name|                plot|           idf_token|
+--------------------+--------------------+--------------------+
|                Gold|With the sudden d...|(262144,[3048,905...|
|          The Choice|In a small coasta...|(262144,[5,11104,...|
|Middle School: Th...|Imaginative quiet...|(262144,[3072,218...|
|    Midnight Special|Alton Meyer is a ...|(262144,[429,9129...|
|     A Monster Calls|The monster does ...|(262144,[4360,178...|
|Pride and Prejudi...|The five highly t...|(262144,[5923,127...|
|Mike and Dave Nee...|Hard-partying bro...|(262144,[7182,155...|
|             Snowden|SNOWDEN stars Jos...|(262144,[12810,38...|
|             The Boy|Greta, a young Am...|(262144,[14376,19...|
|Miracles from Heaven|MIRACLES FROM HEA...|(262144,[61,9420,...|
|            The Boss|A titan of indust...|(262144,[3280,942...|
|             Arrival|Linguistics profe...|(262144,[24980,27...|
|       Gods of Egypt|Set

## Export


In [43]:
from google.colab import  drive
import pandas as pd

drive.mount('/drive')

final_df = cleaned.select(['name', 'plot','idf_token']).toPandas()

final_df.to_csv('/drive/My Drive/BC/bc-data/finalproject/plot_features.csv', index = False)



Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).
