# The evolution of the language in films

In [3]:
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from pyspark.sql.types import *
import pyspark.sql.functions as F

#import findspark
#findspark.init()

from pyspark.sql import *
%matplotlib inline

#spark = SparkSession.builder.getOrCreate()
sqlContext = SQLContext(sc)

## 2 Subtitle analysis 

### 2.1 Preprocessing steps

Before starting the subtitle analysis, we will do some preprocessing steps in order to transform the subtitle texts to a more suitable format and remove undesirable parts. 

As previously mentioned in the parsing section, sentences are stored as lists of strings. For instance the sentence "You're a lovely person." would be represented by the following list:
`["You", "'re", "a", "lovely", "person", "."]`.


There are a few types of words that we do not desire to be part of the analysis, we want to remove common words that do not add any value or meaning to the text. One such category is stop words (https://en.wikipedia.org/wiki/Stop_words). 

We also do not care about the punctuations (https://en.wikipedia.org/wiki/Punctuation) hence we will remove those as well.  

Finally, we also want to transform each word into its "base" form. For instance, the words take, took and taken should be treated as a single word in the analysis and not as three different ones. We will use Lemmatisation (https://en.wikipedia.org/wiki/Lemmatisation) in order to turn "took" and "taken" into their verb base form which is "take". However, we do not only want to lemmatize words but also other cases such as transforming plural words into singular and remove the -ing part of words (walking -> walk) etc.

Note from the above example sentence that contracted words (https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions) are represented as two words in the sentence list. This makes sence when calculating the length of each sentence since a contracted word is actually two words. However, the part on the right hande side of the apostrophe in the contracted word does not add any value to our analysis. Therefore we will drop any word which starts with an apostrophe. 

Finally, we also want to transform each word into lower case. We want the words "Take" and "take" to be treated as the same word.

To summerize, we are doing the folloing preprocessing steps of our subtitle data:

1. Transform each word into lower case 
2. Remove stop words
3. Remove punctuation 
4. Lemmatize words
5. Remove words which starts with an apostrophe


In [None]:
subtitle_df = spark.read.parquet("./subtitle_data.parquet")

(1) Let's convert each word into its lower case representation:

In [None]:
subtitle_df = subtitle_df.select('_id', F.lower(F.col('w')).alias('word'))

We will use NLTK (https://www.nltk.org/) for point 2 in the above list. NLTK provides predefined lists of stop words for several different languages. 

In [None]:
stop_words=set(stopwords.words('english'))
subtitle_df = subtitle_df\
                    .filter(subtitle_df.word.isin(stop_words) == False)

For point 3, we can use the built in puncuation list of the string class in Python.

In [None]:
punctuation_list=list(string.punctuation)
subtitle_df = subtitle_df\
                    .filter(subtitle_df.word.isin(punctuation_list) == False)

The NLTK library also provies a lemmatizer API which we will use to solve point 4.

In [None]:
lemmatizer = WordNetLemmatizer()
lemmatize = F.udf(lambda x: lemmatizer.lemmatize(x, 'v'), StringType())
subtitle_df = subtitle_df.select('_id', lemmatize(F.col('word')).alias('word'))