# Spark Dataframes (and RDDs) for manipulating text

In your cirrus account, in the login node, type the following commands, for installing different nltk packages that we will use later. 

```
[XXX@cirrus-login0 lab_exercises]$ module load anaconda/python3
[XXX@cirrus-login0 lab_exercises]$ python
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download('wordnet')
>>> nltk.download('punkt')
>>> nltk.download('stopwords')
```

The aim of this exercise is to learn how to manipulate data using spark dataframes. We are going to work with a subset of the encyclopaedia britannica from the [National Library of Scotland](https://data.nls.uk/data/digitised-collections/encyclopaedia-britannica/). We have downloaded previously the full dataset, ingested the data, and produce a subsample dataset, called *"nls_demo.csv"* (which is the one we are going to use) using a spark-tool called [defoe](https://github.com/alan-turing-institute/defoe/blob/master/docs/nls_demo_examples/nls_demo_individual_queries.md). 

In [None]:
import nltk
import string

This is the format of the CSV file:
title,edition,year,place,archive_filename,page_filename,page_id,num_pages,type_archive,model,preprocess,page_string

In [None]:
#Reading the csv file into a dataframe
df= sqlContext.read.csv("/lustre/home/shared/y15/spark/data/nls_demo.csv", header="true")
df.show(3)

In [None]:
#Filtering by pages that are not null, and grouping by year, and counting the number of pages
df.filter(df.page_string.isNotNull()).select(df.year, df.page_string).groupby(df.year).count().show()

In [None]:
# Checking how many rows do we have with the value "year" in the column "year"
df[df.year.like("year")].collect()

In [None]:
# Filter again the data, which pages are not null, and which years are not "year", and selecting 2 columns, and counting the elements by year
df.filter(df.page_string.isNotNull()).filter(df["year"]!="year").select(df.year, df.page_string).groupby(df.year).count().show()

In [None]:
#Same that before, but grouping by place
df.filter(df.page_string.isNotNull()).filter(df["year"]!="year").select(df.place, df.page_string).groupby(df.place).count().show()

In [None]:
#Grouping by years, but just the ones between 1773 and 1842
df.filter(df.page_string.isNotNull()).filter(df["year"]!="year").filter(df.year.between(1773, 1842)).select(df.place, df.edition, df.page_string, df.year).groupby(df.year).count().show()

In [None]:
df.filter(df.page_string.isNotNull()).filter(df["year"]!="year").filter(df.year.between(1773, 1842)).select(df.place, df.edition, df.page_string, df.year).show()

In [None]:
df.filter(df.page_string.isNotNull()).filter(df["year"]!="year").filter(df.year.between(1773, 1842)).filter(df.edition.startswith("Second")).select(df.place, df.edition, df.page_string, df.year).show()

In [None]:
# Now lets create a datafame, which pages are not Null, and selecting just year and page_string columns
newdf=df.filter(df.page_string.isNotNull()).select(df.year, df.page_string)
# And check the Schema of the new dataframe
newdf.printSchema()

In [None]:
# Count the number of rows
newdf.count()

In [None]:
# Show the first 20 rows
newdf.show()

In [None]:
# Converting the dataframe to tuples- best suited for processing unstructured data.
pages=newdf.rdd.map(tuple)

In [None]:
pages.take(8)

In [None]:
def sent_TokenizeFunct(x):
    print ("%s" %x)
    return nltk.sent_tokenize(x)

In [None]:
sentenceTokenizeRDD = pages.map(lambda p: sent_TokenizeFunct(p[1]))

In [None]:
sentenceTokenizeRDD.take(5)

In [None]:
def word_TokenizeFunct(x):
    splitted = [word for line in x for word in line.split()]
    return splitted

In [None]:
wordTokenizeRDD = sentenceTokenizeRDD.map(word_TokenizeFunct)

In [None]:
wordTokenizeRDD.take(5)

In [None]:
def removeStopWordsFunct(x):
    nltk.download('stopwords')
    from nltk.corpus import stopwords
    stop_words=set(stopwords.words('english'))
    filteredSentence = [w for w in x if not w in stop_words]
    return filteredSentence

stopwordRDD = wordTokenizeRDD.map(removeStopWordsFunct)

In [None]:
stopwordRDD.take(10)

In [None]:
def removePunctuationsFunct(x):
    nltk.download('punkt')
    list_punct=list(string.punctuation)
    filtered = [''.join(c for c in s if c not in list_punct) for s in x] 
    filtered_space = [s for s in filtered if s] #remove empty space 
    return filtered

In [None]:
rmvPunctRDD = stopwordRDD.map(removePunctuationsFunct)

In [None]:
rmvPunctRDD.take(10)

In [None]:
def lemmatizationFunct(x):
    nltk.download('wordnet')
    lemmatizer = nltk.WordNetLemmatizer()
    finalLem = [lemmatizer.lemmatize(s) for s in x]
    return finalLem

In [None]:
lem_wordsRDD = rmvPunctRDD.map(lemmatizationFunct)

In [None]:
lem_wordsRDD.take(10)