## NLP Preprocessing

This notebook contains the preprocessing conducted for the data to be used in the development of an NLP model.

### 1. Spark

Start Spark, import necessary functions and register the UDFs from "preproc.py" and afterwards read the file

In [1]:
import findspark
findspark.init("/usr/local/spark/")

from pyspark.sql import SparkSession
import collections
from pyspark.sql import SQLContext

from pyspark.sql import SparkSession
from pyspark.sql import SQLContext


spark = SparkSession.builder \
   .master("local") \
   .appName("NLP1") \
   .config("spark.executor.memory", "1gb") \
   .config("spark.sql.random.seed", "1234") \
   .getOrCreate()
      
sc = spark.sparkContext


sqlContext = SQLContext(sc)

In [2]:
# Other packages
import sys
!{sys.executable} -m pip install nltk --no-cache-dir
import sys
!{sys.executable} -m pip install langid --no-cache-dir
import nltk
nltk.download('averaged_perceptron_tagger')
import nltk
nltk.download('stopwords')
import nltk
nltk.download('wordnet')



[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### 1. Register functions

In [3]:
#register functions
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import preproc as pp


# Refer to https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.udf.html
# `pp.check_lang` is used to classify the language of our input text
check_lang_udf = udf(pp.check_lang, StringType())


# removes stop words (cleaned_str/row/document)
remove_stops_udf = udf(pp.remove_stops, StringType())


# catch-all to remove other 'words' that I felt didn't add a lot of value
remove_features_udf = udf(pp.remove_features, StringType())


# Process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech
# tagging, POS-tagging, or simply tagging.
# http://www.nltk.org/book/ch05.html
tag_and_remove_udf = udf(pp.tag_and_remove, StringType())


# lemmatize
# http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
lemmatize_udf = udf(pp.lemmatize, StringType())


# check to see if a row only contains whitespace
check_blanks_udf = udf(pp.check_blanks, StringType())

### 1. Read file

In [4]:
# Load data
from pyspark.sql.functions import when
from pyspark.sql.utils import AnalysisException 

try: 
    file = spark.read.format("csv").option("header", "true").option("multiline", "true").option("quote", "\"").option("escape", "\"").load("../Final Preprocessing/final.csv")
    file.show(1)
except AnalysisException: 
    print("Please check the Filename and Filepath")

+---+--------------------+--------------------+--------------------+--------------------+----------+-----------+--------+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+------+--------------+-----------+--------------------+---------+
|_c0|          track_name|              artist|            track_id|          album_name|popularity|duration_ms|explicit|danceability|energy|key|loudness|mode|speechiness|acousticness|instrumentalness|liveness|valence| tempo|time_signature|track_genre|              lyrics|billboard|
+---+--------------------+--------------------+--------------------+--------------------+----------+-----------+--------+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+------+--------------+-----------+--------------------+---------+
|  0|"""Martha: """"M'...|Friedrich von Flo...|1NzZWhNIP9DIX4yy0...|The World's Best ...|        23|     204706|   False|       0.222| 0.195|  5| -1

In [5]:
# Check size of loaded data
file.count()

61762

### 2. Change the billboard variable into integers instead of strings

Next we change the billboard variable from a string to integers, where 1 indicates that the song was in the billboard charts and 0 indicates that it was not. We do this because if the variable were a string, we would get an error during the model training phase. Be careful, do not run this cell twice without reading the file in again, otherwise the file might become empty.

In [6]:
"""from pyspark.sql.functions import col

#file = file.filter(col("billboard").isin("True", "False", "true", "false"))
file = file.withColumn('billboard', when((col('billboard') == True), 1).otherwise(0))"""

'from pyspark.sql.functions import col\n\n#file = file.filter(col("billboard").isin("True", "False", "true", "false"))\nfile = file.withColumn(\'billboard\', when((col(\'billboard\') == True), 1).otherwise(0))'

In [7]:
"""#file.printSchema()
file.show(1)"""

'#file.printSchema()\nfile.show(1)'

### 3. Select relevant columns and rename them

Now we select only the variables of interest: lyrics of the song, the variable indicating if the song was in the billboard or not, and finally some id variable. We also rename the variables, because I previously made a mistake and instead of changing all my code, I chose to rename them with one extra line of code.

In [8]:
data = file.select("_c0","billboard","lyrics")
data = data.withColumnRenamed("_c0", "id")
data = data.withColumnRenamed("billboard", "label")
data = data.withColumnRenamed("lyrics", "text")

In [9]:
data.printSchema()

root
 |-- id: string (nullable = true)
 |-- label: string (nullable = true)
 |-- text: string (nullable = true)



### 4. Assign the language of each song to a new column

Now we start preprocessing: Use a UDF to check the language of each song and add a new column to the data, indicating the language of the song.

In [10]:
lang_df = data.withColumn("lang", check_lang_udf(data["text"]))

### 5. Create two datasets, one with all languages and one with only English

In addition to the existing dataset, we create another one existing only of English songs. The idea behind this was that the model might perform better or worse on any of these two, because most billboard songs are English. To check whether this assumption was true, we created datasets of "unfiltered" songs (in any language) and of "onlyEnglish" songs. 

#### Version 1: do not filter text by language

In [11]:
lang_df_unfiltered = lang_df

#### Version 2: filter text by language

We do this so that we can later see whether the model performs equally well when only being fed with English songs

In [12]:
lang_df_onlyEnglish = lang_df.filter(lang_df["lang"] == "en")

Checking the results might take a while:

In [13]:
#language_counts = lang_df.select('lang').groupBy('lang').count().show()

In [14]:
#language_counts = lang_df_onlyEnglish.select('lang').groupBy('lang').count().show()

### 6. Next we remove stop words

From here on, we do every processing step twice: once for the data with all songs (unfiltered) and once for the data with only English songs (onlyEnglish). Every step follows the one before, meaning that the transformation of each step is invoked on the column that was created the step before. In this first step, we create a new column in which we copy the lyrics of the song, but remove the stop words defined by our UDF. Stop words are some of the most common words in a language. Removing them reduces dimensionality, so that our whole program is a bit faster. Also, since they are commonly used in most songs, it probably doesn't matter that much if they are in there or not.

In [15]:
rm_stops_unfiltered = lang_df_unfiltered.withColumn("stop_text", remove_stops_udf(lang_df_unfiltered["text"]))
rm_stops_onlyEnglish = lang_df_onlyEnglish.withColumn("stop_text", remove_stops_udf(lang_df_onlyEnglish["text"]))

### 7. Remove some other words that might not be necessary

Next we use some sort of catch-all UDF that removes some other probably unnessary words or strings. This further reduces dimensionality, making it all go smoother and faster.

In [16]:
rm_features_df_unfiltered = rm_stops_unfiltered.withColumn("feat_text", remove_features_udf(rm_stops_unfiltered["stop_text"]))
rm_features_df_onlyEnglish = rm_stops_onlyEnglish.withColumn("feat_text", remove_features_udf(rm_stops_onlyEnglish["stop_text"]))

### 8. POS-tagging the words

POS-tagging is the process of classifying the words into their word classes or lexical categories. That is, we classify words as different types of nouns, adjectives and verbs. The tagger can't only find which word class a word belongs to in a specific context, but it can also guess which word type a word is based on its root. After we tagged all the words, we only keep those which fit into our categories (aka all nouns, adjectives and verbs). I couldn't really figure out why we do that exactly, but my guess is to remove again decrease dimensionality. 

In [17]:
tagged_df_unfiltered = rm_features_df_unfiltered.withColumn("tagged_text", tag_and_remove_udf(rm_features_df_unfiltered["feat_text"]))
tagged_df_onlyEnglish = rm_features_df_onlyEnglish.withColumn("tagged_text", tag_and_remove_udf(rm_features_df_onlyEnglish["feat_text"]))

### 9. Lemmatize

We lemmatize the words, meaning that we group certain words together and display only one word representing the whole group. As an example, we might group "democracy", "democratic" and "democratization" together and only display "democracy". This helps the model as there aren't as many different words anymore, because e.g. "sing" and "singing" are now the same. There aren't that many words to compare to each other anymore.

In [18]:
lemm_df_unfiltered = tagged_df_unfiltered.withColumn("lemm_text", lemmatize_udf(tagged_df_unfiltered["tagged_text"]))
lemm_df_onlyEnglish = tagged_df_onlyEnglish.withColumn("lemm_text", lemmatize_udf(tagged_df_onlyEnglish["tagged_text"]))

### 10. Check blanks

We check whether there are cells of text with only whitespace. First, we check all rows and create a new column indicating 
whether a cell of the lemmatized text has only whitespace in it or not. We then remove such rows using .filter()

In [19]:
check_blanks_df_unfiltered = lemm_df_unfiltered.withColumn("is_blank", check_blanks_udf(lemm_df_unfiltered["lemm_text"]))
no_blanks_df_unfiltered = check_blanks_df_unfiltered.filter(check_blanks_df_unfiltered["is_blank"] == "False")

check_blanks_df_onlyEnglish = lemm_df_onlyEnglish.withColumn("is_blank", check_blanks_udf(lemm_df_onlyEnglish["lemm_text"]))
no_blanks_df_onlyEnglish = check_blanks_df_onlyEnglish.filter(check_blanks_df_onlyEnglish["is_blank"] == "False")

### 11. Deduplication

We drop duplicates based on "text" (which represents the original column of lyrics) and "label" (which represents whether a song was in the charts or not). We shouldn't have to do this technically, but we still do it to be on the safe side.

In [20]:
dedup_df_unfiltered = no_blanks_df_unfiltered.dropDuplicates(['text', 'label'])
dedup_df_onlyEnglish = no_blanks_df_onlyEnglish.dropDuplicates(['text', 'label'])

### 12. Select relevant data again

Now we select the relevant columns, which are "label", "id" and "lemm_text" (which is the last version of our lyrics after all the transformations we did).

In [21]:
data_set_unfiltered = dedup_df_unfiltered.select(dedup_df_unfiltered['id'], dedup_df_unfiltered['lemm_text'], dedup_df_unfiltered['label'])
data_set_onlyEnglish = dedup_df_onlyEnglish.select(dedup_df_onlyEnglish['id'], dedup_df_onlyEnglish['lemm_text'], dedup_df_onlyEnglish['label'])

In [22]:
#data_set_unfiltered.printSchema()

In [23]:
#data_set_onlyEnglish.printSchema()

### 13. Drop NAs

We drop some NAs to be on the safe side (this doesn't work perfectly, in the end file there will still be some text columns with only whitespace, but we remove them in the other notebook).

In [24]:
data_set_unfiltered = data_set_unfiltered.na.drop()
data_set_onlyEnglish = data_set_onlyEnglish.na.drop()

### 14. Rename column

We rename the "lemm_text" column to "text", as this is the name we use in the other notebooks as well. 

In [25]:
data_unfiltered = data_set_unfiltered.withColumnRenamed("lemm_text", "text")
data_onlyEnglish = data_set_onlyEnglish.withColumnRenamed("lemm_text", "text")

In [26]:
#data_unfiltered.printSchema()

In [27]:
#data_onlyEnglish.printSchema()

### 15. Save processed full data

Finally we save the data. I commented that cell out, as it's already done, takes a 
while and since we already have the data there's no need to run it again.

In [28]:
"""import pandas
data_unfiltered.toPandas().to_csv('../Final Preprocessing/NLP_processed_unfiltered.csv')
data_onlyEnglish.toPandas().to_csv('../Final Preprocessing/NLP_processed_onlyEnglish.csv')"""