# 01-Preprocessing

The first NLP exercise is about preprocessing.

You will practice preprocessing using NLTK on raw data. 
This is the first step in most of the NLP projects, so you have to master it.

We will play with the *coldplay.csv* dataset, containing all the songs and lyrics of Coldplay.

As you know, the first step is to import some libraries. So import *nltk* as well as all the libraries you will need.

In [1]:
# Import NLTK and all the needed libraries
import nltk
nltk.download('punkt') #Run this line one time to get the resource
nltk.download('stopwords') #Run this line one time to get the resource
nltk.download('wordnet') #Run this line one time to get the resource
nltk.download('averaged_perceptron_tagger') #Run this line one time to get the resource
import numpy as np
import pandas as pd

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Avry\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Avry\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Avry\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Avry\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


Load now the dataset using pandas.

In [3]:
# TODO: Load the dataset in coldplay.csv
df = pd.read_csv('coldplay.csv')
df.head(10)

Unnamed: 0,Artist,Song,Link,Lyrics
0,Coldplay,Another's Arms,/c/coldplay/anothers+arms_21079526.html,Late night watching tv \nUsed to be you here ...
1,Coldplay,Bigger Stronger,/c/coldplay/bigger+stronger_20032648.html,I want to be bigger stronger drive a faster ca...
2,Coldplay,Daylight,/c/coldplay/daylight_20032625.html,"To my surprise, and my delight \nI saw sunris..."
3,Coldplay,Everglow,/c/coldplay/everglow_21104546.html,"Oh, they say people come \nThey say people go..."
4,Coldplay,Every Teardrop Is A Waterfall,/c/coldplay/every+teardrop+is+a+waterfall_2091...,"I turn the music up, I got my records on \nI ..."
5,Coldplay,Everything's Not Lost,/c/coldplay/everythings+not+lost_20032638.html,When I'm counting up my demons \nSaw there wa...
6,Coldplay,Fix You,/c/coldplay/fix+you_10069035.html,When you try your best but you don't succeed ...
7,Coldplay,For You,/c/coldplay/for+you_20032655.html,If you're lost and feel alone \nCircumnavigat...
8,Coldplay,Fun,/c/coldplay/fun_21104545.html,I know it's over before she says \nI know the...
9,Coldplay,Ghost Story,/c/coldplay/ghost+story_21083666.html,Maybe I'm just a ghost \nDisappear when anybo...


Now, check the dataset, play with it a bit: what are the columns? How many lines? Is there missing data?...

In [6]:
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
# TODO: Explore the data# 1. Number of columns
num_columns = df.shape[1]

# 2. Column names
column_names = df.columns.tolist()

# 3. Total rows
total_rows = df.shape[0]

# 4. Check for missing data in each column
missing_data = df.isnull().sum()

# 5. Check for duplicate rows in the DataFrame
duplicate_data = df.duplicated().sum()

# 6. Data type of each column
column_types = df.dtypes

# Additional Checks

# 7. Summary statistics
summary_stats = df.describe(include='all')


# 11. Outlier detection using IQR method
outliers = {}
for col in df.select_dtypes(include=['float64', 'int64']).columns:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    outliers[col] = df[(df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))]

# 12. Memory usage
memory_usage = df.memory_usage(deep=True)

# Display the results in a professional format
print(f"Dataset Overview:\n")
print(f"Number of Columns: {num_columns}")
print(f"Column Names: {', '.join(column_names)}")
print(f"Total Rows: {total_rows}\n")

print("Missing Data per Column:")
print(missing_data)
print(f"\nNumber of Duplicate Rows: {duplicate_data}\n")

print("Data Types of Each Column:")
print(column_types)
print("\nSummary Statistics:")
print(summary_stats)

print("\nPotential Outliers Detected:")
for col, outlier_data in outliers.items():
    print(f"\nColumn: {col}")
    print(outlier_data)

print("\nMemory Usage of DataFrame:")
print(memory_usage)

# Data Distribution Visualization
for col in df.select_dtypes(include=['float64', 'int64']).columns:
    plt.figure(figsize=(10, 6))
    sns.histplot(df[col], bins=30, kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()


Dataset Overview:

Number of Columns: 4
Column Names: Artist, Song, Link, Lyrics
Total Rows: 120

Missing Data per Column:
Artist    0
Song      0
Link      0
Lyrics    0
dtype: int64

Number of Duplicate Rows: 0

Data Types of Each Column:
Artist    object
Song      object
Link      object
Lyrics    object
dtype: object

Summary Statistics:
          Artist            Song                                     Link  \
count        120             120                                      120   
unique         1             120                                      120   
top     Coldplay  Another's Arms  /c/coldplay/anothers+arms_21079526.html   
freq         120               1                                        1   

                                                   Lyrics  
count                                                 120  
unique                                                120  
top     Late night watching tv  \nUsed to be you here ...  
freq                          

Now select the song 'Every Teardrop Is A Waterfall' and save the Lyrics text into a variable. Print the output of this variable.

In [10]:
# TODO: Select the song 'Every Teardrop Is A Waterfall'
lyrics_text = df[df['Song'].apply(lambda x: x == 'Every Teardrop Is A Waterfall')]['Lyrics'].values[0]

print(lyrics_text)

I turn the music up, I got my records on  
I shut the world outside until the lights come on  
Maybe the streets alight, maybe the trees are gone  
I feel my heart start beating to my favourite song  
  
And all the kids they dance, all the kids all night  
Until Monday morning feels another life  
I turn the music up  
I'm on a roll this time  
And heaven is in sight  
  
I turn the music up, I got my records on  
From underneath the rubble sing a rebel song  
Don't want to see another generation drop  
I'd rather be a comma than a full stop  
  
Maybe I'm in the black, maybe I'm on my knees  
Maybe I'm in the gap between the two trapezes  
But my heart is beating and my pulses start  
Cathedrals in my heart  
  
As we saw oh this light I swear you, emerge blinking into  
To tell me it's alright  
As we soar walls, every siren is a symphony  
And every tear's a waterfall  
Is a waterfall  
Oh  
Is a waterfall  
Oh oh oh  
Is a is a waterfall  
Every tear  
Is a waterfall  
Oh oh oh  


As you can see, there is some preprocessing needed here. So let's do it! What is usually the first step?

Tokenization, yes. So do tokenization on the lyrics of Every Teardrop Is A Waterfall.

So you may have to import the needed library from NLTK if you did not yet.

Be careful, the output you have from your pandas dataframe may not have the right type, so manipulate it wisely to get a string.

In [12]:
import nltk
from nltk.tokenize import word_tokenize

In [14]:
# TODO: Tokenize the lyrics of the song and save the tokens into a variable and print it
token = word_tokenize(lyrics_text)
token  

['I',
 'turn',
 'the',
 'music',
 'up',
 ',',
 'I',
 'got',
 'my',
 'records',
 'on',
 'I',
 'shut',
 'the',
 'world',
 'outside',
 'until',
 'the',
 'lights',
 'come',
 'on',
 'Maybe',
 'the',
 'streets',
 'alight',
 ',',
 'maybe',
 'the',
 'trees',
 'are',
 'gone',
 'I',
 'feel',
 'my',
 'heart',
 'start',
 'beating',
 'to',
 'my',
 'favourite',
 'song',
 'And',
 'all',
 'the',
 'kids',
 'they',
 'dance',
 ',',
 'all',
 'the',
 'kids',
 'all',
 'night',
 'Until',
 'Monday',
 'morning',
 'feels',
 'another',
 'life',
 'I',
 'turn',
 'the',
 'music',
 'up',
 'I',
 "'m",
 'on',
 'a',
 'roll',
 'this',
 'time',
 'And',
 'heaven',
 'is',
 'in',
 'sight',
 'I',
 'turn',
 'the',
 'music',
 'up',
 ',',
 'I',
 'got',
 'my',
 'records',
 'on',
 'From',
 'underneath',
 'the',
 'rubble',
 'sing',
 'a',
 'rebel',
 'song',
 'Do',
 "n't",
 'want',
 'to',
 'see',
 'another',
 'generation',
 'drop',
 'I',
 "'d",
 'rather',
 'be',
 'a',
 'comma',
 'than',
 'a',
 'full',
 'stop',
 'Maybe',
 'I',
 "'m",

It begins to look good. But still, we have the punctuation to remove, so let's do this.

In [15]:
import string

In [16]:
# TODO: Remove the punctuation, then save the result into a variable and print it
remove_punctuation_text = lyrics_text.translate(str.maketrans('', '', string.punctuation))
removed_punctuation_token = word_tokenize(remove_punctuation_text)
removed_punctuation_token

['I',
 'turn',
 'the',
 'music',
 'up',
 'I',
 'got',
 'my',
 'records',
 'on',
 'I',
 'shut',
 'the',
 'world',
 'outside',
 'until',
 'the',
 'lights',
 'come',
 'on',
 'Maybe',
 'the',
 'streets',
 'alight',
 'maybe',
 'the',
 'trees',
 'are',
 'gone',
 'I',
 'feel',
 'my',
 'heart',
 'start',
 'beating',
 'to',
 'my',
 'favourite',
 'song',
 'And',
 'all',
 'the',
 'kids',
 'they',
 'dance',
 'all',
 'the',
 'kids',
 'all',
 'night',
 'Until',
 'Monday',
 'morning',
 'feels',
 'another',
 'life',
 'I',
 'turn',
 'the',
 'music',
 'up',
 'Im',
 'on',
 'a',
 'roll',
 'this',
 'time',
 'And',
 'heaven',
 'is',
 'in',
 'sight',
 'I',
 'turn',
 'the',
 'music',
 'up',
 'I',
 'got',
 'my',
 'records',
 'on',
 'From',
 'underneath',
 'the',
 'rubble',
 'sing',
 'a',
 'rebel',
 'song',
 'Dont',
 'want',
 'to',
 'see',
 'another',
 'generation',
 'drop',
 'Id',
 'rather',
 'be',
 'a',
 'comma',
 'than',
 'a',
 'full',
 'stop',
 'Maybe',
 'Im',
 'in',
 'the',
 'black',
 'maybe',
 'Im',
 'on'

We will now remove the stop words.

In [17]:

import nltk
from nltk.corpus import stopwords
 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Avry\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [37]:
# TODO: remove the stop words using NLTK. Then put the result into a variable and print it
stop_words = set(stopwords.words('english'))
print(stop_words)
removed_stopwords_lyrics = [w for w in removed_punctuation_token if not w.lower() in stop_words]

# removed_stopwords_lyrics

{'his', "don't", 'yourself', 'll', 'as', 'of', 'only', 'your', 'we', 'yours', 'they', 'these', "haven't", 'off', 'have', 'before', 'after', 'down', 'about', 'mustn', 'no', 'from', 'should', "she's", 'doesn', 'i', "mightn't", "aren't", 'him', 'ain', 'doing', 'into', 'such', 'when', 'isn', 'all', "you'd", 'itself', "hadn't", 'and', 'very', 'between', 'couldn', 'been', 'but', 'if', 'because', 'too', 'few', 'did', 'be', 'in', 'can', "won't", 'nor', 'so', 'further', "couldn't", 'each', "that'll", 'on', "shouldn't", "doesn't", "wouldn't", 'any', 'ma', 'above', 'the', 're', 'or', 'it', 'where', 'being', 'through', "didn't", 'won', 'hasn', 'myself', 'o', "weren't", 'am', 'herself', "needn't", 'some', 'ourselves', 'for', 'an', 'more', 'don', 'below', 'needn', 'their', 'hers', 'over', "hasn't", 'will', 'he', 'haven', 'she', 'again', 'now', "shan't", 'what', 'a', 'this', 'under', 'both', 'm', "you'll", 'against', 'out', 'how', 'whom', 'didn', 'had', 'my', 'to', "you're", 'do', 'hadn', "you've", '

Okay we begin to have much less words in our song, right?

Next step is lemmatization. But we had an issue in the lectures, you remember? Let's learn how to do it properly now.

First let's try to do it naively. Import the WordNetLemmatizer and perform lemmatization with default options.

In [19]:
from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()

In [26]:
# TODO: Perform lemmatization using WordNetLemmatizer on our tokens
lemmatized_origin = []
for w in token:
    lemmatized_origin.append(lemmatizer.lemmatize(w))

#Check if the tokens are lemmatized yet or not.
#Return false if the tokens are lemmatized
print(lemmatized_origin == token)


False


As you can see, it worked well on nouns (plural words are now singular for example).

But verbs are not OK: we would 'is' to become 'be' for example.

To do that, we need to do POS-tagging. So let's do this now.

POS-tagging means Part of speech tagging: basically it will classify words into categories: like verbs, nouns, advers and so on...

In order to do that, we will use NLTK and the function *pos_tag*. You have to do it on the step before lemmatization, so use your variable containing all the tokens without punctuation and without stop words.

Hint: you can check on the internet how the *pos_tag* function works [here](https://www.nltk.org/book/ch05.html)

In [27]:
from nltk import pos_tag

In [42]:
# TODO: use the function pos_tag of NLTK to perform POS-tagging and print the result
'''
Notice that the removed_stopwords_lyrics is containing the list of words
that are removed both stopwords and punctuations
'''
pos_tags = pos_tag(removed_stopwords_lyrics)
lemmatizer = WordNetLemmatizer()


As you can see, it does not return values like 'a', 'n', 'v' or 'r' as the WordNet lemmatizer is expecting...

So we have to convert the values from the NLTK POS-tagging to put them into the WordNet Lemmatizer. This is done in the function *get_wordnet_pos* written below. Try to understand it, and then we will reuse it.

In [43]:
# Lemmatize tokens with POS tagging
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return nltk.corpus.wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return nltk.corpus.wordnet.VERB
    elif treebank_tag.startswith('N'):
        return nltk.corpus.wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return nltk.corpus.wordnet.ADV
    else:
        return nltk.corpus.wordnet.NOUN

So now you have all we need to perform properly the lemmatization.

So you have to use the following to do so:
* your tags from the POS-tagging performed
* the function *get_wordnet_pos*
* the *WordNetLemmatizer*

In [44]:
# TODO: Perform the lemmatization properly
lemmatized_tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags]

# Check if the tokens are lemmatized. False if lemmatized
are_tokens_lemmatized = lemmatized_tokens == token
lemmatized_tokens

['turn',
 'music',
 'get',
 'record',
 'shut',
 'world',
 'outside',
 'light',
 'come',
 'Maybe',
 'street',
 'alight',
 'maybe',
 'tree',
 'go',
 'feel',
 'heart',
 'start',
 'beat',
 'favourite',
 'song',
 'kid',
 'dance',
 'kid',
 'night',
 'Monday',
 'morning',
 'feel',
 'another',
 'life',
 'turn',
 'music',
 'Im',
 'roll',
 'time',
 'heaven',
 'sight',
 'turn',
 'music',
 'get',
 'record',
 'underneath',
 'rubble',
 'sing',
 'rebel',
 'song',
 'Dont',
 'want',
 'see',
 'another',
 'generation',
 'drop',
 'Id',
 'rather',
 'comma',
 'full',
 'stop',
 'Maybe',
 'Im',
 'black',
 'maybe',
 'Im',
 'knee',
 'Maybe',
 'Im',
 'gap',
 'two',
 'trapezes',
 'heart',
 'beating',
 'pulse',
 'start',
 'Cathedrals',
 'heart',
 'saw',
 'oh',
 'light',
 'swear',
 'emerge',
 'blink',
 'tell',
 'alright',
 'soar',
 'wall',
 'every',
 'siren',
 'symphony',
 'every',
 'tear',
 'waterfall',
 'waterfall',
 'Oh',
 'waterfall',
 'Oh',
 'oh',
 'oh',
 'waterfall',
 'Every',
 'tear',
 'waterfall',
 'Oh',
 '

What do you think?

Still not perfect, but it's the best we can do for now.

Now you can try stemming, with the help of the lecture, and see the differences compared to the lemmatization

In [47]:
# TODO: Perform stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

stemmed = [stemmer.stem(word) for word in removed_stopwords_lyrics]
stemmed

['turn',
 'music',
 'got',
 'record',
 'shut',
 'world',
 'outsid',
 'light',
 'come',
 'mayb',
 'street',
 'alight',
 'mayb',
 'tree',
 'gone',
 'feel',
 'heart',
 'start',
 'beat',
 'favourit',
 'song',
 'kid',
 'danc',
 'kid',
 'night',
 'monday',
 'morn',
 'feel',
 'anoth',
 'life',
 'turn',
 'music',
 'im',
 'roll',
 'time',
 'heaven',
 'sight',
 'turn',
 'music',
 'got',
 'record',
 'underneath',
 'rubbl',
 'sing',
 'rebel',
 'song',
 'dont',
 'want',
 'see',
 'anoth',
 'gener',
 'drop',
 'id',
 'rather',
 'comma',
 'full',
 'stop',
 'mayb',
 'im',
 'black',
 'mayb',
 'im',
 'knee',
 'mayb',
 'im',
 'gap',
 'two',
 'trapez',
 'heart',
 'beat',
 'puls',
 'start',
 'cathedr',
 'heart',
 'saw',
 'oh',
 'light',
 'swear',
 'emerg',
 'blink',
 'tell',
 'alright',
 'soar',
 'wall',
 'everi',
 'siren',
 'symphoni',
 'everi',
 'tear',
 'waterfal',
 'waterfal',
 'oh',
 'waterfal',
 'oh',
 'oh',
 'oh',
 'waterfal',
 'everi',
 'tear',
 'waterfal',
 'oh',
 'oh',
 'oh',
 'hurt',
 'hurt',
 'ba

Do you see the difference? What would you use?

Both stemming and lemmatizing are working on the same idea of return the words into their original form.
But the result of stemming seems like it made some spelling mistakes in the result.