# Lab 7 - Textual Data Analytics
Complete the code with TODO tag.
## 1. Feature Engineering
In this exercise we will understand the functioning of TF/IDF ranking. Implement the feature engineering and its application, based on the code framework provided below.

First we use textual data from Twitter.

In [1]:
import numpy as np
import pandas as pd
data = pd.read_csv('elonmusk_tweets.csv')
print(len(data))
data.head()

2819


Unnamed: 0,id,created_at,text
0,849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https:...
1,848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa..."
2,848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
3,848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...'
4,848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat ..."


### 1.1. Text Normalization
Now we need to normalize text by stemming, tokenizing, and removing stopwords.

In [2]:
from __future__ import print_function, division
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
nltk.download('punkt')
import string
from nltk.corpus import stopwords
import math
from collections import Counter
nltk.download('stopwords')
import pprint 
pp = pprint.PrettyPrinter(indent=4)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
def normalize(document):
    # TODO: remove punctuation
    text = "".join([ch for ch in document if ch not in string.punctuation])
    
    # TODO: tokenize text
    tokens = nltk.word_tokenize(text)
    
    # TODO: Stemming
    stemmer = PorterStemmer()
    ret = " ".join([stemmer.stem(word.lower())for word in tokens])
    return ret

original_documents = [x.strip() for x in data['text']] 
documents = [normalize(d).split() for d in original_documents]
documents[0]

['band', 'so', 'the', 'robot', 'spare', 'human', 'httpstcov7jujqwfcv']

As you can see that the normalization is still not perfect. Please feel free to improve upon (OPTIONAL), e.g. https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/

### 1.2. Implement TF-IDF
Now you need to implement TF-IDF, including creating the vocabulary, computing term frequency, and normalizing by tf-idf weights.

In [4]:
# Flatten all the documents
flat_list = [word for doc in documents for word in doc]

# TODO: remove stop words from the vocabulary
words = [word for word in flat_list if word not in stopwords.words('english')]

# TODO: we take the 500 most common words only
counts = Counter(words)
vocabulary = counts.most_common(500)
print([x for x in vocabulary if x[0] == 'tesla'])
vocabulary = [x[0] for x in vocabulary]
assert len(vocabulary) == 500

# vocabulary.sort()
vocabulary[:5]

[('tesla', 287)]


['brt', 'tesla', 'spacex', 'model', 'thi']

In [5]:
def tf(vocabulary, documents):
    matrix = [0] * len(documents)
    for i, document in enumerate(documents):
        counts = Counter(document)
        matrix[i] = [0] * len(vocabulary)
        for j, term in enumerate(vocabulary):
            matrix[i][j] = counts[term]
    return matrix

tf = tf(vocabulary, documents)
np.array(vocabulary)[np.where(np.array(tf[1]) > 0)], np.array(tf[1])[np.where(np.array(tf[1]) > 0)]

(array(['tesla', 'exactli'], dtype='<U17'), array([1, 1]))

In [6]:
def idf(vocabulary, documents):
    """TODO: compute IDF, storing values in a dictionary"""
    idf = {}
    num_documents = len(documents)
    for i, term in enumerate(vocabulary):
        idf[term] = math.log(num_documents / sum(term in document for document in documents),2)
    return idf

idf = idf(vocabulary, documents)
[idf[key] for key in vocabulary[:5]]

[2.539126825495932,
 3.3163095197385393,
 3.7262581423445837,
 3.8171115727956972,
 3.8027562798186274]

In [7]:
def vectorize(document, vocabulary, idf):
    vector = [0]*len(vocabulary)
    counts = Counter(document)
    for i,term in enumerate(vocabulary):
        vector[i] = idf[term] * counts[term]
    return vector

document_vectors = [vectorize(s, vocabulary, idf) for s in documents]
np.array(vocabulary)[np.where(np.array(document_vectors[1]) > 0)], np.array(document_vectors[1])[np.where(np.array(document_vectors[1]) > 0)]

(array(['tesla', 'exactli'], dtype='<U17'), array([3.31630952, 6.65361284]))

### 1.3. Compare the results with the reference implementation of scikit-learn library.

Now we use the scikit-learn library. As you can see that, the way we do text normalization affects the result. Feel free to further improve upon (OPTIONAL), e.g. https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn

In [8]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english', max_features=500)

features = tfidf.fit(original_documents)
corpus_tf_idf = tfidf.transform(original_documents) 

sum_words = corpus_tf_idf.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in tfidf.vocabulary_.items()]
print(sorted(words_freq, key = lambda x: x[1], reverse=True)[:5])
print('testla', corpus_tf_idf[1, features.vocabulary_['tesla']])

[('http', 163.54366542841234), ('https', 151.85039944652075), ('rt', 112.61998731390989), ('tesla', 95.96401470715628), ('xe2', 88.20944486346477)]
testla 0.3495243100660956


### 1.4.  Apply TF-IDF for information retrieval
We can use the vector representation of documents to implement an information retrieval system. We test with the query $Q$ = "tesla nasa"

In [9]:
def cosine_similarity(v1,v2):
    """TODO: compute cosine similarity"""
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y =v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    if sumxy == 0:
        result = 0
    else:
        result = sumxy/math.sqrt(sumxx*sumyy)
    return result

def search_vec(query, k, vocabulary, stemmer, document_vectors, original_documents):
    q = query.split()
    q = [stemmer.stem(w) for w in q]
    query_vector = vectorize(q, vocabulary, idf)
    
    # TODO: rank the documents by cosine similarity
    scores = [[cosine_similarity(query_vector, document_vectors[d]),d]for d in range(len(document_vectors))]
    scores.sort(key=lambda x: -x[0])
    print('Top-{0} documents'.format(k))
    for i in range(k):
        print(i, original_documents[scores[i][1]])

query = "tesla nasa"
stemmer = PorterStemmer()
search_vec(query, 5, vocabulary, stemmer, document_vectors, original_documents)

Top-5 documents
0 b'@ashwin7002 @NASA @faa @AFPAA We have not ruled that out.'
1 b'RT @NASA: Updated @SpaceX #Dragon #ISS rendezvous times: NASA TV coverage begins Sunday at 3:30amET: http://t.co/qrm0Dz4jPE. Grapple at  ...'
2 b"Deeply appreciate @NASA's faith in @SpaceX. We will do whatever it takes to make NASA and the American people proud."
3 b'Would also like to congratulate @Boeing, fellow winner of the @NASA commercial crew program'
4 b"@astrostephenson We're aiming for late 2015, but NASA needs to have overlapping capability to be safe. Would do the same"


We can also use the scikit-learn library to do the retrieval.

In [10]:
new_features = tfidf.transform([query])

cosine_similarities = linear_kernel(new_features, corpus_tf_idf).flatten()
related_docs_indices = cosine_similarities.argsort()[::-1]

topk = 5
print('Top-{0} documents'.format(topk))
for i in range(topk):
    print(i, original_documents[related_docs_indices[i]])

Top-5 documents
0 b'@ashwin7002 @NASA @faa @AFPAA We have not ruled that out.'
1 b"SpaceX could not do this without NASA. Can't express enough appreciation. https://t.co/uQpI60zAV7"
2 b'@NASA launched a rocket into the northern lights http://t.co/tR2cSeMV'
3 b'Whatever happens today, we could not have done it without @NASA, but errors are ours alone and me most of all.'
4 b'RT @NASA: Updated @SpaceX #Dragon #ISS rendezvous times: NASA TV coverage begins Sunday at 3:30amET: http://t.co/qrm0Dz4jPE. Grapple at  ...'


# II. Text Processing

# 01-Preprocessing

The first NLP exercise is about preprocessing.

You will practice preprocessing using NLTK on raw data. 
This is the first step in most of the NLP projects, so you have to master it.

We will play with the *coldplay.csv* dataset, containing all the songs and lyrics of Coldplay.

As you know, the first step is to import some libraries. So import *nltk* as well as all the libraries you will need.

In [11]:
# Import NLTK and all the needed libraries
import nltk
nltk.download('punkt') #Run this line one time to get the resource
nltk.download('stopwords') #Run this line one time to get the resource
nltk.download('wordnet') #Run this line one time to get the resource
nltk.download('averaged_perceptron_tagger') #Run this line one time to get the resource
import numpy as np
import pandas as pd

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Load now the dataset using pandas.

In [12]:
# TODO: Load the dataset in coldplay.csv
import numpy as np
import pandas as pd
data = pd.read_csv('coldplay.csv')
print(len(data))
data.head()

120


Unnamed: 0,Artist,Song,Link,Lyrics
0,Coldplay,Another's Arms,/c/coldplay/anothers+arms_21079526.html,Late night watching tv \nUsed to be you here ...
1,Coldplay,Bigger Stronger,/c/coldplay/bigger+stronger_20032648.html,I want to be bigger stronger drive a faster ca...
2,Coldplay,Daylight,/c/coldplay/daylight_20032625.html,"To my surprise, and my delight \nI saw sunris..."
3,Coldplay,Everglow,/c/coldplay/everglow_21104546.html,"Oh, they say people come \nThey say people go..."
4,Coldplay,Every Teardrop Is A Waterfall,/c/coldplay/every+teardrop+is+a+waterfall_2091...,"I turn the music up, I got my records on \nI ..."


Now, check the dataset, play with it a bit: what are the columns? How many lines? Is there missing data?...

In [13]:
# TODO: Explore the data
df = pd.read_csv('coldplay.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Artist  120 non-null    object
 1   Song    120 non-null    object
 2   Link    120 non-null    object
 3   Lyrics  120 non-null    object
dtypes: object(4)
memory usage: 3.9+ KB


Now select the song 'Every Teardrop Is A Waterfall' and save the Lyrics text into a variable. Print the output of this variable.

In [14]:
# TODO: Select the song 'Every Teardrop Is A Waterfall'
df_Song = df[df['Song'] == "Every Teardrop Is A Waterfall"]
df_Lyrics = df_Song["Lyrics"]
print(df_Lyrics)

4    I turn the music up, I got my records on  \nI ...
Name: Lyrics, dtype: object


As you can see, there is some preprocessing needed here. So let's do it! What is usually the first step?

Tokenization, yes. So do tokenization on the lyrics of Every Teardrop Is A Waterfall.

So you may have to import the needed library from NLTK if you did not yet.

Be careful, the output you have from your pandas dataframe may not have the right type, so manipulate it wisely to get a string.

In [15]:
# TODO: Tokenize the lyrics of the song and save the tokens into a variable and print it
import string
import nltk
nltk.download('punkt')
from nltk.stem import *

def normalize(document):
    # TODO: remove punctuation
    text = "".join([ch for ch in document if ch not in string.punctuation])
    
    # TODO: tokenize text
    tokens = nltk.word_tokenize(text)
    
    # TODO: Stemming
    stemmer = PorterStemmer()
    ret = " ".join([stemmer.stem(word.lower())for word in tokens])
    return ret

original_documents = [x.strip() for x in df_Song['Lyrics']]
documents = [normalize(d).split() for d in original_documents]
documents[0]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['i',
 'turn',
 'the',
 'music',
 'up',
 'i',
 'got',
 'my',
 'record',
 'on',
 'i',
 'shut',
 'the',
 'world',
 'outsid',
 'until',
 'the',
 'light',
 'come',
 'on',
 'mayb',
 'the',
 'street',
 'alight',
 'mayb',
 'the',
 'tree',
 'are',
 'gone',
 'i',
 'feel',
 'my',
 'heart',
 'start',
 'beat',
 'to',
 'my',
 'favourit',
 'song',
 'and',
 'all',
 'the',
 'kid',
 'they',
 'danc',
 'all',
 'the',
 'kid',
 'all',
 'night',
 'until',
 'monday',
 'morn',
 'feel',
 'anoth',
 'life',
 'i',
 'turn',
 'the',
 'music',
 'up',
 'im',
 'on',
 'a',
 'roll',
 'thi',
 'time',
 'and',
 'heaven',
 'is',
 'in',
 'sight',
 'i',
 'turn',
 'the',
 'music',
 'up',
 'i',
 'got',
 'my',
 'record',
 'on',
 'from',
 'underneath',
 'the',
 'rubbl',
 'sing',
 'a',
 'rebel',
 'song',
 'dont',
 'want',
 'to',
 'see',
 'anoth',
 'gener',
 'drop',
 'id',
 'rather',
 'be',
 'a',
 'comma',
 'than',
 'a',
 'full',
 'stop',
 'mayb',
 'im',
 'in',
 'the',
 'black',
 'mayb',
 'im',
 'on',
 'my',
 'knee',
 'mayb',
 'im'

It begins to look good. But still, we have the punctuation to remove, so let's do this.

In [16]:
from __future__ import print_function, division
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
nltk.download('punkt')
import string
from nltk.corpus import stopwords
import math
from collections import Counter
nltk.download('stopwords')
import pprint 
pp = pprint.PrettyPrinter(indent=4)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We will now remove the stop words.

In [17]:
# TODO: Remove the punctuation, then save the result into a variable and print it
words = [word for word in documents if word not in stopwords.words('english')]
words

[['i',
  'turn',
  'the',
  'music',
  'up',
  'i',
  'got',
  'my',
  'record',
  'on',
  'i',
  'shut',
  'the',
  'world',
  'outsid',
  'until',
  'the',
  'light',
  'come',
  'on',
  'mayb',
  'the',
  'street',
  'alight',
  'mayb',
  'the',
  'tree',
  'are',
  'gone',
  'i',
  'feel',
  'my',
  'heart',
  'start',
  'beat',
  'to',
  'my',
  'favourit',
  'song',
  'and',
  'all',
  'the',
  'kid',
  'they',
  'danc',
  'all',
  'the',
  'kid',
  'all',
  'night',
  'until',
  'monday',
  'morn',
  'feel',
  'anoth',
  'life',
  'i',
  'turn',
  'the',
  'music',
  'up',
  'im',
  'on',
  'a',
  'roll',
  'thi',
  'time',
  'and',
  'heaven',
  'is',
  'in',
  'sight',
  'i',
  'turn',
  'the',
  'music',
  'up',
  'i',
  'got',
  'my',
  'record',
  'on',
  'from',
  'underneath',
  'the',
  'rubbl',
  'sing',
  'a',
  'rebel',
  'song',
  'dont',
  'want',
  'to',
  'see',
  'anoth',
  'gener',
  'drop',
  'id',
  'rather',
  'be',
  'a',
  'comma',
  'than',
  'a',
  'full'

Okay we begin to have much less words in our song, right?

Next step is lemmatization. But we had an issue in the lectures, you remember? Let's learn how to do it properly now.

First let's try to do it naively. Import the WordNetLemmatizer and perform lemmatization with default options.

TODO: Perform lemmatization using WordNetLemmatizer on our tokens

In [18]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download('wordnet')  # Download the WordNet corpus
nltk.download('stopwords')  # Download the stopwords corpus
nltk.download('omw-1.4')  # Download the omw-1.4 resource

lemmatizer = WordNetLemmatizer()
filtered_tokens_Array = []

for item in documents:
    words = [lemmatizer.lemmatize(word) for word in item]
    filtered_words = [word for word in words if word not in stopwords.words('english')]
    filtered_tokens_Array.append(filtered_words)


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [19]:
print("\nfiltered:\n",filtered_tokens_Array)


filtered:
 [['turn', 'music', 'got', 'record', 'shut', 'world', 'outsid', 'light', 'come', 'mayb', 'street', 'alight', 'mayb', 'tree', 'gone', 'feel', 'heart', 'start', 'beat', 'favourit', 'song', 'kid', 'danc', 'kid', 'night', 'monday', 'morn', 'feel', 'anoth', 'life', 'turn', 'music', 'im', 'roll', 'thi', 'time', 'heaven', 'sight', 'turn', 'music', 'got', 'record', 'underneath', 'rubbl', 'sing', 'rebel', 'song', 'dont', 'want', 'see', 'anoth', 'gener', 'drop', 'id', 'rather', 'comma', 'full', 'stop', 'mayb', 'im', 'black', 'mayb', 'im', 'knee', 'mayb', 'im', 'gap', 'two', 'trapez', 'heart', 'beat', 'pul', 'start', 'cathedr', 'heart', 'saw', 'oh', 'thi', 'light', 'swear', 'emerg', 'blink', 'tell', 'alright', 'soar', 'wall', 'everi', 'siren', 'symphoni', 'everi', 'tear', 'waterfal', 'waterfal', 'oh', 'waterfal', 'oh', 'oh', 'oh', 'waterfal', 'everi', 'tear', 'waterfal', 'oh', 'oh', 'oh', 'hurt', 'hurt', 'bad', 'still', 'ill', 'rais', 'flag', 'oh', 'wa', 'wa', 'wa', 'wa', 'wa', 'waater

Okay we begin to have much less words in our song, right?

Next step is lemmatization. But we had an issue in the lectures, you remember? Let's learn how to do it properly now.

First let's try to do it naively. Import the WordNetLemmatizer and perform lemmatization with default options.

TODO: use the function pos_tag of NLTK to perform POS-tagging and print the result


In [20]:
nltk.download('universal_tagset')

[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

In [21]:
tagged = nltk.pos_tag(filtered_words, tagset="universal")
print(tagged)

[('turn', 'NOUN'), ('music', 'NOUN'), ('got', 'VERB'), ('record', 'NOUN'), ('shut', 'NOUN'), ('world', 'NOUN'), ('outsid', 'NOUN'), ('light', 'ADJ'), ('come', 'VERB'), ('mayb', 'ADJ'), ('street', 'NOUN'), ('alight', 'VERB'), ('mayb', 'NOUN'), ('tree', 'VERB'), ('gone', 'VERB'), ('feel', 'ADJ'), ('heart', 'NOUN'), ('start', 'NOUN'), ('beat', 'VERB'), ('favourit', 'ADJ'), ('song', 'NOUN'), ('kid', 'NOUN'), ('danc', 'NOUN'), ('kid', 'NOUN'), ('night', 'NOUN'), ('monday', 'NOUN'), ('morn', 'VERB'), ('feel', 'VERB'), ('anoth', 'ADJ'), ('life', 'NOUN'), ('turn', 'NOUN'), ('music', 'NOUN'), ('im', 'NOUN'), ('roll', 'NOUN'), ('thi', 'NOUN'), ('time', 'NOUN'), ('heaven', 'ADJ'), ('sight', 'VERB'), ('turn', 'NOUN'), ('music', 'NOUN'), ('got', 'VERB'), ('record', 'ADJ'), ('underneath', 'NOUN'), ('rubbl', 'NOUN'), ('sing', 'VERB'), ('rebel', 'NOUN'), ('song', 'NOUN'), ('dont', 'NOUN'), ('want', 'VERB'), ('see', 'VERB'), ('anoth', 'DET'), ('gener', 'X'), ('drop', 'NOUN'), ('id', 'NOUN'), ('rather',

As you can see, it worked well on nouns (plural words are now singular for example).

But verbs are not OK: we would 'is' to become 'be' for example.

To do that, we need to do POS-tagging. So let's do this now.

POS-tagging means Part of speech tagging: basically it will classify words into categories: like verbs, nouns, advers and so on...

In order to do that, we will use NLTK and the function *pos_tag*. You have to do it on the step before lemmatization, so use your variable containing all the tokens without punctuation and without stop words.

Hint: you can check on the internet how the *pos_tag* function works [here](https://www.nltk.org/book/ch05.html)

 TODO: use the function pos_tag of NLTK to perform POS-tagging and print the result

In [22]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

sentence = "I turn the music on, I got my records on. I shut the world outside until the lights come on. Maybe the streets alight, maybe the trees are gone. I feel my heart start beating to my favorite song. And all the kids, they dance, all the kids all night. Until Monday morning feels another life. I turn the music up, I'm on a roll this time. And heaven is in sight. I turn the music up, I got my records on. From underneath the rubble, sing a rebel song. Don't want to see another generation drop. I'd rather be a comma than a full stop. Maybe I'm in the black, maybe I'm on my knees. Maybe I'm in the gap between the two trapezes. But my heart is beating and my pulses start. Cathedrals in my heart. As we saw oh, this light I swear. Oh, I'm gonna give it a try, I'll leave my door open wide. And I'm gonna be the one who never lets you in. And I'm gonna be the one who's gonna let you in. I'm gonna be the one who never lets you in. I'm gonna be the one who's gonna let you in. So, you hurt me bad but I won't mind. Still, I'll raise the flag. Oh, It was a wa wa wa wa wa wa. It was a wa wa wa wa wa wa wa. Every tear, every tear. Every teardrop is a waterfall. Oh, It was a wa wa wa wa wa wa. Every tear is a waterfall. So, you hurt me bad, but I won't mind. Still, I'll raise the flag. Oh."

tokens = word_tokenize(sentence)
tagged_words = pos_tag(tokens)

result = [(word, tag) for word, tag in tagged_words if tag not in ["DT", "IN", "CC", "RB", "TO", "PRP", "PRP$", "MD", "JJ", "CD", "UH", ".", ",", "VBZ"]]

print(result)

[('turn', 'VBP'), ('music', 'NN'), ('got', 'VBD'), ('records', 'NNS'), ('shut', 'VBD'), ('world', 'NN'), ('lights', 'NNS'), ('come', 'VBP'), ('streets', 'NNS'), ('alight', 'VBD'), ('trees', 'NNS'), ('are', 'VBP'), ('gone', 'VBN'), ('feel', 'VBP'), ('heart', 'NN'), ('start', 'VB'), ('beating', 'NN'), ('song', 'NN'), ('all', 'PDT'), ('kids', 'NNS'), ('dance', 'VBP'), ('all', 'PDT'), ('kids', 'NNS'), ('night', 'NN'), ('Monday', 'NNP'), ('morning', 'NN'), ('feels', 'NNS'), ('life', 'NN'), ('turn', 'VBP'), ('music', 'NN'), ("'m", 'VBP'), ('roll', 'NN'), ('time', 'NN'), ('heaven', 'NN'), ('sight', 'NN'), ('turn', 'VBP'), ('music', 'NN'), ('got', 'VBD'), ('records', 'NNS'), ('rubble', 'NN'), ('sing', 'VBG'), ('rebel', 'NN'), ('song', 'NN'), ('Do', 'VBP'), ('want', 'VB'), ('see', 'VB'), ('generation', 'NN'), ('drop', 'NN'), ('be', 'VB'), ('comma', 'NN'), ('stop', 'NN'), ("'m", 'VBP'), ("'m", 'VBP'), ('knees', 'NNS'), ("'m", 'VBP'), ('gap', 'NN'), ('trapezes', 'NNS'), ('heart', 'NN'), ('beating

As you can see, it does not return values like 'a', 'n', 'v' or 'r' as the WordNet lemmatizer is expecting...

So we have to convert the values from the NLTK POS-tagging to put them into the WordNet Lemmatizer. This is done in the function *get_wordnet_pos* written below. Try to understand it, and then we will reuse it.

In [23]:
from nltk.corpus import wordnet

def get_wordnet_pos(pos_tag):
    output = np.asarray(pos_tag)
    for i in range(len(pos_tag)):
        if pos_tag[i][1].startswith('J'):
            output[i][1] = wordnet.ADJ
        elif pos_tag[i][1].startswith('V'):
            output[i][1] = wordnet.VERB
        elif pos_tag[i][1].startswith('R'):
            output[i][1] = wordnet.ADV
        else:
            output[i][1] = wordnet.NOUN
    return output

So now you have all we need to perform properly the lemmatization.

So you have to use the following to do so:
* your tags from the POS-tagging performed
* the function *get_wordnet_pos*
* the *WordNetLemmatizer*

Perform the lemmatization properly

In [24]:
import nltk
from nltk.stem import WordNetLemmatizer

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Input text
words = ['I', 'turn', 'music', 'I', 'get', 'record', 'I', 'shut', 'world', 'outside', 'light', 'come', 'Maybe', 'street',
         'alight', 'maybe', 'tree', 'go', 'I', 'feel', 'heart', 'start', 'beat', 'favourite', 'song', 'And', 'kid',
         'dance', 'kid', 'night', 'Until', 'Monday', 'morning', 'feel', 'another', 'life', 'I', 'turn', 'music', 'I',
         'roll', 'time', 'And', 'heaven', 'sight', 'I', 'turn', 'music', 'I', 'get', 'record', 'From', 'underneath',
         'rubble', 'sing', 'rebel', 'song', 'Do', 'want', 'see', 'another', 'generation', 'drop', 'I', 'rather', 'comma',
         'full', 'stop', 'Maybe', 'I', 'black', 'maybe', 'I', 'knees', 'Maybe', 'I', 'gap', 'two', 'trapeze', 'But',
         'heart', 'beating', 'pulse', 'start', 'Cathedrals', 'heart', 'As', 'saw', 'oh', 'light', 'I', 'swear',
         'emerge', 'blink', 'To', 'tell', 'alright', 'As', 'soar', 'wall', 'every', 'siren', 'symphony', 'And', 'every',
         'tear', 'waterfall', 'Is', 'waterfall', 'Oh', 'Is', 'waterfall', 'Oh', 'oh', 'oh', 'Is', 'waterfall', 'Every',
         'tear', 'Is', 'waterfall', 'Oh', 'oh', 'oh', 'So', 'hurt', 'hurt', 'bad', 'But', 'still', 'I', 'raise', 'flag',
         'Oh', 'It', 'wa', 'wa', 'wa', 'wa', 'A', 'wa', 'wa', 'wa', 'wa', 'Every', 'tear', 'Every', 'tear', 'Every',
         'teardrop', 'waterfall', 'Every', 'tear', 'Every', 'tear', 'Every', 'teardrop', 'waterfall', 'Every', 'tear',
         'Every', 'tear', 'Every', 'teardrop', 'waterfall']

# Lemmatize each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

# Print the lemmatized words
print(lemmatized_words)

['I', 'turn', 'music', 'I', 'get', 'record', 'I', 'shut', 'world', 'outside', 'light', 'come', 'Maybe', 'street', 'alight', 'maybe', 'tree', 'go', 'I', 'feel', 'heart', 'start', 'beat', 'favourite', 'song', 'And', 'kid', 'dance', 'kid', 'night', 'Until', 'Monday', 'morning', 'feel', 'another', 'life', 'I', 'turn', 'music', 'I', 'roll', 'time', 'And', 'heaven', 'sight', 'I', 'turn', 'music', 'I', 'get', 'record', 'From', 'underneath', 'rubble', 'sing', 'rebel', 'song', 'Do', 'want', 'see', 'another', 'generation', 'drop', 'I', 'rather', 'comma', 'full', 'stop', 'Maybe', 'I', 'black', 'maybe', 'I', 'knee', 'Maybe', 'I', 'gap', 'two', 'trapeze', 'But', 'heart', 'beating', 'pulse', 'start', 'Cathedrals', 'heart', 'As', 'saw', 'oh', 'light', 'I', 'swear', 'emerge', 'blink', 'To', 'tell', 'alright', 'As', 'soar', 'wall', 'every', 'siren', 'symphony', 'And', 'every', 'tear', 'waterfall', 'Is', 'waterfall', 'Oh', 'Is', 'waterfall', 'Oh', 'oh', 'oh', 'Is', 'waterfall', 'Every', 'tear', 'Is', 'w

What do you think?

Still not perfect, but it's the best we can do for now.

Now you can try stemming, with the help of the lecture, and see the differences compared to the lemmatization

In [25]:
import nltk
from nltk.stem import PorterStemmer

def perform_stemming(text):
    stemmer = PorterStemmer()
    stemmed_words = []

    for word in text:
        stemmed_word = stemmer.stem(word)
        stemmed_words.append(stemmed_word)

    return stemmed_words

# Input text
text = ['i', 'turn', 'music', 'i', 'got', 'record', 'i', 'shut', 'world', 'outsid', 'light', 'come', 'mayb', 'street', 'alight', 'mayb', 'tree', 'gone', 'i', 'feel', 'heart', 'start', 'beat', 'favourit', 'song', 'and', 'kid', 'danc', 'kid', 'night', 'until', 'monday', 'morn', 'feel', 'anoth', 'life', 'i', 'turn', 'music', 'i', 'roll', 'time', 'and', 'heaven', 'sight', 'i', 'turn', 'music', 'i', 'got', 'record', 'from', 'underneath', 'rubbl', 'sing', 'rebel', 'song', 'do', 'want', 'see', 'anoth', 'gener', 'drop', 'i', 'rather', 'comma', 'full', 'stop', 'mayb', 'i', 'black', 'mayb', 'i', 'knee', 'mayb', 'i', 'gap', 'two', 'trapez', 'but', 'heart', 'beat', 'puls', 'start', 'cathedr', 'heart', 'as', 'saw', 'oh', 'light', 'i', 'swear', 'emerg', 'blink', 'to', 'tell', 'alright', 'as', 'soar', 'wall', 'everi', 'siren', 'symphoni', 'and', 'everi', 'tear', 'waterfal', 'is', 'waterfal', 'oh', 'is', 'waterfal', 'oh', 'oh', 'oh', 'is', 'waterfal', 'everi', 'tear', 'is', 'waterfal', 'oh', 'oh', 'oh', 'so', 'hurt', 'hurt', 'bad', 'but', 'still', 'i', 'rais', 'flag', 'oh', 'it', 'wa', 'wa', 'wa', 'wa', 'a', 'wa', 'wa', 'wa', 'wa', 'everi', 'tear', 'everi', 'tear', 'everi', 'teardrop', 'waterfal', 'everi', 'tear', 'everi', 'tear', 'everi', 'teardrop', 'waterfal', 'everi', 'tear', 'everi', 'tear', 'everi', 'teardrop', 'waterfal']

# Perform stemming
stemmed_text = perform_stemming(text)

# Print the result
print(stemmed_text)


['i', 'turn', 'music', 'i', 'got', 'record', 'i', 'shut', 'world', 'outsid', 'light', 'come', 'mayb', 'street', 'alight', 'mayb', 'tree', 'gone', 'i', 'feel', 'heart', 'start', 'beat', 'favourit', 'song', 'and', 'kid', 'danc', 'kid', 'night', 'until', 'monday', 'morn', 'feel', 'anoth', 'life', 'i', 'turn', 'music', 'i', 'roll', 'time', 'and', 'heaven', 'sight', 'i', 'turn', 'music', 'i', 'got', 'record', 'from', 'underneath', 'rubbl', 'sing', 'rebel', 'song', 'do', 'want', 'see', 'anoth', 'gener', 'drop', 'i', 'rather', 'comma', 'full', 'stop', 'mayb', 'i', 'black', 'mayb', 'i', 'knee', 'mayb', 'i', 'gap', 'two', 'trapez', 'but', 'heart', 'beat', 'pul', 'start', 'cathedr', 'heart', 'as', 'saw', 'oh', 'light', 'i', 'swear', 'emerg', 'blink', 'to', 'tell', 'alright', 'as', 'soar', 'wall', 'everi', 'siren', 'symphoni', 'and', 'everi', 'tear', 'waterf', 'is', 'waterf', 'oh', 'is', 'waterf', 'oh', 'oh', 'oh', 'is', 'waterf', 'everi', 'tear', 'is', 'waterf', 'oh', 'oh', 'oh', 'so', 'hurt', '