## Pre-processing a text for NLP purposes - Emre Kurt

**In this notebook, I will be preparing a text for natural language processing. I'm going to be using lyrics from the song 'Aquemini', by the hip hop duo Outkast. 'Aquemini' is a song off the album with the same name. The procedure will include removing unnecessary components (capitalizations, punctuation, and stop words), tokenization of the text, and lemmatization and stemming.**

In [1]:
lyrics = open('Aquemini.txt', 'r').read()
lyrics

'Even the Sun goes down, heroes eventually die\nHoroscopes often lie\nAnd sometimes, "Y"\nNothing is for sure, nothing is for certain\nNothing lasts forever, but until they close the curtain\n(Y\'all know)\nIt\'s him and I: Aquemini\n\nNow is the time to get on—like Spike Lee said, "Get on the Bus"\nGo get your work and keep your beeper chirping—it\'s a must\nIs you on that dust or corn starch? Familiar with that smack, man?\nThe music is like that green stuff provided to you by sack-man\nPacman, how in the fuck you think we gon\' do that, man?\nRiding round Old National on eighteens without no gat, man\nI\'m strapped, man, and ready to bust on any nigga like that, man\nMe and my nigga, we roll together like Batman and Robin\nWe prayed together through hard times, swung hard when it was fitting\nBut now, we tapping the brakes from all them corners that we be bending\nIn Volkswagens and Bonnevilles, Chevrolets and Coupe de Villes\nIf you ain\'t got no rims, nigga, don\'t get no wood-gra

In [2]:
import re
wordCount = len(re.findall(r'\w+', lyrics))
print("There are %i words in the song."%wordCount)

There are 782 words in the song.


In [3]:
punc = ["!", "?", ".", '"', "(", ")", ",", ":", "'", "-", "—"]
cleanLyrics = ''
for letter in lyrics:
    if letter not in punc:
        cleanLyrics+=letter
  
cleanLyrics = cleanLyrics.replace("\n", "  ")
cleanLyrics

'Even the Sun goes down heroes eventually die  Horoscopes often lie  And sometimes Y  Nothing is for sure nothing is for certain  Nothing lasts forever but until they close the curtain  Yall know  Its him and I Aquemini    Now is the time to get onlike Spike Lee said Get on the Bus  Go get your work and keep your beeper chirpingits a must  Is you on that dust or corn starch Familiar with that smack man  The music is like that green stuff provided to you by sackman  Pacman how in the fuck you think we gon do that man  Riding round Old National on eighteens without no gat man  Im strapped man and ready to bust on any nigga like that man  Me and my nigga we roll together like Batman and Robin  We prayed together through hard times swung hard when it was fitting  But now we tapping the brakes from all them corners that we be bending  In Volkswagens and Bonnevilles Chevrolets and Coupe de Villes  If you aint got no rims nigga dont get no woodgrain steering wheel for real Real  You can go on

**I think removing only the most common stop words (i.e. "and", "the") wouldn't hurt. I don't think any meaningful conclusion can be made from analyses including those words. Removing swear words also makes sense, as words like "fuck" are usually not used in their literal sense ("fucking ___"). They are mostly used as filler words.** 

In [6]:
cleanLyrics
commonStop = "and I a about an are as at be by com for from how in is it of on or that the this to was what when where who will with the www it's it won't can't don't".split()
stopwords = commonStop+["fuck", "shit", "bitch", "fucking", "fuckin'", "yeah"]
lyricsNoStops = ""
for words in cleanLyrics.lower().split():
    if words not in stopwords:
        lyricsNoStops+=words
        lyricsNoStops+=" "
lyricsNoStops

'even sun goes down heroes eventually die horoscopes often lie sometimes y nothing sure nothing certain nothing lasts forever but until they close curtain yall know its him i aquemini now time get onlike spike lee said get bus go get your work keep your beeper chirpingits must you dust corn starch familiar smack man music like green stuff provided you sackman pacman you think we gon do man riding round old national eighteens without no gat man im strapped man ready bust any nigga like man me my nigga we roll together like batman robin we prayed together through hard times swung hard fitting but now we tapping brakes all them corners we bending volkswagens bonnevilles chevrolets coupe de villes if you aint got no rims nigga dont get no woodgrain steering wheel real real you can go chill out still build let your paper stack stead going into overkill pay your fuckin beeper bill even sun goes down heroes eventually die horoscopes often lie sometimes y nothing sure nothing certain nothing l

In [12]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')
tokens = word_tokenize(lyricsNoStops)
print("There are %i tokens (words) in the corpus."%len(tokens))

There are 557 tokens (words) in the corpus.


[nltk_data] Downloading package punkt to /Users/emrekurt/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [14]:
def word_count(list):
    counts = dict()
    for word in list:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1

    return counts

print("There are %i types (unique words) in the corpus."%len(word_count(tokens)))

There are 313 types (unique words) in the corpus.


In [15]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

for token in tokens:
    roots = ps.stem(token)
    print([token, roots])

['even', 'even']
['sun', 'sun']
['goes', 'goe']
['down', 'down']
['heroes', 'hero']
['eventually', 'eventu']
['die', 'die']
['horoscopes', 'horoscop']
['often', 'often']
['lie', 'lie']
['sometimes', 'sometim']
['y', 'y']
['nothing', 'noth']
['sure', 'sure']
['nothing', 'noth']
['certain', 'certain']
['nothing', 'noth']
['lasts', 'last']
['forever', 'forev']
['but', 'but']
['until', 'until']
['they', 'they']
['close', 'close']
['curtain', 'curtain']
['yall', 'yall']
['know', 'know']
['its', 'it']
['him', 'him']
['i', 'i']
['aquemini', 'aquemini']
['now', 'now']
['time', 'time']
['get', 'get']
['onlike', 'onlik']
['spike', 'spike']
['lee', 'lee']
['said', 'said']
['get', 'get']
['bus', 'bu']
['go', 'go']
['get', 'get']
['your', 'your']
['work', 'work']
['keep', 'keep']
['your', 'your']
['beeper', 'beeper']
['chirpingits', 'chirpingit']
['must', 'must']
['you', 'you']
['dust', 'dust']
['corn', 'corn']
['starch', 'starch']
['familiar', 'familiar']
['smack', 'smack']
['man', 'man']
['music'

In [16]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
wordnet_lemmatizer = WordNetLemmatizer()

for token in tokens:
    print("Lemma for {} is {}".format(token, wordnet_lemmatizer.lemmatize(token)))  

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/emrekurt/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lemma for even is even
Lemma for sun is sun
Lemma for goes is go
Lemma for down is down
Lemma for heroes is hero
Lemma for eventually is eventually
Lemma for die is die
Lemma for horoscopes is horoscope
Lemma for often is often
Lemma for lie is lie
Lemma for sometimes is sometimes
Lemma for y is y
Lemma for nothing is nothing
Lemma for sure is sure
Lemma for nothing is nothing
Lemma for certain is certain
Lemma for nothing is nothing
Lemma for lasts is last
Lemma for forever is forever
Lemma for but is but
Lemma for until is until
Lemma for they is they
Lemma for close is close
Lemma for curtain is curtain
Lemma for yall is yall
Lemma for know is know
Lemma for its is it
Lemma for him is him
Lemma for i is i
Lemma for aquemini is aquemini
Lemma for now is now
Lemma for time is time
Lemma for get is get
Lemma for onlike is onlike
Lemma for spike is spike
Lemma for lee is lee
Lemma for said is said
Lemma for get is get
Lemma for bus is bus
Lemma for go is go
Lemma for get is get
Lemma fo

**At this point, the singular text we used is cleaned and ready for our relevant analytical purposes. If at any point we want to include what we have previously removed, the code can be erased.**