# Data Cleaning

## Cleaning The Data

In [1]:
import pandas as pd

data_df = pd.read_csv("lyrics.csv",index_col=0)
data_df

Unnamed: 0,lyrics
ABBA,"[Verse 1] I, I've been in love before I though..."
David_Bowie,[Intro] [Verse 1] A small Jean Genie snuck of...
Janis_Joplin,"[Intro] Oh, come on, come on, come on, come on..."
Michael_Jackson,"[Verse 1] Your butt is mine, gonna tell you ri..."
Queen,[Verse 1] I can dim the lights and sing you so...
Rolling_Stones,[Intro] What a drag it is getting old [Verse ...
The_Clash,Stay around don't play around This old town an...
Bob_Dylan,[Verse 1] Go away from my window Leave at your...
Elton_John,[Verse 1] Can you hear it in the distance? Can...
Led_Zeppeling,[Intro] Hey That's right [Verse 1] Asked swee...


In [7]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/mausoto/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/mausoto/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [8]:
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def clean_text_1(text):
    # Lowercase
    text = text.lower()
    # Remove special text in brackets ([chorus],[guitar],etc)
    text = re.sub('\[.*?\]', '', text)
    # Remove punctuation
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    # Remove words containing numbers
    text = re.sub('\w*\d\w*', '', text)    
    # Remove quotes
    text = re.sub('[‘’“”…]', '', text)
    # Remove new line \n 
    text = re.sub('\n', ' ', text)
    # Remove stop_word
    stop_words = stopwords.words('english')
    words = word_tokenize(text)
    new_text = ""
    for w in words:
        if w not in stop_words and len(w) > 1:
            new_text = new_text + " " + w
    return new_text
    

In [9]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.lyrics.apply(clean_text_1))

In [10]:
data_df.loc['David_Bowie']['lyrics'][:500]

"[Intro]  [Verse 1] A small Jean Genie snuck off to the city Strung out on lasers and slash back blazers And ate all your razors while pulling the waiters Talking 'bout Monroe and walking on Snow White New York's a go-go and everything tastes nice Poor little Greenie Woo-hoo  (Get back one)  [Chorus] The Jean Genie lives on his back The Jean Genie loves chimney stacks (The Jean Genie) he's outrageous, he screams and he bawls The Jean Genie, let yourself go, oh  [Interlude]  [Verse 2] Sits like a "

In [11]:
data_clean.loc['David_Bowie']['lyrics'][:500]

' small jean genie snuck city strung lasers slash back blazers ate razors pulling waiters talking bout monroe walking snow white new yorks gogo everything tastes nice poor little greenie woohoo get back one jean genie lives back jean genie loves chimney stacks jean genie hes outrageous screams bawls jean genie let go oh sits like man smiles like reptile loves loves short shell scratch sand wont let go hand says hes beautician sells nutrition keeps dead hair making underwear poor little greenie je'

## Stemming / Lemmatization

In [23]:
import nltk

from nltk.tokenize import sent_tokenize, word_tokenize

from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

In [24]:
#A list of words to be stemmed
verb_list = ["was","were","am","general","generalize","generalizing","insurance","insured"]
noun_list = ["dogs","feet","insurance","knowledge"]
adjec_list= ["harder","better","faster","stronger"]

In [29]:
porter = PorterStemmer()
lancaster = LancasterStemmer()

print("%-20s %-20s %-20s"% ("Word","Porter Stemmer","lancaster Stemmer"))
for word in verb_list:
    print("%-20s %-20s %-20s"%(word, porter.stem(word),lancaster.stem(word)))
print("--")
for word in noun_list:
    print("%-20s %-20s %-20s"%(word, porter.stem(word),lancaster.stem(word)))
print("--")
for word in adjec_list:
    print("%-20s %-20s %-20s"%(word, porter.stem(word),lancaster.stem(word)))

Word                 Porter Stemmer       lancaster Stemmer   
was                  wa                   was                 
were                 were                 wer                 
am                   am                   am                  
general              gener                gen                 
generalize           gener                gen                 
generalizing         gener                gen                 
insurance            insur                ins                 
insured              insur                ins                 
--
dogs                 dog                  dog                 
feet                 feet                 feet                
insurance            insur                ins                 
knowledge            knowledg             knowledg            
--
harder               harder               hard                
better               better               bet                 
faster               faster               fast   

In [32]:
import nltk
from nltk.stem import WordNetLemmatizer 

# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("generalized")) # try pos='v':
#POS: part of speech ADJ=a, ADJ_SAT=s, ADV=r, NOUN=n, VERB=v



generalized


In [33]:
print("%-20s  %-20s"% ("Word","WordNet Lemmatizer"))
for word in verb_list:
    print("%-20s %-20s"%(word,lemmatizer.lemmatize(word, pos='v')))
print("--")
for word in noun_list:
    print("%-20s %-20s"%(word,lemmatizer.lemmatize(word, pos='n')))
print("--")
for word in adjec_list:
    print("%-20s %-20s"%(word,lemmatizer.lemmatize(word, pos='a')))

Word                  WordNet Lemmatizer  
was                  be                  
were                 be                  
am                   be                  
general              general             
generalize           generalize          
generalizing         generalize          
insurance            insurance           
insured              insure              
--
dogs                 dog                 
feet                 foot                
insurance            insurance           
knowledge            knowledge           
--
harder               hard                
better               good                
faster               fast                
stronger             strong              


In [34]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /home/mausoto/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/mausoto/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [37]:
# POS tagging

from nltk import word_tokenize, pos_tag

txt = "Remember when you were young, you shone like the sun Shine on you crazy diamond"
pos_tag(word_tokenize(txt))

[('Remember', 'NNP'),
 ('when', 'WRB'),
 ('you', 'PRP'),
 ('were', 'VBD'),
 ('young', 'JJ'),
 (',', ','),
 ('you', 'PRP'),
 ('shone', 'VBP'),
 ('like', 'IN'),
 ('the', 'DT'),
 ('sun', 'NN'),
 ('Shine', 'NN'),
 ('on', 'IN'),
 ('you', 'PRP'),
 ('crazy', 'VBP'),
 ('diamond', 'NN')]

POS tag list:

- CC	coordinating conjunction
- CD	cardinal digit
- DT	determiner
- EX	existential there (like: "there is" ... think of it like "there exists")
- FW	foreign word
- IN	preposition/subordinating conjunction
- JJ	adjective	'big'
- JJR	adjective, comparative	'bigger'
- JJS	adjective, superlative	'biggest'
- LS	list marker	1)
- MD	modal	could, will
- NN	noun, singular 'desk'
- NNS	noun plural	'desks'
- NNP	proper noun, singular	'Harrison'
- NNPS	proper noun, plural	'Americans'
- PDT	predeterminer	'all the kids'
- POS	possessive ending	parent\'s
- PRP	personal pronoun	I, he, she
- PRP\$ 	possessive pronoun	my, his, hers
- RB	adverb	very, silently,
- RBR	adverb, comparative	better
- RBS	adverb, superlative	best
- RP	particle	give up
- TO	to	go 'to' the store.
- UH	interjection	errrrrrrrm
- VB	verb, base form	take
- VBD	verb, past tense	took
- VBG	verb, gerund/present participle	taking
- VBN	verb, past participle	taken
- VBP	verb, sing. present, non-3d	take
- VBZ	verb, 3rd person sing. present	takes
- WDT	wh-determiner	which
- WP	wh-pronoun	who, what
- WP\$	possessive wh-pronoun	whose
- WRB	wh-abverb	where, when

In [38]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

def lemmatize_tag(text):
    lemma=[]
    for i,j in pos_tag(word_tokenize(text)) :
        p=j[0].lower()
        if p in ['j','n','v']:
            if p == 'j':
                p = 'a'
            lemma.append(wnl.lemmatize(i,p))
        else :
            lemma.append(wnl.lemmatize(i))    
    return ' '.join(lemma)



In [39]:
data_clean.loc['David_Bowie']['lyrics'][:500]

'small jean genie snuck city strung laser slash back blazer eat razor pull waiter talk bout monroe walk snow white new york gogo everything taste nice poor little greenie woohoo get back one jean genie live back jean genie love chimney stack jean genie he outrageous scream bawl jean genie let go oh sits like man smile like reptile love love short shell scratch sand wont let go hand say he beautician sell nutrition keep dead hair make underwear poor little greenie jean genie live back jean genie l'

In [40]:
data_clean = pd.DataFrame(data_clean.lyrics.apply(lemmatize_tag))

In [41]:
data_clean.loc['David_Bowie']['lyrics'][:500]

'small jean genie snuck city strung laser slash back blazer eat razor pull waiter talk bout monroe walk snow white new york gogo everything taste nice poor little greenie woohoo get back one jean genie live back jean genie love chimney stack jean genie he outrageous scream bawl jean genie let go oh sits like man smile like reptile love love short shell scratch sand wont let go hand say he beautician sell nutrition keep dead hair make underwear poor little greenie jean genie live back jean genie l'

### Save clean data

In [42]:
data_clean.to_csv('lyrics_clean.csv')

## Question
1. Which further clean can be aplied to the text?