# Text cleaning

In [1]:
import pandas as pd

from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

from sklearn.datasets import fetch_20newsgroups

In [2]:
# load data
data = fetch_20newsgroups(subset='train')
df = pd.DataFrame(data.data, columns=['text'])
df.head()

Unnamed: 0,text
0,From: lerxst@wam.umd.edu (where's my thing)\nS...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...


In [3]:
# print example of text

print(df['text'][10])

From: irwin@cmptrc.lonestar.org (Irwin Arnstein)
Subject: Re: Recommendation on Duc
Summary: What's it worth?
Distribution: usa
Expires: Sat, 1 May 1993 05:00:00 GMT
Organization: CompuTrac Inc., Richardson TX
Keywords: Ducati, GTS, How much? 
Lines: 13

I have a line on a Ducati 900GTS 1978 model with 17k on the clock.  Runs
very well, paint is the bronze/brown/orange faded out, leaks a bit of oil
and pops out of 1st with hard accel.  The shop will fix trans and oil 
leak.  They sold the bike to the 1 and only owner.  They want $3495, and
I am thinking more like $3K.  Any opinions out there?  Please email me.
Thanks.  It would be a nice stable mate to the Beemer.  Then I'll get
a jap bike and call myself Axis Motors!

-- 
-----------------------------------------------------------------------
"Tuba" (Irwin)      "I honk therefore I am"     CompuTrac-Richardson,Tx
irwin@cmptrc.lonestar.org    DoD #0826          (R75/6)
-------------------------------------------------------------------

In [4]:
# remove punctuation
df["text"] = df['text'].str.replace('[^\w\s]','', regex=True)

In [5]:
# print example without punctuation

print(df['text'][10])

From irwincmptrclonestarorg Irwin Arnstein
Subject Re Recommendation on Duc
Summary Whats it worth
Distribution usa
Expires Sat 1 May 1993 050000 GMT
Organization CompuTrac Inc Richardson TX
Keywords Ducati GTS How much 
Lines 13

I have a line on a Ducati 900GTS 1978 model with 17k on the clock  Runs
very well paint is the bronzebrownorange faded out leaks a bit of oil
and pops out of 1st with hard accel  The shop will fix trans and oil 
leak  They sold the bike to the 1 and only owner  They want 3495 and
I am thinking more like 3K  Any opinions out there  Please email me
Thanks  It would be a nice stable mate to the Beemer  Then Ill get
a jap bike and call myself Axis Motors

 

Tuba Irwin      I honk therefore I am     CompuTracRichardsonTx
irwincmptrclonestarorg    DoD 0826          R756




In [6]:
# alternative way to remove punctuation
import string
df['text'] = df['text'].str.replace('[{}]'.format(string.punctuation), '', regex=True)

In [7]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [8]:
# remove numbers, keep only text

df['text'] = df['text'].str.replace('\d+', '', regex=True)

In [9]:
# print example without numbers

print(df['text'][10])

From irwincmptrclonestarorg Irwin Arnstein
Subject Re Recommendation on Duc
Summary Whats it worth
Distribution usa
Expires Sat  May   GMT
Organization CompuTrac Inc Richardson TX
Keywords Ducati GTS How much 
Lines 

I have a line on a Ducati GTS  model with k on the clock  Runs
very well paint is the bronzebrownorange faded out leaks a bit of oil
and pops out of st with hard accel  The shop will fix trans and oil 
leak  They sold the bike to the  and only owner  They want  and
I am thinking more like K  Any opinions out there  Please email me
Thanks  It would be a nice stable mate to the Beemer  Then Ill get
a jap bike and call myself Axis Motors

 

Tuba Irwin      I honk therefore I am     CompuTracRichardsonTx
irwincmptrclonestarorg    DoD           R




In [10]:
# put in lower case

df['text'] = df['text'].str.lower()

In [11]:
# print example in lower case

print(df['text'][10])

from irwincmptrclonestarorg irwin arnstein
subject re recommendation on duc
summary whats it worth
distribution usa
expires sat  may   gmt
organization computrac inc richardson tx
keywords ducati gts how much 
lines 

i have a line on a ducati gts  model with k on the clock  runs
very well paint is the bronzebrownorange faded out leaks a bit of oil
and pops out of st with hard accel  the shop will fix trans and oil 
leak  they sold the bike to the  and only owner  they want  and
i am thinking more like k  any opinions out there  please email me
thanks  it would be a nice stable mate to the beemer  then ill get
a jap bike and call myself axis motors

 

tuba irwin      i honk therefore i am     computracrichardsontx
irwincmptrclonestarorg    dod           r




In [12]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [13]:
# temove stop words

def remove_stopwords(text):
    stop = set(stopwords.words('english'))
    text = [word for word in text.split() if word not in stop]
    text = ' '.join(x for x in text)
    return text

In [14]:
# test function on single text

remove_stopwords(df['text'][10])

'irwincmptrclonestarorg irwin arnstein subject recommendation duc summary whats worth distribution usa expires sat may gmt organization computrac inc richardson tx keywords ducati gts much lines line ducati gts model k clock runs well paint bronzebrownorange faded leaks bit oil pops st hard accel shop fix trans oil leak sold bike owner want thinking like k opinions please email thanks would nice stable mate beemer ill get jap bike call axis motors tuba irwin honk therefore computracrichardsontx irwincmptrclonestarorg dod r'

In [15]:
# apply function to entire dataframe
# (this operation takes a while)

df['text'] = df['text'].apply(remove_stopwords)

In [16]:
# print example text without stopwords

print(df['text'][10])

irwincmptrclonestarorg irwin arnstein subject recommendation duc summary whats worth distribution usa expires sat may gmt organization computrac inc richardson tx keywords ducati gts much lines line ducati gts model k clock runs well paint bronzebrownorange faded leaks bit oil pops st hard accel shop fix trans oil leak sold bike owner want thinking like k opinions please email thanks would nice stable mate beemer ill get jap bike call axis motors tuba irwin honk therefore computracrichardsontx irwincmptrclonestarorg dod r


In [17]:
# Stemming

# http://www.nltk.org/howto/stem.html
# for other stemmers

stemmer = SnowballStemmer("english")

In [18]:
# test stemmer in one word
stemmer.stem('running')

'run'

In [19]:
def stemm_words(text):
    text = [stemmer.stem(word) for word in text.split()]
    text = ' '.join(x for x in text)
    return text

In [20]:
# test function on single text

stemm_words(df['text'][10])

'irwincmptrclonestarorg irwin arnstein subject recommend duc summari what worth distribut usa expir sat may gmt organ computrac inc richardson tx keyword ducati gts much line line ducati gts model k clock run well paint bronzebrownorang fade leak bit oil pop st hard accel shop fix tran oil leak sold bike owner want think like k opinion pleas email thank would nice stabl mate beemer ill get jap bike call axi motor tuba irwin honk therefor computracrichardsontx irwincmptrclonestarorg dod r'

In [21]:
# stem entire dataframe
df['text'] = df['text'].apply(stemm_words)

In [22]:
# print example with stemmed words

print(df['text'][10])

irwincmptrclonestarorg irwin arnstein subject recommend duc summari what worth distribut usa expir sat may gmt organ computrac inc richardson tx keyword ducati gts much line line ducati gts model k clock run well paint bronzebrownorang fade leak bit oil pop st hard accel shop fix tran oil leak sold bike owner want think like k opinion pleas email thank would nice stabl mate beemer ill get jap bike call axi motor tuba irwin honk therefor computracrichardsontx irwincmptrclonestarorg dod r
