<a href="https://colab.research.google.com/github/CRSpradlin/natural-language-processing-course/blob/main/NLP%20Course%20Work/6.%20Text%20Cleaning%20and%20Preprocessing/CompleteTextCleaningAndPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [135]:
!pip install beautifulsoup4
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m67.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [136]:
import pandas as pd
import numpy as np
import spacy

In [137]:
from spacy.lang.en.stop_words import STOP_WORDS as stopwords

In [138]:
df = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/twitter-data/master/twitter4000.csv', encoding = 'latin1')

In [139]:
df

Unnamed: 0,twitts,sentiment
0,is bored and wants to watch a movie any sugge...,0
1,back in miami. waiting to unboard ship,0
2,"@misskpey awwww dnt dis brng bak memoriessss, ...",0
3,ughhh i am so tired blahhhhhhhhh,0
4,@mandagoforth me bad! It's funny though. Zacha...,0
...,...,...
3995,i just graduated,1
3996,Templating works; it all has to be done,1
3997,mommy just brought me starbucks,1
3998,@omarepps watching you on a House re-run...lov...,1


## Goal: Preprocess This Data
This data has a lot of short handed statements that are not officially part of the English language. We need to clean the data so it can be processed in the most accurate way.

In [140]:
df['sentiment'].value_counts()

sentiment
0    2000
1    2000
Name: count, dtype: int64

## Word Counts

In [141]:
len('this is example text'.split())

4

In [142]:
# Caclulate the word count of the twitts column and apply that result to a new column in the data frame.
df['word_count'] = df['twitts'].apply(lambda x: len(str(x).split()))

In [143]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_count
582,I think my TwitterBerry is broken,0,6
2225,breakfast with my two favorite people ever.,1,7
883,Grrr doing my online class homework,0,6
1823,@Jonasbrothers the app never works for me,0,7
567,Argh! Sounds like someone is standing at my do...,0,22


In [144]:
df['word_count'].max()

32

In [145]:
df['word_count'].min()

1

In [146]:
#Calculates and returns the row results that only have a single word data content.
df[df['word_count']==1]

Unnamed: 0,twitts,sentiment,word_count
385,homework,0,1
691,@ekrelly,0,1
1124,disappointed,0,1
1286,@officialmgnfox,0,1
1325,headache,0,1
1897,@MCRmuffin,0,1
2542,Graduated!,1,1
2947,reading,1,1
3176,@omeirdeleon,1,1
3470,www.myspace.com/myfinalthought,1,1


## Character Count

In [147]:
def char_count(text):
  s = text.split()
  s = ''.join(s)
  return len(s)

In [148]:
char_count('this test')

8

In [149]:
# Much like the word count, calculate the character count and add it to the data frame as a new column. (Without Spaces)
df['char_count'] = df['twitts'].apply(lambda x: char_count(str(x)))

In [150]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count
2090,@phoenix_life you too! I'll be on again tomorrow,1,8,41
2925,@ClaytonKennedy I've actually done it several ...,1,10,59
2322,@RuSt true,1,2,9
2786,@OneNil same thing. as long as there's words.,1,8,38
2628,"@toitokyo Ok, then what happens?",1,5,28


## Average Word Length

In [151]:
x = 'this is an example' # 15/4 = ~3
y = 'this is another example' # 22/4 = ~5

In [152]:
df['avg_word_length'] = df['char_count']/df['word_count']

In [153]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length
3793,"That IS lame Liz lol, I should send you a care...",1,16,65,4.0625
1428,Relaxing before its time to go to work,0,8,31,3.875
1325,headache,0,1,8,8.0
2614,@niroism ill give it a go tomorrow .,1,8,29,3.625
163,@cassendraaa YES YES. UNDERSTAND.,0,4,30,7.5


## Stop Words Count
As a reminder, stop words are words that are so common in the English Language but do not provide very much sentiment as to what is being said. We want to remove these stop words but in some cases they can be useful (eg. can, cannot)

In [154]:
print(stopwords)

{'our', 'amount', 'thence', 'namely', 'each', 'sixty', 'will', 'everywhere', 'upon', 'alone', 'must', 'five', 'always', 'anyhow', 'former', 'else', 'thus', 'they', 'meanwhile', 'around', 'part', 'one', 'beside', 'although', 'already', 'mostly', 'do', 'the', 'three', 'which', 'i', 'when', 'since', 'all', 'show', 'throughout', 'by', 'not', 'so', 'me', 'myself', 'became', 'can', 'moreover', 'would', 'name', 'various', 'whereas', '’m', 'for', 'further', 'hereafter', 'before', "'d", 'therein', 'eight', 'though', 'twelve', 'see', 'latterly', 'everything', 'therefore', 'did', 'after', 'above', 'just', "'ve", 'indeed', 'had', 'was', 'none', 'whereafter', 'because', 'done', 'is', 'full', 'then', 'toward', 'its', 'whenever', 'please', 'regarding', 'become', 'your', 'yours', 'and', 'again', 'call', 'hence', 'nothing', 'cannot', 'how', '‘ve', 'along', 'four', 'into', 'nine', 'together', 'be', 'least', 'first', 'ourselves', 'hereby', 'made', '‘re', 'even', 'these', '‘ll', 'whatever', 'formerly', 'n

In [155]:
len(stopwords)

326

In [156]:
x = 'this is an example'
words = x.split()
count = 0
for word in words:
  if word in stopwords:
    count += 1
print('Stopword count: ' + str(count))

Stopword count: 3


In [157]:
stopwords_count = lambda x: len([word for word in x.split() if word in stopwords])

In [158]:
df['stop_words_count'] = df.twitts.apply(stopwords_count)

In [159]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count
3226,My first twwweeeeeettt yeeeah,1,4,26,6.5,1
1873,gotta make the most of my last full day in ktown,0,11,38,3.454545,8
1794,Saw Walk the Line tonight.Never heard much Cas...,0,21,127,6.047619,7
169,@lolly_poppet: matching socks im sorry ill be...,0,12,56,4.666667,4
607,ugh I hate working. . so tired,0,7,24,3.428571,1


## Count HashTags and Mentions (\#, @)

In [160]:
x = 'this is #hastag and this is @mention'
x.split()

['this', 'is', '#hastag', 'and', 'this', 'is', '@mention']

In [161]:
[word for word in x.split() if word.startswith('#') or word.startswith('@')]

['#hastag', '@mention']

In [162]:
df['hastags'] = df.twitts.apply(lambda x: len([word for word in x.split() if word.startswith('#')]))

In [163]:
df['mentions'] = df.twitts.apply(lambda x: len([word for word in x.split() if word.startswith('@')]))

In [164]:
df[df['mentions']>0].sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions
3177,@Breezy4Sheezy Yeah we love ya,1,5,26,5.2,1,0,1
3738,@madmohican spendin it in bed wiv my lover hav...,1,24,104,4.333333,5,0,1
3685,@lilithsaintcrow You are so cool. I wanna get ...,1,12,50,4.166667,5,0,1
52,@mamma_J I was spining someone and fell,0,7,33,4.714286,3,0,1
2334,@hernseugene u said u were in class with the g...,1,20,102,5.1,7,0,1


In [165]:
df[df['hastags']>0].sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions
468,Stupid bus broke down so will be late to googl...,0,14,57,4.071429,6,2,0
2000,@roninofragnarok You can use the # to specify ...,1,24,98,4.083333,8,3,1
3619,#3hotwords Let's eat out,1,4,21,5.25,1,1,0
3057,"I am glowing with exhaustion, and am grateful ...",1,12,60,5.0,7,1,0
3891,@jsjv didi you wanna learn Italiano after watc...,1,17,98,5.764706,6,1,1


## If Numeric Digits are present in twitts

In [166]:
x = 'this is 1 and 2'
x.split()[2].isdigit()

True

In [167]:
[digit for digit in x.split() if digit.isdigit()]

['1', '2']

In [168]:
df['numeric_count'] = df.twitts.apply(lambda x: len([digit for digit in x.split() if digit.isdigit()]))

In [169]:
df[df['numeric_count']>0].sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count
869,"@AgustinaP wow, multiple vacations must be nic...",0,20,87,4.35,6,0,1,1
1331,@lovelylaura1982 my fave 2 won't get a look in...,0,30,114,3.8,14,0,1,1
1156,watched like the last 40 minutes of 'the engli...,0,25,112,4.48,8,0,0,1
3038,"@run350 Girl, U outta see it. LOL I have 3 awk...",1,27,111,4.111111,8,0,1,1
3581,"Working. Will be in Gothenburg in 3 hours, wor...",1,15,73,4.866667,6,0,0,1


## Uppercase Word Count

In [170]:
x = 'EXAMPLE ONE' #Uppercase usually means higher intensity emotion
y = 'example two'

print([word for word in x.split() if word.isupper()])
print([word for word in y.split() if word.isupper()])

['EXAMPLE', 'ONE']
[]


In [171]:
df['upper_count'] = df.twitts.apply(lambda x: len([word for word in x.split() if word.isupper()]))

In [172]:
df[df['upper_count']>0].sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count,upper_count
3896,@mrsteveharvey thanks for writing your book......,1,22,119,5.409091,7,0,1,0,1
42,Anyone want to see transformers tonight? :/ I ...,0,16,72,4.5,6,0,0,0,1
3447,Its raining in the DC area too? @LadyMinista,1,8,37,4.625,2,0,1,0,1
785,@MrGHETTISTORY I HANDLED IT BETTER THAN I THOU...,0,25,108,4.32,0,0,1,1,22
2128,@tdrracing interviewed by the Chron and mentio...,1,16,79,4.9375,6,0,1,1,1


## Lower Case Conversion

In [173]:
x = 'this is Text example'
x.lower() # text normalization

'this is text example'

In [174]:
x = 45.0
# x.lower() # This throws an error. "Float" object has no property "lower"
str(x).lower()

'45.0'

In [175]:
df['twitts'] = df['twitts'].apply(lambda x: str(x).lower())
# If you haven't noticed, we do this normalization after we have counted the upper_count.

In [176]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count,upper_count
2026,cooking chicken noodles soup.. yum! perfect fo...,1,11,61,5.545455,2,0,0,0,0
1086,@degadeals good morning! it's overcast and rai...,0,20,90,4.5,7,0,1,0,0
1832,@kevinokeefe i can't imagine no globe. i'm a n...,0,19,102,5.368421,5,0,1,0,1
2213,@paulaaaron she did win! she is the grand pri...,1,17,66,3.882353,7,0,1,0,1
656,lol &quot;jizz in my pants&quot; song is hilar...,0,16,71,4.4375,8,0,0,0,1


## Expansion of Contractions

In [177]:
contractions = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"u": " you ",
"ur": " your ",
"n": " and ",
"won't": "would not",
'dis': 'this',
'bak': 'back',
'brng': 'bring'
}

In [178]:
x = "I'm don't he'll" # "I am", "do not", "he will"

In [179]:
def expand_contractions(text, contractions):
  words = text.split()
  for index, word in enumerate(words):
    lower = str(word).lower()
    if lower in contractions:
      words[index] = contractions[lower]
  return ' '.join(words)

In [180]:
expand_contractions(x, contractions)

'i am do not he will'

In [181]:
%%timeit
df['twitts'] = df.twitts.apply(lambda x: expand_contractions(x, contractions))

25.7 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [182]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count,upper_count
3687,pool side! feels like old times,1,6,26,4.333333,0,0,0,0,0
3538,@djotwist lol thanks oliver....ur the best!,1,6,38,6.333333,1,0,1,0,0
3614,@rustycharm def surprising for me i tell ya so...,1,22,99,4.5,9,1,1,0,1
249,"@themightyfoz from that tweet, the link went t...",0,24,111,4.625,9,0,1,0,1
267,morning tweeties happy monday having a nice qu...,0,11,55,5.0,2,0,0,0,0


You may need to manually update the contractions to include some instances of words/short-hand english that you may encounter and re-run you processing.

## Count and Removal of Emails

In [183]:
import regex as re
df['twitts'].str.contains('.com').sum()

336

In [184]:
df[df['twitts'].str.contains('gmail.com')]

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count,upper_count
2448,"when i click my firefox 'most visited' tab, af...",1,17,79,4.647059,5,0,0,0,1


In [185]:
df[df['twitts'].str.contains('hotmail.com')]


Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count,upper_count
3713,@securerecs arghh me please markbradbury_16@ho...,1,5,51,10.2,0,0,1,0,0


In [186]:
x = df.iloc[3713]['twitts']

In [187]:
re.findall(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+\b)', x)

['markbradbury_16@hotmail.com']

In [188]:
df['emails'] = df.twitts.apply(lambda x: re.findall(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+\b)', x))

In [189]:
df['email_count'] = df.emails.apply(lambda x: len(x))

In [190]:
df[df['email_count']>0]

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count,upper_count,emails,email_count
3713,@securerecs arghh me please markbradbury_16@ho...,1,5,51,10.2,0,0,1,0,0,[markbradbury_16@hotmail.com],1


In [191]:
df['twitts'] = df.twitts.apply(lambda x: re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+\b)', '', x))

In [192]:
df[df['email_count']>0]

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count,upper_count,emails,email_count
3713,@securerecs arghh me please,1,5,51,10.2,0,0,1,0,0,[markbradbury_16@hotmail.com],1


## Count URLs and Remove Them

In [193]:
x = 'please visit https://google.com or ssh://something.terminal and take a look at ftp://yourdrive-cool.com/test/1/5?amount=all done'

In [194]:
re.findall(r'\S+://\S+', x)

['https://google.com',
 'ssh://something.terminal',
 'ftp://yourdrive-cool.com/test/1/5?amount=all']

In [195]:
df['urls'] = df.twitts.apply(lambda x: re.findall(r'\S+://\S+', x))

In [196]:
df['url_count'] = df.urls.apply(lambda x: len(x))

In [197]:
df[df['url_count']>0]

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count,upper_count,emails,email_count,urls,url_count
16,@brianquest i made 1 fo you 2: http://bit.ly/e...,0,19,81,4.263158,3,0,1,1,3,[],0,[http://bit.ly/eid8a],1
98,heading to work http://twitpic.com/4eojz,0,4,37,9.250000,1,0,0,0,0,[],0,[http://twitpic.com/4eojz],1
99,@blondeblogger http://twitpic.com/4w8hk - i am...,0,10,62,6.200000,4,0,1,0,0,[],0,[http://twitpic.com/4w8hk],1
144,i miss you ã¢ââ« http://blip.fm/~8lc2f,0,5,35,7.000000,1,0,0,0,2,[],0,[http://blip.fm/~8lc2f],1
183,photo: miss germany http://tumblr.com/xf825f012,0,4,44,11.000000,0,0,0,0,0,[],0,[http://tumblr.com/xf825f012],1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3819,new staark video of &quot;sweet release&quot; ...,1,20,118,5.900000,4,0,0,0,1,[],0,[http://staark.net/video.php],1
3826,http://tinyurl.com/kwmynq helmet - unsung to s...,1,9,76,8.444444,2,0,0,0,0,[],0,"[http://tinyurl.com/kwmynq, http://plurk.com/p...",2
3837,@r0ckergirl14 wow sweet again!! http://twitpic...,1,5,52,10.400000,0,0,1,0,0,[],0,[http://twitpic.com/69b67],1
3958,someone has been creative with my #deskmess wh...,1,13,91,7.000000,7,1,1,0,0,[],0,[http://twitpic.com/7jgf1],1


In [198]:
df['twitts'] = df.twitts.apply(lambda x: re.sub(r'\S+://\S+', '', x))

In [199]:
df[df['url_count']>0]

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count,upper_count,emails,email_count,urls,url_count
16,@brianquest i made 1 fo you 2: i tried but it...,0,19,81,4.263158,3,0,1,1,3,[],0,[http://bit.ly/eid8a],1
98,heading to work,0,4,37,9.250000,1,0,0,0,0,[],0,[http://twitpic.com/4eojz],1
99,@blondeblogger - i am so sad this is so blurry!,0,10,62,6.200000,4,0,1,0,0,[],0,[http://twitpic.com/4w8hk],1
144,i miss you ã¢ââ«,0,5,35,7.000000,1,0,0,0,2,[],0,[http://blip.fm/~8lc2f],1
183,photo: miss germany,0,4,44,11.000000,0,0,0,0,0,[],0,[http://tumblr.com/xf825f012],1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3819,new staark video of &quot;sweet release&quot; ...,1,20,118,5.900000,4,0,0,0,1,[],0,[http://staark.net/video.php],1
3826,helmet - unsung to start your day,1,9,76,8.444444,2,0,0,0,0,[],0,"[http://tinyurl.com/kwmynq, http://plurk.com/p...",2
3837,@r0ckergirl14 wow sweet again!!,1,5,52,10.400000,0,0,1,0,0,[],0,[http://twitpic.com/69b67],1
3958,someone has been creative with my #deskmess wh...,1,13,91,7.000000,7,1,1,0,0,[],0,[http://twitpic.com/7jgf1],1


## Remove RT (Retweets)

In [200]:
df['twitts'].str.contains('rt ').sum()

109

In [201]:
x = 'rt @username: this is a test'

In [202]:
re.sub(r'rt ', '', x)

'@username: this is a test'

In [203]:
df['twitts'] = df.twitts.apply(lambda x: re.sub(r'rt ', '', x))

## Special/Punctuation Character Removal

In [204]:
x = '@crspradlin i was just testing....'

In [205]:
re.sub(r'[^\w\s]+', '', x)

'crspradlin i was just testing'

In [206]:
df['twitts'] = df.twitts.apply(lambda x: re.sub(r'[^\w\s]+', '', x))

In [207]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count,upper_count,emails,email_count,urls,url_count
2177,utaia that is cool i can live with that you ha...,1,17,67,3.941176,8,0,1,0,0,[],0,[],0
3385,girlonthemove_ hoi welkom,1,3,26,8.666667,0,0,1,0,0,[],0,[],0
3185,danieldraper looks like something i could wast...,1,10,54,5.4,4,0,1,0,1,[],0,[],0
3670,erickeee that is really cool thanks for the info,1,8,43,5.375,3,0,1,0,0,[],0,[],0
2554,lisaswrite goodnight,1,2,20,10.0,0,0,1,0,0,[],0,[],0


## Remove Multiple Spaces

In [208]:
x = 'this    is  a      test '

In [209]:
' '.join(x.split())

'this is a test'

In [210]:
df['twitts'] = df.twitts.apply(lambda x: ' '.join(x.split()))

## Remove HTML Tags

In [211]:
from bs4 import BeautifulSoup

In [212]:
x = '<html><h1> This is a test </h1></html>'

In [213]:
BeautifulSoup(x, 'lxml').get_text().strip()

'This is a test'

In [214]:
%%time
df['twitts'] = df.twitts.apply(lambda x: BeautifulSoup(x, 'lxml').get_text().strip())

CPU times: user 1.15 s, sys: 4.44 ms, total: 1.16 s
Wall time: 1.46 s


In [215]:
x = 'Áccěntěd těxt'

In [216]:
import unicodedata

In [217]:
def remove_accented_chars(text):
  text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore') # Normal Form Compose and Decompose
  return text

In [218]:
remove_accented_chars(x)

'Accented text'

In [219]:
df['twitts'] = df.twitts.apply(lambda x: remove_accented_chars(x))

## Remove Stop Words

In [220]:
x = 'this has some stop words'

In [221]:
' '.join([t for t in x.split() if t not in stopwords])

'stop words'

In [222]:
df['twitts_no_stopwords'] = df.twitts.apply(lambda x: ' '.join([t for t in x.split() if t not in stopwords]))

In [223]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count,upper_count,emails,email_count,urls,url_count,twitts_no_stopwords
3598,monkeylover35 yeah i had to get away for a bit...,1,17,76,4.470588,8,0,1,0,2,[],0,[],0,monkeylover35 yeah away bit timetotime healthy
2079,mknell hi i am helping out morgan twitter for jb,1,9,41,4.555556,2,0,1,0,2,[],0,[],0,mknell hi helping morgan twitter jb
2570,lexithaboss thanx babe love back 2 you,1,7,34,4.857143,2,0,1,1,0,[],0,[],0,lexithaboss thanx babe love 2
2454,is impatiently waiting for her little sister t...,1,10,48,4.8,5,0,0,0,0,[],0,[],0,impatiently waiting little sister
1357,lakers is the champion how i wish it was magic,0,10,45,4.5,4,0,0,0,1,[],0,[],0,lakers champion wish magic


## Convert to Root or Base Word Form