## Complete text preprocessing


### General Feature Extraction
- File loading
- Word counts
- Characters count
- Average characters per word
- Stop words count
- Count #HashTags and @Mentions
- If numeric digits are present in twitts
- Upper case word counts


### Preprocessing and Cleaning
- Lower case
- Contraction to Expansion
- Emails removal and counts
- URLs removal and counts
- Removal of RT
- Removal of Special Characters
- Removal of multiple spaces
- Removal of HTML tags
- Removal of accented characters
- Removal of Stop Words
- Conversion into base form of words
- Common Occuring words Removal
- Rare Occuring words Removal
- Word Cloud
- Spelling Correction
- Tokenization
- Lemmatization
- Detecting Entities using NER
- Noun Detection
- Language Detection
- Sentence Translation
- Using Inbuilt Sentiment Classifier

In [117]:
import pandas as pd
import numpy as np
import spacy

In [118]:
from spacy.lang.en.stop_words import STOP_WORDS as stopwords

In [119]:
df = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/twitter-data/master/twitter4000.csv', encoding = 'latin1')

In [120]:
df

Unnamed: 0,twitts,sentiment
0,is bored and wants to watch a movie any sugge...,0
1,back in miami. waiting to unboard ship,0
2,"@misskpey awwww dnt dis brng bak memoriessss, ...",0
3,ughhh i am so tired blahhhhhhhhh,0
4,@mandagoforth me bad! It's funny though. Zacha...,0
...,...,...
3995,i just graduated,1
3996,Templating works; it all has to be done,1
3997,mommy just brought me starbucks,1
3998,@omarepps watching you on a House re-run...lov...,1


In [121]:
df['sentiment'].value_counts()

0    2000
1    2000
Name: sentiment, dtype: int64



### Word counts

In [122]:
len('this is text'.split())

3

In [123]:
df['word_counts'] = df['twitts'].apply(lambda x: len(str(x).split()))

In [124]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts
1362,"starting to feel quite depressed, it will be b...",0,20
287,Windows VM decided to drop net access Copying...,0,25
141,@Rowdyeh its not there anymore,0,5
3018,@DinoGoesRawr I'm surprised you're so loveable,1,6
3325,"@Kimmy6313 I totally feel better, you were rig...",1,14


In [125]:
df['word_counts'].max()

32

In [126]:
df['word_counts'].min()

1

In [127]:
df[df['word_counts']==1]

Unnamed: 0,twitts,sentiment,word_counts
385,homework,0,1
691,@ekrelly,0,1
1124,disappointed,0,1
1286,@officialmgnfox,0,1
1325,headache,0,1
1897,@MCRmuffin,0,1
2542,Graduated!,1,1
2947,reading,1,1
3176,@omeirdeleon,1,1
3470,www.myspace.com/myfinalthought,1,1



### Characters count

In [128]:
len('this is')

7

In [129]:
def char_counts(x):
    s = x.split()
    x = ''.join(s)
    return len(x)

In [130]:
char_counts('this is')

6

In [131]:
df['char_counts'] = df['twitts'].apply(lambda x: char_counts(str(x)))

In [132]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts
1120,"@KristenJStewart good good, i bet he gets pret...",0,15,76
860,yeah. several years ago. miss him every day ...,0,13,59
2119,@MetsMerized it was kind of funny late last ni...,1,20,93
3641,"@TheEllenShow God might know. Then again, she ...",1,20,94
3677,wishes that he would realize that I love him!!,1,9,38



### Average word length

In [133]:
x = 'this is' # 6/2 = 3
y = 'thankyou guys' # 12/2 = 6

In [134]:
df['avg_word_len'] = df['char_counts']/df['word_counts']

In [135]:
df.sample(4)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len
3114,"@plushapo Sounds awesome, where are you? Som...",1,17,100,5.882353
1351,I wanna sleep but I need to study literature! ...,0,10,45,4.5
2436,*Playing spades*...unless u can also count car...,1,12,58,4.833333
416,can't seem to shake off the flu,0,7,25,3.571429



### Stop words count

In [136]:
print(stopwords)

{'someone', 'thereby', 'top', 'becoming', 'it', 'only', 'yourself', 'forty', 'hence', 'being', 'themselves', 'among', 'twelve', 'what', 'myself', 'per', 'herein', 'at', 'thereafter', 'amongst', 'do', 'we', 'thereupon', 'the', 'these', 'fifteen', 'name', 'each', 'on', 'whatever', 'often', 'else', "'s", 'other', 'he', 'anyway', 'between', 'us', '‘d', 'where', 'throughout', 'nevertheless', 'some', "'d", 'became', 'whose', 'there', 'up', 'move', 'has', 'our', 'whole', 'hereupon', 'go', 'latterly', 'doing', 'upon', 'own', 'five', 'below', 'no', 'beforehand', 'cannot', 'serious', 'from', 'whereafter', 'am', 'will', 'now', '‘s', 'anyone', 'how', 'please', 'third', 'done', '‘re', 'whereas', 'to', 'four', 'until', 'whenever', 'under', 'mostly', 'get', 'thence', 'off', 'full', 'well', 'this', 'twenty', "'m", 'one', 'become', 'same', 'then', 'hereafter', 'ourselves', 'wherein', 'an', 'might', 'neither', 'beside', 'much', 'hers', 'elsewhere', '’ll', 'here', 'amount', 'former', 'anywhere', 'whom', 

In [137]:
len(stopwords)

326

In [138]:
x = 'this is the text data'

In [139]:
x.split()

['this', 'is', 'the', 'text', 'data']

In [140]:
[t for t in x.split() if t in stopwords]

['this', 'is', 'the']

In [141]:
len([t for t in x.split() if t in stopwords])

3

In [142]:
df['stop_words_len'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t in stopwords]))

In [143]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len
1121,Trying to unravel a snafu with my record of em...,0,10,45,4.5,5
285,@mallory0905 i guess you never got that tomato...,0,22,114,5.181818,10
1080,@caitiebecker Awww I'm sorry that you are sick...,0,24,101,4.208333,10
2534,I was so happy today!,1,5,17,3.4,2
1765,I am soooo bummed out about leaving this place...,0,16,66,4.125,6



### Count hashtags & mentions

In [144]:
x = 'this is #hashtag and this is @mention'

In [145]:
x.split()

['this', 'is', '#hashtag', 'and', 'this', 'is', '@mention']

In [146]:
[t for t in x.split() if t.startswith('@')]

['@mention']

In [147]:
len([t for t in x.split() if t.startswith('@')])

1

In [148]:
df['hashtags_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('#')]))

In [149]:
df['mentions_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('@')]))

In [150]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count
1581,waiting in the airport. vacation is over,0,7,34,4.857143,4,0,0
2886,"thinks that mean, hateful people make the worl...",1,16,78,4.875,7,0,0
2713,A perfect sunday for MindEmptyness,1,5,30,6.0,1,0,0
968,has subscribed with Alchemy twice to get an al...,0,26,112,4.307692,12,0,0
1763,@Ali1702 OMG- did daughter not come home last ...,0,23,94,4.086957,10,0,1



### If numeric digits present in tweets

In [151]:
x = 'this is 1 and 2'

In [152]:
x.split()

['this', 'is', '1', 'and', '2']

In [153]:
x.split()[3].isdigit()

False

In [154]:
[t for t in x.split() if t.isdigit()]

['1', '2']

In [155]:
df['numerics_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isdigit()]))

In [156]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count
1178,"UP, my eyes hurt",0,4,13,3.25,1,0,0,0
3243,Mischka's... &lt;3 No better way to celebrate ...,1,11,56,5.090909,3,0,0,0
3324,http://twitpic.com/66lrp - Afternoon cream tea...,1,12,71,5.916667,4,0,0,0
1289,"Im still really worried, that i havent 'tweete...",0,14,71,5.071429,8,0,0,0
484,i wanna meet The Jonas Brothers,0,6,26,4.333333,1,0,0,0



### Upper case words count

In [157]:
x = 'I AM HAPPY'
y = 'i am happy'

In [158]:
[t for t in x.split() if t.isupper()]

['I', 'AM', 'HAPPY']

In [159]:
df['upper_counts'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isupper()]))

In [160]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts
626,@joeypage omg not right we just walked 2 miles...,0,23,79,3.434783,10,0,1,1,0
1723,Somehow found myself stuck in the 80's with a ...,0,17,68,4.0,7,0,0,0,0
1876,@jonnybabyy I am SO JEALOUS!!!! I want to go s...,0,14,51,3.642857,5,0,1,0,5
1424,@Liljudy95 Aww Whoever Anita Lazo Was Im Sure ...,0,16,67,4.1875,0,0,1,0,1
444,I am jealous of Shelly's passport I want one ...,0,12,54,4.5,4,0,0,0,2


In [161]:
df.iloc[1012]['twitts']

'@lisalent  I am thinking of putting together a package for couples to elope on the central coast. Need a few photographers, interested?'

In [162]:
df.iloc[1012]['upper_counts']

1



### Preprocessing & Cleaning

#### Lower case conversion

In [163]:
x = 'this is Text'

In [164]:
x.lower()

'this is text'

In [165]:
x = 45.0
str(x).lower()

'45.0'

In [166]:
df['twitts'] = df['twitts'].apply(lambda x: str(x).lower())

In [167]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts
2644,assignment actually looking pretty decent....s...,1,14,77,5.5,4,0,0,0,1
2220,@msballin that was too much fun!,1,6,27,4.5,4,0,1,0,0
2826,just watched the day the earth stood still pr...,1,10,53,5.3,4,0,0,0,0
2816,having a get together today...if you are comin...,1,13,64,4.923077,7,0,0,0,0
620,@daisygunner sorry if i gave you my cold! i t...,0,18,64,3.555556,8,0,1,0,1



#### Contraction to Expansion

In [168]:
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
" u ": " you ",
" ur ": " your ",
" n ": " and ",
"won't": "would not",
'dis': 'this',
'bak': 'back',
'brng': 'bring'}

In [169]:
x = "i'm don't he'll" # "i am do not he will"

In [170]:
def cont_to_exp(x):
    if type(x) is str:
        for key in contractions:
            value = contractions[key]
            x = x.replace(key, value)
        return x
    else:
        return x
    

In [171]:
cont_to_exp(x)

'i am do not he will'

In [172]:
%%timeit
df['twitts'] = df['twitts'].apply(lambda x: cont_to_exp(x))

92.2 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [173]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts
3460,@karine1205 daaaaang partying it up! you go!!!,1,7,40,5.714286,1,0,1,0,1
1694,boarded my flight but forgot my qc3s,0,7,30,4.285714,3,0,0,0,0
997,"spedi have left the jungle, that means no more...",0,16,68,4.25,8,0,0,0,1
177,"my cat snuck out, and is stalking somewhere ar...",0,20,79,3.95,10,0,0,0,0
2315,@whoschrishughes finally got time to view the ...,1,19,93,4.894737,8,0,1,0,0


### Count & Remove Emails

In [174]:
import re

In [175]:
df[df['twitts'].str.contains('hotmail.com')]

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts
3713,@securerecs arghh me please markbradbury_16@h...,1,5,51,10.2,0,0,1,0,0


In [176]:
df.iloc[3713]['twitts']

'@securerecs arghh me please  markbradbury_16@hotmail.com'

In [177]:
x = '@securerecs arghh me please  markbradbury_16@hotmail.com'

In [178]:
re.findall(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)', x)

['markbradbury_16@hotmail.com']

In [179]:
df['emails'] = df['twitts'].apply(lambda x: re.findall(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+\b)', x))

In [180]:
df['emails_count'] = df['emails'].apply(lambda x: len(x))

In [181]:
df[df['emails_count'] > 0]

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count
3713,@securerecs arghh me please markbradbury_16@h...,1,5,51,10.2,0,0,1,0,0,[markbradbury_16@hotmail.com],1


In [182]:
re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", x)

'@securerecs arghh me please  '

In [183]:
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", x))

In [184]:
df[df['emails_count']>0]

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count
3713,@securerecs arghh me please,1,5,51,10.2,0,0,1,0,0,[markbradbury_16@hotmail.com],1


### Count URLs and remove it.

In [185]:
x = 'hi, thanks to watching it. for more visit https://youtube.com/kgptalkie'

In [186]:
#shh://git@git.com:username/repo.git=riif?%

In [187]:
re.findall(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)

[('https', 'youtube.com', '/kgptalkie')]

In [188]:
df['url_flags'] = df['twitts'].apply(lambda x: len(re.findall(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)))

In [189]:
df[df['url_flags']>0].sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count,url_flags
2065,http://twitpic.com/6s8j1 - with mis bestiess b...,1,6,45,7.5,1,0,0,0,1,[],0,1
2927,http://twitpic.com/6qhi7 - omg! this is the wa...,1,18,97,5.388889,6,0,0,0,0,[],0,1
3282,for the folks that missed me http://twitpic.c...,1,7,47,6.714286,3,0,0,0,0,[],0,1
2243,please keep voting for @chriscuzzy &amp; @tomf...,1,19,120,6.315789,6,0,3,0,0,[],0,1
303,"oqo, she is dead - such a shame as they looked...",0,15,77,5.133333,6,0,0,0,1,[],0,1


In [190]:
x

'hi, thanks to watching it. for more visit https://youtube.com/kgptalkie'

In [191]:
re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x)

'hi, thanks to watching it. for more visit '

In [192]:
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x))

In [193]:
df[df['url_flags']>0].sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count,url_flags
1458,@tracecyrus - the picture would not come up f...,0,10,62,6.2,4,0,1,0,0,[],0,1
3788,"@lux_bird - oh, yummie yummie...with icecream...",1,7,72,10.285714,0,0,1,0,0,[],0,1
2539,@jornjansen that is nice so where will you go...,1,16,83,5.1875,6,0,1,0,0,[],0,1
3268,2 of the kiddies who we took to the circus,1,11,57,5.181818,6,0,0,1,0,[],0,1
3203,"@thewebguy - dude, there is no hot topic in s...",1,14,85,6.071429,3,0,1,0,0,[],0,1


### Remove RT

In [194]:
df[df['twitts'].str.contains('rt')]

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count,url_flags
4,@mandagoforth me bad! it is funny though. zach...,0,26,116,4.461538,13,0,2,0,0,[],0,0
23,"ut oh, i wonder if the ram on the desktop is s...",0,14,46,3.285714,7,0,0,0,2,[],0,0
59,@paulmccourt dunno what sky you're looking at!...,0,15,80,5.333333,3,0,1,0,0,[],0,0
75,im back home in belfast im realli tired thoug...,0,22,84,3.818182,9,0,0,0,1,[],0,0
81,@lilmonkee987 i know what you mean... i feel s...,0,11,48,4.363636,5,0,1,0,0,[],0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3913,for the press so after she recovered she kille...,1,24,100,4.166667,1,0,0,0,0,[],0,0
3919,earned her cpr &amp; first aid certifications!,1,7,40,5.714286,1,0,0,0,1,[],0,0
3945,"@teciav &quot;i look high, i look low, i look ...",1,23,106,4.608696,10,0,1,0,0,[],0,0
3951,i am soo very parched. and hungry. oh and i am...,1,21,87,4.142857,7,0,0,2,1,[],0,0


In [195]:
x = 'rt @username: hello hirt'

In [196]:
re.sub(r'\brt\b', '', x).strip()

'@username: hello hirt'

In [197]:
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'\brt\b', '', x).strip())

### Special chars removal or punctuation removal

In [198]:
df.sample(3)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count,url_flags
2083,the sun is shining and it is very hot today,1,9,34,3.777778,3,0,0,0,0,[],0,0
3679,@shaundiviney whats the stack packs gonna incl...,1,19,89,4.684211,8,0,1,0,0,[],0,0
220,@vishuxpert he he he.... sorry cant help you o...,0,15,75,5.0,5,0,1,0,1,[],0,0


In [199]:
x = '@duyku apparently i was not ready enough... i...'

In [200]:
re.sub(r'[^\w ]+', "", x)

'duyku apparently i was not ready enough i'

In [201]:
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'[^\w ]+', "", x))

In [202]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count,url_flags
1834,carley amp kim are coming over but no mallory ...,0,17,71,4.176471,5,0,0,0,0,[],0,0
879,nnoooo my hair poofed but the color looks gre...,0,15,72,4.8,6,0,0,0,1,[],0,0
1124,thisappointed,0,1,12,12.0,0,0,0,0,0,[],0,0
2455,mother day today,1,3,14,4.666667,0,0,0,0,0,[],0,0
835,avoiding statistics and now suddenlink shut th...,0,22,92,4.181818,12,0,0,0,1,[],0,0


### Remove multiple white spaces "hi    hello    "

In [203]:
x =  'hi    hello     how are you'

In [204]:
' '.join(x.split())

'hi hello how are you'

In [205]:
df['twitts'] = df['twitts'].apply(lambda x: ' '.join(x.split()))



### Remove HTML tags

In [206]:
%pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [207]:
from bs4 import BeautifulSoup

In [208]:
x = '<html><h1> thanks for watching it </h1></html>'

In [209]:
x.replace('<html><h1>', '').replace('</h1></html>', '') # this is not a better way to remove the html tags

' thanks for watching it '

In [210]:
BeautifulSoup(x, 'lxml').get_text().strip()

'thanks for watching it'

In [211]:
%%time
df['twitts'] = df['twitts'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text().strip())

CPU times: total: 734 ms
Wall time: 928 ms


### Remove accented chars

In [212]:
x = 'Áccěntěd těxt'

In [213]:
import unicodedata

In [214]:
def remove_accented_chars(x):
    x = unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return x

In [215]:
remove_accented_chars(x)

'Accented text'

In [216]:
df['twitts'] = df['twitts'].apply(lambda x: remove_accented_chars(x))


### Remove stop words

In [217]:
x = 'this is a stop words'

In [218]:
' '.join([t for t in x.split() if t not in stopwords])

'stop words'

In [219]:
df['twitts_no_stop'] = df['twitts'].apply(lambda x: ' '.join([t for t in x.split() if t not in stopwords]))

In [220]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count,url_flags,twitts_no_stop
1022,i cannot sleep got to go to class at 8 am and ...,0,16,45,2.8125,8,0,0,1,2,[],0,0,sleep got class 8
3749,off for a swim ahh i love the water,1,9,27,3.0,5,0,0,0,0,[],0,0,swim ahh love water
3417,transbay are there actually coffeshopscafes op...,1,23,112,4.869565,7,0,1,0,2,[],0,0,transbay actually coffeshopscafes open sf 910p...
1538,ow my arm,0,3,7,2.333333,1,0,0,0,0,[],0,0,ow arm
3896,mrsteveharvey thanks for writing your book i l...,1,22,119,5.409091,7,0,1,0,1,[],0,0,mrsteveharvey thanks writing book loved alot g...


### Convert into base or root form of the word

In [235]:
nlp = spacy.load('en_core_web_sm')

In [236]:
x = 'this is chocolates. what is times? this balls'

In [239]:
def make_to_base(x):
    x = str(x)
    x_list = []
    doc = nlp(x)
    for token in doc:
        lemma = token.lemma_
        if lemma == '-PRON-' or lemma == 'be':
            lemma = token.text
        x_list.append(lemma)
    return ' '.join(x_list)

In [240]:
make_to_base(x)

'this is chocolate . what is time ? this ball'

In [241]:
df['twitts'] = df['twitts'].apply(lambda x: make_to_base(x))

In [242]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count,url_flags,twitts_no_stop
3788,lux_bird oh yummie yummiewith icecreamhmmmm,1,7,72,10.285714,0,0,1,0,0,[],0,1,lux_bird oh yummie yummiewith icecreamhmmmm
1021,iamlittleboot someone that s not as good as yo...,0,15,69,4.6,6,0,1,0,2,[],0,0,iamlittleboots thats good hard find
2780,petiteandchic I have dressy top tunic short am...,1,18,110,6.111111,3,0,1,0,1,[],0,0,petiteandchic dressy tops tunics short amp lon...
2983,love tokio hotel tom my loveee,1,7,35,5.0,1,0,0,0,0,[],0,0,love tokio hotel tom loveee
670,a random beautiful baby just give I a hug whil...,0,27,108,4.0,11,0,0,0,2,[],0,0,random beautiful baby gave hug shopping presh ...
