## Complete text preprocessing


### General Feature Extraction
- File loading
- Word counts
- Characters count
- Average characters per word
- Stop words count
- Count #HashTags and @Mentions
- If numeric digits are present in twitts
- Upper case word counts


### Preprocessing and Cleaning
- Lower case
- Contraction to Expansion
- Emails removal and counts
- URLs removal and counts
- Removal of RT
- Removal of Special Characters
- Removal of multiple spaces
- Removal of HTML tags
- Removal of accented characters
- Removal of Stop Words
- Conversion into base form of words
- Common Occuring words Removal
- Rare Occuring words Removal
- Word Cloud
- Spelling Correction
- Tokenization
- Lemmatization
- Detecting Entities using NER
- Noun Detection
- Language Detection
- Sentence Translation
- Using Inbuilt Sentiment Classifier

In [1]:
import pandas as pd
import numpy as np
import spacy

In [2]:
from spacy.lang.en.stop_words import STOP_WORDS as stopwords

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/twitter-data/master/twitter4000.csv', encoding = 'latin1')

In [4]:
df

Unnamed: 0,twitts,sentiment
0,is bored and wants to watch a movie any sugge...,0
1,back in miami. waiting to unboard ship,0
2,"@misskpey awwww dnt dis brng bak memoriessss, ...",0
3,ughhh i am so tired blahhhhhhhhh,0
4,@mandagoforth me bad! It's funny though. Zacha...,0
...,...,...
3995,i just graduated,1
3996,Templating works; it all has to be done,1
3997,mommy just brought me starbucks,1
3998,@omarepps watching you on a House re-run...lov...,1


In [5]:
df['sentiment'].value_counts()

0    2000
1    2000
Name: sentiment, dtype: int64



### Word counts

In [6]:
len('this is text'.split())

3

In [7]:
df['word_counts'] = df['twitts'].apply(lambda x: len(str(x).split()))

In [8]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts
3615,@Emerald20 hmmm I might have to swing by for t...,1,13
3484,last week on sat nite.. had a blast catching u...,1,26
2443,"I'm trying to work, but am highly distracted w...",1,17
905,i am not a sult!!!!!!! am i?,0,7
1859,I think Im gonna lay out in the sun a bit then...,0,17


In [9]:
df['word_counts'].max()

32

In [10]:
df['word_counts'].min()

1

In [11]:
df[df['word_counts']==1]

Unnamed: 0,twitts,sentiment,word_counts
385,homework,0,1
691,@ekrelly,0,1
1124,disappointed,0,1
1286,@officialmgnfox,0,1
1325,headache,0,1
1897,@MCRmuffin,0,1
2542,Graduated!,1,1
2947,reading,1,1
3176,@omeirdeleon,1,1
3470,www.myspace.com/myfinalthought,1,1



### Characters count

In [12]:
len('this is')

7

In [13]:
def char_counts(x):
    s = x.split()
    x = ''.join(s)
    return len(x)

In [14]:
char_counts('this is')

6

In [15]:
df['char_counts'] = df['twitts'].apply(lambda x: char_counts(str(x)))

In [16]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts
1805,the mall is a dangerous place for my bank acco...,0,10,40
1706,"lesson all planned, topic: myspace identity......",0,13,62
3074,I am spending a lot more time on Twitter after...,1,21,100
3793,"That IS lame Liz lol, I should send you a care...",1,16,65
3563,"#flylady lunch has been consumed, ds2 down for...",1,18,74



### Average word length

In [17]:
x = 'this is' # 6/2 = 3
y = 'thankyou guys' # 12/2 = 6

In [18]:
df['avg_word_len'] = df['char_counts']/df['word_counts']

In [19]:
df.sample(4)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len
3341,hey guys..this is my twitter page... a little ...,1,15,69,4.6
2333,Gooood morning....I CAN'T WAIT TO GET HOME,1,7,36,5.142857
3879,Awake lovely and sunny outside! Good night las...,1,15,70,4.666667
3677,wishes that he would realize that I love him!!,1,9,38,4.222222



### Stop words count

In [20]:
print(stopwords)

{'someone', 'thereby', 'top', 'becoming', 'it', 'only', 'yourself', 'forty', 'hence', 'being', 'themselves', 'among', 'twelve', 'what', 'myself', 'per', 'herein', 'at', 'thereafter', 'amongst', 'do', 'we', 'thereupon', 'the', 'these', 'fifteen', 'name', 'each', 'on', 'whatever', 'often', 'else', "'s", 'other', 'he', 'anyway', 'between', 'us', '‘d', 'where', 'throughout', 'nevertheless', 'some', "'d", 'became', 'whose', 'there', 'up', 'move', 'has', 'our', 'whole', 'hereupon', 'go', 'latterly', 'doing', 'upon', 'own', 'five', 'below', 'no', 'beforehand', 'cannot', 'serious', 'from', 'whereafter', 'am', 'will', 'now', '‘s', 'anyone', 'how', 'please', 'third', 'done', '‘re', 'whereas', 'to', 'four', 'until', 'whenever', 'under', 'mostly', 'get', 'thence', 'off', 'full', 'well', 'this', 'twenty', "'m", 'one', 'become', 'same', 'then', 'hereafter', 'ourselves', 'wherein', 'an', 'might', 'neither', 'beside', 'much', 'hers', 'elsewhere', '’ll', 'here', 'amount', 'former', 'anywhere', 'whom', 

In [21]:
len(stopwords)

326

In [22]:
x = 'this is the text data'

In [23]:
x.split()

['this', 'is', 'the', 'text', 'data']

In [24]:
[t for t in x.split() if t in stopwords]

['this', 'is', 'the']

In [25]:
len([t for t in x.split() if t in stopwords])

3

In [26]:
df['stop_words_len'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t in stopwords]))

In [27]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len
2044,@Kat_LB @liberty100 @Sharan23 Yeah it would ...,1,15,75,5.0,5
2382,Good Morning! TGIF! Short workday for me then...,1,15,76,5.066667,4
489,thiking of goin to the library but not realy c...,0,11,52,4.727273,6
2559,@_Ely_ Check your DM.,1,4,18,4.5,1
3997,mommy just brought me starbucks,1,5,27,5.4,2



### Count hashtags & mentions

In [28]:
x = 'this is #hashtag and this is @mention'

In [29]:
x.split()

['this', 'is', '#hashtag', 'and', 'this', 'is', '@mention']

In [30]:
[t for t in x.split() if t.startswith('@')]

['@mention']

In [31]:
len([t for t in x.split() if t.startswith('@')])

1

In [32]:
df['hashtags_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('#')]))

In [33]:
df['mentions_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('@')]))

In [34]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count
553,@mashable the link does not work,0,6,27,4.5,3,0,1
3825,on my way 2 skool class started 2day,1,8,29,3.625,2,0,0
3878,@mahoekst dude....... the only other search en...,1,22,116,5.272727,12,0,1
3341,hey guys..this is my twitter page... a little ...,1,15,69,4.6,4,0,0
594,boo everyone is being lame now,0,6,25,4.166667,4,0,0



### If numeric digits present in tweets

In [35]:
x = 'this is 1 and 2'

In [36]:
x.split()

['this', 'is', '1', 'and', '2']

In [37]:
x.split()[3].isdigit()

False

In [38]:
[t for t in x.split() if t.isdigit()]

['1', '2']

In [39]:
df['numerics_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isdigit()]))

In [40]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count
2743,Monday Sale - www.bellassweetboutique.etsy.co...,1,20,109,5.45,5,0,0,0
877,can't help but eat chocolate with peanuts and ...,0,16,80,5.0,6,0,0,0
853,getting a bit sick of sundays or rather the p...,0,21,80,3.809524,10,0,0,0
2213,@paulaaaron She did win! She is the grand pri...,1,17,66,3.882353,7,0,1,0
3756,@DsBabyGirl yup! Camden is gonna be fun,1,7,33,4.714286,2,0,1,0



### Upper case words count

In [41]:
x = 'I AM HAPPY'
y = 'i am happy'

In [42]:
[t for t in x.split() if t.isupper()]

['I', 'AM', 'HAPPY']

In [43]:
df['upper_counts'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isupper()]))

In [44]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts
159,@MissFredi But I feel so constanly ill @Davin...,0,8,46,5.75,1,0,2,0,1
2957,@delta_goodrem awww that's so you ones that t...,1,18,82,4.555556,6,0,1,0,1
2590,@eliasatya Hi Elias! Nice to meet you on Twitt...,1,14,58,4.142857,6,0,1,0,0
3520,Haters do your job you got me this far!!!! Tha...,1,15,66,4.4,8,0,0,0,0
3550,@SelfMade2k9 lol.Atleast someone likes it.,1,5,38,7.6,1,0,1,0,0


In [45]:
df.iloc[1012]['twitts']

'@lisalent  I am thinking of putting together a package for couples to elope on the central coast. Need a few photographers, interested?'

In [52]:
df.iloc[1012]['upper_counts']

1



### Preprocessing & Cleaning

#### Lower case conversion

In [46]:
x = 'this is Text'

In [47]:
x.lower()

'this is text'

In [48]:
x = 45.0
str(x).lower()

'45.0'

In [49]:
df['twitts'] = df['twitts'].apply(lambda x: str(x).lower())

In [50]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts
1765,i am soooo bummed out about leaving this place...,0,16,66,4.125,6,0,0,0,1
398,it's so damn hot in this apartment. too bad i ...,0,26,107,4.115385,15,0,0,0,0
1601,"@strikeitfierce ahahah. yeahh, ive been good. ...",0,18,95,5.277778,6,0,1,0,0
221,"rain, rain, go away! so done with all the wet...",0,24,101,4.208333,7,0,0,0,2
2467,woot.going to gibalter. http://myloc.me/2bz6,1,4,41,10.25,1,0,0,0,0



#### Contraction to Expansion

In [53]:
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
" u ": " you ",
" ur ": " your ",
" n ": " and ",
"won't": "would not",
'dis': 'this',
'bak': 'back',
'brng': 'bring'}

In [54]:
x = "i'm don't he'll" # "i am do not he will"

In [55]:
def cont_to_exp(x):
    if type(x) is str:
        for key in contractions:
            value = contractions[key]
            x = x.replace(key, value)
        return x
    else:
        return x
    

In [56]:
cont_to_exp(x)

'i am do not he will'

In [57]:
%%timeit
df['twitts'] = df['twitts'].apply(lambda x: cont_to_exp(x))

80 ms ± 3.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [59]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts
3476,@livehereandnow 7:00! yay!,1,3,24,8.0,0,0,1,0,1
818,wow ... it would been 10 mouths today ...,0,9,33,3.666667,2,0,0,1,0
3850,@jamesiwilliams i luv hello kitty,1,5,29,5.8,0,0,1,0,1
416,cannot seem to shake off the flu,0,7,25,3.571429,4,0,0,0,0
1795,with all this shit i have to pay for and get f...,0,22,81,3.681818,15,0,0,0,1


### Count & Remove Emails

In [60]:
import re

In [61]:
df[df['twitts'].str.contains('hotmail.com')]

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts
3713,@securerecs arghh me please markbradbury_16@h...,1,5,51,10.2,0,0,1,0,0


In [62]:
df.iloc[3713]['twitts']

'@securerecs arghh me please  markbradbury_16@hotmail.com'

In [63]:
x = '@securerecs arghh me please  markbradbury_16@hotmail.com'

In [64]:
re.findall(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)', x)

['markbradbury_16@hotmail.com']

In [65]:
df['emails'] = df['twitts'].apply(lambda x: re.findall(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+\b)', x))

In [71]:
df['emails_count'] = df['emails'].apply(lambda x: len(x))

In [72]:
df[df['emails_count'] > 0]

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count
3713,@securerecs arghh me please markbradbury_16@h...,1,5,51,10.2,0,0,1,0,0,[markbradbury_16@hotmail.com],1


In [73]:
re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", x)

'@securerecs arghh me please  '

In [74]:
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", x))

In [75]:
df[df['emails_count']>0]

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count
3713,@securerecs arghh me please,1,5,51,10.2,0,0,1,0,0,[markbradbury_16@hotmail.com],1


### Count URLs and remove it.

In [76]:
x = 'hi, thanks to watching it. for more visit https://youtube.com/kgptalkie'

In [77]:
#shh://git@git.com:username/repo.git=riif?%

In [78]:
re.findall(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)

[('https', 'youtube.com', '/kgptalkie')]

In [79]:
df['url_flags'] = df['twitts'].apply(lambda x: len(re.findall(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)))

In [80]:
df[df['url_flags']>0].sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count,url_flags
2912,mission statement or something like that | ps...,1,9,66,7.333333,3,0,0,0,0,[],0,1
3396,taking the kids to golfland sunsplash in rosev...,1,17,112,6.588235,5,0,0,0,1,[],0,1
3254,http://twitpic.com/6qb66 - i found 3g in a car...,1,13,62,4.769231,4,0,0,0,3,[],0,1
3363,@jeffreecuntstar http://twitpic.com/42iau - co...,1,5,48,9.6,0,0,1,0,0,[],0,1
3807,@chloevictoriaxo http://twitpic.com/5oo0f - nice,1,4,45,11.25,0,0,1,0,0,[],0,1


In [81]:
x

'hi, thanks to watching it. for more visit https://youtube.com/kgptalkie'

In [82]:
re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x)

'hi, thanks to watching it. for more visit '

In [83]:
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x))

In [86]:
df[df['url_flags']>0].sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count,url_flags
2017,buy the song,1,4,29,7.25,1,0,0,0,0,[],0,1
2954,@michaelsheen - ooh so cute,1,6,47,7.833333,1,0,1,0,0,[],0,1
1704,mondays often mean back to work. sigh.,0,8,56,7.0,3,0,0,0,0,[],0,1
3807,@chloevictoriaxo - nice,1,4,45,11.25,0,0,1,0,0,[],0,1
2703,@tommcfly - this one! its awesome!,1,7,53,7.571429,0,0,1,0,1,[],0,1


### Remove RT

In [87]:
df[df['twitts'].str.contains('rt')]

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count,url_flags
4,@mandagoforth me bad! it is funny though. zach...,0,26,116,4.461538,13,0,2,0,0,[],0,0
23,"ut oh, i wonder if the ram on the desktop is s...",0,14,46,3.285714,7,0,0,0,2,[],0,0
59,@paulmccourt dunno what sky you're looking at!...,0,15,80,5.333333,3,0,1,0,0,[],0,0
75,im back home in belfast im realli tired thoug...,0,22,84,3.818182,9,0,0,0,1,[],0,0
81,@lilmonkee987 i know what you mean... i feel s...,0,11,48,4.363636,5,0,1,0,0,[],0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3913,for the press so after she recovered she kille...,1,24,100,4.166667,1,0,0,0,0,[],0,0
3919,earned her cpr &amp; first aid certifications!,1,7,40,5.714286,1,0,0,0,1,[],0,0
3945,"@teciav &quot;i look high, i look low, i look ...",1,23,106,4.608696,10,0,1,0,0,[],0,0
3951,i am soo very parched. and hungry. oh and i am...,1,21,87,4.142857,7,0,0,2,1,[],0,0


In [88]:
x = 'rt @username: hello hirt'

In [89]:
re.sub(r'\brt\b', '', x).strip()

'@username: hello hirt'

In [90]:
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'\brt\b', '', x).strip())

### Special chars removal or punctuation removal

In [91]:
df.sample(3)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count,url_flags
3217,@vectorlovers great ! i jus' had a total vect...,1,25,116,4.64,9,0,1,0,0,[],0,0
2742,@iraymondliu: i like that!,1,4,23,5.75,0,0,1,0,1,[],0,0
1390,i never thought that everything wud end this w...,0,31,107,3.451613,16,0,0,0,2,[],0,0


In [92]:
x = '@duyku apparently i was not ready enough... i...'

In [95]:
re.sub(r'[^\w ]+', "", x)

'duyku apparently i was not ready enough i'

In [96]:
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'[^\w ]+', "", x))

In [97]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count,url_flags
64,kankzxd i am jealous i love that man i missed ...,0,17,64,3.764706,6,0,1,0,3,[],0,0
3194,jamfactory i will be there i need more drople...,1,9,48,5.333333,3,0,1,0,0,[],0,0
2986,oh man the bent spoons ice cream is muy delic...,1,24,114,4.75,5,0,0,0,8,[],0,0
3661,jonathanrknight you guys gotta do it again in ...,1,17,68,4.0,9,0,1,0,0,[],0,0
370,arghhhhhhhhhhhhh fuck the storm cannot go out ...,0,10,52,5.2,5,0,0,0,0,[],0,0


### Remove multiple white spaces "hi    hello    "

In [98]:
x =  'hi    hello     how are you'

In [99]:
' '.join(x.split())

'hi hello how are you'

In [101]:
df['twitts'] = df['twitts'].apply(lambda x: ' '.join(x.split()))



### Remove HTML tags

In [102]:
%pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [103]:
from bs4 import BeautifulSoup

In [104]:
x = '<html><h1> thanks for watching it </h1></html>'

In [105]:
x.replace('<html><h1>', '').replace('</h1></html>', '') # this is not a better way to remove the html tags

' thanks for watching it '

In [106]:
BeautifulSoup(x, 'lxml').get_text().strip()

'thanks for watching it'

In [107]:
%%time
df['twitts'] = df['twitts'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text().strip())

CPU times: total: 750 ms
Wall time: 814 ms


### Remove accented chars

In [108]:
x = 'Áccěntěd těxt'

In [109]:
import unicodedata

In [110]:
def remove_accented_chars(x):
    x = unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return x

In [111]:
remove_accented_chars(x)

'Accented text'

In [112]:
df['twitts'] = df['twitts'].apply(lambda x: remove_accented_chars(x))


### Remove stop words

In [113]:
x = 'this is a stop words'

In [114]:
' '.join([t for t in x.split() if t not in stopwords])

'stop words'

In [115]:
df['twitts_no_stop'] = df['twitts'].apply(lambda x: ' '.join([t for t in x.split() if t not in stopwords]))

In [116]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_counts,char_counts,avg_word_len,stop_words_len,hashtags_count,mentions_count,numerics_count,upper_counts,emails,emails_count,url_flags,twitts_no_stop
1495,so bb rocks again lol i dnt want it to take ov...,0,25,85,3.4,9,0,0,0,2,[],0,0,bb rocks lol dnt want summer bt wil lol nyt il...
1878,mad the nuggets just gave up,0,6,24,4.0,2,0,0,0,0,[],0,0,mad nuggets gave
2430,katyonak get 100 followers a day using wwwtwee...,1,20,95,4.75,9,0,1,1,0,[],0,0,katyonak 100 followers day wwwtweeteraddercom ...
1101,summer school is not fun im tired of all this ...,0,15,58,3.866667,7,0,0,0,0,[],0,0,summer school fun im tired fucking work cold
2414,boris kodjoe what a specimen,1,5,28,5.6,2,0,0,0,0,[],0,0,boris kodjoe specimen
