<a href="https://colab.research.google.com/github/CRSpradlin/natural-language-processing-course/blob/main/NLP%20Course%20Work/6.%20Text%20Cleaning%20and%20Preprocessing/CompleteTextCleaningAndPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [153]:
import pandas as pd
import numpy as np
import spacy

In [154]:
from spacy.lang.en.stop_words import STOP_WORDS as stopwords

In [155]:
df = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/twitter-data/master/twitter4000.csv', encoding = 'latin1')

In [156]:
df

Unnamed: 0,twitts,sentiment
0,is bored and wants to watch a movie any sugge...,0
1,back in miami. waiting to unboard ship,0
2,"@misskpey awwww dnt dis brng bak memoriessss, ...",0
3,ughhh i am so tired blahhhhhhhhh,0
4,@mandagoforth me bad! It's funny though. Zacha...,0
...,...,...
3995,i just graduated,1
3996,Templating works; it all has to be done,1
3997,mommy just brought me starbucks,1
3998,@omarepps watching you on a House re-run...lov...,1


## Goal: Preprocess This Data
This data has a lot of short handed statements that are not officially part of the English language. We need to clean the data so it can be processed in the most accurate way.

In [157]:
df['sentiment'].value_counts()

sentiment
0    2000
1    2000
Name: count, dtype: int64

## Word Counts

In [158]:
len('this is example text'.split())

4

In [159]:
# Caclulate the word count of the twitts column and apply that result to a new column in the data frame.
df['word_count'] = df['twitts'].apply(lambda x: len(str(x).split()))

In [160]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_count
2479,So tired... Getting ready for juries. Just gav...,1,19
3026,Heading back to the R.C. tomorrow. Great vaca...,1,8
404,I literally feel sick without him...I hate tha...,0,26
410,"Feel sooo bad! I got Matty sick, now I'm at wo...",0,23
3387,Another loser: I'm happy Kalkata Knight Riders...,1,13


In [161]:
df['word_count'].max()

32

In [162]:
df['word_count'].min()

1

In [163]:
#Calculates and returns the row results that only have a single word data content.
df[df['word_count']==1]

Unnamed: 0,twitts,sentiment,word_count
385,homework,0,1
691,@ekrelly,0,1
1124,disappointed,0,1
1286,@officialmgnfox,0,1
1325,headache,0,1
1897,@MCRmuffin,0,1
2542,Graduated!,1,1
2947,reading,1,1
3176,@omeirdeleon,1,1
3470,www.myspace.com/myfinalthought,1,1


## Character Count

In [164]:
def char_count(text):
  s = text.split()
  s = ''.join(s)
  return len(s)

In [165]:
char_count('this test')

8

In [166]:
# Much like the word count, calculate the character count and add it to the data frame as a new column. (Without Spaces)
df['char_count'] = df['twitts'].apply(lambda x: char_count(str(x)))

In [167]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count
2659,Just got back from shopping w/ mom now readin...,1,13,52
2555,@manicmother how's the little one doing? And ...,1,11,53
2997,"Sweet! ah, the twinkie.",1,4,20
2937,kickin it w/ me abygirl,1,5,19
3048,Not sleeping. Sugarfree redbull is currently m...,1,13,74


## Average Word Length

In [168]:
x = 'this is an example' # 15/4 = ~3
y = 'this is another example' # 22/4 = ~5

In [169]:
df['avg_word_length'] = df['char_count']/df['word_count']

In [170]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length
519,Second bar I've tried has flakey wifi. Left a...,0,26,107,4.115385
1720,oh gosh i'm so unprepared for the upcoming tes...,0,27,109,4.037037
1439,That's a nasty sounding cat fight outside. I t...,0,26,109,4.192308
1293,"@orangecatblues Ugh, flea goop and THEN gettin...",0,12,66,5.5
1474,Shit happens But Why ??? I hate this part of ...,0,32,94,2.9375


## Stop Words Count
As a reminder, stop words are words that are so common in the English Language but do not provide very much sentiment as to what is being said. We want to remove these stop words but in some cases they can be useful (eg. can, cannot)

In [171]:
print(stopwords)

{'throughout', 'hereafter', 'whose', 'became', 'whole', 'two', 'several', 'among', 'yours', 'using', '‘ve', 'both', 'four', 'though', 'which', 'i', 'even', 'back', 'by', '’s', 'could', 'but', 'while', 'third', 'however', 're', 'fifteen', 'thereby', 'keep', 'own', 'against', 'themselves', '’m', 'wherever', 'say', 'move', 'everything', 'had', 'never', 'am', 'beyond', 'toward', 'a', 'what', 'bottom', 'behind', 'therefore', 'serious', "'re", 'after', 'myself', 'everyone', 'well', "'ve", 'make', 'because', 'call', 'except', 'eight', '’ll', 'might', 'forty', 'will', 'fifty', "'s", 'few', 'thru', 'whereas', 'further', 'sixty', 'whereafter', 'can', 'indeed', 'less', 'regarding', 'still', 'amongst', 'go', 'neither', 'been', 'beside', 'meanwhile', 'per', 'so', '’re', 'sometime', 'whereby', 'an', 'seem', 'until', 'via', 'if', 'beforehand', 'to', 'five', 'himself', 'one', 'during', 'part', 'former', 'although', 'thence', 'nor', 'give', 'towards', 'here', 'we', 'also', 'did', 'mine', 'why', 'may', 

In [172]:
len(stopwords)

326

In [173]:
x = 'this is an example'
words = x.split()
count = 0
for word in words:
  if word in stopwords:
    count += 1
print('Stopword count: ' + str(count))

Stopword count: 3


In [174]:
stopwords_count = lambda x: len([word for word in x.split() if word in stopwords])

In [175]:
df['stop_words_count'] = df.twitts.apply(stopwords_count)

In [176]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count
481,Hates movin! Ahhh help me,0,5,21,4.2,1
1248,is missing the sun again. Stuck in a windowles...,0,10,46,4.6,4
214,"my eye keeps twitching, lack of sleep",0,7,31,4.428571,2
2612,"@dannynic Hola Barrow, we are in Las Palmas, G...",1,22,94,4.272727,9
1030,fml. why don't any of my friends have money? w...,0,24,98,4.083333,10


## Count HashTags and Mentions (\#, @)

In [177]:
x = 'this is #hastag and this is @mention'
x.split()

['this', 'is', '#hastag', 'and', 'this', 'is', '@mention']

In [178]:
[word for word in x.split() if word.startswith('#') or word.startswith('@')]

['#hastag', '@mention']

In [179]:
df['hastags'] = df.twitts.apply(lambda x: len([word for word in x.split() if word.startswith('#')]))

In [180]:
df['mentions'] = df.twitts.apply(lambda x: len([word for word in x.split() if word.startswith('@')]))

In [181]:
df[df['mentions']>0].sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions
1152,@stilettojungle thanks for the bracelet sale p...,0,15,97,6.466667,4,0,1
2568,"@ratrapid wow! that inventon is so good, im gl...",1,14,54,3.857143,6,0,1
565,@derrickkendall that is if i'm not busy murder...,0,17,84,4.941176,8,0,1
3121,@Kakaze Can you please tell @johnrauchman how ...,1,14,67,4.785714,8,0,2
2729,"@sdashwood heyy nice to meet yaa.. btw, what ...",1,15,64,4.266667,4,0,1


In [182]:
df[df['hastags']>0].sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions
3938,#DMCwmnSHOW enjoying the show very much from s...,1,8,44,5.5,5,1,0
3244,#Iremember the skipper grows up doll -- move h...,1,18,73,4.055556,9,1,0
475,@Restrictor I can only imagine what it might h...,0,23,113,4.913043,10,1,1
3941,#assassinate is also trending because #spymast...,1,10,65,6.5,4,2,0
2577,#3hotwords - @britneyspears' &quot;phonography...,1,4,49,12.25,0,1,1


## If Numeric Digits are present in twitts

In [183]:
x = 'this is 1 and 2'
x.split()[2].isdigit()

True

In [184]:
[digit for digit in x.split() if digit.isdigit()]

['1', '2']

In [185]:
df['numeric_count'] = df.twitts.apply(lambda x: len([digit for digit in x.split() if digit.isdigit()]))

In [186]:
df[df['numeric_count']>0].sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count
1530,"On my way 2 da pool party! Lost a lil weight, ...",0,15,58,3.866667,3,0,0,1
2668,@chocoboy1der - Earl! I need you 2 vote 4 me. ...,1,11,51,4.636364,1,0,1,2
837,Sorry I'm @ home bored 2....waiting 4 my baby ...,0,12,49,4.083333,2,0,1,2
3273,@Nennamusic thats all you need babe key 2 life,1,9,38,4.222222,2,0,1,1
2378,just recieved my oasis tickets for oasis heato...,1,20,85,4.25,6,0,0,1


## Uppercase Word Count

In [187]:
x = 'EXAMPLE ONE' #Uppercase usually means higher intensity emotion
y = 'example two'

print([word for word in x.split() if word.isupper()])
print([word for word in y.split() if word.isupper()])

['EXAMPLE', 'ONE']
[]


In [188]:
df['upper_count'] = df.twitts.apply(lambda x: len([word for word in x.split() if word.isupper()]))

In [189]:
df[df['upper_count']>0].sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count,upper_count
573,Oh my what a night! haha Kasey and I are stayi...,0,27,103,3.814815,13,0,0,1,1
35,"I was rollin' up Prince Ave, heard all the sir...",0,17,69,4.058824,7,0,0,0,1
3281,@NikkiLorenzo. Hey miss thang. AJ is gunna che...,1,15,68,4.533333,3,0,1,0,2
2128,@tdrracing interviewed by the Chron and mentio...,1,16,79,4.9375,6,0,1,1,1
2253,"Hilarious text, I always knew that there was s...",1,15,107,7.133333,7,0,0,0,1


## Lower Case Conversion

In [190]:
x = 'this is Text example'
x.lower() # text normalization

'this is text example'

In [191]:
x = 45.0
# x.lower() # This throws an error. "Float" object has no property "lower"
str(x).lower()

'45.0'

In [192]:
df['twitts'] = df['twitts'].apply(lambda x: str(x).lower())
# If you haven't noticed, we do this normalization after we have counted the upper_count.

In [193]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count,upper_count
1273,"wow, really tired and unmotivated today. but t...",0,24,112,4.666667,9,0,0,0,1
1562,"@dollparts666 eh! hate life, they might! po...",0,22,106,4.818182,8,0,1,0,0
341,primavera all morning. bugs in the system = gr...,0,18,84,4.666667,7,0,0,0,0
3765,listening to kate voegele (as always!) &amp; g...,1,22,97,4.409091,5,0,0,1,0
523,misses your jokes. http://plurk.com/p/12e40f,0,4,41,10.25,1,0,0,0,0


## Expansion of Contractions

In [194]:
contractions = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"u": " you ",
"ur": " your ",
"n": " and ",
"won't": "would not",
'dis': 'this',
'bak': 'back',
'brng': 'bring'
}

In [195]:
x = "I'm don't he'll" # "I am", "do not", "he will"

In [196]:
def expand_contractions(text, contractions):
  words = text.split()
  for index, word in enumerate(words):
    lower = str(word).lower()
    if lower in contractions:
      words[index] = contractions[lower]
  return ' '.join(words)

In [197]:
expand_contractions(x, contractions)

'i am do not he will'

In [198]:
%%timeit
df['twitts'] = df.twitts.apply(lambda x: expand_contractions(x, contractions))

27.6 ms ± 7.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [199]:
df.sample(5)

Unnamed: 0,twitts,sentiment,word_count,char_count,avg_word_length,stop_words_count,hastags,mentions,numeric_count,upper_count
1166,gotta write 15 pages about witches in the midd...,0,12,54,4.5,4,0,0,1,0
230,@rustyrockets i am dancing due to &quot;wolf s...,0,22,125,5.681818,10,0,1,0,0
2420,my family i love these boys http://twitpic.com...,1,7,46,6.571429,2,0,0,0,1
2647,please keep following me.,1,4,22,5.5,2,0,0,0,0
3703,song of the day: outta here by esmee denters--...,1,23,86,3.73913,13,0,0,0,0


You may need to manually update the contractions to include some instances of words/short-hand english that you may encounter and re-run you processing.