What should we do for text pre-processing

・Convert text to lowercase – This is to avoid distinguish between words simply on case.

・Remove Number – Numbers may or may not be relevant to our analyses. Usually it does not carry any importance in sentiment analysis

・Remove Punctuation – Punctuation can provide grammatical context which supports understanding. For bag of words based sentiment analysis punctuation does not add value.

・Remove English stop words – Stop words are common words found in a language. Words like for, of, are etc are common stop words.

・Remove Own stop words(if required) – Along with English stop words, we could instead or in addition remove our own stop words. The choice of own stop word might depend on the domain of discourse, and might not become apparent until we’ve done some analysis.

・Strip white space – Eliminate extra white spaces.

・Stemming – Transforms to root word. Stemming uses an algorithm that removes common word endings for English words, such as “es”, “ed” and “’s”. For example i.e., 1) “computer” & “computers” become “comput”

・Lemmatisation – transform to dictionary base form i.e., “produce” & “produced” become “produce”

・Sparse terms – We are often not interested in infrequent terms in our documents. Such “sparse” terms should be removed from the document term matrix.

Reference:https://datascience.stackexchange.com/questions/11402/preprocessing-text-before-use-rnn/11421

In [1]:
import pandas as pd   
from nltk import tokenize
import re
from nltk.corpus import stopwords
from nltk import stem
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

In [2]:
df_train["comment_text"] = df_train["comment_text"].str.replace("\n"," ") #\n removal
df_test["comment_text"] = df_test["comment_text"].str.replace("\n"," ")
df_train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation Why the edits made under my userna...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,""" More I can't make any real suggestions on im...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [3]:
df_test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,"== From RfC == The title is fine as it is, ..."
2,00013b17ad220c46,""" == Sources == * Zawe Ashton on Lapland..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [50]:
s = df_train["comment_text"][0]
print(s)

Explanation Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27


In [51]:
# Captal to lower
s = s.lower()
print(s)

explanation why the edits made under my username hardcore metallica fan were reverted? they weren't vandalisms, just closure on some gas after i voted at new york dolls fac. and please don't remove the template from the talk page since i'm retired now.89.205.38.27


In [61]:
#Number removal
def remove_number(s):
    s = re.sub(r'[0-9]', "", s)
    return s
s = remove_number(s)
print(s)

explanation why the edits made under my username hardcore metallica fan were reverted? they weren't vandalisms just closure on some gas after i voted at new york dolls fac and please don't remove the template from the talk page since i'm retired now


In [80]:
#Punctuation removal
def remove_punctuation(s):
    s = s.replace('"',"")
    s = s.replace(',',"")
    s = s.replace('.',"")
    s = s.replace("?","")
    s = s.replace("!","")
    s = s.replace("[","")
    s = s.replace("]","")
    s = s.replace("{","")
    s = s.replace("}","")
    s = s.replace("(","")
    s = s.replace(")","")
    s = s.replace(":","")
    s = s.replace(";","")
    return s
s = remove_punctuation(s)
print(s)

explanation why the edits made under my username hardcore metallica fan were reverted they weren't vandalisms just closure on some gas after i voted at new york dolls fac and please don't remove the template from the talk page since i'm retired now


In [54]:
#Lematization and stemming
words = s.split()
lemmatizer = stem.WordNetLemmatizer()
for i in range(len(words)):
    words[i] = lemmatizer.lemmatize(words[i],pos='v')
print(words)

['explanation', 'why', 'the', 'edit', 'make', 'under', 'my', 'username', 'hardcore', 'metallica', 'fan', 'be', 'reverted?', 'they', "weren't", 'vandalisms', 'just', 'closure', 'on', 'some', 'gas', 'after', 'i', 'vote', 'at', 'new', 'york', 'dolls', 'fac', 'and', 'please', "don't", 'remove', 'the', 'template', 'from', 'the', 'talk', 'page', 'since', "i'm", 'retire', 'now']


In [55]:
#Stop word removal
#Show stopword list
#We can add our own stop word by adding own stop word to that list
stops = set(stopwords.words("english"))
print(stops)

{'ain', 'yourselves', 'your', "couldn't", 'an', 'were', 'i', 'have', "it's", 'won', 'which', "mustn't", 'then', "didn't", "you'll", 'in', 'it', 'is', 'on', 'who', 'further', 'once', 'other', 'couldn', 'itself', 'our', 'those', 'again', 'few', 'any', "she's", 'ours', "won't", 'for', 'into', 'through', 'why', 'hasn', 'own', 'while', 'her', 'mustn', 'didn', 'here', 'should', 'now', "hasn't", 'these', 'above', 'nor', 'my', 'how', "you're", 'isn', 'if', 'his', 'himself', 'by', 've', 'not', 'are', "shouldn't", 'their', 'a', 'both', 'mightn', 'him', 'most', "aren't", "wouldn't", 'myself', 'they', 'to', 'only', "hadn't", "mightn't", "needn't", 'there', 'same', "you'd", 'or', "isn't", 'more', 'herself', 'ourselves', 'you', 'up', 'all', 'just', "should've", 'each', 'been', 'its', 'll', 'doesn', 'has', 'from', 'but', 'over', 'at', 'aren', 'after', 'doing', 'where', "don't", 'and', 'down', 'as', 'themselves', "that'll", 'am', "shan't", 'yours', 'so', 'such', 'too', 'does', 'before', "you've", 'off

In [56]:
#Stop word removal
stops = set(stopwords.words("english"))
meaningful_words = [w for w in words if not w in stops] 
print(meaningful_words)

['explanation', 'edit', 'make', 'username', 'hardcore', 'metallica', 'fan', 'reverted?', 'vandalisms', 'closure', 'gas', 'vote', 'new', 'york', 'dolls', 'fac', 'please', 'remove', 'template', 'talk', 'page', 'since', "i'm", 'retire']


In [69]:
#Capital and lower count as different.
def comment_to_words_1(comment):
    lemmatizer = stem.WordNetLemmatizer()
    comment = remove_number(comment)
    comment = remove_punctuation(comment)
    words = comment.split()
    for i in range(len(words)):
        words[i] = lemmatizer.lemmatize(words[i],pos='v')
    stops = set(stopwords.words("english"))                  
    meaningful_words = [w for w in words if not w in stops]   
    # Join the words back into one string separated by space and return the result.
    return( " ".join( meaningful_words ))   

#Capital and lower count as same and only include letters
def comment_to_words_2(comment):
    lemmatizer = stem.WordNetLemmatizer()
    comment = remove_number(comment)
    letters_only = remove_punctuation(comment)
    words = letters_only.lower().split()
    for i in range(len(words)):
        words[i] = lemmatizer.lemmatize(words[i],pos='v')
    stops = set(stopwords.words("english"))                  
    meaningful_words = [w for w in words if not w in stops]   
    # Join the words back into one string separated by space and return the result.
    return( " ".join( meaningful_words ))

In [72]:
comment_to_words_1(df_train["comment_text"][0] )

"Explanation Why edit make username Hardcore Metallica Fan revert They vandalisms closure GAs I vote New York Dolls FAC And please remove template talk page since I'm retire"

In [81]:
#tqdm is the library to show the progress of the iteration. You can omit if you want
from tqdm import tqdm
clean_comment_train = []
for i in tqdm(range(df_train.shape[0])):
    clean_comment_train.append(comment_to_words_1(df_train["comment_text"][i]))

100%|██████████| 159571/159571 [01:16<00:00, 2095.61it/s]


In [82]:
clean_comment_test = []
for i in tqdm(range(df_test.shape[0])):
    clean_comment_test.append(comment_to_words_1(df_test["comment_text"][i]))

100%|██████████| 153164/153164 [01:13<00:00, 2097.88it/s]


In [83]:
#You can save as new csv file as preprocessing takes time
df_train["comment_text"] = clean_comment_train
df_train.to_csv( 'clean_train.csv' )
df_test["comment_text"] = clean_comment_test
df_test.to_csv( 'clean_test.csv' )

In [84]:
clean_comment_train = []
for i in tqdm(range(df_train.shape[0])):
    clean_comment_train.append(comment_to_words_2(df_train["comment_text"][i]))

100%|██████████| 159571/159571 [01:16<00:00, 2085.26it/s]


In [85]:
clean_comment_test = []
for i in tqdm(range(df_test.shape[0])):
    clean_comment_test.append(comment_to_words_2(df_test["comment_text"][i]))

100%|██████████| 153164/153164 [01:11<00:00, 2143.80it/s]


In [86]:
#Without capital letter
df_train["comment_text"] = clean_comment_train
df_train.to_csv( 'clean_train_wo_capital.csv' )
df_test["comment_text"] = clean_comment_test
df_test.to_csv( 'clean_test_wo_capital.csv' )