# Hate_speech-Text Pre-processing
Special thanks to VidyaAnalytica tutorials that helped me in this excercise.

# Problem Statement
To preprocess dataset for future use of supervised and unsupervised learning

# 1)- Importing key modules

In [1]:
# Let's be rebels and ignore warnings for now
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Visualization 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [3]:
import nltk
import pandas as pd
import numpy as np
import requests
import pickle

# 2)-Loading Dataset

In [4]:
train = pd.read_pickle('basic_feature.pkl')
train.head()

Unnamed: 0,id,label,tweet,word_count,char_count,avg_word,stopwords,hastags,numerics,upper
0,1,0,@user when a father is dysfunctional and is s...,21,102,4.555556,10,1,0,0
1,2,0,@user @user thanks for #lyft credit i can't us...,22,122,5.315789,5,3,0,0
2,3,0,bihday your majesty,5,21,5.666667,1,0,0,0
3,4,0,#model i love u take with u all the time in ...,17,86,4.928571,5,1,0,0
4,5,0,factsguide: society now #motivation,8,39,8.0,1,1,0,0


# 3)-Basic Text Pre-processing of text data
- Lower casing
- Punctuation removal
- Stopwords removal
- Frequent words removal
- Rare words removal
- Spelling correction
- Tokenization
- Stemming
- Lemmatization

### 3.1)-Lower case
First pre-processing step which we will do is transform our tweets into lower case. This avoids having multiple copies of the same words. For example, while calculating the word count, ‘Father’ and ‘father’ will be taken as different words.

In [5]:
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
train['tweet'].head()

0    @user when a father is dysfunctional and is so...
1    @user @user thanks for #lyft credit i can't us...
2                                  bihday your majesty
3    #model i love u take with u all the time in ur...
4                  factsguide: society now #motivation
Name: tweet, dtype: object

### 3.2)-Removing Punctuation
removing em will help us reduce the size 

In [6]:
train['tweet'] = train['tweet'].str.replace('[^\w\s]','')
train['tweet'].head()

0    user when a father is dysfunctional and is so ...
1    user user thanks for lyft credit i cant use ca...
2                                  bihday your majesty
3    model i love u take with u all the time in urð...
4                    factsguide society now motivation
Name: tweet, dtype: object

In [7]:
train['tweet'][1]

'user user thanks for lyft credit i cant use cause they dont offer wheelchair vans in pdx disapointed getthanked'

In [8]:
train['tweet'][0]

'user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction run'

In [9]:
#train['tweet'] = train['tweet'].str.replace('[^a-zA-Z]','')
#train['tweet'].head()

In [10]:
import string
import re

def clean_more(text):
    ''' remove text in square brackets, remove punctuation and remove words containing numbers if anyleft by now.'''
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

clean_pun_2 = lambda x: clean_more(x)

In [11]:
train['tweet'] = train.tweet.apply(clean_pun_2)
train['tweet'].head()

0    user when a father is dysfunctional and is so ...
1    user user thanks for lyft credit i cant use ca...
2                                  bihday your majesty
3    model i love u take with u all the time in urð...
4                    factsguide society now motivation
Name: tweet, dtype: object

### 3.3)-Removal of Stop Words

In [12]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
train['tweet'].head()

0    user father dysfunctional selfish drags kids d...
1    user user thanks lyft credit cant use cause do...
2                                       bihday majesty
3                model love u take u time urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

In [13]:
train['tweet'][0]

'user father dysfunctional selfish drags kids dysfunction run'

In [14]:
train['tweet'][1]

'user user thanks lyft credit cant use cause dont offer wheelchair vans pdx disapointed getthanked'

### 3.4)-Common word removal

In [15]:
# let’s check the 10 most frequently occurring words in our text data
freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[:10]
freq

user     17473
love      2647
ð         2522
day       2199
â         1797
happy     1663
amp       1582
im        1139
u         1136
time      1110
dtype: int64

 remove these words as their presence will not of any use in classification of our text data.

In [16]:
freq = list(freq.index)
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['tweet'].head()

0    father dysfunctional selfish drags kids dysfun...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

In [17]:
freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[:10]
freq

life        1086
like        1042
today        991
new          983
positive     928
thankful     919
get          917
people       852
good         840
bihday       825
dtype: int64

We can clearly see that we have got a better list of most frequent words now than before.

In [18]:
train['tweet'][0]

'father dysfunctional selfish drags kids dysfunction run'

In [19]:
train['tweet'][1]

'thanks lyft credit cant use cause dont offer wheelchair vans pdx disapointed getthanked'

### 3.5)-Rare words removal
Let’s remove rarely occurring words from the text. Because they’re so rare, the association between them and other words is dominated by noise

In [20]:
freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[-10:]
freq

newchallenge          1
itchyeyes             1
outnproud             1
defund                1
flemington            1
winch                 1
nomoresnapchat        1
blurred               1
treasonâ              1
horizoninnovations    1
dtype: int64

In [21]:
freq = list(freq.index)
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['tweet'].head()

0    father dysfunctional selfish drags kids dysfun...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

In [22]:
freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[-10:]
freq

wii                          1
toddlerlife                  1
comparison                   1
hayleyrosescott              1
knight                       1
myphotography                1
sanogo                       1
whenrealtorscompeteyouwin    1
stirling                     1
nofillers                    1
dtype: int64

**There are still words that are rare. Of course, we can have another round to remove them. For simplicity, let's keep it to one round **

### 3.6)- Spelling correction

In [23]:
#use the textblob library
from textblob import TextBlob
train['tweet'][:5].apply(lambda x: str(TextBlob(x).correct()))

0    father dysfunctional selfish drags kiss dysfun...
1    thanks left credit can use cause dont offer wh...
2                                       midday majesty
3                               model take or ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

model take urð ðððð ððð has been converted to orð ðððð ððð. It is not correct still. There are chances of getting some more mistakes. In tweets , mostly slang language is used. So, this may not be very helpful technique to apply here.

### 3.7)-Tokenization
Tokenization refers to dividing the text into a sequence of words or sentences.

In [24]:
TextBlob(train['tweet'][1]).words

WordList(['thanks', 'lyft', 'credit', 'cant', 'use', 'cause', 'dont', 'offer', 'wheelchair', 'vans', 'pdx', 'disapointed', 'getthanked'])

### 3.8)- Stemming
Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc.

In [25]:
from nltk.stem import PorterStemmer
st = PorterStemmer()
train['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0        father dysfunct selfish drag kid dysfunct run
1    thank lyft credit cant use caus dont offer whe...
2                                       bihday majesti
3                              model take urð ðððð ððð
4                              factsguid societi motiv
Name: tweet, dtype: object

In [26]:
train['tweet'][0]

'father dysfunctional selfish drags kids dysfunction run'

### 3.9)-Lemmatization

Lemmatization is a more effective option than stemming because it converts the word into its root word, rather than just stripping the suffices. It makes use of the vocabulary and does a morphological analysis to obtain the root word. Therefore, we usually prefer using lemmatization over stemming.

In [27]:
from textblob import Word
train['tweet'] = train['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
train['tweet'].head()

0    father dysfunctional selfish drag kid dysfunct...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

In [28]:
train['tweet'][0]

'father dysfunctional selfish drag kid dysfunction run'

In [29]:
train['tweet'][1]

'thanks lyft credit cant use cause dont offer wheelchair van pdx disapointed getthanked'

In [30]:
train.to_pickle('basic_text_pre-process.pkl')