# Hate_speech-Advance Text Processing
Special thanks to VidyaAnalytica tutorials that helped me in this excercise.

# Problem Statement
To preprocess dataset for future use of supervised and unsupervised learning

# 1)- Importing key modules

In [1]:
# Let's be rebels and ignore warnings for now
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Visualization 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [3]:
import nltk
import pandas as pd
import numpy as np
import requests
import pickle

# 2)-Loading Dataset

In [4]:
train = pd.read_pickle('basic_text_pre-process.pkl')
train.head()

Unnamed: 0,id,label,tweet,word_count,char_count,avg_word,stopwords,hastags,numerics,upper
0,1,0,father dysfunctional selfish drag kid dysfunct...,21,102,4.555556,10,1,0,0
1,2,0,thanks lyft credit cant use cause dont offer w...,22,122,5.315789,5,3,0,0
2,3,0,bihday majesty,5,21,5.666667,1,0,0,0
3,4,0,model take urð ðððð ððð,17,86,4.928571,5,1,0,0
4,5,0,factsguide society motivation,8,39,8.0,1,1,0,0


# 3)-Advance Text Processing
- N-grams
- Term Frequency
- Inverse Document Frequency
- Term Frequency-Inverse Document Frequency (TF-IDF)
- Bag of Words
- Sentiment Analysis
- Word Embedding

### 3.1)-N-gram
N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on can also be used.

Unigrams do not usually contain as much information as compared to bigrams and trigrams. The basic principle behind n-grams is that they capture the language structure, like what letter or word is likely to follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

In [5]:
train['tweet'][0]

'father dysfunctional selfish drag kid dysfunction run'

In [7]:
from textblob import TextBlob
TextBlob(train['tweet'][0]).ngrams(2)

[WordList(['father', 'dysfunctional']),
 WordList(['dysfunctional', 'selfish']),
 WordList(['selfish', 'drag']),
 WordList(['drag', 'kid']),
 WordList(['kid', 'dysfunction']),
 WordList(['dysfunction', 'run'])]

### 3.2)-Term frequency
Term frequency is simply the ratio of the count of a word present in a sentence, to the length of the sentence.

Therefore, we can generalize term frequency as:

TF = (Number of times term T appears in the particular row) / (number of terms in that row)

In [8]:
tf1 = (train['tweet'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
tf1

Unnamed: 0,words,tf
0,getthanked,1
1,thanks,1
2,use,1
3,credit,1
4,pdx,1
5,wheelchair,1
6,dont,1
7,offer,1
8,lyft,1
9,cause,1


### 3.3)-Inverse Document Frequency

The intuition behind inverse document frequency (IDF) is that a word is not of much use to us if it’s appearing in all the documents.

Therefore, the IDF of each word is the log of the ratio of the total number of rows to the number of rows in which that word is present.

IDF = log(N/n), where, N is the total number of rows and n is the number of rows in which the word was present.

In [9]:
for i,word in enumerate(tf1['words']):
  tf1.loc[i, 'idf'] = np.log(train.shape[0]/(len(train[train['tweet'].str.contains(word)])))

tf1

Unnamed: 0,words,tf,idf
0,getthanked,1,9.679156
1,thanks,1,4.597751
2,use,1,3.574363
3,credit,1,7.327781
4,pdx,1,8.762865
5,wheelchair,1,9.273691
6,dont,1,3.746911
7,offer,1,6.522155
8,lyft,1,8.762865
9,cause,1,5.690172


**The more the value of IDF, the more unique is the word.**

### 3.4)-Term Frequency – Inverse Document Frequency (TF-IDF)
TF-IDF is the multiplication of the TF and IDF

In [10]:
tf1['tfidf'] = tf1['tf'] * tf1['idf']
tf1

Unnamed: 0,words,tf,idf,tfidf
0,getthanked,1,9.679156,9.679156
1,thanks,1,4.597751,4.597751
2,use,1,3.574363,3.574363
3,credit,1,7.327781,7.327781
4,pdx,1,8.762865,8.762865
5,wheelchair,1,9.273691,9.273691
6,dont,1,3.746911,3.746911
7,offer,1,6.522155,6.522155
8,lyft,1,8.762865,8.762865
9,cause,1,5.690172,5.690172


We can see that the TF-IDF has penalized words like ‘don’t’, ‘can’t’, and ‘use’ because they are commonly occurring words. However, it has given a high weight to “disappointed” since that will be very useful in determining the sentiment of the tweet.

In [11]:
# Let's apply to all
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', stop_words= 'english',ngram_range=(1,1))
train_vect = tfidf.fit_transform(train['tweet'])

In [12]:
train_vect

<31962x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 113362 stored elements in Compressed Sparse Row format>

In [13]:
# Convert to dense data and also store as document term matrix of tf-idf
tfidf_dtm = pd.DataFrame(train_vect.toarray(), columns=tfidf.get_feature_names())

In [14]:
tfidf_dtm.head()

Unnamed: 0,able,absolutely,account,act,action,actor,actually,adapt,add,adult,...,âï,âïð,ðâ,ðâï,ðð,ððð,ðððð,ððððð,ðððððð,ó¾
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.448983,0.496101,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
#We can always take it to term-document matrix.
tfidf_tdm=tfidf_dtm.transpose()

In [16]:
tfidf_tdm.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,31952,31953,31954,31955,31956,31957,31958,31959,31960,31961
able,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
absolutely,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
account,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
act,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
action,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3.5)-Bag of Words
Bag of Words (BoW) refers to the representation of text which describes the presence of words within the text data. The intuition behind this is that two similar text fields will contain similar kind of words, and will therefore have a similar bag of words. Further, that from the text alone we can learn something about the meaning of the document.

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = "word")
train_bow = bow.fit_transform(train['tweet'])
train_bow

<31962x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 127755 stored elements in Compressed Sparse Row format>

### 3.6)-Sentiment Analysis

In [18]:
#check the sentiment of the first few tweets

train['tweet'][:5].apply(lambda x: TextBlob(x).sentiment)

0    (-0.3, 0.5354166666666667)
1                    (0.2, 0.2)
2                    (0.0, 0.0)
3                    (0.0, 0.0)
4                    (0.0, 0.0)
Name: tweet, dtype: object

it returns a tuple representing polarity and subjectivity of each tweet. Here, we only extract polarity as it indicates the sentiment as value nearer to 1 means a positive sentiment and values nearer to -1 means a negative sentiment

In [19]:
train['sentiment'] = train['tweet'].apply(lambda x: TextBlob(x).sentiment[0] )
train[['tweet','sentiment']].head()

Unnamed: 0,tweet,sentiment
0,father dysfunctional selfish drag kid dysfunct...,-0.3
1,thanks lyft credit cant use cause dont offer w...,0.2
2,bihday majesty,0.0
3,model take urð ðððð ððð,0.0
4,factsguide society motivation,0.0


### 3.7)-Word Embeddings
Word Embedding is the representation of text in the form of vectors. The underlying idea here is that similar words will have a minimum distance between their vectors.

#### word2vec

Word2Vec models require a lot of text, so either we can train it on our training data or we can use the pre-trained word vectors developed by Google, Wiki, etc.

In [None]:
# applying glove for word embedding
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

In [None]:
from gensim.models import KeyedVectors # load the Stanford GloVe model
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)

Let’s say our tweet contains a text saying ‘go away’. We can easily obtain it’s word vector

In [None]:
model['go']

In [None]:
model['away']

We then take the average to represent the string ‘go away’ in the form of vectors having 100 dimensions

In [None]:
(model['go'] + model['away'])/2

We have converted the entire string into a vector which can now be used as a feature in any modelling technique.