## Advance Text Processing 

    N-grams
    Term Frequency
    Inverse Document Frequency
    Term Frequency-Inverse Document Frequency (TF-IDF)
    Bag of Words
    Sentiment Analysis
    Word Embedding

In [None]:
import pandas as pd
train = pd.read_csv('D:\\Datasets\\train_E6oV3lV.csv')

In [None]:
train.head()

## N-grams

N-grams are the combination of multiple words used together.\
Ngrams with N=1 called unigrams, bigrams (N=2) , trigrams (N=3)... 

In [None]:
from textblob import TextBlob

In [None]:
TextBlob(train['tweet'][0]).ngrams(2)

## Term frequency

Term frequency is the ratio of the count of a word present in a sentence, to the length of the sentence

In [None]:
tf1 = (train['tweet'][1:2]).apply(lambda x : pd.value_counts(x.split(" "))).sum(axis=0).reset_index

In [None]:
tf1

## Inverse Document Frequency

Intuition behind IDF is that a word is not of much use to us if it's appearing in all the documents.\
Therefore, IDF of each word is the log of ratio of the total no. of rows to the no. of rows in which that word is present.

In [None]:
for i,word in enumerate(tf1['words']):
    tf1.loc[i,'idf'] = np.log(train.shape[0]/(len(train[train['tweet'].str.contains(word)])))
    
tf1

## Term Frequency - Inverse Document Frequency (TF-IDF)

In [None]:
tf1['tfidf'] = tf1['tf'] * tf1['idf']
tf1

Using sklearn

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features = 1000,lowercase =True, analyzer='word',stop_words='english',ngram_range=(1,1))
train_vect = tfidf.fit_transform(train['tweet'])
train_vect

## Bag of Words

BoW refers to representation of text which describes the presence of words within the text data.\
Intuition behind this is that fields will contain similar kind of words, and will therefore have a similar bag of words.\
Further, that from the text alone we can learn something about the meaning of the document.\

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer='word')
train_bow = bow.fit_transform(train['tweet'])
train_bow

## Sentiment Analysis

In [None]:
train['tweet'][:5].apply(lambda x: TextBlob(x).sentiment)

It returns a tuple representing polarity and subjectivity of each tweet.\
Here, we only extract polarity as it indicates the sentiment(nearer to 1 or -1). 

In [None]:
train['sentiment'] = train['tweet'].apply(lambda x: TextBlob(x).sentiment[0])
train[['tweet','sentiment']].head()

## Word Embeddings

Word embedding is the represention of text in the form of vectors.\
The underlying idea here is that similar words will have minimum distance b/w thier vectors.\


In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file,word2vec_output_file)