# Text Representation with Feature Engineering

### Exploring Traditional Statistical Models

Feature Engineering is often known as the secret sauce to creating superior and better performing machine learning models. Just one excellent feature could be your ticket to winning a Kaggle challenge! The importance of feature engineering is even more important for unstructured, textual data because we need to convert free flowing text into some numeric representations which can then be understood by machine learning algorithms. 

Here we will explore the following feature engineering techniques:

- Bag of Words Model (TF)
- Bag of N-grams Model
- TF-IDF Model
- Similarity Features

# Prepare a Sample Corpus

Let’s now take a sample corpus of documents on which we will run most of our analyses in this article. A corpus is typically a collection of text documents usually belonging to one or more subjects or domains.

In [1]:
!pip install contractions
!pip install textsearch
!pip install tqdm
import nltk
nltk.download('punkt')

Collecting contractions
  Downloading contractions-0.0.52-py2.py3-none-any.whl (7.2 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-1.4.2.tar.gz (321 kB)
[K     |████████████████████████████████| 321 kB 8.2 MB/s 
[?25hCollecting anyascii
  Downloading anyascii-0.2.0-py3-none-any.whl (283 kB)
[K     |████████████████████████████████| 283 kB 35.2 MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  Created wheel for pyahocorasick: filename=pyahocorasick-1.4.2-cp37-cp37m-linux_x86_64.whl size=85447 sha256=21e927968555e4d330cd68fc459633ade14c51be96b217b740f26e2cc17c0633
  Stored in directory: /root/.cache/pip/wheels/25/19/a6/8f363d9939162782bb8439d886469756271abc01f76fbd790f
Successfully built pyahocorasick
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully install

True

In [2]:
import pandas as pd
import numpy as np

pd.options.display.max_colwidth = 200

corpus = ['The sky is blue and beautiful sky sky',
          'Love this blue and beautiful sky blue blue!',
          'The quick brown fox jumps over the lazy dog.',
          "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
          'I love green eggs, ham, sausages and bacon!',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'    
]
labels = ['weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather', 'animals']

corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 
                          'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df

Unnamed: 0,Document,Category
0,The sky is blue and beautiful sky sky,weather
1,Love this blue and beautiful sky blue blue!,weather
2,The quick brown fox jumps over the lazy dog.,animals
3,"A king's breakfast has sausages, ham, bacon, eggs, toast and beans",food
4,"I love green eggs, ham, sausages and bacon!",food
5,The brown fox is quick and the blue dog is lazy!,animals
6,The sky is very blue and the sky is very beautiful today,weather
7,The dog is lazy but the brown fox is quick!,animals


In [3]:
import nltk 
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
import nltk 
import re

stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(text):
  text = re.sub(r'[^a-zA-Z\s]','',text,re.I)
  text = text.lower()
  text = text.strip()
  tokens = nltk.word_tokenize(text)
  filtered_tokens = [token for token in tokens if token not in stop_words]
  return  " ".join(filtered_tokens)

In [5]:
corpus

array(['The sky is blue and beautiful sky sky',
       'Love this blue and beautiful sky blue blue!',
       'The quick brown fox jumps over the lazy dog.',
       "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
       'I love green eggs, ham, sausages and bacon!',
       'The brown fox is quick and the blue dog is lazy!',
       'The sky is very blue and the sky is very beautiful today',
       'The dog is lazy but the brown fox is quick!'], dtype='<U66')

In [6]:
normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(corpus)
norm_corpus

array(['sky blue beautiful sky sky', 'love blue beautiful sky blue blue',
       'quick brown fox jumps lazy dog',
       'kings breakfast sausages ham , bacon , eggs , toast beans',
       'love green eggs ham sausages bacon !',
       'brown fox quick blue dog lazy', 'sky blue sky beautiful today',
       'dog lazy brown fox quick'], dtype='<U57')

## Count Vectorizer from SKlean

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
# max_df=1.0, min_df=1
#Max = remove data values that appear frequently
#max_df = 0.60 - Ignore all the words that appear in more than 60% rows/Docs
#max_df = 50 -  Ignore all the words that appear in more than 50 rows/Docs

#min_df = remove data values that appear too infrequently
#min_df = 0.06 - Ignore all the words that appear in less than 6% rows/Docs
#min_df = 50 -  Ignore all the words that appear in less than 50 rows/Docs
cv = CountVectorizer(min_df = 0., max_df = 1.)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
cv_matrix

array([[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0],
       [0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0],
       [1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]])

In [9]:
vocab = cv.get_feature_names()
pd.DataFrame(cv_matrix,columns = vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0
1,0,0,1,3,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
2,0,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,0,0,0,0
3,1,1,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0
4,1,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,1,0,0,0
5,0,0,0,1,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0
6,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1
7,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0


# One hot Vector

In [10]:
#to generate one hot vector out of this 
# max_df=1.0, min_df=1
#Max = remove data values that appear frequently
#max_df = 0.60 - Ignore all the words that appear in more than 60% rows/Docs
#max_df = 50 -  Ignore all the words that appear in more than 50 rows/Docs

#min_df = remove data values that appear too infrequently
#min_df = 0.06 - Ignore all the words that appear in less than 6% rows/Docs
#min_df = 50 -  Ignore all the words that appear in less than 50 rows/Docs
cv = CountVectorizer(min_df = 0., max_df = 1.,binary = True)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
cv_matrix

array([[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0],
       [1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]])

In [11]:
#Unigram - One 
vocab = cv.get_feature_names()
pd.DataFrame(cv_matrix,columns = vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
2,0,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,0,0,0,0
3,1,1,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0
4,1,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,1,0,0,0
5,0,0,0,1,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0
6,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
7,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0


## Bag of N-Gram words

In [12]:
#Bigrams - (Bacon,beans) (beans,beautiful) (beau..,blue)
bv = CountVectorizer(ngram_range=(1,2))
bv_matrix = bv.fit_transform(norm_corpus)

bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names()
pd.DataFrame(bv_matrix,columns = vocab)

Unnamed: 0,bacon,bacon eggs,beans,beautiful,beautiful sky,beautiful today,blue,blue beautiful,blue blue,blue dog,blue sky,breakfast,breakfast sausages,brown,brown fox,dog,dog lazy,eggs,eggs ham,eggs toast,fox,fox jumps,fox quick,green,green eggs,ham,ham bacon,ham sausages,jumps,jumps lazy,kings,kings breakfast,lazy,lazy brown,lazy dog,love,love blue,love green,quick,quick blue,quick brown,sausages,sausages bacon,sausages ham,sky,sky beautiful,sky blue,sky sky,toast,toast beans,today
0,0,0,0,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,1,1,0,0,0
1,0,0,0,1,1,0,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,1,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,1,0,0,0,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,1,0
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,1,0,0,1,0,0,0,1,1,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,1,0,0,0,1
7,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0


## **TF-IDF**
Tf - Term Frequency 
IDF - Inverse Document Frequency


In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(min_df= 0.,max_df = 1.,use_idf = True)
tv_matrix = tv.fit_transform(norm_corpus)

tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names()
pd.DataFrame(tv_matrix,columns = vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.304732,0.267182,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.914195,0.0,0.0
1,0.0,0.0,0.31217,0.821114,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.361761,0.0,0.0,0.31217,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.380362,0.380362,0.0,0.380362,0.0,0.0,0.525949,0.0,0.380362,0.0,0.380362,0.0,0.0,0.0,0.0
3,0.321164,0.383215,0.0,0.0,0.383215,0.0,0.0,0.321164,0.0,0.0,0.321164,0.0,0.383215,0.0,0.0,0.0,0.321164,0.0,0.383215,0.0
4,0.394554,0.0,0.0,0.0,0.0,0.0,0.0,0.394554,0.0,0.470784,0.394554,0.0,0.0,0.0,0.394554,0.0,0.394554,0.0,0.0,0.0
5,0.0,0.0,0.0,0.365048,0.0,0.416351,0.416351,0.0,0.416351,0.0,0.0,0.0,0.0,0.416351,0.0,0.416351,0.0,0.0,0.0,0.0
6,0.0,0.0,0.360826,0.316365,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.721652,0.0,0.498935
7,0.0,0.0,0.0,0.0,0.0,0.447214,0.447214,0.0,0.447214,0.0,0.0,0.0,0.0,0.447214,0.0,0.447214,0.0,0.0,0.0,0.0


In [14]:
!pip install contractions 
!pip install textsearch 
!pip install tqdm
import nltk



In [15]:
dataset = pd.read_csv(r'https://github.com/dipanjanS/nlp_workshop_dhs18/raw/master/Unit%2011%20-%20Sentiment%20Analysis%20-%20Unsupervised%20Learning/movie_reviews.csv.bz2', compression='bz2')
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [16]:
dataset.head()

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...",positive
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone...",positive


In [17]:
#lexicon based sentiment analysis models - Use the model to classify
dataset['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [18]:
dataset.shape

(50000, 2)

In [19]:
#Split it into train and test
reviews = dataset['review'].values
sentiments = dataset['sentiment'].values


train_reviews = reviews[:35000]
test_reviews = reviews[35000:]

train_sentiments = sentiments[:35000]
test_sentiments = sentiments[35000:]

In [20]:
#Perform pre-processing to the text
import contractions
from bs4 import BeautifulSoup 
import tqdm
import unicodedata
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text


def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def pre_process_corpus(docs):
  norm_docs = []
  for text in tqdm.tqdm(docs): 
    text = strip_html_tags(text)
    text = re.sub(r'[^a-zA-Z\s]','',text,re.I)
    text = text.lower()
    text = text.strip()
    text = remove_accented_chars(text)
    text = contractions.fix(text)
    tokens = nltk.word_tokenize(text)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    text = " ".join(filtered_tokens)
    text = re.sub(" +", ' ',text)
    text =text.strip()
    norm_docs.append(text)
  return norm_docs

In [21]:
%%time 

norm_train_corpus = pre_process_corpus(train_reviews)
norm_test_corpus = pre_process_corpus(test_reviews)

100%|██████████| 35000/35000 [01:24<00:00, 414.19it/s]
100%|██████████| 15000/15000 [00:35<00:00, 422.33it/s]

CPU times: user 1min 59s, sys: 822 ms, total: 2min
Wall time: 2min





In [66]:
## Feature Engineering 
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary = False, min_df = 5,max_df = 1.0,ngram_range= (1,2))
cv_train_features = cv.fit_transform(norm_train_corpus)

cv_test_features = cv.transform(norm_test_corpus)

In [None]:
## Feature Engineering 
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(use_idf = True, min_df = 5,max_df = 1.0,ngram_range= (1,2))
tv_train_features = tv.fit_transform(norm_train_corpus)

In [71]:
tv_test_features = tv.transform(norm_test_corpus)

In [67]:
#Try out Logistic 
from sklearn.linear_model import LogisticRegression

#initiate a model 
lr = LogisticRegression(penalty = 'l2',max_iter = 500,random_state = 42)

lr.fit(cv_train_features,train_sentiments)

lr_bow_predictions = lr.predict(cv_test_features)

In [68]:
len(lr_bow_predictions)

15000

In [69]:
from sklearn.metrics import confusion_matrix, classification_report

labels = ['negative','positive']

print(classification_report(test_sentiments,lr_bow_predictions))
pd.DataFrame(confusion_matrix(test_sentiments,lr_bow_predictions),index = labels,columns = labels)

              precision    recall  f1-score   support

    negative       0.90      0.90      0.90      7490
    positive       0.90      0.90      0.90      7510

    accuracy                           0.90     15000
   macro avg       0.90      0.90      0.90     15000
weighted avg       0.90      0.90      0.90     15000



Unnamed: 0,negative,positive
negative,6716,774
positive,746,6764


In [73]:
#fitting and training on TFIDF vectorizers

lr.fit(tv_train_features,train_sentiments)
lr_tfidf_predictions = lr.predict(tv_test_features)

In [74]:
from sklearn.metrics import confusion_matrix, classification_report

labels = ['negative','positive']

print(classification_report(test_sentiments,lr_tfidf_predictions))
pd.DataFrame(confusion_matrix(test_sentiments,lr_tfidf_predictions),index = labels,columns = labels)

              precision    recall  f1-score   support

    negative       0.91      0.89      0.90      7490
    positive       0.89      0.91      0.90      7510

    accuracy                           0.90     15000
   macro avg       0.90      0.90      0.90     15000
weighted avg       0.90      0.90      0.90     15000



Unnamed: 0,negative,positive
negative,6679,811
positive,677,6833


In [None]:
#take time till 10:40am IST 
#try Random Forest - For both CV and TFIDF

In [22]:
#LSTM based Sentiment Classifier 
norm_train_corpus = pre_process_corpus(train_reviews)
norm_test_corpus = pre_process_corpus(test_reviews)

100%|██████████| 35000/35000 [01:23<00:00, 420.14it/s]
100%|██████████| 15000/15000 [00:36<00:00, 409.65it/s]


In [23]:
import tensorflow as tf

t = tf.keras.preprocessing.text.Tokenizer(oov_token= '<UNK>')#If any word not in the vocab is there name tag it UNK
# fit the tokenizer on the document 
t.fit_on_texts(norm_train_corpus)
t.word_index['<PAD>']= 0

In [24]:
train_sequences = t.texts_to_sequences(norm_train_corpus)
test_sequences = t.texts_to_sequences(norm_test_corpus)

In [25]:
print("Vocabulary size ={}".format(len(t.word_index)))
print("Number of Documents={}".format(t.document_count))

Vocabulary size =97894
Number of Documents=35000


In [26]:
MAX_SEQUENCE_LENGTH = 1000

In [27]:
#pad dataset to a maximum review length in words
X_train = tf.keras.preprocessing.sequence.pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
X_test = tf.keras.preprocessing.sequence.pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)

In [28]:
X_train.shape

(35000, 1000)

In [29]:
#Encode the labels 
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
num_classes = 2

In [30]:
y_train = le.fit_transform(train_sentiments)
y_test = le.transform(test_sentiments)

In [31]:
VOCAB_SIZE = len(t.word_index)

In [32]:
EMBEDDING_DIM = 300 #Dimension for dense embedding for each token

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(input_dim =VOCAB_SIZE,output_dim = EMBEDDING_DIM,input_length = MAX_SEQUENCE_LENGTH))
model.add(tf.keras.layers.LSTM(128,return_sequences = False))
model.add(tf.keras.layers.Dense(256,activation = 'relu'))
model.add(tf.keras.layers.Dense(1,activation = 'sigmoid'))

model.compile(loss = 'binary_crossentropy',optimizer="adam",metrics =['accuracy'])

In [33]:
model.summary()


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 1000, 300)         29368200  
_________________________________________________________________
lstm (LSTM)                  (None, 128)               219648    
_________________________________________________________________
dense (Dense)                (None, 256)               33024     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 257       
Total params: 29,621,129
Trainable params: 29,621,129
Non-trainable params: 0
_________________________________________________________________


In [34]:
model.fit(X_train,y_train,epochs = 2,batch_size = 100,shuffle = True,validation_split =0.1)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f5d9697b7d0>

In [36]:
## Evaluation on Test
scores = model.evaluate(X_test,y_test)
print("Accuracy: ",scores[1]*100)

Accuracy:  89.11333084106445


In [38]:
predictions = model.predict_classes(X_test).ravel()
predictions = ['positive' if item==1 else 'negative' for item in predictions]



In [39]:
from sklearn.metrics import confusion_matrix, classification_report

labels = ['negative','positive']

print(classification_report(test_sentiments,predictions))
pd.DataFrame(confusion_matrix(test_sentiments,predictions),index = labels,columns = labels)

              precision    recall  f1-score   support

    negative       0.89      0.89      0.89      7490
    positive       0.89      0.89      0.89      7510

    accuracy                           0.89     15000
   macro avg       0.89      0.89      0.89     15000
weighted avg       0.89      0.89      0.89     15000



Unnamed: 0,negative,positive
negative,6678,812
positive,821,6689


In [None]:
dataset = pd.read_csv(r'https://github.com/dipanjanS/nlp_workshop_dhs18/raw/master/Unit%2011%20-%20Sentiment%20Analysis%20-%20Unsupervised%20Learning/movie_reviews.csv.bz2', compression='bz2')
dataset.info()