[![AnalyticsDojo](https://github.com/rpi-techfundamentals/spring2019-materials/blob/master/fig/final-logo.png?raw=1)](http://rpi.analyticsdojo.com)
<center><h1> Vectorization Options</h1></center>
<center><h3><a href = 'http://rpi.analyticsdojo.com'>rpi.analyticsdojo.com</a></h3></center>

This is adopted from: [Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words)
[https://github.com/wendykan/DeepLearningMovies](https://github.com/wendykan/DeepLearningMovies)


## Vectorizers

To be meaningfully modeled, words must be turned into vectors. This notebook covers a number of the approaches for text vectorization.

# Bag of Words

In [35]:
import pandas as pd
import numpy as np
from gensim import models
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

In [36]:
!wget https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/labeledTrainData.tsv
!wget https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/unlabeledTrainData.tsv
!wget https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/testData.tsv

--2023-03-20 13:39:59--  https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/labeledTrainData.tsv
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/labeledTrainData.tsv [following]
--2023-03-20 13:39:59--  https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/labeledTrainData.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33556378 (32M) [text/plain]
Saving to: 'labeledTrainData.tsv.2'


2023-03-20 13:40:03 (7.68 MB/s) - 'labeledTrainData.tsv.2' saved [33556378/33556

In [37]:
train = pd.read_csv('labeledTrainData.tsv', header=0, \
                    delimiter="\t", quoting=3)
unlabeled_train= pd.read_csv('unlabeledTrainData.tsv', header=0, \
                    delimiter="\t", quoting=3)
test = pd.read_csv('testData.tsv', header=0, \
                    delimiter="\t", quoting=3)

In [38]:
print(train.columns.values, test.columns.values)

['id' 'sentiment' 'review'] ['id' 'review']


In [39]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [40]:
print('The train shape is: ', train.shape)
print('The train shape is: ', test.shape)

The train shape is:  (25000, 3)
The train shape is:  (25000, 2)


In [41]:
print('The first review is:')
print(train["review"][0])

The first review is:
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bi

### Common Preprocessing

Packages provide a variety of preprocessing routines. This results in a Tokenized set of data. 


https://radimrehurek.com/gensim/parsing/preprocessing.html


In [56]:
from bs4 import BeautifulSoup
from gensim.parsing.preprocessing import strip_punctuation, strip_numeric, stem_text,  preprocess_string
from gensim.parsing.preprocessing import strip_multiple_whitespaces, remove_stopwords

# define custom filters
text_col='review'
CUSTOM_FILTERS = [
                  lambda x: BeautifulSoup(x).get_text(),
                  lambda x: x.encode('utf-8').strip(),
                  lambda x: x.lower(), #lowercase
                  strip_multiple_whitespaces,# remove repeating whitespaces
                  strip_numeric, # remove numbers
                  strip_punctuation, #remove punctuation
                  remove_stopwords,# remove stopwords
                  stem_text # return porter-stemmed text,
                 ]

train[text_col+'_tokens']=train[text_col].apply(preprocess_string, filters=CUSTOM_FILTERS)
train



Unnamed: 0,id,sentiment,review,review_tokens,review_bow,review_tfidf_bow,review_tag,review_docvecs,review_lsi_bow,review_lsi_bow_d,review_lsi_tfidf,review_lsi_tfidf_d,review_lda_bow,review_lda_bow_d,review_lda_tfidf,review_lda_tfidf_d
0,"""5814_8""",1,"""With all this stuff going down at the moment ...","[stuff, go, moment, mj, ve, start, listen, mus...","[(0, 1), (1, 1), (2, 1), (3, 3), (4, 1), (5, 1...","[(0, 0.017801646828531716), (1, 0.032536370221...","([stuff, go, moment, mj, ve, start, listen, mu...","[-0.37071925, -0.37471336, 0.35511154, 0.31053...","[(0, 11.779776065057115), (1, 0.68880848196096...","[11.779776065057115, 0.6888084819609639, 1.001...","[(0, 0.13420705817330306), (1, 0.0204246000473...","[0.13420705817330306, 0.020424600047362993, -0...","[(0, 0.00010167516), (1, 0.00010167516), (2, 0...","[0.00010167516302317381, 0.0001016751630231738...","[(0, 0.003221488), (1, 0.003221488), (2, 0.003...","[0.00322148809209466, 0.00322148809209466, 0.0..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ...","[classic, war, world, timothi, hine, entertain...","[(26, 1), (36, 1), (40, 2), (64, 1), (79, 2), ...","[(26, 0.05409934611948453), (36, 0.04806180321...","([classic, war, world, timothi, hine, entertai...","[-0.13829194, 0.6065801, -0.3300408, 0.0780963...","[(0, 1.8561407545341606), (1, 1.89513522830264...","[1.8561407545341606, 1.8951352283026432, 0.902...","[(0, 0.10878197286923105), (1, -0.038464149430...","[0.10878197286923105, -0.03846414943017419, 0....","[(0, 0.057396516), (1, 0.0002598573), (2, 0.00...","[0.05739651620388031, 0.0002598573046270758, 0...","[(0, 0.0029386827), (1, 0.0029386827), (2, 0.0...","[0.0029386826790869236, 0.0029386826790869236,..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell...","[film, start, manag, nichola, bell, give, welc...","[(7, 1), (8, 2), (15, 1), (19, 1), (21, 1), (2...","[(7, 0.02641721019773179), (8, 0.0113918159397...","([film, start, manag, nichola, bell, give, wel...","[-0.87693584, -0.6684592, 0.039008725, 0.04651...","[(0, 4.764756308545519), (1, 2.073713845195448...","[4.764756308545519, 2.0737138451954484, -1.120...","[(0, 0.10832676919366832), (1, -0.037771180057...","[0.10832676919366832, -0.03777118005717449, 0....","[(0, 9.49308e-05), (1, 0.107810296), (2, 9.493...","[9.493080142419785e-05, 0.10781029611825943, 9...","[(0, 0.0022494197), (1, 0.0022494197), (2, 0.0...","[0.002249419689178467, 0.002249419689178467, 0..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi...","[assum, prais, film, greatest, film, opera, t,...","[(8, 4), (14, 1), (39, 1), (40, 5), (79, 1), (...","[(8, 0.027915395297452853), (14, 0.01445786468...","([assum, prais, film, greatest, film, opera, t...","[0.05829846, 0.006241254, -0.68099487, -1.0528...","[(0, 7.83569395865421), (1, 2.7322675300286496...","[7.83569395865421, 2.7322675300286496, -1.6456...","[(0, 0.11090122350875947), (1, -0.027767160612...","[0.11090122350875947, -0.027767160612794, 0.00...","[(0, 0.0001194125), (1, 0.0001194125), (2, 0.0...","[0.00011941249977098778, 0.0001194124997709877...","[(0, 0.0029939013), (1, 0.0029939013), (2, 0.0...","[0.0029939012601971626, 0.0029939012601971626,..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ...","[superbl, trashi, wondrous, unpretenti, s, exp...","[(0, 1), (16, 1), (40, 1), (51, 1), (58, 1), (...","[(0, 0.026260993443946862), (16, 0.03952381157...","([superbl, trashi, wondrous, unpretenti, s, ex...","[-0.12724672, -0.27751902, 0.23157094, -0.1183...","[(0, 4.381490951727345), (1, 4.539909128826721...","[4.381490951727345, 4.5399091288267215, -0.292...","[(0, 0.1362741166670168), (1, -0.0238380867201...","[0.1362741166670168, -0.023838086720193115, 0....","[(0, 0.00010438681), (1, 0.00010438681), (2, 0...","[0.00010438681056257337, 0.0001043868105625733...","[(0, 0.0018709088), (1, 0.0018709088), (2, 0.0...","[0.0018709087744355202, 0.0018709087744355202,..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,"""3453_3""",0,"""It seems like more consideration has gone int...","[like, consider, gone, imdb, review, film, wen...","[(8, 16), (14, 1), (19, 1), (40, 2), (44, 1), ...","[(8, 0.3705908678713166), (14, 0.0479838505272...","([like, consider, gone, imdb, review, film, we...","[0.15594596, 0.18544354, -0.105273694, 0.22136...","[(0, 13.536256392336092), (1, -8.7504469219753...","[13.536256392336092, -8.750446921975358, -1.57...","[(0, 0.21539958446612195), (1, -0.028975017869...","[0.21539958446612195, -0.02897501786907147, -0...","[(0, 0.0004084208), (1, 0.0004084208), (2, 0.0...","[0.00040842080488801, 0.00040842080488801, 0.0...","[(0, 0.0042754444), (1, 0.0042754444), (2, 0.0...","[0.004275444429367781, 0.004275444429367781, 0..."
24996,"""5064_1""",0,"""I don't believe they made this film. Complete...","[t, believ, film, complet, unnecessari, film, ...","[(3, 1), (14, 2), (38, 1), (39, 1), (40, 7), (...","[(3, 0.04430060588799008), (14, 0.067862671930...","([t, believ, film, complet, unnecessari, film,...","[-0.11668433, 0.3858125, 0.12042065, -0.166378...","[(0, 4.968709313380288), (1, 5.659449057954181...","[4.968709313380288, 5.659449057954181, -3.2577...","[(0, 0.1819199245141776), (1, -0.0006595155745...","[0.1819199245141776, -0.0006595155745994549, 0...","[(0, 0.00024402926), (1, 0.00024402926), (2, 0...","[0.0002440292591927573, 0.0002440292591927573,...","[(0, 0.0030703177), (1, 0.0030703177), (2, 0.0...","[0.0030703176744282246, 0.0030703176744282246,..."
24997,"""10905_3""",0,"""Guy is a loser. Can't get girls, needs to bui...","[gui, loser, t, girl, need, build, pick, stron...","[(7, 1), (8, 2), (40, 3), (43, 1), (46, 2), (5...","[(7, 0.08536332493723874), (8, 0.0368109758149...","([gui, loser, t, girl, need, build, pick, stro...","[-0.019323384, 0.02601225, -0.20017359, -0.245...","[(0, 3.417893938851843), (1, 0.762966760162663...","[3.417893938851843, 0.7629667601626638, -0.275...","[(0, 0.15146882820812926), (1, 0.0486274672625...","[0.15146882820812926, 0.04862746726251161, 0.0...","[(0, 0.19364153), (1, 0.00037042814), (2, 0.00...","[0.1936415284872055, 0.00037042814074084163, 0...","[(0, 0.0036014372), (1, 0.0036014372), (2, 0.0...","[0.0036014372017234564, 0.0036014372017234564,..."
24998,"""10194_3""",0,"""This 30 minute documentary Buñuel made in the...","[minut, documentari, buñuel, earli, s, spain, ...","[(28, 1), (35, 1), (40, 1), (61, 1), (64, 1), ...","[(28, 0.06264404738146905), (35, 0.05167372356...","([minut, documentari, buñuel, earli, s, spain,...","[-0.013894695, -0.16366182, -0.06063278, 0.144...","[(0, 2.9912926459933327), (1, 3.24235945183102...","[2.9912926459933327, 3.2423594518310255, -0.23...","[(0, 0.07659577172365657), (1, -0.014535841799...","[0.07659577172365657, -0.014535841799178663, 0...","[(0, 0.00023004648), (1, 0.00023004648), (2, 0...","[0.0002300464839208871, 0.0002300464839208871,...","[(0, 0.0036316316), (1, 0.0036316316), (2, 0.0...","[0.003631631610915065, 0.003631631610915065, 0..."


In [43]:
from gensim import corpora

#Create a Dictionary.
cdict = corpora.Dictionary(train[text_col+'_tokens'].to_list())

#Create a Bag of Words Model
train[text_col+'_bow']=train[text_col+'_tokens'].apply(lambda x, dictionary: dictionary.doc2bow(x), dictionary= cdict)
train

Unnamed: 0,id,sentiment,review,review_tokens,review_bow
0,"""5814_8""",1,"""With all this stuff going down at the moment ...","[stuff, go, moment, mj, ve, start, listen, mus...","[(0, 1), (1, 1), (2, 1), (3, 3), (4, 1), (5, 1..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ...","[classic, war, world, timothi, hine, entertain...","[(26, 1), (36, 1), (40, 2), (64, 1), (79, 2), ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell...","[film, start, manag, nichola, bell, give, welc...","[(7, 1), (8, 2), (15, 1), (19, 1), (21, 1), (2..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi...","[assum, prais, film, greatest, film, opera, t,...","[(8, 4), (14, 1), (39, 1), (40, 5), (79, 1), (..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ...","[superbl, trashi, wondrous, unpretenti, s, exp...","[(0, 1), (16, 1), (40, 1), (51, 1), (58, 1), (..."
...,...,...,...,...,...
24995,"""3453_3""",0,"""It seems like more consideration has gone int...","[like, consider, gone, imdb, review, film, wen...","[(8, 16), (14, 1), (19, 1), (40, 2), (44, 1), ..."
24996,"""5064_1""",0,"""I don't believe they made this film. Complete...","[t, believ, film, complet, unnecessari, film, ...","[(3, 1), (14, 2), (38, 1), (39, 1), (40, 7), (..."
24997,"""10905_3""",0,"""Guy is a loser. Can't get girls, needs to bui...","[gui, loser, t, girl, need, build, pick, stron...","[(7, 1), (8, 2), (40, 3), (43, 1), (46, 2), (5..."
24998,"""10194_3""",0,"""This 30 minute documentary Buñuel made in the...","[minut, documentari, buñuel, earli, s, spain, ...","[(28, 1), (35, 1), (40, 1), (61, 1), (64, 1), ..."


In [44]:
def transform(x, model):
    return model[x]

tfidf_bow = models.TfidfModel( train[text_col+'_bow'].to_list(),  normalize=True)
train[text_col+'_tfidf_bow']=train[text_col+'_bow'].apply(transform, model=tfidf_bow )
train

Unnamed: 0,id,sentiment,review,review_tokens,review_bow,review_tfidf_bow
0,"""5814_8""",1,"""With all this stuff going down at the moment ...","[stuff, go, moment, mj, ve, start, listen, mus...","[(0, 1), (1, 1), (2, 1), (3, 3), (4, 1), (5, 1...","[(0, 0.017801646828531716), (1, 0.032536370221..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ...","[classic, war, world, timothi, hine, entertain...","[(26, 1), (36, 1), (40, 2), (64, 1), (79, 2), ...","[(26, 0.05409934611948453), (36, 0.04806180321..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell...","[film, start, manag, nichola, bell, give, welc...","[(7, 1), (8, 2), (15, 1), (19, 1), (21, 1), (2...","[(7, 0.02641721019773179), (8, 0.0113918159397..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi...","[assum, prais, film, greatest, film, opera, t,...","[(8, 4), (14, 1), (39, 1), (40, 5), (79, 1), (...","[(8, 0.027915395297452853), (14, 0.01445786468..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ...","[superbl, trashi, wondrous, unpretenti, s, exp...","[(0, 1), (16, 1), (40, 1), (51, 1), (58, 1), (...","[(0, 0.026260993443946862), (16, 0.03952381157..."
...,...,...,...,...,...,...
24995,"""3453_3""",0,"""It seems like more consideration has gone int...","[like, consider, gone, imdb, review, film, wen...","[(8, 16), (14, 1), (19, 1), (40, 2), (44, 1), ...","[(8, 0.3705908678713166), (14, 0.0479838505272..."
24996,"""5064_1""",0,"""I don't believe they made this film. Complete...","[t, believ, film, complet, unnecessari, film, ...","[(3, 1), (14, 2), (38, 1), (39, 1), (40, 7), (...","[(3, 0.04430060588799008), (14, 0.067862671930..."
24997,"""10905_3""",0,"""Guy is a loser. Can't get girls, needs to bui...","[gui, loser, t, girl, need, build, pick, stron...","[(7, 1), (8, 2), (40, 3), (43, 1), (46, 2), (5...","[(7, 0.08536332493723874), (8, 0.0368109758149..."
24998,"""10194_3""",0,"""This 30 minute documentary Buñuel made in the...","[minut, documentari, buñuel, earli, s, spain, ...","[(28, 1), (35, 1), (40, 1), (61, 1), (64, 1), ...","[(28, 0.06264404738146905), (35, 0.05167372356..."


In [45]:
# Word2Vec
train[text_col+'_tag']=pd.Series(TaggedDocument(doc, [i]) for i, doc in enumerate(train[text_col+'_tokens'].to_list()))
doc2vec = Doc2Vec(train[text_col+'_tag'] , vector_size=50, window=2, min_count=1, workers=4)
train[text_col+'_docvecs']=pd.Series([doc2vec.docvecs[x] for x in range(len(train))])
train


  train[text_col+'_docvecs']=pd.Series([doc2vec.docvecs[x] for x in range(len(train))])


Unnamed: 0,id,sentiment,review,review_tokens,review_bow,review_tfidf_bow,review_tag,review_docvecs
0,"""5814_8""",1,"""With all this stuff going down at the moment ...","[stuff, go, moment, mj, ve, start, listen, mus...","[(0, 1), (1, 1), (2, 1), (3, 3), (4, 1), (5, 1...","[(0, 0.017801646828531716), (1, 0.032536370221...","([stuff, go, moment, mj, ve, start, listen, mu...","[-0.37071925, -0.37471336, 0.35511154, 0.31053..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ...","[classic, war, world, timothi, hine, entertain...","[(26, 1), (36, 1), (40, 2), (64, 1), (79, 2), ...","[(26, 0.05409934611948453), (36, 0.04806180321...","([classic, war, world, timothi, hine, entertai...","[-0.13829194, 0.6065801, -0.3300408, 0.0780963..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell...","[film, start, manag, nichola, bell, give, welc...","[(7, 1), (8, 2), (15, 1), (19, 1), (21, 1), (2...","[(7, 0.02641721019773179), (8, 0.0113918159397...","([film, start, manag, nichola, bell, give, wel...","[-0.87693584, -0.6684592, 0.039008725, 0.04651..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi...","[assum, prais, film, greatest, film, opera, t,...","[(8, 4), (14, 1), (39, 1), (40, 5), (79, 1), (...","[(8, 0.027915395297452853), (14, 0.01445786468...","([assum, prais, film, greatest, film, opera, t...","[0.05829846, 0.006241254, -0.68099487, -1.0528..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ...","[superbl, trashi, wondrous, unpretenti, s, exp...","[(0, 1), (16, 1), (40, 1), (51, 1), (58, 1), (...","[(0, 0.026260993443946862), (16, 0.03952381157...","([superbl, trashi, wondrous, unpretenti, s, ex...","[-0.12724672, -0.27751902, 0.23157094, -0.1183..."
...,...,...,...,...,...,...,...,...
24995,"""3453_3""",0,"""It seems like more consideration has gone int...","[like, consider, gone, imdb, review, film, wen...","[(8, 16), (14, 1), (19, 1), (40, 2), (44, 1), ...","[(8, 0.3705908678713166), (14, 0.0479838505272...","([like, consider, gone, imdb, review, film, we...","[0.15594596, 0.18544354, -0.105273694, 0.22136..."
24996,"""5064_1""",0,"""I don't believe they made this film. Complete...","[t, believ, film, complet, unnecessari, film, ...","[(3, 1), (14, 2), (38, 1), (39, 1), (40, 7), (...","[(3, 0.04430060588799008), (14, 0.067862671930...","([t, believ, film, complet, unnecessari, film,...","[-0.11668433, 0.3858125, 0.12042065, -0.166378..."
24997,"""10905_3""",0,"""Guy is a loser. Can't get girls, needs to bui...","[gui, loser, t, girl, need, build, pick, stron...","[(7, 1), (8, 2), (40, 3), (43, 1), (46, 2), (5...","[(7, 0.08536332493723874), (8, 0.0368109758149...","([gui, loser, t, girl, need, build, pick, stro...","[-0.019323384, 0.02601225, -0.20017359, -0.245..."
24998,"""10194_3""",0,"""This 30 minute documentary Buñuel made in the...","[minut, documentari, buñuel, earli, s, spain, ...","[(28, 1), (35, 1), (40, 1), (61, 1), (64, 1), ...","[(28, 0.06264404738146905), (35, 0.05167372356...","([minut, documentari, buñuel, earli, s, spain,...","[-0.013894695, -0.16366182, -0.06063278, 0.144..."


In [46]:
def create_dense(x, vlen=50):
    """
    Create a dense vector of float64's from a sparse vector of any numbers.
    """
    try:
        x=dict(x)
        output=[]
        for i in range(vlen):
            if i in x.keys():
                output.append(np.float64(x[i]))
            else:
                output.append(0)
        return output
    except:
        return np.nan

In [47]:

lsi_model_bow = models.LsiModel(train[text_col+'_bow'].to_list(), id2word=cdict, num_topics=50)
train[text_col+'_lsi_bow']=train[text_col+'_bow'].apply(transform, model=lsi_model_bow)
train[text_col+'_lsi_bow_d']=train[text_col+'_lsi_bow'].apply(create_dense, vlen=50)

In [48]:
lsi_model_tfidf = models.LsiModel(train[text_col+'_tfidf_bow'].to_list(), id2word=cdict, num_topics=50)
train[text_col+'_lsi_tfidf']=train[text_col+'_tfidf_bow'].apply(transform, model=lsi_model_tfidf)
train[text_col+'_lsi_tfidf_d']=train[text_col+'_lsi_tfidf'].apply(create_dense, vlen=50)

In [49]:
lda_model_bow = models.LdaModel(train[text_col+'_bow'].to_list(), id2word=cdict, num_topics=50, minimum_probability=0)
train[text_col+'_lda_bow']=train[text_col+'_bow'].apply(transform, model=lda_model_bow)
train[text_col+'_lda_bow_d']=train[text_col+'_lda_bow'].apply(create_dense, vlen=50)

In [50]:
lda_model_tfidf = models.LdaModel(train[text_col+'_tfidf_bow'].to_list(), id2word=cdict, num_topics=50, minimum_probability=0)
train[text_col+'_lda_tfidf']=train[text_col+'_tfidf_bow'].apply(transform, model=lda_model_tfidf)
train[text_col+'_lda_tfidf_d']=train[text_col+'_lda_tfidf'].apply(create_dense, vlen=50)

In [51]:
train

Unnamed: 0,id,sentiment,review,review_tokens,review_bow,review_tfidf_bow,review_tag,review_docvecs,review_lsi_bow,review_lsi_bow_d,review_lsi_tfidf,review_lsi_tfidf_d,review_lda_bow,review_lda_bow_d,review_lda_tfidf,review_lda_tfidf_d
0,"""5814_8""",1,"""With all this stuff going down at the moment ...","[stuff, go, moment, mj, ve, start, listen, mus...","[(0, 1), (1, 1), (2, 1), (3, 3), (4, 1), (5, 1...","[(0, 0.017801646828531716), (1, 0.032536370221...","([stuff, go, moment, mj, ve, start, listen, mu...","[-0.37071925, -0.37471336, 0.35511154, 0.31053...","[(0, 11.779776065057115), (1, 0.68880848196096...","[11.779776065057115, 0.6888084819609639, 1.001...","[(0, 0.13420705817330306), (1, 0.0204246000473...","[0.13420705817330306, 0.020424600047362993, -0...","[(0, 0.00010167516), (1, 0.00010167516), (2, 0...","[0.00010167516302317381, 0.0001016751630231738...","[(0, 0.003221488), (1, 0.003221488), (2, 0.003...","[0.00322148809209466, 0.00322148809209466, 0.0..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ...","[classic, war, world, timothi, hine, entertain...","[(26, 1), (36, 1), (40, 2), (64, 1), (79, 2), ...","[(26, 0.05409934611948453), (36, 0.04806180321...","([classic, war, world, timothi, hine, entertai...","[-0.13829194, 0.6065801, -0.3300408, 0.0780963...","[(0, 1.8561407545341606), (1, 1.89513522830264...","[1.8561407545341606, 1.8951352283026432, 0.902...","[(0, 0.10878197286923105), (1, -0.038464149430...","[0.10878197286923105, -0.03846414943017419, 0....","[(0, 0.057396516), (1, 0.0002598573), (2, 0.00...","[0.05739651620388031, 0.0002598573046270758, 0...","[(0, 0.0029386827), (1, 0.0029386827), (2, 0.0...","[0.0029386826790869236, 0.0029386826790869236,..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell...","[film, start, manag, nichola, bell, give, welc...","[(7, 1), (8, 2), (15, 1), (19, 1), (21, 1), (2...","[(7, 0.02641721019773179), (8, 0.0113918159397...","([film, start, manag, nichola, bell, give, wel...","[-0.87693584, -0.6684592, 0.039008725, 0.04651...","[(0, 4.764756308545519), (1, 2.073713845195448...","[4.764756308545519, 2.0737138451954484, -1.120...","[(0, 0.10832676919366832), (1, -0.037771180057...","[0.10832676919366832, -0.03777118005717449, 0....","[(0, 9.49308e-05), (1, 0.107810296), (2, 9.493...","[9.493080142419785e-05, 0.10781029611825943, 9...","[(0, 0.0022494197), (1, 0.0022494197), (2, 0.0...","[0.002249419689178467, 0.002249419689178467, 0..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi...","[assum, prais, film, greatest, film, opera, t,...","[(8, 4), (14, 1), (39, 1), (40, 5), (79, 1), (...","[(8, 0.027915395297452853), (14, 0.01445786468...","([assum, prais, film, greatest, film, opera, t...","[0.05829846, 0.006241254, -0.68099487, -1.0528...","[(0, 7.83569395865421), (1, 2.7322675300286496...","[7.83569395865421, 2.7322675300286496, -1.6456...","[(0, 0.11090122350875947), (1, -0.027767160612...","[0.11090122350875947, -0.027767160612794, 0.00...","[(0, 0.0001194125), (1, 0.0001194125), (2, 0.0...","[0.00011941249977098778, 0.0001194124997709877...","[(0, 0.0029939013), (1, 0.0029939013), (2, 0.0...","[0.0029939012601971626, 0.0029939012601971626,..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ...","[superbl, trashi, wondrous, unpretenti, s, exp...","[(0, 1), (16, 1), (40, 1), (51, 1), (58, 1), (...","[(0, 0.026260993443946862), (16, 0.03952381157...","([superbl, trashi, wondrous, unpretenti, s, ex...","[-0.12724672, -0.27751902, 0.23157094, -0.1183...","[(0, 4.381490951727345), (1, 4.539909128826721...","[4.381490951727345, 4.5399091288267215, -0.292...","[(0, 0.1362741166670168), (1, -0.0238380867201...","[0.1362741166670168, -0.023838086720193115, 0....","[(0, 0.00010438681), (1, 0.00010438681), (2, 0...","[0.00010438681056257337, 0.0001043868105625733...","[(0, 0.0018709088), (1, 0.0018709088), (2, 0.0...","[0.0018709087744355202, 0.0018709087744355202,..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,"""3453_3""",0,"""It seems like more consideration has gone int...","[like, consider, gone, imdb, review, film, wen...","[(8, 16), (14, 1), (19, 1), (40, 2), (44, 1), ...","[(8, 0.3705908678713166), (14, 0.0479838505272...","([like, consider, gone, imdb, review, film, we...","[0.15594596, 0.18544354, -0.105273694, 0.22136...","[(0, 13.536256392336092), (1, -8.7504469219753...","[13.536256392336092, -8.750446921975358, -1.57...","[(0, 0.21539958446612195), (1, -0.028975017869...","[0.21539958446612195, -0.02897501786907147, -0...","[(0, 0.0004084208), (1, 0.0004084208), (2, 0.0...","[0.00040842080488801, 0.00040842080488801, 0.0...","[(0, 0.0042754444), (1, 0.0042754444), (2, 0.0...","[0.004275444429367781, 0.004275444429367781, 0..."
24996,"""5064_1""",0,"""I don't believe they made this film. Complete...","[t, believ, film, complet, unnecessari, film, ...","[(3, 1), (14, 2), (38, 1), (39, 1), (40, 7), (...","[(3, 0.04430060588799008), (14, 0.067862671930...","([t, believ, film, complet, unnecessari, film,...","[-0.11668433, 0.3858125, 0.12042065, -0.166378...","[(0, 4.968709313380288), (1, 5.659449057954181...","[4.968709313380288, 5.659449057954181, -3.2577...","[(0, 0.1819199245141776), (1, -0.0006595155745...","[0.1819199245141776, -0.0006595155745994549, 0...","[(0, 0.00024402926), (1, 0.00024402926), (2, 0...","[0.0002440292591927573, 0.0002440292591927573,...","[(0, 0.0030703177), (1, 0.0030703177), (2, 0.0...","[0.0030703176744282246, 0.0030703176744282246,..."
24997,"""10905_3""",0,"""Guy is a loser. Can't get girls, needs to bui...","[gui, loser, t, girl, need, build, pick, stron...","[(7, 1), (8, 2), (40, 3), (43, 1), (46, 2), (5...","[(7, 0.08536332493723874), (8, 0.0368109758149...","([gui, loser, t, girl, need, build, pick, stro...","[-0.019323384, 0.02601225, -0.20017359, -0.245...","[(0, 3.417893938851843), (1, 0.762966760162663...","[3.417893938851843, 0.7629667601626638, -0.275...","[(0, 0.15146882820812926), (1, 0.0486274672625...","[0.15146882820812926, 0.04862746726251161, 0.0...","[(0, 0.19364153), (1, 0.00037042814), (2, 0.00...","[0.1936415284872055, 0.00037042814074084163, 0...","[(0, 0.0036014372), (1, 0.0036014372), (2, 0.0...","[0.0036014372017234564, 0.0036014372017234564,..."
24998,"""10194_3""",0,"""This 30 minute documentary Buñuel made in the...","[minut, documentari, buñuel, earli, s, spain, ...","[(28, 1), (35, 1), (40, 1), (61, 1), (64, 1), ...","[(28, 0.06264404738146905), (35, 0.05167372356...","([minut, documentari, buñuel, earli, s, spain,...","[-0.013894695, -0.16366182, -0.06063278, 0.144...","[(0, 2.9912926459933327), (1, 3.24235945183102...","[2.9912926459933327, 3.2423594518310255, -0.23...","[(0, 0.07659577172365657), (1, -0.014535841799...","[0.07659577172365657, -0.014535841799178663, 0...","[(0, 0.00023004648), (1, 0.00023004648), (2, 0...","[0.0002300464839208871, 0.0002300464839208871,...","[(0, 0.0036316316), (1, 0.0036316316), (2, 0.0...","[0.003631631610915065, 0.003631631610915065, 0..."
