<a href="https://www.kaggle.com/code/aleksandrmorozov123/deep-learning-for-nlp?scriptVersionId=156466025" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv


**Checking statistics of the Corpus**

In [2]:
# import required libraries
import pandas as pd
reviews = pd.read_csv ('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [3]:
# comparing the text of two selected reviews
print (repr(reviews.iloc[3344]['review'][0:300]))
print (repr(reviews.iloc[23909]['review'][0:300]))

'The first time I saw this "film" I loved it. When I was 11, I was more interested in the music and dancing. As I\'ve grown older, I\'ve become more interested in the acting as well. While the first half is just a retrospective of Michael\'s career (from the Jackson 5 up to "Bad"), it was still entertai'
'...now please move on because that\'s getting on my nerves.<br /><br />Seriously, the man behind brilliant pieces like "My Own Private Idaho" and "To Die For" (and others not so brilliant movies, i.e. the unnecessary "Psycho" remake) started an experimental phase with "Gerry", which reached its peak '


In [4]:
# ignore spaces after the stop words
import re
reviews ["paragraphs"] = reviews ["review"].map (lambda text: re.split ('[.?!]\s*\n', text))
reviews ['number_of_paragraphs'] = reviews ["paragraphs"].map (len)

**Preparations**

In [5]:
# import required libraries
import sklearn
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.de.stop_words import STOP_WORDS

tfidf_text_vectorizer = TfidfVectorizer(stop_words=list(STOP_WORDS))
vectors_text = tfidf_text_vectorizer.fit_transform (reviews ['review'])
vectors_text.shape

(50000, 101758)

In [6]:
# flatten the paragraphs keeping the sentiment
paragraph_df = pd.DataFrame ([{'review': paragraph, 'sentiment': sentiment}
                             for paragraphs, sentiment in \
                             zip (reviews ['paragraphs'], reviews ['sentiment'])
                             for paragraph in paragraphs if paragraph])
tfidf_para_vectorizer = TfidfVectorizer(stop_words=list(STOP_WORDS))
tfidf_para_vectors = tfidf_para_vectorizer.fit_transform (paragraph_df ['review'])
tfidf_para_vectors.shape

(50000, 101758)

**Nonnegative matrix factorization** - $ V \approx W \cdot H $

In [7]:
# import required library
from sklearn.decomposition import NMF

nmf_text_model = NMF (n_components = 10, random_state = 42)
W_text_matrix = nmf_text_model.fit_transform (vectors_text)
H_text_matrix = nmf_text_model.components_

# define a function for outputtin a summary
def display_topics (model, features, no_top_words=5):
    for topic, word_vector in enumerate (nmf_text_model.components_):
        total = word_vector.sum ()
        largest = word_vector.argsort ()[::-1]  # invert sort order
        print ("\ntopic %02d" % topic)
        for i in range (0, no_top_words):
            print ("  %s (%2.2f)" % (features [largest [i]],
                                    word_vector [largest[i]] * 100.0/total))
            
# calling the function
display_topics (nmf_text_model, tfidf_text_vectorizer.get_feature_names_out())




topic 00
  the (8.01)
  of (1.84)
  to (0.47)
  from (0.43)
  on (0.41)

topic 01
  br (22.11)
  10 (0.50)
  some (0.26)
  no (0.25)
  here (0.24)

topic 02
  to (2.22)
  they (1.33)
  that (1.24)
  have (0.75)
  show (0.73)

topic 03
  he (2.65)
  his (2.18)
  to (0.96)
  him (0.92)
  the (0.89)

topic 04
  film (4.47)
  is (2.06)
  this (1.92)
  films (0.97)
  to (0.96)

topic 05
  movie (6.44)
  this (3.26)
  is (1.94)
  bad (1.40)
  movies (1.32)

topic 06
  and (3.04)
  of (1.22)
  is (1.17)
  are (0.60)
  as (0.60)

topic 07
  you (6.92)
  if (2.34)
  your (1.59)
  don (0.90)
  watch (0.88)

topic 08
  she (4.31)
  is (1.02)
  the (0.79)
  to (0.76)
  and (0.56)

topic 09
  it (6.17)
  and (1.74)
  but (1.19)
  my (1.07)
  the (0.97)


In [8]:
# normalizing topics
W_text_matrix.sum (axis=0)/W_text_matrix.sum()*100.0

array([17.79233196, 11.90474076,  9.18845062,  8.77492193,  8.10281665,
        9.50987968,  9.76571393,  8.26599463,  5.81269207, 10.88245777])

**Create a topic model for paragraphs using NMF**

In [13]:
nmf_para_model = NMF (n_components = 10, random_state = 42)
W_para_matrix = nmf_para_model.fit_transform (tfidf_para_vectors)
H_para_matrix = nmf_para_model.components_

display_topics (nmf_para_model, tfidf_para_vectorizer.get_feature_names_out ())




topic 00
  the (8.01)
  of (1.84)
  to (0.47)
  from (0.43)
  on (0.41)

topic 01
  br (22.11)
  10 (0.50)
  some (0.26)
  no (0.25)
  here (0.24)

topic 02
  to (2.22)
  they (1.33)
  that (1.24)
  have (0.75)
  show (0.73)

topic 03
  he (2.65)
  his (2.18)
  to (0.96)
  him (0.92)
  the (0.89)

topic 04
  film (4.47)
  is (2.06)
  this (1.92)
  films (0.97)
  to (0.96)

topic 05
  movie (6.44)
  this (3.26)
  is (1.94)
  bad (1.40)
  movies (1.32)

topic 06
  and (3.04)
  of (1.22)
  is (1.17)
  are (0.60)
  as (0.60)

topic 07
  you (6.92)
  if (2.34)
  your (1.59)
  don (0.90)
  watch (0.88)

topic 08
  she (4.31)
  is (1.02)
  the (0.79)
  to (0.76)
  and (0.56)

topic 09
  it (6.17)
  and (1.74)
  but (1.19)
  my (1.07)
  the (0.97)


**Latent semantic analysis with singular value decomposition** - any $ m \times n $ matrix V can be decomposed as follows
$V = U \cdot \Sigma \cdot V^* $

In [14]:
# import required module
from sklearn.decomposition import TruncatedSVD

svd_para_model = TruncatedSVD (n_components = 10, random_state = 42)
W_svd_para_matrix = svd_para_model.fit_transform (tfidf_para_vectors)
H_svd_para_matrix = svd_para_model.components_

display_topics (svd_para_model, tfidf_para_vectorizer.get_feature_names_out ())


topic 00
  the (8.01)
  of (1.84)
  to (0.47)
  from (0.43)
  on (0.41)

topic 01
  br (22.11)
  10 (0.50)
  some (0.26)
  no (0.25)
  here (0.24)

topic 02
  to (2.22)
  they (1.33)
  that (1.24)
  have (0.75)
  show (0.73)

topic 03
  he (2.65)
  his (2.18)
  to (0.96)
  him (0.92)
  the (0.89)

topic 04
  film (4.47)
  is (2.06)
  this (1.92)
  films (0.97)
  to (0.96)

topic 05
  movie (6.44)
  this (3.26)
  is (1.94)
  bad (1.40)
  movies (1.32)

topic 06
  and (3.04)
  of (1.22)
  is (1.17)
  are (0.60)
  as (0.60)

topic 07
  you (6.92)
  if (2.34)
  your (1.59)
  don (0.90)
  watch (0.88)

topic 08
  she (4.31)
  is (1.02)
  the (0.79)
  to (0.76)
  and (0.56)

topic 09
  it (6.17)
  and (1.74)
  but (1.19)
  my (1.07)
  the (0.97)


**Latent Dirichlet Allocation**

In [16]:
# import required modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

count_para_vectorizer = CountVectorizer (stop_words=list(STOP_WORDS))
count_para_vectors = count_para_vectorizer.fit_transform (paragraph_df ['review'])

lda_para_model = LatentDirichletAllocation (n_components = 10, random_state = 42)
W_lda_para_matrix = lda_para_model.fit_transform (count_para_vectors)
H_lda_para_matrix = lda_para_model.components_

display_topics (lda_para_model, tfidf_para_vectorizer.get_feature_names_out ())

KeyboardInterrupt: 

In [None]:
# visualizing LDA results
import pyLDAvis.sklearn

lda_display = pyLDAvis.sklearn.prepare (lda_para_model, count_para_vectors,
                                       count_para_vectorizer, sort_topics = False)
pyLDAvis.display (lda_display)