# II.  Python Packages for Text Mining and Analysis

A wide variety of Python packages and modules are available for performing text mining and anlysis.  When executed, the code cell below will load those necessary to perform the activities presented in this course module.  Comments in the code identify each of the packages being loaded.  In each case, you can refer to the package documentation for more specific information about the package being used.  You must run the code cells below to properly prepare your environment to perfrom the text mining and analysis tasks presented in this module.

In [None]:
# update collab environment to latest version of NLTK
# documentation: https://www.nltk.org/
!pip install nltk -U

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/5e/37/9532ddd4b1bbb619333d5708aaad9bf1742f051a664c3c6fa6632a105fd8/nltk-3.6.2-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 5.1MB/s 
Installing collected packages: nltk
  Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.6.2


In [3]:
# we also need to install the ldavis/gensim connector package
!pip install pyLDAvis

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/03/a5/15a0da6b0150b8b68610cc78af80364a80a9a4c8b6dd5ee549b8989d4b60/pyLDAvis-3.3.1.tar.gz (1.7MB)
[K     |████████████████████████████████| 1.7MB 5.1MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting numpy>=1.20.0
[?25l  Downloading https://files.pythonhosted.org/packages/3f/03/c3526fb4e79a793498829ca570f2f868204ad9a8040afcd72d82a8f121db/numpy-1.21.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7MB)
[K     |████████████████████████████████| 15.7MB 187kB/s 
[?25hCollecting pandas>=1.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/99/f7/01cea7f6c963100f045876eb4aa1817069c5c9eca73d2dbfb5d31ff9a39f/pandas-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (10.8MB)
[K     |████████

In [6]:
# import the base nltk package
import nltk

# load the nltk tokenize module
from nltk.tokenize import word_tokenize

# download the punkt model
nltk.download('punkt')

# import nltk stopword module
from nltk.corpus import stopwords

# donload the stopword list
nltk.download('stopwords')

# import the nltk porter stemmer
from nltk.stem.porter import PorterStemmer

# import the nltk lemmatizer
from nltk.stem import WordNetLemmatizer

# import regular expression package
import re

# import numpy
import numpy as np

# import pandas
import pandas as pd
from pprint import pprint

# import the os module 
import os

# import main Gensim package
import gensim

# import gensim corpora module
import gensim.corpora as corpora

# import gensim simple_process module
from gensim.utils import simple_preprocess

# import gensim language models
from gensim.models import CoherenceModel

# import spacy package for lemmatization
import spacy

# import ldavize package for model visualization
import pyLDAvis

# import the ldavis gensym connector package
#import pyLDAvis.gensim

# import the matplotlib package for plottin
import matplotlib.pyplot as plt

# setup matplot lib to work from commandline
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

# disable deprication warnings
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# III.  Load a Working Corpus

Before you can load a files for analysis, you must mount your Google Drive in this environment.

In [3]:
from google.colab import drive
drive.mount('/gdrive/')

Mounted at /gdrive/


Once your Google Drive has successfully mounted, you can open one of the sample data files provided for the course or a file of your own that you have placed in the data_my directory of the Course Home Directory:


1.   To load a course sample file, in the code cell below uncomment (remove the hashtag at the start of the line) the line that reads, "," and then run the cell.
2.   To load a text (ASCII) file of your own, replace the "\<filename\>" substring in the line that reads, "," with the name of your file, uncomment the line, and then run the cell.



In [7]:
working_file_directory_path = "/gdrive/MyDrive/rbs_digital_approaches_2021/data_class/eebo-tcp/"

In [8]:
file_list = os.listdir(working_file_directory_path) 

In [10]:
print(file_list)

['B06649.headed.txt', 'B06707.headed.txt', 'B06669.headed.txt', 'B07103.headed.txt', 'B06716.headed.txt', 'B06667.headed.txt', 'B06774.headed.txt', 'B06872.headed.txt', 'B06876.headed.txt', 'B06758.headed.txt', 'B06672.headed.txt', 'B06789.headed.txt', 'B06569.headed.txt', 'B06608.headed.txt', 'B06674.headed.txt', 'B06761.headed.txt', 'B06682.headed.txt', 'B06575.headed.txt', 'B06645.headed.txt', 'B06792.headed.txt', 'B06605.headed.txt', 'B06795.headed.txt', 'B06699.headed.txt', 'B06777.headed.txt', 'B06632.headed.txt', 'B06782.headed.txt', 'B06802.headed.txt', 'B06688.headed.txt', 'B06712.headed.txt', 'B06788.headed.txt', 'B31385.headed.txt', 'B06739.headed.txt', 'B06556.headed.txt', 'B06784.headed.txt', 'B06677.headed.txt', 'B06646.headed.txt', 'B06762.headed.txt', 'B06558.headed.txt', 'B25542.headed.txt', 'B06656.headed.txt', 'B06787.headed.txt', 'B06767.headed.txt', 'B06614.headed.txt', 'B06563.headed.txt', 'B06624.headed.txt', 'B06634.headed.txt', 'B06694.headed.txt', 'B06597.head

In [11]:
this_path = working_file_directory_path + file_list[0]
print(this_path)

/gdrive/MyDrive/rbs_digital_approaches_2021/data_class/eebo-tcp/B06649.headed.txt


In [15]:
text_collection = []
for nextfile in file_list:
  next_file_path = working_file_directory_path + nextfile
  next_file_object = open(next_file_path, "r", encoding='windows-1252')
  next_text = next_file_object.read()
  text_collection.append(next_text)

In [16]:
print(text_collection[1])

The World turn'd up-side down OR,  Money grown Trouble some. Shewing the vanity of youngmen, who spend their youthfull days in rioting and want onness, which is undoubtely the High-way to want and Beggary, as you may plainly see in these following lines, wherein the Extravagant doth not only lament his mispent time, but also gives advice to others, to prevent tjose miseries which befell him by his profuse spending till too Late he sees his error. Tune of, Packingtons Pound.       I Am a young blade that had money good store But now by debauchery grown very poor When I had enough to have served my turn Oh then in my pocket my money did burn Then straitway I hunted to find out good fellows, And could not endure to be out of an Alehouse, But by Whoring and Drinking I now am undone, And now I am laugh'd at, by every one. And when I was drunk I must needs have a whore, By which means I quickly consumed my store; For I met with a Wench with her powderde locks, And she for my love furnish me 

In [17]:
from gensim.utils import simple_preprocess

In [18]:
tokens = [simple_preprocess(next_doc, deacc=True) for next_doc in text_collection]


In [21]:
print(tokens[23])



In [23]:
gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]

In [24]:
print(gensim_dictionary)

Dictionary(79940 unique tokens: ['accidents', 'achilles', 'action', 'ad', 'ages']...)


In [27]:
print(gensim_corpus[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 3), (8, 1), (9, 1), (10, 1), (11, 1), (12, 5), (13, 19), (14, 1), (15, 1), (16, 1), (17, 1), (18, 3), (19, 3), (20, 1), (21, 1), (22, 1), (23, 1), (24, 5), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 11), (33, 7), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 3), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 3), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 2), (78, 1), (79, 1), (80, 1), (81, 1), (82, 4), (83, 1), (84, 1), (85, 1), (86, 1), (87, 2), (88, 1), (89, 2), (90, 3), (91, 1), (92, 1), (93, 1), (94, 1), (95, 2), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 2), (104, 1), (105, 1), (106, 1), (107, 1), (108, 2), (109, 3), (110, 3

In [30]:
bow_word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]
print(bow_word_frequencies[:50])


















asdfdsdsfd

In [31]:
from gensim import models
tfidf = models.TfidfModel(gensim_corpus, smartirs='ntc')


In [33]:
tfidf_list = []
for sent in tfidf[gensim_corpus]:
    tfidf_list.append([[gensim_dictionary[id], np.around(frequency, decimals=2)] for id, frequency in sent])

In [34]:
print(tfidf_list[0])

[['accidents', 0.06], ['achilles', 0.06], ['action', 0.04], ['ad', 0.03], ['ages', 0.03], ['air', 0.06], ['alas', 0.03], ['all', 0.0], ['allow', 0.04], ['almost', 0.03], ['alone', 0.02], ['am', 0.01], ['an', 0.03], ['and', 0.01], ['appetite', 0.04], ['are', 0.0], ['argument', 0.04], ['arran', 0.09], ['as', 0.0], ['at', 0.0], ['attaques', 0.09], ['austin', 0.07], ['back', 0.03], ['banisht', 0.07], ['be', 0.0], ['bed', 0.03], ['better', 0.01], ['biast', 0.09], ['bloom', 0.08], ['bode', 0.08], ['breast', 0.04], ['build', 0.05], ['but', 0.02], ['by', 0.01], ['caesar', 0.05], ['calling', 0.03], ['can', 0.01], ['cato', 0.04], ['chain', 0.06], ['chamber', 0.05], ['charles', 0.04], ['city', 0.03], ['clouds', 0.05], ['colour', 0.03], ['common', 0.01], ['confess', 0.04], ['conscience', 0.02], ['constellat', 0.09], ['correspondency', 0.07], ['could', 0.01], ['cramm', 0.09], ['crime', 0.03], ['croud', 0.08], ['crown', 0.07], ['dar', 0.07], ['dark', 0.05], ['dead', 0.02], ['departure', 0.05], ['des

In [35]:
# Python program to sort a list of
# tuples by the second Item using sort() 
  
# Function to sort hte list by second item of tuple
def Sort_Tuple(tup): 
  
    # reverse = None (Sorts in Ascending order) 
    # key is set to sort using second element of 
    # sublist lambda has been used 
    tup.sort(key = lambda x: x[1]) 
    return tup 

  
# printing the sorted list of tuples
print(tfidf_list[0]) 

[['accidents', 0.06], ['achilles', 0.06], ['action', 0.04], ['ad', 0.03], ['ages', 0.03], ['air', 0.06], ['alas', 0.03], ['all', 0.0], ['allow', 0.04], ['almost', 0.03], ['alone', 0.02], ['am', 0.01], ['an', 0.03], ['and', 0.01], ['appetite', 0.04], ['are', 0.0], ['argument', 0.04], ['arran', 0.09], ['as', 0.0], ['at', 0.0], ['attaques', 0.09], ['austin', 0.07], ['back', 0.03], ['banisht', 0.07], ['be', 0.0], ['bed', 0.03], ['better', 0.01], ['biast', 0.09], ['bloom', 0.08], ['bode', 0.08], ['breast', 0.04], ['build', 0.05], ['but', 0.02], ['by', 0.01], ['caesar', 0.05], ['calling', 0.03], ['can', 0.01], ['cato', 0.04], ['chain', 0.06], ['chamber', 0.05], ['charles', 0.04], ['city', 0.03], ['clouds', 0.05], ['colour', 0.03], ['common', 0.01], ['confess', 0.04], ['conscience', 0.02], ['constellat', 0.09], ['correspondency', 0.07], ['could', 0.01], ['cramm', 0.09], ['crime', 0.03], ['croud', 0.08], ['crown', 0.07], ['dar', 0.07], ['dark', 0.05], ['dead', 0.02], ['departure', 0.05], ['des

In [42]:
from gensim import models

In [49]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=gensim_corpus, 
                                            id2word=gensim_dictionary,
                                            num_topics=20, 
                                            random_state=100,
                                            update_every=1,
                                            chunksize=100,
                                            passes=10,
                                            alpha='auto',
                                            per_word_topics=True)

In [61]:
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)

(0, '0.001*"clipped" + 0.001*"swear" + 0.001*"payment" + 0.001*"pass" + 0.001*"crowns" + 0.001*"loans" + 0.001*"receivers" + 0.001*"boat" + 0.001*"exchequer" + 0.001*"payments"')
(1, '0.000*"and" + 0.000*"the" + 0.000*"of" + 0.000*"to" + 0.000*"that" + 0.000*"in" + 0.000*"his" + 0.000*"he" + 0.000*"was" + 0.000*"be"')
(2, '0.000*"and" + 0.000*"the" + 0.000*"of" + 0.000*"to" + 0.000*"that" + 0.000*"in" + 0.000*"he" + 0.000*"for" + 0.000*"was" + 0.000*"his"')
(3, '0.004*"blazon" + 0.001*"royal" + 0.001*"cheerful" + 0.000*"tempteth" + 0.000*"arms" + 0.000*"cryest" + 0.000*"soit" + 0.000*"diev" + 0.000*"pense" + 0.000*"droit"')
(4, '0.000*"and" + 0.000*"of" + 0.000*"the" + 0.000*"in" + 0.000*"that" + 0.000*"was" + 0.000*"he" + 0.000*"his" + 0.000*"to" + 0.000*"kynge"')
(5, '0.048*"of" + 0.037*"the" + 0.026*"and" + 0.025*"to" + 0.025*"was" + 0.022*"in" + 0.020*"he" + 0.019*"his" + 0.015*"ye" + 0.015*"that"')
(6, '0.019*"my" + 0.016*"and" + 0.013*"for" + 0.012*"to" + 0.011*"me" + 0.010*"the"

In [None]:
!pip install pyLDAvis -U

In [67]:
import pyLDAvis

In [65]:
from pyLDAvis import gensim_models

In [70]:
lda_viz = pyLDAvis.gensim_models.prepare(lda_model, gensim_corpus, gensim_dictionary)
pyLDAvis.display(lda_viz)

  by='saliency', ascending=False).head(R).drop('saliency', 1)
