# Tutorial: Working with Text

## Guggenheim Museum Art Books

This tutorial on how to work with Text in Python. In order to run the example, we will leverage art books made publicly available by Guggenheim Museum. The full reporistory of books is available here: https://archive.org/details/guggenheimmuseum?and%5B%5D=mediatype%3A%22texts%22&sort=titleSorter&page=1 
The txt format of this has been split into multiple files, one book per file.
The data can be found in ../data/books/{1, 2, ..., 220}.txt
There are 207 art books


## Step 1: Load the data

Firstly, let's read the book and ensure proper encoding of the document. 
Please select the book that you want to load:
   * Open the ../data/book_list.csv
   * Select the book you are interested to work with (e.g. "Marc Chagall and the Jewish theater"
   * Find the corresponding book_urn (e.g. "chagallj00chag")
   * Create a url by replacing book urn in the following url https://raw.githubusercontent.com/AnnaNican/wcaiconf_2019/master/data/books/[your book].txt 
   (e.g. https://raw.githubusercontent.com/AnnaNican/wcaiconf_2019/master/data/books/chagallj00chag.txt )
   * Place the url below in the file url

In [2]:
import urllib2

fileurl = 'https://raw.githubusercontent.com/AnnaNican/wcaiconf_2019/master/data/books/chagallj00chag.txt'
booktext = urllib2.urlopen(fileurl).read()

booktext = booktext.replace('\n', '')
booktext = unicode(booktext, 'utf-8')

print(booktext)



# Step 2: Exploring the data


##  Tokenisation
Tokenisation is the process of splitting a raw string into a list of tokens

... What is a token? We're interested in meaningful units of text

* Words
* Phrases
* Punctuation
* Numbers
* Dates
* Currencies
* Hashtags
* ...?


Tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. 

In [8]:
# import sys  

# reload(sys)  
# sys.setdefaultencoding('utf8')

# print type(corpus_all_in_one)
# corpus_all_in_one.encode('utf-8').strip()

In [10]:
from nltk.tokenize import word_tokenize

try:  # py3
    all_tokens = [t for t in word_tokenize(booktext)]
except UnicodeDecodeError:  # py27
#     all_tokens = [t for t in word_tokenize(corpus_all_in_one.decode('utf-8'))]
    all_tokens = [t for t in word_tokenize(booktext.decode('utf-8'))]

print("Total number of tokens: {}".format(len(booktext)))
print("Sample of tokens: {}".format(booktext[0:10]))


Total number of tokens: 687938
Sample of tokens: GUGGENHEIM


## Counting Words¶
We start with a simple word count using collections.Counter

We are interested in finding:

how many times a word occurs across the whole corpus (total number of occurrences)
in how many documents a word occurs

In [11]:
from collections import Counter

total_term_frequency = Counter(all_tokens)

for word, freq in total_term_frequency.most_common(20):
    print("{}\t{}".format(word, freq))

,	9484
the	7179
.	6484
of	4064
and	3585
in	2923
a	2385
to	2096
Chagall	1197
``	1062
is	1053
's	994
his	973
''	969
)	943
The	919
(	914
was	882
that	850
with	831


## Stop-words
We notice that some of the most common words above are not very interesting.

These words are called stop-words, and they don't provide any particular meaning in isolation (articles, conjunctions, pronouns, etc.)

Notice:

there is no "universal" list of stop-words
removing stop-words can be useful or damaging depending on the application
e.g. if you remove stop-words, what do you do with "The Who", "to be or not to be" and similar phrases?

In [19]:
from nltk.corpus import stopwords
import string

print(stopwords.words('english'))
print(len(stopwords.words('english')))
print(string.punctuation)

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u

In [20]:
stop_list = stopwords.words('english') + list(string.punctuation)

tokens_no_stop = [token for token in all_tokens
                        if token not in stop_list]

total_term_frequency_no_stop = Counter(tokens_no_stop)

for word, freq in total_term_frequency_no_stop.most_common(20):
    print("{}\t{}".format(word.encode('utf-8'), freq))

Chagall	1197
``	1062
's	994
''	969
The	919
I	793
—	666
Yiddish	592
Jewish	563
theater	455
art	433
Russian	364
Theater	348
In	338
world	280
one	266
And	205
Moscow	201
new	201
n't	200


In [21]:
# import re
# ## remove all non-alpha numeric characters
# regex = re.compile(r'[^A-Za-z0-9 _]')
# total_term_frequency_clean = filter(lambda i: not regex.search(i), total_term_frequency_no_stop)


# # #remove additional non-words
# # remove_tokens = ['var', 'div', 'log', 'bug', 'require', 'class', 'p', 'script', 'vs', 'if', '0', 'main_wrap', 'main']
# # total_term_frequency_clean = [x for x in total_term_frequency_clean if x not in remove_tokens]

# total_term_frequency_clean = Counter(total_term_frequency_clean)
# for word, freq in total_term_frequency_clean.most_common(30):
#     print("{}\t{}".format(word.encode('utf-8'), freq))


In [None]:
Notice When and The above (uppercase W and T)

Different variations of the same words are counted as different words (they are, after all, different strings)

## Text Normalisation
Replacing tokens with a canonical form, so we can group together different spelling/variations of the same word

Examples:

lowercasing
stemming
American-to-British mapping
synonym mapping
Stemming is the process of reducing a word to its base/root form, called stem

In [22]:

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
all_tokens_lower = [t.lower() for t in all_tokens]

tokens_normalised = [stemmer.stem(t) for t in all_tokens_lower
                                     if t not in stop_list]

total_term_frequency_normalised = Counter(tokens_normalised)

for word, freq in total_term_frequency_normalised.most_common(20):
    print("{}\t{}".format(word.encode('utf-8'), freq))

chagal	1203
``	1062
's	994
''	969
theater	828
—	666
art	593
yiddish	592
jewish	568
paint	465
russian	368
world	337
artist	331
new	327
one	316
work	297
jew	222
life	214
time	213
moscow	201



Tips:

a stem is not always a word
careful with one-way transformations (like lowercasing)
wrap your preprocessing steps in a function / chain of functions for better design?

## n-grams
When we are interested in phrases rather than single terms, we can look into n-grams

An n-gram is a sequence of n adjacent terms.

Commonly used n-grams include bigrams (n=2) and trigrams (n=3).

In [23]:
from nltk import ngrams

phrases = Counter(ngrams(all_tokens_lower, 2))
for phrase, freq in phrases.most_common(20):
    print("{}\t{}".format(phrase, freq))

(u'of', u'the')	1163
(u'in', u'the')	812
(u',', u'and')	809
(u',', u'the')	574
(u'.', u'the')	504
(u'.', u'.')	446
(u'chagall', u"'s")	399
(u'to', u'the')	363
(u')', u'.')	358
(u'and', u'the')	350
(u'on', u'the')	330
(u'.', u'in')	268
(u'for', u'the')	255
(u',', u'in')	252
(u')', u',')	251
(u'of', u'a')	230
(u'.', u'i')	229
(u',', u'a')	213
(u'the', u'yiddish')	209
(u'.', u"''")	204


In [24]:
phrases = Counter(ngrams(all_tokens_lower, 3))
for phrase, freq in phrases.most_common(20):
    print("{}\t{}".format(phrase, freq))

(u'.', u'.', u'.')	268
(u',', u'and', u'the')	96
(u'the', u'yiddish', u'theater')	96
(u',', u'pp', u'.')	79
(u'yiddish', u'chamber', u'theater')	70
(u"''", u'(', u'in')	67
(u'of', u'chagall', u"'s")	65
(u'new', u'york', u':')	60
(u'.', u'in', u'the')	60
(u'.', u'chagall', u"'s")	57
(u',', u'in', u'the')	54
(u'the', u'jewish', u'theater')	51
(u'of', u'the', u'yiddish')	50
(u'texts', u'and', u'documents')	48
(u'in', u'chagall', u"'s")	47
(u'.', u'no', u'.')	47
(u'the', u'yiddish', u'chamber')	46
(u'of', u'the', u'theater')	45
(u',', u'no', u'.')	45
(u'marc', u'chagall', u':')	43



### n-grams and stop-words
Stop-word removal will affect n-grams

e.g. phrases like "a pinch of salt" become "pinch salt" after stop-word removal

In [25]:
phrases = Counter(ngrams(tokens_no_stop, 2))

for phrase, freq in phrases.most_common(20):
    print("{}\t{}".format(phrase, freq))

(u'Chagall', u"'s")	399
(u'Marc', u'Chagall')	137
(u'New', u'York')	111
(u'Chamber', u'Theater')	110
(u'Yiddish', u'Theater')	99
(u'Yiddish', u'theater')	81
(u'Yiddish', u'Chamber')	70
(u'Sholem', u'Aleichem')	70
(u'Jewish', u'Theater')	62
(u'``', u'The')	54
(u'Lanternshooter', u'Menakhem-Mendel')	53
(u'The', u'Russian')	48
(u'Texts', u'Documents')	47
(u'St.', u'Petersburg')	46
(u'``', u'Chagall')	46
(u'Chagall', u'The')	46
(u'Granovskii', u"'s")	44
(u"''", u'Russian')	44
(u"''", u'Yiddish')	44
(u"''", u'The')	42


In [26]:
phrases = Counter(ngrams(tokens_no_stop, 3))

for phrase, freq in phrases.most_common(20):
    print("{}\t{}".format(phrase, freq))

(u'Yiddish', u'Chamber', u'Theater')	70
(u'Chagall', u'The', u'Russian')	39
(u'The', u'Russian', u'Years')	38
(u'Solomon', u'R.', u'Guggenheim')	31
(u'Marc', u'Chagall', u'The')	30
(u'Menakhem-Mendel', u'Lanternshooter', u'Menakhem-Mendel')	24
(u'Bakingfish', u'Lanternshooter', u'Menakhem-Mendel')	24
(u'State', u"Tret'iakov", u'Gallery')	24
(u'Lanternshooter', u'Menakhem-Mendel', u'Lanternshooter')	23
(u'``', u'Marc', u'Chagall')	23
(u'Chagall', u"'s", u'art')	21
(u'State', u'Jewish', u'Chamber')	19
(u'Jewish', u'Chamber', u'Theater')	19
(u'State', u'Yiddish', u'Chamber')	19
(u'Texts', u'Documents', u'1')	18
(u'Sholem', u'Aleichem', u'Evening')	17
(u'R.', u'Guggenheim', u'Museum')	17
(u'Vitali', u'Marc', u'Chagall')	17
(u'Sholem', u'Aleichem', u"'s")	17
(u'Chagall', u"'s", u'paintings')	17







# Step 3: Named Entity Recognition / Entity Extraction

Named entity recognition (NER)is probably the first step towards information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. NER is used in many fields in Natural Language Processing (NLP), and it can help answering many real-world questions, such as:

Which companies were mentioned in the news article?
Were specified products mentioned in complaints or reviews?
Does the tweet contain the name of a person? Does the tweet contain this person’s location?

[named entity](https://en.wikipedia.org/wiki/Named_entity) recognizer with NLTK and SpaCy, to identify the names of things, such as persons, organizations, or locations in the raw text. Let’s get started!


##  Entity Extraction with spaCy






In [1]:
# in case requirements.txt would not work
# import sys
# !{sys.executable} -m pip install spacy
# !{sys.executable} -m spacy download en

In [None]:
import spacy
from spacy import displacy
# import en_core_web_sm
# nlp = en_core_web_sm.load()
nlp = spacy.load("en")

doc = nlp(booktext)

print("Total number of tokens: {}".format(len(doc.ents))
print("Sample of tokens: {}".format(([(X.text, X.label_) for X in doc.ents])))


In [None]:
from collections import Counter
labels = [x.label_ for x in doc.ents]
Counter(labels)

# Step 4: Summarization

There are two types of text summarization algorithms: *extractive* and *abstractive*. 

    * Extractive summarization algorithms attempt to score the phrases or sentences in a document and return only the most highly informative blocks of text.

    * Abstractive text summarization actually creates new text which doesn’t exist in that form in the document. Abstractive summarization is what you might do when explaining a book you read to your friend, and it is much more difficult for a computer to do than extractive summarization.
    
    
### PyTeaser

[PyTeaser](https://github.com/xiaoxu193/PyTeaser) is a Python implementation of the Scala project TextTeaser, which is a heuristic approach for extractive text summarization.TextTeaser associates a score with every sentence. This score is a linear combination of features extracted from that sentence. Features that TextTeaser looks at are:

* titleFeature: The count of words which are common to title of the document and sentence.
* sentenceLength: Authors of TextTeaser defined a constant “ideal” (with value 20), which represents the ideal length of the summary, in terms of number of words. sentenceLength is calculated as a normalized distance from this value.
* sentencePosition: Normalized sentence number (position in the list of sentences).
* keywordFrequency: Term frequency in the bag-of-words model (after removing stop words).



In [5]:
from pyteaser import Summarize
summaries = Summarize("Book about Chagall", booktext)
print summaries

[u'Therefore, Chagall meditating on his visions, Chagall the draftsman, is perceived even more sharply than Chagall the painter.', u'See Grigori Kasovsky, "Chagall and the Jewish Art Programme," in Vitali, Marc Chagall: The Russian Years ipo6-ip22, p. 57. 66.', u'It wants to produce the kernel trom which a normal Yiddish theater, Yiddish theater art in a European sense, will develop.', u'The Vilna artists have lived to see Chagall with their own eyes and to hear him speak in the international language, the Esperanto, called Jewish art. " Chagall delivered the opening address.', u'M. Chagall, "Letter to Pavel Davidovitch Ettinger 1920," in Vitali, Marc Chagall: The Russian Years lpo6~ip22, pp. 73\u201475. 71.']


### Gensim 

[gensim.summarization module](https://radimrehurek.com/gensim/summarization/summariser.html) implements TextRank, an unsupervised algorithm based on weighted-graphs from a paper by Mihalcea et al. TextRank works as follows:

* Pre-process the text: remove stop words and stem the remaining words.
* Create a graph where vertices are sentences.
* Connect every sentence to every other sentence by an edge. The weight of the edge is how similar the two sentences are.
* Run the PageRank algorithm on the graph.
* Pick the vertices(sentences) with the highest PageRank score

In original TextRank the weights of an edge between two sentences is the percentage of words appearing in both of them. 


In [None]:
from gensim.summarization.summarizer import summarize
print(summarize(booktext))

gensim Version: 3.4.0


### LexRank (sumy)

LexRank
LexRank is an unsupervised graph based approach similar to TextRank. LexRank uses IDF-modified Cosine as the similarity measure between two sentences. This similarity is used as weight of the graph edge between two sentences. LexRank also incorporates an intelligent post-processing step which makes sure that top sentences chosen for the summary are not too similar to each other.

More on LexRank Vs. TextRank can be found here.

Note on running time: extremely slow

In [None]:
#Import library essentials
from sumy.parsers.plaintext import PlaintextParser #We're choosing a plaintext parser here, other parsers available for HTML etc.
from sumy.nlp.tokenizers import Tokenizer 
from sumy.summarizers.lex_rank import LexRankSummarizer #We're choosing Lexrank, other algorithms are also built in


# parser = PlaintextParser.from_file(file, Tokenizer("english"))
summarizer = LexRankSummarizer()

# string = unicode(raw_input(), 'utf8')
booktext_for_output = booktext.encode('utf8', 'replace')
summary = summarizer(booktext_for_output, 5) #Summarize the document with 5 sentences

for sentence in summary:
    print sentence




### Luhn (sumy)

It is one of the earliest suggested algorithm by the famous IBM researcher it was named after. It scores sentences based on frequency of the most important words.

Note on running time: super fast


In [8]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.luhn import LuhnSummarizer


parser = PlaintextParser.from_string(booktext,Tokenizer("english"))
summarizer_luhn = LuhnSummarizer()
summary_1 =summarizer_luhn(parser.document,2)
for sentence in summary_1:
	print(sentence)


Yet for the most part those various items are not depictions of individual objects in the world but represent several recognizable domains throughout Chagall's art: old Jews of the recent religious past, as seen from the distance of a secular generation; Christian officials and peasants of the village; his own, invented "Vitebsk " as the symbolic small town of a distant Jewish world; another version of "Vitebsk," with its churches symbolizing provincial Russia; animals in that world, often humanized; his child-bride Bella and loving couples; Jesus Christ as the suffering Jew; Paris with the emblematic Eiffel Tower and the window of his studio; and, later in his career, anonymous Jewish masses, crossing the Red Sea or facing the Holocaust; and the world of the Bible.
A skillful and excellently precise brush; now fondly licking, now scratching; now bathing in the even ripple of the daubs, now scattering marvelous "Chagallian " little dots, drops and patterns, joyful and resounding, scarl

### LSA (sumy)
Based on term frequency techniques with singular value decomposition to summarize texts.

Latent semantic analysis is an unsupervised method of summarization it combines term frequency techniques with singular value decomposition to summarize texts. It is one of the most recent suggested technique for summerization

Note on running time: extremely slow

In [None]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.lsa import LsaSummarizer

parser = PlaintextParser.from_string(booktext,Tokenizer("english"))
summarizer_lsa = LsaSummarizer()
summary_2 =summarizer_lsa(parser.document,2)
for sentence in summary_2:
    print(sentence)

# Step 5: Topic Modeling



In [15]:
## Bag of words


phrases = Counter(ngrams(tokens_no_stop, 3))

for phrase, freq in phrases.most_common(20):
    print("{}\t{}".format(phrase, freq))


dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break



[(u',', 9484), (u'the', 7179), (u'.', 6484), (u'of', 4064), (u'and', 3585), (u'in', 2923), (u'a', 2385), (u'to', 2096), (u'Chagall', 1197), (u'``', 1062), (u'is', 1053), (u"'s", 994), (u'his', 973), (u"''", 969), (u')', 943), (u'The', 919), (u'(', 914), (u'was', 882), (u'that', 850), (u'with', 831)]


AttributeError: 'list' object has no attribute 'fit_transform'

In [14]:
import warnings
warnings.simplefilter("ignore", DeprecationWarning)
# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
 
# Helper function
def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        
# Tweak the two parameters below
number_topics = 5
number_words = 10
# Create and fit the LDA model
lda = LDA(n_components=number_topics)
lda.fit(count_data)
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)

NameError: name 'count_data' is not defined