## NLP Chapter2

### Building a Counter with bag-of-words
In this exercise, you'll build your first (in this course) ```bag-of-words``` counter using a Wikipedia article, which has been pre-loaded as article. Try doing the bag-of-words without looking at the full article text, and guessing what the topic is! If you'd like to peek at the title at the end, we've included it as ```article_title```. Note that this ```article``` text has had very little preprocessing from the raw Wikipedia database entry.

In [1]:
from nltk.tokenize import word_tokenize
from collections import Counter

In [2]:
wikipedia_file = open('datasets/Wikipedia articles/wiki_text_debugging.txt')
article = ''.join(wikipedia_file)


In [3]:
# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))

[(',', 151), ('the', 150), ('.', 89), ('of', 81), ("''", 68), ('to', 63), ('a', 60), ('in', 44), ('and', 41), ('debugging', 40)]


### Text preprocessing practice
```
from nltk.stem import WordNetLemmatizer
```
Helps make for better input data
When performing machine learning or other statistical methods
Examples:
* Tokenization to create a bag of words
* Lowercasing words
1. Lemmatization/Stemming
2. Shorten words to their root stems
3. Removing stop words, punctuation, or unwanted tokens

Now, it's your turn to apply the techniques you've learned to help clean up text for better NLP results. You'll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on your cleaned text

**Preprocessing example**

* **Input text:** Cats, dogs and birds are common pets. So are fish.
* **Output tokens:** cat, dog, bird, common, pet, fish



In [4]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Krishna\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Extracting the data from PDF

english_stops = ''.join(open('datasets/english_stopwords.txt')) or stopwords.words('english') (from nltk.corpus import stopwords)

In [10]:
import PyPDF2 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [20]:
#write a for-loop to open many files -- leave a comment if you'd #like to learn how
filename = 'enter the name of the file here' 
#open allows you to read the file
pdfFileObj = open("datasets/Spring Microservices.pdf",'rb')

#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
   text = text
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text
else:
   text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc.
# Now, we will clean our text variable, and return it as a list of keywords.



In [22]:
# Tokenize the article: tokens
tokens = word_tokenize(text)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in stopwords.words('english')]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))


[('service', 883), ('microservices', 819), ('application', 480), ('spring', 349), ('case', 304), ('also', 297), ('microservice', 285), ('following', 260), ('data', 255), ('one', 249)]


### Gensim
Gensim is a free Python library designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

```Gensim``` is designed to process raw, unstructured digital texts (“plain text”).

The algorithms in ```Gensim```, such as ```Word2Vec```, ```FastText```, Latent Semantic Analysis (```LSI```, LSA, see ```LsiModel```), ```Latent Dirichlet Allocation``` (LDA, see ```LdaModel```) etc, automatically discover the semantic structure of documents by examining statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

### Creating and querying a corpus with gensim
It's time to apply the methods you learned in the previous video to create your first gensim dictionary and corpus!

You'll use these data structures to investigate word trends and potential interesting topics in your document set. To get started, we have imported a few additional messy articles from Wikipedia, which were preprocessed by lowercasing all words, tokenizing them, and removing stop words and punctuation. These were then stored in a list of document tokens called articles. You'll need to do some light preprocessing and then generate the gensim dictionary and corpus.

In [28]:
# My Code - to extract data from txt files in Wikipedia articles folder
import glob
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Write the pattern: pattern
pattern = 'datasets/Wikipedia articles/wiki_text*.txt'

# Save all file matches: csv_files
txt_files = glob.glob(pattern)
txt_files
# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

read_files = []
for txt in txt_files:

    with open(txt, 'r', encoding="utf8") as content_file:
        read_files.append(content_file.read())

articles = []
for files in read_files:
    tokens = [w for w in word_tokenize(files.lower()) if w.isalpha()]
    no_stops = [t for t in tokens if t not in stopwords.words('english')]
    lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
    articles.append(lemmatized)

In [9]:
print(articles[3][:20])

['multiple', 'one', 'file', 'debugging', 'debugger', 'debugging', 'tool', 'computer', 'program', 'used', 'software', 'programs', 'target', 'program', 'code', 'examined', 'might', 'alternatively', 'running', 'set']


In [29]:
# Import Dictionary
from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

# Create a MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in articles]

# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])


computer
[(1, 1), (13, 1), (14, 1), (17, 1), (24, 1), (27, 1), (33, 1), (34, 4), (42, 2), (43, 7)]


### Gensim bag-of-words
Now, you'll use your new gensim corpus and dictionary to see the most common terms per document and across all documents. You can use your dictionary to look up the terms. Take a guess at what the topics are and feel free to explore more documents in the IPython Shell!

You have access to the dictionary and corpus objects you created in the previous exercise, as well as the Python defaultdict and itertools to help with the creation of intermediate data structures for analysis.

The fifth document from corpus is stored in the variable doc, which has been sorted in descending order.

In [31]:
from collections import defaultdict
import itertools

# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)
    
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count
    
# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

# Print the top 5 words across all documents alongside the count
for word_id,word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)


debugging 40
system 25
bug 16
software 16
problem 15
computer 749
software 450
program 340
cite 322
language 320


### What is tf-idf?
**Term frequency - inverse document frequency**
Allows you to determine the most important words in each document
Each corpus may have shared words beyond just stopwords
These words should be down-weighted in importance
Example from astronomy: "Sky"
Ensures most common words don't show up as key words
Keeps document specific frequent words weighted high.

**Tf-idf formula**
```
Wi,j = tfi,j ∗ log(N / dfi,j)
Wi,j = tf-idf weight for token i in document j
tfi,j = number of occurences of token i in document j
dfi,j = number of documents that contain token i
N = total number of documents
```

You want to calculate the tf-idf weight for the word "computer", which appears five times in a document containing 100 words. Given a corpus containing 200 documents, with 20 documents mentioning the word "computer", tf-idf can be calculated by multiplying term frequency with inverse document frequency.

Term frequency = percentage share of the word compared to all tokens in the document Inverse document frequency = logarithm of the total number of documents in a corpora divided by the number of documents containing the term
```
(5 / 100) * log(200 / 20)

```

### Tf-idf with Wikipedia
Now it's your turn to determine new significant terms for your corpus by applying gensim's tf-idf. You will again have access to the same corpus and dictionary objects you created in the previous exercises - dictionary, corpus, and doc. Will tf-idf make for more interesting results on the document level?

In [33]:
# Import TfidfModel
from gensim.models.tfidfmodel import TfidfModel

# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

# Print the first five weights
print(tfidf_weights[:5])

# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)


[(1, 0.013142166260971512), (13, 0.013142166260971512), (14, 0.013142166260971512), (17, 0.013142166260971512), (24, 0.020829840701882617)]
wolf 0.2355708661191282
debugging 0.21817967280241635
fence 0.18845669289530256
squeeze 0.14134251967147693
tron 0.14134251967147693


### summarization.summarizer – TextRank Summariser

In [41]:
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

wikipedia_file = open('datasets/Wikipedia articles/wiki_text_debugging.txt')
article = ''.join(wikipedia_file)
print(article)

'''Debugging''' is the process of finding and resolving of defects that prevent correct operation of computer software or a system.  

Numerous books have been written about debugging (see below: #Further reading|Further reading), as it involves numerous aspects, including interactive debugging, control flow, integration testing, Logfile|log files, monitoring (Application monitoring|application, System Monitoring|system), memory dumps, Profiling (computer programming)|profiling, Statistical Process Control, and special design tactics to improve detection while simplifying changes.

Origin
A computer log entry from the Mark&nbsp;II, with a moth taped to the page

The terms "bug" and "debugging" are popularly attributed to Admiral Grace Hopper in the 1940s.[http://foldoc.org/Grace+Hopper Grace Hopper]  from FOLDOC While she was working on a Harvard Mark II|Mark II Computer at Harvard University, her associates discovered a moth stuck in a relay and thereby impeding operation, whereupon s

In [40]:
summarized_text = summarize(article)
print(summarized_text)

Numerous books have been written about debugging (see below: #Further reading|Further reading), as it involves numerous aspects, including interactive debugging, control flow, integration testing, Logfile|log files, monitoring (Application monitoring|application, System Monitoring|system), memory dumps, Profiling (computer programming)|profiling, Statistical Process Control, and special design tactics to improve detection while simplifying changes.
The terms "bug" and "debugging" are popularly attributed to Admiral Grace Hopper in the 1940s.[http://foldoc.org/Grace+Hopper Grace Hopper]  from FOLDOC While she was working on a Harvard Mark II|Mark II Computer at Harvard University, her associates discovered a moth stuck in a relay and thereby impeding operation, whereupon she remarked that they were "debugging" the system.
However the term "bug" in the meaning of technical error dates back at least to 1878 and Thomas Edison (see software bug for a full discussion), and "debugging" seems 

In [58]:
print(keywords(article))

uber
companies
taxi company
value
values
false
political
economy
silicon
hero
marissa
management
ruthless greed
personal dressing
unroll
rights
