## NLP Chapter3

### Building a Counter with bag-of-words
In this exercise, you'll build your first (in this course) ```bag-of-words``` counter using a Wikipedia article, which has been pre-loaded as article. Try doing the bag-of-words without looking at the full article text, and guessing what the topic is! If you'd like to peek at the title at the end, we've included it as ```article_title```. Note that this ```article``` text has had very little preprocessing from the raw Wikipedia database entry.

In [1]:
from nltk.tokenize import word_tokenize
from collections import Counter

In [2]:
wikipedia_file = open('datasets/Wikipedia articles/wiki_text_debugging.txt')
article = ''.join(wikipedia_file)


In [3]:
# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))

[(',', 151), ('the', 150), ('.', 89), ('of', 81), ("''", 68), ('to', 63), ('a', 60), ('in', 44), ('and', 41), ('debugging', 40)]


### Text preprocessing practice
```
from nltk.stem import WordNetLemmatizer
```
Helps make for better input data
When performing machine learning or other statistical methods
Examples:
* Tokenization to create a bag of words
* Lowercasing words
1. Lemmatization/Stemming
2. Shorten words to their root stems
3. Removing stop words, punctuation, or unwanted tokens

Now, it's your turn to apply the techniques you've learned to help clean up text for better NLP results. You'll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on your cleaned text

**Preprocessing example**

* **Input text:** Cats, dogs and birds are common pets. So are fish.
* **Output tokens:** cat, dog, bird, common, pet, fish



In [4]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Krishna\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Extracting the data from PDF

english_stops = ''.join(open('datasets/english_stopwords.txt')) or stopwords.words('english') (from nltk.corpus import stopwords)

In [10]:
import PyPDF2 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [20]:
#write a for-loop to open many files -- leave a comment if you'd #like to learn how
filename = 'enter the name of the file here' 
#open allows you to read the file
pdfFileObj = open("datasets/Spring Microservices.pdf",'rb')

#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
   text = text
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text
else:
   text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc.
# Now, we will clean our text variable, and return it as a list of keywords.



In [22]:
# Tokenize the article: tokens
tokens = word_tokenize(text)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in stopwords.words('english')]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))


[('service', 883), ('microservices', 819), ('application', 480), ('spring', 349), ('case', 304), ('also', 297), ('microservice', 285), ('following', 260), ('data', 255), ('one', 249)]
