## NLP Chapter3

### Building a Counter with bag-of-words
In this exercise, you'll build your first (in this course) ```bag-of-words``` counter using a Wikipedia article, which has been pre-loaded as article. Try doing the bag-of-words without looking at the full article text, and guessing what the topic is! If you'd like to peek at the title at the end, we've included it as ```article_title```. Note that this ```article``` text has had very little preprocessing from the raw Wikipedia database entry.

In [5]:
from nltk.tokenize import word_tokenize
from collections import Counter

In [6]:
wikipedia_file = open('datasets/Wikipedia articles/wiki_text_debugging.txt')
article = ''.join(wikipedia_file)


In [7]:
# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))

[(',', 151), ('the', 150), ('.', 89), ('of', 81), ("''", 68), ('to', 63), ('a', 60), ('in', 44), ('and', 41), ('debugging', 40)]


### Text preprocessing practice
```
from nltk.stem import WordNetLemmatizer
```
Helps make for better input data
When performing machine learning or other statistical methods
Examples:
* Tokenization to create a bag of words
* Lowercasing words
1. Lemmatization/Stemming
2. Shorten words to their root stems
3. Removing stop words, punctuation, or unwanted tokens

Now, it's your turn to apply the techniques you've learned to help clean up text for better NLP results. You'll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on your cleaned text

**Preprocessing example**

* **Input text:** Cats, dogs and birds are common pets. So are fish.
* **Output tokens:** cat, dog, bird, common, pet, fish



In [13]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\E082952\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [11]:
english_stops = ''.join(open('datasets/english_stopwords.txt'))

In [14]:
# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))


[('debugging', 40), ('system', 25), ('software', 16), ('bug', 16), ('problem', 15), ('tool', 15), ('computer', 14), ('process', 13), ('term', 13), ('used', 12)]


In [16]:
# importing required modules
import PyPDF2
PyPDF2.download()
# creating a pdf file object
pdfFileObj = open('datasets/ds.pdf', 'rb')
 
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
 
# printing number of pages in pdf file
print(pdfReader.numPages)
 
# creating a page object
pageObj = pdfReader.getPage(0)
 
# extracting text from page
print(pageObj.extractText())
 
# closing the pdf file object
pdfFileObj.close()

ModuleNotFoundError: No module named 'PyPDF2'