## Lab Assignment: Text Analytics / Natural Language Processing (NLP) with the NLTK library

### Objective: To apply knowledge of text analytics and NLP in Python.

### Instructions:

This will be part of a livecoding session.

*Introductory Terms*
- Corpus = Collection of documents / texts
- Tokens = Individual elements (words)
- Tokenizing = Process of breaking down tokens (sentence or word)
- Stop Words = Dictionary of words to discard (a, an, the, if, etc.)
- Stemming = Reducing words to their root (stem)
- Parts of Speech = Noun, Pronoun, Verb, Adverb, etc.
  - JJ = Adjectives 
  - NN = Nouns
  - RB = Adverbs
  - PRP = Pronouns
  - VB = Verbs
- Lemmatizing = Reducing words to their core meaning
- Chunking = Allows you to identify phrases for context
- Chinking = Exclude a pattern / opposite of chunking
- Named Entity Recognition (NER) - Allows you to find named entites in your text

*Great reference:  https://realpython.com/nltk-nlp-python/#:~:text=Natural%20language%20processing%20(NLP)%20is,and%20contains%20human%2Dreadable%20text.*

### Deliverables:

This Jupyter notebook.

### Grading Criteria:

1. Correctness and functionality of each function/program
2. Proper use of basic control structures and functions
3. Code readability and organization
4. Note: Students are encouraged to work collaboratively, but each student must submit their own work. In addition, students should utilize version control (e.g. GitHub) to  manage their code and collaborate with peers.

Example from NLTK website

In [15]:
import nltk
from nltk import Tree
nltk.download('maxent_ne_chunker')
nltk.download('words')

sentence = """At eight o'clock on Thursday morning... Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)
tokens
# Output: ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', '...', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

tagged = nltk.pos_tag(tokens)
tagged[0:6]
# Output: [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN')]

entities = nltk.ne_chunk(tagged)
entities
# Output: Tree('S', [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN'), ('...', ':'), Tree('PERSON', [('Arthur', 'NNP')]), ('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')])

t = Tree.fromstring('(S (NP (DT the) (NN cat)) (VP (VBD sat) (PP (IN on) (NP (DT a) (NN mat)))))')
t.draw()

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\mike\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\mike\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


Process the Gutenberg corpus

In [1]:
import nltk
from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import Tree
from nltk.draw import tree

# Download necessary NLTK resources
nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')

# Load and tokenize text data from the Gutenberg corpus
emma = gutenberg.raw('austen-emma.txt')
tokens = word_tokenize(emma)

# Perform basic text preprocessing
tokens = [token.lower() for token in tokens if token.isalpha()]  # Convert to lowercase and remove punctuation
tokens = [token for token in tokens if token not in stopwords.words('english')]  # Remove stopwords

# Perform frequency distribution analysis
fdist = FreqDist(tokens)
most_common = fdist.most_common(10)
print("Most common words:")
for word, frequency in most_common:
    print(f"{word}: {frequency}")

# Perform part-of-speech tagging
tagged_tokens = nltk.pos_tag(tokens[:100])
print("\nPart-of-speech tagging:")
for token, tag in tagged_tokens:
    print(f"{token}: {tag}")

# Perform lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens[:100]]
print("\nLemmatization:")
for token, lemma in zip(tokens[:100], lemmatized_tokens):
    print(f"{token}: {lemma}")

# Perform treebank parsing and visualization
parsed_sent = Tree.fromstring('(S ' + ' '.join([f'({tag} {token})' for token, tag in tagged_tokens]) + ')')
print("\nTreebank Visualization:")
parsed_sent.pretty_print()



ModuleNotFoundError: No module named 'nltk'