<a href="https://colab.research.google.com/github/Soroush-Khorami/CFAR10-Classification-with-CNNs/blob/main/Assignment01_DS04_S01_NLTK_SpaCy_RezaShokrzad_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Practice Assignment: NLP with NLTK & spaCy**

* This assignment is part of the NLP Workshop on YouTube, which is free and open to the public.
* **Lecturer: Reza Shokrzad.**
*‌ [دسترسی به جلسه اول کلاس](https://youtube.com/live/lDCoqQSc4ZE?feature=share)
* [برنامه اجرایی کلاس و جلسات](https://docs.google.com/spreadsheets/d/1SP3NJ9H7yp8sgof-zp_t4oxmdxjMdEgoL_mmCDvdUm4/edit?gid=0#gid=0)


Welcome to this **Fill-in-the-Blanks NLP Assignment!** 🎯 This exercise will help you solidify your understanding of **NLTK** and **spaCy** by filling in the missing parts of the code. Follow the instructions carefully, and make sure to test your solutions!


## **1. Working with Corpora & Lexical Resources**
**Task:** Load and analyze texts from different corpora.
- Use NLTK’s **Gutenberg** corpus to load the text of *Moby Dick*.
- Tokenize it into words.
- Count the top 10 most frequent words (excluding stopwords).

In [6]:
import nltk

nltk.download('stopwords')
nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('punkt_tab')


from nltk.corpus import gutenberg
from nltk.probability import FreqDist
from nltk.corpus import stopwords



# Load text
text = gutenberg.raw('melville-moby_dick.txt')  # FILL THIS

# Tokenize words
words = nltk.tokenize.word_tokenize(text)

# Remove stopwords
filtered_words = [word for word in words if word.isalnum() and word.lower() not in stopwords.words('english')]  # FILL THIS
#*** list doest have lower() attribute so I deleted it. ***


# Compute frequency distribution
fdist = FreqDist(filtered_words)

# Print top 10 words
print(fdist.most_common(10))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

## **2. Tokenization Techniques**
**Task:** Tokenize a given text using both **NLTK** and **spaCy**.

In [6]:
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')
nlp = spacy.load('en_core_web_sm')

text = "SpaCy is fast! However, NLTK provides flexibility in tokenization."

# NLTK Tokenization
nltk_word_tokens = word_tokenize(text)  # FILL THIS
nltk_sent_tokens = sent_tokenize(text)  # FILL THIS

# spaCy Tokenization
doc = nlp(text)
spacy_tokens = [token.text for token in doc]

print("NLTK Word Tokens:", nltk_word_tokens)
print("NLTK Sentence Tokens:", nltk_sent_tokens)
print("spaCy Tokens:", spacy_tokens)

NLTK Word Tokens: ['SpaCy', 'is', 'fast', '!', 'However', ',', 'NLTK', 'provides', 'flexibility', 'in', 'tokenization', '.']
NLTK Sentence Tokens: ['SpaCy is fast!', 'However, NLTK provides flexibility in tokenization.']
spaCy Tokens: ['SpaCy', 'is', 'fast', '!', 'However', ',', 'NLTK', 'provides', 'flexibility', 'in', 'tokenization', '.']


## **3. Regex Pattern Matching for Phone Number Detection**
**Task:** Write a pattern using regex to find the phone nymber in the text.

In [9]:
import re

# Example 2: Phone Number Extraction
text_phones = "Call me at +1-202-555-0173 or reach our office at (415) 123-4567."
phone_pattern = r"\+?\d{1,3}[-.\s]?\(?\d{1,4}\)?[-.\s]?\d{3}[-.\s]?\d{4}"  # Regex for phone numbers

phones = re.findall(phone_pattern, text_phones)
print("Detected Phone Numbers:", phones)


Detected Phone Numbers: ['+1-202-555-0173', '415) 123-4567']


## 4. **Stopwords Filtering using NLTK**
**Task:** Analyze movie reviews where stopwords are removed to focus on meaningful words.

In [11]:

nltk.download("stopwords")
nltk.download("punkt")

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# 🎬 Sample Movie Review
review = """The movie was absolutely amazing! The cinematography was stunning, and the characters were incredibly well-developed.
However, the storyline felt a bit predictable at times, and some scenes were unnecessarily long. Overall, a great experience!"""

# Tokenize words
words = word_tokenize(review)

# Remove stopwords
filtered_words = [word for word in words if word.lower() not in stopwords.words("english") and word.isalnum()]

# Output results
print("Original Words:", words)
print("\nFiltered (No Stopwords):", filtered_words)


Original Words: ['The', 'movie', 'was', 'absolutely', 'amazing', '!', 'The', 'cinematography', 'was', 'stunning', ',', 'and', 'the', 'characters', 'were', 'incredibly', 'well-developed', '.', 'However', ',', 'the', 'storyline', 'felt', 'a', 'bit', 'predictable', 'at', 'times', ',', 'and', 'some', 'scenes', 'were', 'unnecessarily', 'long', '.', 'Overall', ',', 'a', 'great', 'experience', '!']

Filtered (No Stopwords): ['movie', 'absolutely', 'amazing', 'cinematography', 'stunning', 'characters', 'incredibly', 'However', 'storyline', 'felt', 'bit', 'predictable', 'times', 'scenes', 'unnecessarily', 'long', 'Overall', 'great', 'experience']


## 5. **Stemming Methods using NLTK**
**Task:** Analyze legal and scientific terms to observe how different stemming algorithms behave.

In [1]:
from nltk.stem import PorterStemmer, LancasterStemmer

# ⚖️ Sample Legal & Scientific Terms
words = ["arguing", "justification", "liable", "obligations", "classification", "microbiology", "evolutionary", "running", "happiness"]

# Initialize Stemmer Objects
porter = PorterStemmer()
lancaster = LancasterStemmer()

# Apply Stemming
porter_stems = [porter.stem(word) for word in words]
lancaster_stems = [lancaster.stem(word) for word in words]

# Output Results
print("Original Words:", words)
print("\nPorter Stemmer Results:", porter_stems)
print("\nLancaster Stemmer Results:", lancaster_stems)


Original Words: ['arguing', 'justification', 'liable', 'obligations', 'classification', 'microbiology', 'evolutionary', 'running', 'happiness']

Porter Stemmer Results: ['argu', 'justif', 'liabl', 'oblig', 'classif', 'microbiolog', 'evolutionari', 'run', 'happi']

Lancaster Stemmer Results: ['argu', 'just', 'liabl', 'oblig', 'class', 'microbiolog', 'evolv', 'run', 'happy']


## 6. **Lemmatization Strategies using NLTK & spaCy**

### NLTK’s WordNetLemmatizer
**Task:** Lemmatize a political news headline to show how lemmatization helps retain the correct part of speech (POS) while normalizing words.

In [9]:
nltk.download("wordnet")
nltk.download("punkt")

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize

# 📰 Sample News Headline
headline = "The senators debated the increasing regulations affecting technology companies."

# Tokenize words
words = word_tokenize(headline)

# Initialize Lemmatizer
lemmatizer = WordNetLemmatizer()

# Apply Lemmatization (default without POS tagging)
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print("Original Words:", words)
print("\nLemmatized Words:", lemmatized_words)


Original Words: ['The', 'senators', 'debated', 'the', 'increasing', 'regulations', 'affecting', 'technology', 'companies', '.']

Lemmatized Words: ['The', 'senator', 'debated', 'the', 'increasing', 'regulation', 'affecting', 'technology', 'company', '.']


### spaCy’s Built-in Lemmatizer

In [10]:
import spacy

nlp = spacy.load("en_core_web_sm")

# Process the same headline
doc = nlp(headline)

# Apply Lemmatization
spacy_lemmatized = [token.lemma_ for token in doc]

print("\nspaCy Lemmatized Words:", spacy_lemmatized)



spaCy Lemmatized Words: ['the', 'senator', 'debate', 'the', 'increase', 'regulation', 'affect', 'technology', 'company', '.']


## 7. **Parsing & Chunking using NLTK**

**Task:** Analyze legal contracts and job descriptions where parsing and chunking help extract meaningful phrases like noun phrases (NPs) or verb phrases (VPs).

In [31]:
# 📜 Task: Extracting Key Phrases from Legal & Job Documents
import nltk

# nltk.download("punkt")
nltk.download("averaged_perceptron_tagger_eng")

# 📜 Sample Legal Contract Text
contract_text = "The tenant shall pay the monthly rent before the 5th of each month."

# Tokenize & POS Tagging
words = nltk.tokenize.word_tokenize(contract_text)
pos_tags = nltk.pos_tag(words)

# Define a Chunking Grammar for Noun Phrases (NP)
grammar = r"NP: {<DT>?<JJ>*<NN>+}"

# Apply Chunking
chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(pos_tags)

# Display Results
print("Chunked Tree:")
tree.pretty_print()


Chunked Tree:
                                               S                                                                     
    ___________________________________________|__________________________________________________________            
   |       |        |       |      |      |    |          NP                      NP                      NP         
   |       |        |       |      |      |    |     _____|______         ________|_________         _____|_____      
shall/MD pay/VB before/IN the/DT 5th/CD of/IN ./. The/DT     tenant/NN the/DT monthly/JJ rent/NN each/DT     month/NN



## 8. **Exploring Hyponyms & Hypernyms using WordNet (NLTK)**

**Task:** Hyponyms (specific terms) and hypernyms (general terms) in scientific and business domains, where hierarchical relationships between words are essential.

In [53]:
# 🔍 Task: Explore Word Relationships in Science & Business
from nltk.corpus import wordnet

# 🦁 Find Hypernyms & Hyponyms for "lion"
word = "lion"
synset = wordnet.synsets(word)[0]  # Selecting the first synset

# Hypernyms (More General Category)
hypernyms = synset.hypernyms()
print(f"Hypernyms (More General Concept) of '{word}':")
print([hypernym.name().split('.')[0] for hypernym in hypernyms])

# Hyponyms (More Specific Types)
hyponyms = synset.hyponyms()
print(f"\nHyponyms (More Specific Types) of '{word}':")
print([hyponym.name().split('.')[0] for hyponym in hyponyms])

Synset('dog.n.01')
dog.n.01


## **9. Named Entity Recognition (NER) with spaCy**
**Task:** Extract named entities from a complex sentence.

In [54]:
nlp = spacy.load("en_core_web_sm")
text = "In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission."

doc = nlp(text)  # FILL THIS

print("Named Entities:")
for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")

Named Entities:
1969 -> DATE
Neil Armstrong -> PERSON
first -> ORDINAL
Moon -> PERSON
Apollo 11 -> LAW
