# Session 14 🐍

☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️

***

# 108. NLTK (Natural Language Toolkit)  
NLTK is a leading platform for building Python programs to work with human language data (Natural Language Processing). It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries.

***

# 109. Text Processing
- Tokenization (splitting text into words/sentences)
- Stemming (reducing words to root form)
- Lemmatization (intelligent word reduction)
- Part-of-speech tagging (identifying nouns, verbs etc.)
- Named entity recognition (identifying people, places etc.)

***

# 110. Corpora & Lexical Resources
- WordNet (lexical database)
- Stopwords (common words to filter out)
- Brown Corpus (text categorization)
- Gutenberg Corpus (literary works)

***

# 111. NLP Algorithms
- Classification (text categorization)
- Clustering (grouping similar documents)
- Sentiment analysis (opinion mining)

***

# 112. Core Functionality

***

## 112-1. Tokenization

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Hello world! NLTK is awesome."
print(word_tokenize(text))  # ['Hello', 'world', '!', 'NLTK', 'is', 'awesome', '.']
print(sent_tokenize(text))  # ['Hello world!', 'NLTK is awesome.']

***

## 112-2. Stopwords Removal

In [None]:
from nltk.corpus import stopwords

words = ["the", "quick", "brown", "fox"]
stop_words = set(stopwords.words('english'))
filtered = [w for w in words if w.lower() not in stop_words]     # ['quick', 'brown', 'fox']

***

## 112-3. Stemming vs Lemmatization

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
print(stemmer.stem("running"))  # "run"

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("running", pos="v"))  # "run"

***

## 112-4. Part-of-Speech Tagging

In [None]:
from nltk import pos_tag

tokens = word_tokenize("NLTK is amazing")
print(pos_tag(tokens))  # [('NLTK', 'NNP'), ('is', 'VBZ'), ('amazing', 'VBG')]

***

## 112-5. Named Entity Recognition (NER)

In [None]:
from nltk import ne_chunk

text = "Apple is looking at buying U.K. startup for $1 billion"
tags = pos_tag(word_tokenize(text))
entities = ne_chunk(tags)
print(entities)  # Shows organizations, locations, etc.

***

## 112-6. Frequency Distribution

In [None]:
from nltk.probability import FreqDist

words = word_tokenize("hello hello world")
fdist = FreqDist(words)
print(fdist.most_common())  # [('hello', 2), ('world', 1)]

***

# 113. Advanced Features

***

## 113-1. Text Classification

In [None]:
from nltk.classify import NaiveBayesClassifier

train_data = [("great movie", "pos"), ("terrible acting", "neg")]
classifier = NaiveBayesClassifier.train(train_data)
print(classifier.classify("awesome film"))  # "pos"

---

## 113-2. WordNet (Lexical Database)

In [None]:
from nltk.corpus import wordnet

synonyms = wordnet.synsets("happy")
print(synonyms[0].definition())  # "enjoying or showing joy"

---

## 113-3. Sentiment Analysis

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()
print(sia.polarity_scores("NLTK is amazing!"))
# {'neg': 0.0, 'neu': 0.297, 'pos': 0.703, 'compound': 0.6696}

---

***

# Some Excercises

**1.**  Tokenize the following text into words and sentences:
- "NLTK is a Python library for NLP. It provides easy-to-use interfaces to over 50 corpora!"
- Remove all stopwords from the tokenized words
- Apply both stemming and lemmatization to each remaining word
- Compare the results

___

**2.** Tag parts of speech in the sentence: "The quick brown fox jumps over the lazy dog"
- Extract only the nouns and verbs from the tagged result
- Create a frequency distribution of the POS tags
- Visualize the frequency distribution using matplotlib

---

**3.**  Process the text: "Apple Inc. is planning to open a new store in Paris next month"
- Perform named entity recognition
- Extract all organization and location entities
- Display the entity types and their spans in the original text

---

**4.**  Find all synsets for the word "bank"

For each synset, print:
- Definition
- Example sentences
- All lemmas
- Calculate the similarity between "bank" (financial institution) and "riverbank"
- Find the lowest common hypernym between "car" and "bicycle"

***

**5.** Create a small training set of 10 positive and 10 negative movie reviews
- Train a Naive Bayes classifier using NLTK
- Test it on new sentences like "The acting was terrible" and "Wonderful cinematography"
- Calculate the accuracy using a small test set

***

**6.** Analyze sentiment of 3 product reviews:

    - "This product is absolutely wonderful!"

    - "Waste of money, would not recommend"

    - "It's okay, but could be better"

- For each review, display:

    - Polarity scores

    - Overall sentiment (positive/negative/neutral)

- Compare with TextBlob's sentiment analysis

***

**7.** Take the first 100 words from the NLTK Gutenberg corpus (austen-emma.txt)
- Create a frequency distribution
- Plot the 10 most common words
- Calculate:
    - Lexical diversity (unique words/total words)
    - Average word length
    - Word length distribution

***

**8.** Create a text processing pipeline that:
- Takes raw text input
- Tokenizes and removes stopwords
- Applies lemmatization
- Extracts named entities
- Performs sentiment analysis
- Apply it to a paragraph from a news article
- Present the results in a structured format

***

#                                                        🌞 https://github.com/AI-Planet 🌞