<a href="https://colab.research.google.com/github/THRINADH43/JNTU-Summer-ChatGPT-Course/blob/main/Copy_of_ChatGPT_Lab02_NLTK_Questions_Thrinadh_Manubothu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Natural Language Processing

## What is Natural Language Processing?

Have you ever wondered how Siri understands what you say? Or how Google Translate can translate between hundreds of languages? These are just a couple of examples of NLP in action.

You might be wondering, what exactly is NLP? Well, in simple terms, it's a field in artificial intelligence that helps computers understand, interpret, and respond to human language in a meaningful way.

Cool, right?

Applications of NLP are everywhere! We can see this field exploding recently with ChatGPT and other applications that use: sentiment analysis, chatbots, recommendation systems, and more!

In this lab, we're going to start with some basics of NLP as building blocks.
Don't worry if you're not a programming pro, we're going to guide you every step of the way. Here's what we're going to do:

1. **Introduction to Python for NLP**: Python is a favorite language among NLP practitioners, and you're about to find out why! We'll get you started with some of the basic coding concepts you'll need for this lab.

2. **Playing with NLTK library**: NLTK stands for Natural Language Toolkit, and it's a fantastic tool for dealing with human language data. It's going to be our best friend in this lab!

3. **Diving into Text Processing**: We'll go through some important NLP techniques like tokenization, stemming, and lemmatization. These are just fancy terms for breaking down and simplifying text so a computer can understand it.

Let's dive in...



## Lab: Getting Started with Python for NLP


Python is a great language for NLP due to its simplicity and the powerful libraries it has.


### Introduction to NLTK library

Let's begin by importing the Natural Language Toolkit (NLTK), one of the most popular libraries for NLP in Python. We'll also download a specific package called 'punkt'. 'punkt' is a pre-trained tokenizer model, allowing us to break down sentences into individual words or tokens, a fundamental step in many NLP tasks.


In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Basic text processing (tokenization, stemming, lemmatization)

Now we'll perform tokenization, one of the most basic yet crucial steps in NLP. Tokenization is the process of splitting a sentence into individual words or 'tokens'. Each token is like a meaningful piece of a puzzle, and together they form the full picture - the sentence. Let's try tokenizing a simple sentence and see what we get.


In [None]:
from nltk.tokenize import word_tokenize
text = "This is an example sentence for tokenization."

tokens = word_tokenize(text)
print(tokens)

['This', 'is', 'an', 'example', 'sentence', 'for', 'tokenization', '.']


"Next up, we have two super useful techniques: 'stemming' and 'lemmatization'.

#### Stemming

**Stemming** is like trimming a word down to its root form. For example, 'running', 'runs', and 'ran' all come from the root word 'run'. Stemming is a technique used to extract the base form of words by removing any affixes (like prefixes or suffixes).

This helps us standardize words and can improve text understanding, especially in search queries.

NLTK provides several off-the-shelf stemmers, and for this example, we'll use the Porter stemmer, one of the oldest and simplest stemming algorithms.



In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokens]
print(stemmed)

['thi', 'is', 'an', 'exampl', 'sentenc', 'for', 'token', '.']



As you may recognize, stemming is a bit crude and might give us a root that's not even a real word! That's where lemmatization comes in.

#### Lemmatization

**Lemmatization** is another technique that is a bit smarter and gives us the basic form of a word, also known as the 'lemma', that's sure to be a real word. So, 'is', 'am', and 'are' would all become 'be'.

In [None]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token) for token in tokens]
print(lemmatized)

['This', 'is', 'a', 'more', 'complex', 'sentence', 'that', 'I', 'am', 'running', 'to', 'demonstrate', 'lemmatization', '.']


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
text = "“In linguistic morphology, inflection is a process of word formation, \
        in which a word is modified to express different grammatical categories \
        such as tense, case, voice, aspect, person, number, gender, mood, animacy, and definiteness.”"
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token) for token in tokens]
print(lemmatized)

['“', 'In', 'linguistic', 'morphology', ',', 'inflection', 'is', 'a', 'process', 'of', 'word', 'formation', ',', 'in', 'which', 'a', 'word', 'is', 'modified', 'to', 'express', 'different', 'grammatical', 'category', 'such', 'a', 'tense', ',', 'case', ',', 'voice', ',', 'aspect', ',', 'person', ',', 'number', ',', 'gender', ',', 'mood', ',', 'animacy', ',', 'and', 'definiteness', '.', '”']


### Stemming vs Lemmatization showdown!


| Technique | Type | Points |
| --- | --- | --- |
| **Stemming** | Pros | **Boosts speed**: Helps computer programs run faster. |
| | | **Groups words**: Lumps together words that mean similar things. |
| | | **Simplifies analysis**: Makes it easier to understand and compare texts. |
| | Cons | **Can mix up words**: Sometimes, wrongly groups different words together. |
| | | **Can split similar words**: At times, fails to group words that are similar. |
| | | **Struggles with complex languages**: Languages with complicated grammar can pose challenges. |
| **Lemmatization** | Pros | **More accurate**: Gives more precise results because it understands the context and grammar. |
| | Cons | **Takes longer**: Compared to stemming, lemmatization takes more time because it does a more thorough job. |



### YOU TRY:

**Exercise 1:** Tokenize the following sentence: "This is an example sentence. It is used for the NLP lab."


In [None]:
sentence = "This is yet another example sentence that is being used for the NLP lab."
### YOUR CODE HERE

['This', 'is', 'yet', 'another', 'example', 'sentence', 'that', 'is', 'being', 'used', 'for', 'the', 'NLP', 'lab', '.']



**Exercise 2:** Stem and lemmatize the tokens obtained in the previous step.


In [None]:
### YOUR CODE HERE

Stemmed:  ['thi', 'is', 'yet', 'anoth', 'exampl', 'sentenc', 'that', 'is', 'be', 'use', 'for', 'the', 'nlp', 'lab', '.']
Lemmatized:  ['This', 'is', 'yet', 'another', 'example', 'sentence', 'that', 'is', 'being', 'used', 'for', 'the', 'NLP', 'lab', '.']


### Part of Speech (POS) Tagging
POS tagging is the task of labeling the words in a sentence with their appropriate part of speech (noun, verb, adjective, etc.). NLTK library has a method for doing this:



In [None]:
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag

tokens = word_tokenize("This is an example sentence for POS tagging.")
tagged_tokens = pos_tag(tokens)
print(tagged_tokens)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('This', 'DT'), ('is', 'VBZ'), ('an', 'DT'), ('example', 'NN'), ('sentence', 'NN'), ('for', 'IN'), ('POS', 'NNP'), ('tagging', 'NN'), ('.', '.')]


Here's a breakdown of the sentence "This is an example sentence for POS tagging.":


- "This/DT": "This" is a determiner.

- "is/VBZ": "is" is a verb, 3rd person singular present.

- "an/DT": "an" is a determiner.

- "example/NN": "example" is a noun.

- "sentence/NN": "sentence" is a noun.

- "for/IN": "for" is a preposition.

- "(ORGANIZATION POS/NNP)": "POS" is recognized as a proper noun and it is classified as an organization. The "ORGANIZATION" tag is a part of NER and indicates that "POS" is being recognized as the name of an organization.

- "tagging/NN": "tagging" is a noun.

- "./.": This signifies the end of the sentence.


### Common POS Tags
The tags used are from the Penn Treebank Project, a widely used resource for training NLP models. The tagset includes:


- NN: noun, singular or mass

- DT: determiner

- VB: verb, base form

- WRB: Wh-adverb

- JJ: adjective

- MD: modal

- IN: preposition or subordinating conjunction

"S" at the beginning stands for Sentence, indicating the start of a sentence. This notation is commonly used in NLP for parsing sentences into their grammatical structure.

Please note that part-of-speech tagging is not always 100% accurate, especially with ambiguous sentences or words that can serve multiple functions depending on the context. For example, in this sentence, "chuck" and "woodchuck" could have been tagged differently depending on the interpretation.

### Named Entity Recognition (NER)
NER is the task of finding named entities like person names, geographic locations, company names, etc.

In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

from nltk import ne_chunk

ne_tree = ne_chunk(tagged_tokens)
print(ne_tree)


(S
  This/DT
  is/VBZ
  an/DT
  example/NN
  sentence/NN
  for/IN
  (ORGANIZATION POS/NNP)
  tagging/NN
  ./.)


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


### YOU TRY:

**Exercise 3:** Go through the process (using different variable names) to tag the parts of speech in the following sentence:


In [None]:
woodchuck = "how much wood could a woodchuck named Mr. WoodChuck chuck if a woodchuck could chuck wood"
### YOUR CODE HERE





[('how', 'WRB'), ('much', 'JJ'), ('wood', 'NN'), ('could', 'MD'), ('a', 'DT'), ('woodchuck', 'NN'), ('named', 'VBN'), ('Mr.', 'NNP'), ('WoodChuck', 'NNP'), ('chuck', 'VBZ'), ('if', 'IN'), ('a', 'DT'), ('woodchuck', 'NN'), ('could', 'MD'), ('chuck', 'VB'), ('wood', 'NN')]


**Exercise 4:** Now perform NER (named entity recognition on the sentence above) and consider what NER performs in that sentence or how it might be working.

In [None]:
### YOUR CODE


(S
  how/WRB
  much/JJ
  wood/NN
  could/MD
  a/DT
  woodchuck/NN
  named/VBN
  (PERSON Mr./NNP WoodChuck/NNP)
  chuck/VBZ
  if/IN
  a/DT
  woodchuck/NN
  could/MD
  chuck/VB
  wood/NN)


## Practice on Your Own





**Exercise 5:** Write a function that takes a sentence as input and returns the tokens, stemmed words, and lemmatized words.

In [None]:
sentence = "This is another example sentence for practice."
### YOUR CODE HERE


Tokens: ['This', 'is', 'another', 'example', 'sentence', 'for', 'practice', '.']
Stemmed: ['thi', 'is', 'anoth', 'exampl', 'sentenc', 'for', 'practic', '.']
Lemmatized: ['This', 'is', 'another', 'example', 'sentence', 'for', 'practice', '.']


**Exercise 6:** Write a function that takes a sentence, tokenizes it, performs POS tagging and NER, and returns the result.

In [None]:
sentence = "Elon Musk, the CEO of SpaceX, announced a new mission to Mars."
### YOUR CODE HERE



(S
  (PERSON Elon/NNP)
  (GPE Musk/NNP)
  ,/,
  the/DT
  (ORGANIZATION CEO/NNP)
  of/IN
  (ORGANIZATION SpaceX/NNP)
  ,/,
  announced/VBD
  a/DT
  new/JJ
  mission/NN
  to/TO
  (PERSON Mars/NNP)
  ./.)


### Bonus section and Preview for Next Time!

See if you can look through this code and think about what it does:

In [None]:

from nltk.corpus import webtext
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from collections import Counter

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()



# Get the web and chat text corpus
text = webtext.raw()

# Tokenize text
tokens = word_tokenize(text)

# Remove stopwords
tokens = [token for token in tokens if token not in stop_words]

# Perform stemming
stemmed_tokens = [stemmer.stem(token) for token in tokens]

# Perform lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

# Identify most common words
most_common_stemmed_words = Counter(stemmed_tokens).most_common(10)
most_common_lemmatized_words = Counter(lemmatized_tokens).most_common(10)

print("Most common stemmed words: ", most_common_stemmed_words)
print("Most common lemmatized words: ", most_common_lemmatized_words)


Most common stemmed words:  [('.', 16479), (':', 14328), (',', 12427), ('i', 7805), ('?', 4743), ('!', 4417), ('*', 3944), ('#', 3777), ("n't", 3481), ("'s", 3198)]
Most common lemmatized words:  [('.', 16479), (':', 14328), (',', 12427), ('I', 7805), ('?', 4743), ('!', 4417), ('*', 3944), ('#', 3777), ("n't", 3477), ("'s", 3181)]
