# Understanding Natural Language Processing (NLP)
Natural Language Processing (NLP) is a branch of artificial intelligence (AI) focused on enabling computers to understand, interpret, and respond to human language in a way that is both meaningful and useful. NLP bridges the gap between human communication and machine understanding, leveraging computational methods to process and analyze large amounts of natural language data. It combines principles of linguistics, computer science, and machine learning to tackle the complexities of human language.

## Key Concepts
1. Tokenization  
Tokenization is the process of breaking down a text into smaller units called tokens, which can be words, phrases, or characters. For example, the sentence "NLP is fascinating!" can be tokenized into ["NLP", "is", "fascinating", "!"].

    - Word-level tokenization focuses on splitting text into words.
    - Subword tokenization is used in tasks like machine translation where partial words may carry meaning.
    - Sentence tokenization splits text into sentences.

2. Part-of-Speech (POS) Tagging  
This involves labeling words in a sentence with their grammatical roles, such as noun, verb, adjective, etc. For example:

    - "The dog barks" → "The (determiner) dog (noun) barks (verb)".

3. Named Entity Recognition (NER)  
NER identifies and classifies entities within text, such as names of people, organizations, locations, dates, and more.

    - Example: "Barack Obama was born in Hawaii" → Barack Obama (Person), Hawaii (Location).

4. Sentiment Analysis  
Sentiment analysis determines the emotional tone or opinion expressed in a piece of text. It categorizes text as positive, negative, or neutral, often used in product reviews or social media analysis.

    - Example: "I love this product!" → Positive sentiment.

5. Stemming and Lemmatization  
These techniques reduce words to their root forms to simplify text processing:

    - Stemming chops off inflections (e.g., "running" → "run").
    - Lemmatization considers the context to return the base form (e.g., "better" → "good").

6. Syntax and Parsing  
Parsing analyzes the grammatical structure of a sentence to understand its syntax. Dependency parsing identifies relationships between words in a sentence, aiding in understanding complex sentences.

7. Text Classification  
Text classification assigns predefined categories to text, such as spam detection in emails or categorizing news articles into topics.

8. Language Modeling  
Language models predict the probability of word sequences, enabling applications like autocomplete, speech recognition, and machine translation. Models like GPT (Generative Pre-trained Transformer) represent state-of-the-art approaches in this area.

**Installing NLTK**
`pip install nltk`

In [42]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag, ne_chunk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tree import Tree

In [43]:
# Download required NLTK data files
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/aayamojha/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/aayamojha/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /Users/aayamojha/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to /Users/aayamojha/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aayamojha/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aayamojha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

| **Download**                 | **Purpose**                                                             | **Used By**                        |
|------------------------------|-------------------------------------------------------------------------|-----------------------------------|
| `punkt_tab`                     | Sentence and word tokenization rules.                                   | `sent_tokenize`, `word_tokenize`. |
| `averaged_perceptron_tagger_eng`| POS tagging model.                                                     | `nltk.pos_tag`.                   |
| `maxent_ne_chunker_tab`         | Named Entity Recognition (NER).                                         | `nltk.ne_chunk`.                  |
| `words`                     | English vocabulary dataset for validation in NER and other tasks.       | `nltk.ne_chunk` (indirectly).     |
| `wordnet`                   | Lexical database for lemmatization and advanced semantic analysis.      | `WordNetLemmatizer`.              |
| `stopwords`                 | List of stop words for filtering.                                       | `nltk.corpus.stopwords`.          |


In [44]:
# Example text
text = """Natural Language Processing (NLP) is a fascinating field of AI. 
          Researchers use it to develop tools like chatbots, translators, and summarizers."""

In [45]:
print("Tokenization:")
sentences = sent_tokenize(text)  # Sentence Tokenization
print("Sentences:", sentences)

Tokenization:
Sentences: ['Natural Language Processing (NLP) is a fascinating field of AI.', 'Researchers use it to develop tools like chatbots, translators, and summarizers.']


In [46]:
words = word_tokenize(text)  # Word Tokenization
print("Words:", words)

Words: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'Researchers', 'use', 'it', 'to', 'develop', 'tools', 'like', 'chatbots', ',', 'translators', ',', 'and', 'summarizers', '.']


In [47]:
print("\nPart-of-Speech (POS) Tagging:")
pos_tags = pos_tag(words)
print("POS Tags:", pos_tags)


Part-of-Speech (POS) Tagging:
POS Tags: [('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('fascinating', 'JJ'), ('field', 'NN'), ('of', 'IN'), ('AI', 'NNP'), ('.', '.'), ('Researchers', 'NNP'), ('use', 'VBP'), ('it', 'PRP'), ('to', 'TO'), ('develop', 'VB'), ('tools', 'NNS'), ('like', 'IN'), ('chatbots', 'NNS'), (',', ','), ('translators', 'NNS'), (',', ','), ('and', 'CC'), ('summarizers', 'NNS'), ('.', '.')]


| Tag   | Meaning                                   |
|-----|:-------------------------------------------:|
| CC    | Coordinating conjunction                 |
| CD    | Cardinal number                          |
| DT    | Determiner                               |
| EX    | Existential                              |
| FW    | Foreign word                             |
| IN    | Preposition or conjunction               |
| JJ    | Adjective                                |
| JJR   | Adjective, comparative                   |
| JJS   | Adjective, superlative                   |
| LS    | List item marker                         |
| MD    | Modal                                    |
| NN    | Noun, singular                           |
| NNS   | Noun, plural                             |
| NNP   | Proper noun, singular                    |
| NNPS  | Proper noun, plural                      |
| PDT   | Predeterminer                            |
| POS   | Possessive ending                        |
| PRP   | Personal pronoun                         |
| PRP\$  | Possessive pronoun                       |
| RB    | Adverb                                   |
| RBR   | Adverb, comparative                      |
| RBS   | Adverb, superlative                      |
| RP    | Particle                                 |
| SYM   | Symbol                                   |
| TO    | To                                       |
| UH    | Interjection                             |
| VB    | Verb, base form                          |
| VBD   | Verb, past tense                         |
| VBG   | Verb, gerund or participle               |
| VBN   | Verb, past participle                    |
| VBP   | Verb, non-3rd person singular present    |
| VBZ   | Verb, 3rd person singular present        |
| WDT   | Wh-determiner                            |
| WP    | Wh-pronoun                               |
| WP$   | Possessive wh-pronoun                    |
| WRB   | Wh-adverb                                |
| .     | Punctuation mark                         |
| ,     | Punctuation mark                         |
| :     | Punctuation mark                         |
| (     | Left parenthesis                         |
| )     | Right parenthesis                        |


In [49]:
print("\nStemming:")
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in words]
print("Stems:", stems)


Stemming:
Stems: ['natur', 'languag', 'process', '(', 'nlp', ')', 'is', 'a', 'fascin', 'field', 'of', 'ai', '.', 'research', 'use', 'it', 'to', 'develop', 'tool', 'like', 'chatbot', ',', 'translat', ',', 'and', 'summar', '.']


In [50]:
print("\nLemmatization:")
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in words]
print("Lemmas:", lemmas)


Lemmatization:
Lemmas: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'Researchers', 'use', 'it', 'to', 'develop', 'tool', 'like', 'chatbots', ',', 'translator', ',', 'and', 'summarizers', '.']


# Difference Between Stemming and Lemmatization
Both stemming and lemmatization are techniques used in natural language processing (NLP) to reduce words to their base or root forms. However, they differ in their approaches, accuracy, and output.  

## Stemming
- Definition: Stemming is the process of cutting off prefixes and suffixes from a word to get its root form, often without considering the meaning of the word. It can sometimes result in non-existent words.
- Approach: Stemming uses heuristic rules to remove affixes (prefixes and suffixes) from a word. It is generally faster but less accurate than lemmatization.
- Output: Stemming may produce words that are not real words in the dictionary.

**Example of Stemming:**
- Word: "running"
- Stemmed Form: "run" (correct)
- Word: "better"
- Stemmed Form: "better" (incorrect because it's not reduced to the base form)

## Lemmatization
- Definition: Lemmatization is the process of reducing a word to its base or dictionary form (lemma), which is a valid word. Unlike stemming, lemmatization considers the context and the part of speech (POS) of the word.
- Approach: Lemmatization uses a vocabulary and morphological analysis to convert a word into its proper base form. It is more computationally intensive but produces accurate results.
- Output: The result of lemmatization is always a valid word in the dictionary.

**Example of Lemmatization:**
- Word: "running"
- Lemmatized Form: "run" (correct)
- Word: "better"
- Lemmatized Form: "good" (correct as the lemma of "better" is "good")

# What are Stopwords?
Stopwords are common words that are filtered out in text processing tasks, especially in Natural Language Processing (NLP). These words typically don't add significant meaning or value to the analysis of the content. They are usually very frequent words in a language and often consist of functional words like pronouns, prepositions, articles, conjunctions, and auxiliary verbs.  

Stopwords are removed from the text to reduce the dataset's size and to focus the analysis on the more meaningful words.

## Where Do Stopwords Come From?
Stopwords come from the structure of natural language and the need to process text in a more meaningful way. In a sentence, words like "and," "the," "is," "of," etc., are necessary for the sentence to make grammatical sense but don't contribute significant semantic value. Removing these words helps focus on the key content words that are more relevant for many NLP tasks like text classification, sentiment analysis, and information retrieval.  

**For example:**  

- Original Sentence: "The quick brown fox jumps over the lazy dog."
- Stopwords Removed: "quick brown fox jumps lazy dog."
- By removing stopwords, the focus shifts to the main content of the sentence.



## What Words Are Chosen as Stopwords?
The specific words chosen as stopwords depend on the language and the task. However, stopwords are generally function words that do not carry much meaning by themselves. These typically include:  

**Common Stopwords in English:**
- Articles: "a", "an", "the"
- Prepositions: "in", "on", "at", "by", "with"
- Pronouns: "I", "you", "he", "she", "it", "we", "they"
- Conjunctions: "and", "but", "or", "because", "if"
- Auxiliary Verbs: "is", "are", "was", "were", "have", "has", "do", "does"
- Determiners: "this", "that", "these", "those"
- Other Common Words: "to", "for", "of", "from", "as", "so", "up", "down"


In [55]:
text

'Natural Language Processing (NLP) is a fascinating field of AI. \n          Researchers use it to develop tools like chatbots, translators, and summarizers.'

In [54]:
words = word_tokenize(text)

# Load stopwords from NLTK
stop_words = set(stopwords.words('english'))

# Filter out stopwords
filtered_words = [word for word in words if word.lower() not in stop_words]

print(filtered_words)

['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'AI', '.', 'Researchers', 'use', 'develop', 'tools', 'like', 'chatbots', ',', 'translators', ',', 'summarizers', '.']


In [56]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r