# What is NLP?

NLP = Teaching computers to understand and work with human language (like English, Hindi, etc).


### Why is it needed?
Because computers don’t understand human language naturally. They understand 0s and 1s — so we use NLP to help them make sense of our messy, emotional, and context-heavy language.

<hr>

### 1. Text Preprocessing in NLP

Text data is messy. Before you can train a model, you must clean and standardize it.
Let’s go through all the common preprocessing steps:

In [1]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/abdullahshaikh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/abdullahshaikh/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/abdullahshaikh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/abdullahshaikh/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/abdullahshaikh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Tokenization

### 👉 What is it?
Breaking a sentence into smaller parts (words or sentences).
These parts are called tokens.

Example:

"I love machine learning"

After word tokenization →

`["I", "love", "machine", "learning"]`

In [2]:
### paragraph ---> Sentence

from nltk.tokenize import sent_tokenize

corpus = """Hello Abdullah this side. Hope you're doing
good and in pink of health.
"""

sent_tokenize(corpus) # works only on punctuation

['Hello Abdullah this side.', "Hope you're doing\ngood and in pink of health."]

In [3]:
### paragraph ---> words
from nltk.tokenize import word_tokenize

word_tokenize(corpus)

['Hello',
 'Abdullah',
 'this',
 'side',
 '.',
 'Hope',
 'you',
 "'re",
 'doing',
 'good',
 'and',
 'in',
 'pink',
 'of',
 'health',
 '.']

In [4]:
from nltk.tokenize import wordpunct_tokenize

wordpunct_tokenize(corpus)

['Hello',
 'Abdullah',
 'this',
 'side',
 '.',
 'Hope',
 'you',
 "'",
 're',
 'doing',
 'good',
 'and',
 'in',
 'pink',
 'of',
 'health',
 '.']

### Step 2: Stopword Removal

#### What are stopwords?
Stopwords are common words in a language that don’t add much meaning for analysis.

Examples of stopwords in English:

`[is, the, a, an, and, I, you, of, to, in]`

In [5]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "I am learning natural language processing"
words = word_tokenize(text)

filtered = [w for w in words if w.lower() not in stopwords.words('english')]
print(filtered)

['learning', 'natural', 'language', 'processing']


<hr>

## Stemming and Lemmatization

Both are used to reduce a word to its root form, but they do it differently.

### Stemming – (Rough Cut ✂️)
Reduce a word to its <b>base/root</b> by chopping off suffixes, regardless of whether it becomes a meaningful word or not..

Example:

`"playing", "played", "plays" → "play"`



`"studies", "studying" → "studi"`


Common Stemmers in Python:

- PorterStemmer – basic and common

- LancasterStemmer – more aggressive

- SnowballStemmer – smarter than Porter

In [14]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["playing", "flies", "happiness", "maximum", "emotion", "studies", "crying", "nationalism"]

print("PorterStemmer Results:")
for word in words:
    print(f"{word} → {stemmer.stem(word)}")


PorterStemmer Results:
playing → play
flies → fli
happiness → happi
maximum → maximum
emotion → emot
studies → studi
crying → cri
nationalism → nation


In [15]:
from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()
words = ["playing", "flies", "happiness", "maximum", "emotion", "studies", "crying", "nationalism"]

print("LancasterStemmer Results:")
for word in words:
    print(f"{word} → {stemmer.stem(word)}")

LancasterStemmer Results:
playing → play
flies → fli
happiness → happy
maximum → maxim
emotion → emot
studies → study
crying → cry
nationalism → nat


In [13]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")

print(stemmer.stem("playing"))       # play
print(stemmer.stem("flies"))         # fli
print(stemmer.stem("universities"))  # univers
print(stemmer.stem("studying"))      # studi

play
fli
univers
studi


In [16]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")
words = ["playing", "flies", "happiness", "maximum", "emotion", "studies", "crying", "nationalism"]

print("SnowballStemmer Results:")
for word in words:
    print(f"{word} → {stemmer.stem(word)}")


SnowballStemmer Results:
playing → play
flies → fli
happiness → happi
maximum → maximum
emotion → emot
studies → studi
crying → cri
nationalism → nation


## What is Lemmatization?

Lemmatization is the process of converting a word to its dictionary <b>base form (called the lemma) using grammar rules and context.</b>

Real-Life Example:

Original sentence:

`"The children were playing and studies were ongoing."`

With stemming:

`["The", "children", "were", "play", "and", "studi", "were", "ongo"]`

With lemmatization:

`["The", "child", "be", "play", "and", "study", "be", "ongoing"]`

Much cleaner and meaningful, right?

In [17]:
words = ['studies', "children", "were", "playing", "and", "studies", "were", "ongoing"]

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
for word in words:
    print(f"{word} → {lemmatizer.lemmatize(word)}")

The → The
children → child
were → were
playing → playing
and → and
studies → study
were → were
ongoing → ongoing


In [18]:
# POS - Part Of Speech

print(lemmatizer.lemmatize("running", pos="v"))  # Output: "run" (verb)
print(lemmatizer.lemmatize("better", pos="a")) # adverb

run
good
