## What is NLP?

**Natural Language Processing (NLP)** is a field of Artificial Intelligence that focuses on the interaction between humans and computers using natural language.

It enables machines to understand, interpret, generate, and respond to human language in a valuable way.

### Goals of NLP:
- Convert unstructured language into structured data.
- Enable machines to understand human communication.
- Build applications that can read, listen, and generate human language.

---

## Real-World Applications of NLP

| Application            | Description |
|------------------------|-------------|
| 🤖 **Chatbots/Assistants** | Virtual assistants like Siri, Alexa, Google Assistant use NLP to understand and respond to user queries. |
| 🌍 **Translation**         | Tools like Google Translate convert text from one language to another. |
| 😊 **Sentiment Analysis**  | NLP can detect the emotional tone (positive, negative, neutral) in texts like reviews or tweets. |
| 🔍 **Search Engines**      | Google uses NLP to understand queries and rank pages. |
| 🏥 **Healthcare**          | NLP is used to extract information from clinical notes, medical records, and research papers. |
| 📄 **Document Summarization** | Automatic summarization of news, legal documents, etc. |
| ✉️ **Spam Detection**      | Email services use NLP to classify emails as spam or not. |

---

## Structured vs Unstructured Data

| Type         | Examples                      | Description |
|--------------|-------------------------------|-------------|
| **Structured**   | SQL tables, Excel sheets        | Clearly defined fields, easy to analyze. |
| **Unstructured** | Texts, Emails, PDFs, Social Media | Human language in raw form—difficult for machines to analyze without NLP. |

> **Note:** Around **80%+ of the world’s data is unstructured**, which is why NLP is so important.

---

## Text Preprocessing Overview

Before using language data in models, we must clean and standardize it. This is called **preprocessing**.

### Common Preprocessing Steps:
1. Lowercasing  
2. Removing punctuation  
3. Tokenization (splitting text into words)  
4. Stopword removal (like “is”, “the”, “a”)  
5. Stemming / Lemmatization (converting words to root form)  
6. Removing special characters, numbers, emojis (if not needed)  

---

## Hands-On Task: Simple Text Cleaning

Let’s start with a basic text preprocessing example using Python.

### 🔧 Sample Text:

```python
text = "Hello World! This is my FIRST NLP session. Let's clean this text. 🤖"


### NLP Techniques to Apply:
1) Text Preprocessing
2) Feature Extraction
3) Model Training
4) Output Generation

# Terms usind in NLP:
1) Corpus
2) Document
3) Vocabulary
4) Words

# Text Preprocessing (Part 1)

---

## 1. Tokenization

**Definition:**  
Tokenization is the process of breaking text into smaller pieces called **tokens**. These tokens can be:

- **Words** → Word-level tokenization  
- **Sentences** → Sentence-level tokenization

### Why it matters:
- Machine learning models work with numbers, not raw text.
- Tokenization helps in separating meaningful units for further processing.

### Examples:

**Sentence:**  
`"NLP is fun!"`

- **Word tokens:**  
  `["NLP", "is", "fun", "!"]`

- **Sentence tokens:**  
  `["NLP is fun!"]`

---

## 2. Stopwords

**Definition:**  
Stopwords are common words in a language (like `"the"`, `"is"`, `"and"`) that carry little meaning and are often removed during preprocessing.

### Why remove them?
- They occur frequently but don’t contribute much to understanding.
- Removing stopwords helps reduce **noise** in the data and improves processing efficiency.

---

## 3. Stemming vs Lemmatization

| Feature   | **Stemming**                           | **Lemmatization**                            |
|-----------|----------------------------------------|----------------------------------------------|
| **Function** | Cuts words to root form                | Uses linguistic knowledge to get base form     |
| **Output**   | Often not a real word                  | Real, dictionary-valid word                   |
| **Tool Example** | `PorterStemmer`                      | `WordNetLemmatizer`, `spaCy`                  |
| **Example**     | `"running"` → `"run"` (both)         | `"better"` → `"good"` (only lemmatizer)       |

> **Tip:** Lemmatization is usually more accurate but slower than stemming.

---



# Feature Extraction (Part 2)

1) One Hot Encoding

## 4. Bag of Words (BoW)

### Definition:
**Bag of Words (BoW)** is a technique used to convert a collection of text documents into numerical feature vectors. It creates a vocabulary of known words and encodes each document as a vector of word counts.

> BoW ignores grammar and word order, and focuses only on the frequency of words.

---

### Example:

Two texts:

1. `"I love NLP"`  
2. `"NLP is fun"`

---

**Vocabulary** → `[ "I", "love", "NLP", "is", "fun" ]`

**BoW Matrix**:

|        | I | love | NLP | is | fun |
|--------|---|------|-----|----|-----|
| Doc 1  | 1 | 1    | 1   | 0  | 0   |
| Doc 2  | 0 | 0    | 1   | 1  | 1   |

---

### Interpretation:
- Each row represents a document.
- Each column represents a word from the vocabulary.
- Values are the **count of times the word appears** in the document.

---

> **Note:** BoW is simple but powerful. However, it does not capture the meaning or context of words.


# N-Grams in NLP

---

##  What are N-Grams?

An **n-gram** is a contiguous sequence of **n consecutive words (tokens)** from a given text.

| n       | Name     | Example from `"I love NLP"`      |
|---------|----------|----------------------------------|
| 1-gram  | Unigram  | `["I", "love", "NLP"]`           |
| 2-gram  | Bigram   | `["I love", "love NLP"]`         |
| 3-gram  | Trigram  | `["I love NLP"]`                 |

---

## Why Use N-Grams?

Using n-grams (especially bigrams and trigrams) helps capture **phrases and context**, which single words (unigrams) cannot provide.

### ✅ Common Applications:
- **Text classification**
- **Sentiment analysis**
- **Plagiarism detection**
- **Language modeling**
- **Autocomplete systems**

---

## Limitations of N-Grams

| Limitation | Description |
|------------|-------------|
| **Sparsity** | Higher-order n-grams result in very large and sparse feature spaces. |
| **Memory & Computation** | Storing and processing bigram/trigram matrices is resource-intensive. |
| **Lack of Deep Semantics** | N-grams don't understand the meaning behind words or grammar. |
| **Vocabulary Explosion** | As `n` increases, the number of possible combinations grows exponentially. |

> ✅ **Tip:** For deeper understanding of context and meaning, techniques like **word embeddings** (Word2Vec, GloVe) or **transformers** are used instead of pure n-grams.


# 📌 TF-IDF (Term Frequency–Inverse Document Frequency)

## 🔍 What is TF-IDF?

TF-IDF is a statistical measure used in **Natural Language Processing (NLP)** and **information retrieval** to evaluate how important a word is to a document in a collection or corpus.

It is commonly used to **transform textual data into numerical features** for tasks like:

- Text classification
- Document clustering
- Search engines
- Spam filtering

---



![image.png](attachment:image.png)

In [None]:
import pandas as pd
import numpy as np
import nltk # natural language toolkit -> tokenization, stopword removal , stemming wa lemmitization,
import torch
import torch.nn as nn
import torch.optim as optim
import string

from torch.utils.data import Dataset, DataLoader
# TfIDF use garxum
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report


nltk.download('punkt') # punkt is required for tokenization
nltk.download('punkt_tab')
nltk.download('stopwords') # stopwords are required for stopword removal
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer


data = pd.read_csv('spam.csv', encoding='latin-1')

# data.head()

data = data[['v2', 'v1']]
data.columns = ['text', 'labels']
data.head()

data.isnull().sum()

# 1) Text Preprocessing
def preprocess_text(text):
  text = text.lower()
  # punctuation hatauney
  text = "".join(char for char in text if char not in string.punctuation)
  # tokenization
  tokens = word_tokenize(text)
  # remove stopwords
  stop_words = set(stopwords.words('english'))
  tokens = [word for word in tokens if word not in stop_words]
  # lemmatization
  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(word) for word in tokens]
  return " ".join(tokens)

data['cleaned_text'] = data['text'].apply(preprocess_text)



data['labels'] = data['labels'].map({"ham":0, "spam":1})
data.head()

# 2) Feature Extraction (Tf-IDF -> ngram, bagofwords)
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(data['cleaned_text']).toarray()
y = data['labels'].values

# Mathi numpy ma convert vaisakyo
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


class EmailDataset(Dataset):
  def __init__(self, X, y):
    self.X = torch.tensor(X, dtype=torch.float32)
    self.y = torch.tensor(y, dtype=torch.float32)

  def __len__(self):
    return len(self.X)

  def __getitem__(self, idx):
    return self.X[idx], self.y[idx]


train_dataset = EmailDataset(X_train, y_train)
test_dataset = EmailDataset(X_test, y_test)


train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)


class SpamClassifier(nn.Module):
  def __init__(self, num_inputs):
    super().__init__()
    self.fc1 = nn.Linear(num_inputs, 128)
    self.fc2 = nn.Linear(128, 64)
    self.fc3 = nn.Linear(64, 1)

    self.dropout = nn.Dropout(0.5)
    self.relu = nn.ReLU()
    self.sigmoid = nn.Sigmoid()

  def forward(self, x):
    x = self.relu(self.fc1(x))
    x = self.relu(self.fc2(x))
    x = self.dropout(x)
    x = self.sigmoid(self.fc3(x))
    return x


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SpamClassifier(num_inputs=X_train.shape[1]).to(device)

criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

num_epochs = 10
model.train()
for epoch in range(1, num_epochs+1):
  total_loss = 0.0
  for inputs, labels in train_loader:
    inputs, labels = inputs.to(device), labels.to(device)
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, labels.unsqueeze(1))
    loss.backward()
    optimizer.step()

    total_loss += loss.item()

  avg_loss = total_loss / len(train_loader)
  print(f"Epoch [{epoch}/{num_epochs}], Loss: {avg_loss:.4f}")




model.eval()
correct = 0
total = 0
with torch.no_grad():
  for inputs, labels in test_loader:
    inputs, labels = inputs.to(device), labels.to(device)
    outputs = model(inputs)
    predicted = (outputs >= 0.5).float()
    total += labels.size(0)
    correct += (predicted == labels.unsqueeze(1)).sum().item()

accuracy = correct / total
print(f"Test Accuracy: {accuracy:.4f}")



test_email = 'Hey you have won millons dollar'

def predict_email(test_email):
  model.eval()
  preprocessed_email = preprocess_text(test_email)
  email_vector = vectorizer.transform([preprocessed_email]).toarray()
  email_tensor = torch.tensor(email_vector, dtype=torch.float32).to(device)
  output = model(email_tensor)
  predicted_label = (output >= 0.5).float().item()

  if predicted_label == 1:
    print("Email is spam")
  else:
    print("Email is not spam")





result = predict_email(test_email)