# Unstructured ML

## **🔍 What We Will Cover in Unstructured ML**
Unstructured data requires specialized **preprocessing, feature extraction, and models**. We can divide this into:  

### **1️⃣ Natural Language Processing (NLP) → Text-Based ML**
✅ Tokenization, Stemming, Lemmatization  
✅ TF-IDF, Word Embeddings (Word2Vec, BERT)  
✅ Text Classification, Sentiment Analysis  
✅ Named Entity Recognition (NER), Topic Modeling  
✅ Chatbot Development (Rasa, OpenAI API, LLMs)  

### **2️⃣ Computer Vision (CV) → Image-Based ML**
✅ Image Preprocessing (OpenCV, PIL)  
✅ Feature Extraction (CNN, ResNet, Vision Transformers)  
✅ Image Classification, Object Detection  
✅ Face Recognition, OCR (Tesseract, EasyOCR)  

### **3️⃣ Audio & Speech Processing**
✅ Feature Extraction (MFCC, Spectrograms)  
✅ Speech-to-Text (Whisper, DeepSpeech)  
✅ Voice Classification, Music Genre Recognition

## **Unstructured ML** can indeed be divided into **Deep Learning** and **Shallow Learning** based on the **modeling techniques** used:

---

### **🔹 1️⃣ Deep Learning in Unstructured ML**
- **Deep Learning** refers to **neural networks with many layers**, capable of automatically learning complex patterns in unstructured data.
  
- **Common deep learning models for unstructured ML**:
  - **CNNs (Convolutional Neural Networks)**: Used for **image data** (e.g., image classification, object detection).
  - **RNNs (Recurrent Neural Networks)**, **LSTMs (Long Short-Term Memory)**, and **GRUs (Gated Recurrent Units)**: Used for **sequence data** (e.g., text, speech).
  - **Transformers** (like **BERT, GPT**): Used for **NLP** tasks like sentiment analysis, text generation, translation, etc.
  - **Autoencoders**: Used for **unsupervised tasks** like anomaly detection, image reconstruction, and denoising.

#### **Deep Learning Examples in Unstructured ML:**
- **Image Classification**: Using CNNs to classify images (e.g., cat vs dog).
- **NLP**: Using RNNs, LSTMs, or transformers like BERT for text classification, translation, or summarization.
- **Speech-to-Text**: Using deep neural networks to transcribe speech into text.

---

### **🔹 2️⃣ Shallow Learning in Unstructured ML**
- **Shallow Learning** refers to **traditional machine learning algorithms** applied to unstructured data, often with manual feature extraction.

- **Shallow models for unstructured ML** typically use **features engineered from the raw data**, such as using:
  - **Handcrafted image features** (e.g., HOG, SIFT, or ORB) followed by traditional models like **SVM, Random Forest, or k-NN**.
  - **Manual text features** (e.g., **TF-IDF** or **bag of words**) followed by traditional classifiers like **Naive Bayes**, **SVM**, or **Logistic Regression**.

#### **Shallow Learning Examples in Unstructured ML:**
- **Image Classification**: Using **handcrafted features** (e.g., HOG, SIFT) with **SVM** or **Random Forest**.
- **Text Classification**: Using **TF-IDF features** with **Logistic Regression** or **Naive Bayes**.

---

## **📊 Summary:**
- **Deep Learning**: Uses **neural networks with many layers** for **end-to-end learning** from raw unstructured data (e.g., raw images, raw text).
- **Shallow Learning**: Requires **manual feature extraction** before applying traditional ML models.

## **📌 NLP Roadmap**
We'll go step by step, from basic text processing to advanced **deep learning NLP models**.

### **1️⃣ Text Preprocessing**
✅ Tokenization (**Splitting text into words/sentences**)  
✅ Stemming vs Lemmatization (**Reducing words to base form**)  
✅ Stopword Removal (**Filtering unnecessary words**)  
✅ Part-of-Speech (POS) Tagging (**Identifying word types**)  

### **2️⃣ Feature Engineering for NLP**
✅ **TF-IDF (Term Frequency - Inverse Document Frequency)**  
✅ **Word Embeddings** (Word2Vec, GloVe, FastText)  
✅ **Transformers** (BERT, GPT)  

### **3️⃣ NLP Applications**
✅ **Text Classification** (Spam Detection, Sentiment Analysis)  
✅ **Named Entity Recognition (NER)** (Extracting Names, Locations, Dates, etc.)  
✅ **Topic Modeling** (LDA, Latent Semantic Analysis)  

### **4️⃣ Advanced NLP & Chatbot Development**
✅ **Intent Recognition** for chatbots  
✅ **Building AI Chatbots with Rasa & OpenAI API (GPT Models)**  
✅ **Deploying NLP Models (FastAPI, Flask, Streamlit)**  

# **🚀 Step 1: Basic NLP Preprocessing (Tokenization, Lemmatization, Stopwords Removal) using SpaCy**  

## **📌 Why is Preprocessing Important?**
Before applying machine learning models to text, we need to **clean and structure** the raw text to remove unnecessary noise and standardize words.  
✅ **Tokenization**: Breaking text into words/sentences.  
✅ **Lemmatization**: Converting words to their root form (better than stemming).  
✅ **Stopword Removal**: Removing common words that add little meaning (e.g., "the", "is").  
✅ **POS Tagging**: Identifying parts of speech (nouns, verbs, adjectives).  

In [15]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [16]:
## **1️⃣ Tokenization (Breaking Text into Words)**
import spacy

# Load the SpaCy English model
nlp = spacy.load("en_core_web_sm")

# Example text
text = "SpaCy is an amazing NLP library! It makes text processing easy."

# Process text with SpaCy
doc = nlp(text)

# Tokenization: Extract words
tokens = [token.text for token in doc]
print("🔹 Tokens:", tokens)

🔹 Tokens: ['SpaCy', 'is', 'an', 'amazing', 'NLP', 'library', '!', 'It', 'makes', 'text', 'processing', 'easy', '.']


### **✅ Output:**
```
🔹 Tokens: ['SpaCy', 'is', 'an', 'amazing', 'NLP', 'library', '!', 'It', 'makes', 'text', 'processing', 'easy', '.']
```
💡 **SpaCy automatically handles punctuation and special characters.**

---

## **2️⃣ Lemmatization (Reducing Words to Root Form)**
Lemmatization converts words to their **dictionary root** while keeping the meaning.  
For example, `"running"` → `"run"`, `"better"` → `"good"`.

```python
lemmas = [token.lemma_ for token in doc]
print("🔹 Lemmatized:", lemmas)
```
### **✅ Output:**
```
🔹 Lemmatized: ['SpaCy', 'be', 'an', 'amazing', 'NLP', 'library', '!', 'it', 'make', 'text', 'process', 'easy', '.']
```
💡 **"is" → "be"** and **"makes" → "make"** → Shows how lemmatization standardizes words.

## **3️⃣ Stopword Removal (Removing Unimportant Words)**
Stopwords like `"the", "is", "an"` add little meaning to NLP models.  
SpaCy has a **built-in stopword list**.

```python
# Removing stopwords
filtered_tokens = [token.text for token in doc if not token.is_stop]
print("🔹 Tokens without Stopwords:", filtered_tokens)
```
### **✅ Output:**
```
🔹 Tokens without Stopwords: ['SpaCy', 'amazing', 'NLP', 'library', '!', 'makes', 'text', 'processing', 'easy', '.']
```
💡 **"is", "an", "it" were removed because they are stopwords.**

## **4️⃣ POS Tagging (Identifying Parts of Speech)**
Each word in a sentence has a role: **noun, verb, adjective, etc.**  
This helps in tasks like **Named Entity Recognition (NER), sentiment analysis**.

```python
for token in doc:
    print(f"{token.text} → {token.pos_}")
```
### **✅ Output:**
```
SpaCy → PROPN
is → AUX
an → DET
amazing → ADJ
NLP → PROPN
library → NOUN
! → PUNCT
It → PRON
makes → VERB
text → NOUN
processing → NOUN
easy → ADJ
. → PUNCT
```
💡 **"SpaCy" is a Proper Noun (`PROPN`), "makes" is a Verb (`VERB`)** → Useful for text understanding.

## **📌 Combining All Preprocessing Steps**
Let's apply **tokenization, lemmatization, stopword removal, and POS tagging** together.

```python
def preprocess_text(text):
    doc = nlp(text)
    cleaned_tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    return cleaned_tokens

# Example usage
text = "SpaCy is an amazing NLP library! It makes text processing easy."
processed_text = preprocess_text(text)
print("✅ Processed Text:", processed_text)
```
### **✅ Output:**
```
✅ Processed Text: ['SpaCy', 'amazing', 'NLP', 'library', 'make', 'text', 'process', 'easy']
```
💡 **Now the text is cleaned and ready for ML models!**

## **🚀 Step 2: Feature Engineering for NLP**  

Now that we have **cleaned and preprocessed text**, we need to **convert it into numerical representations** that ML models can understand.  
---

### **📌 NLP Feature Engineering Overview**
#### **1️⃣ TF-IDF (Term Frequency - Inverse Document Frequency)**
- Traditional **bag-of-words** approach that weighs words based on their importance in a document.  

#### **2️⃣ Word Embeddings (Word2Vec, GloVe, FastText)**
- **Dense vector representations** that capture the **semantic meaning of words**.  

#### **3️⃣ Transformers (BERT, GPT)**
- **Context-aware embeddings** that understand word relationships **based on full sentence context**.

---

### **1️⃣ TF-IDF (Term Frequency - Inverse Document Frequency)**
✅ **What is TF-IDF?**  
- **Term Frequency (TF):** Measures how frequently a word appears in a document.  
- **Inverse Document Frequency (IDF):** Downweights words that appear frequently in many documents (e.g., "the", "is").  
- **TF-IDF** = TF × IDF → Higher values mean **important words**, lower values mean **common words**.

In [18]:
#### **🔹 Example: Compute TF-IDF in Python**

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data
documents = [
    "Machine learning is great",
    "Deep learning is a subset of machine learning",
    "Natural language processing is part of AI"
]

# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer()

# Fit and transform text data
tfidf_matrix = tfidf.fit_transform(documents)

# Convert to DataFrame for better readability
import pandas as pd
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())

print(df_tfidf)

# ✅ **TF-IDF captures important words and removes unnecessary ones!**  

         ai      deep     great        is  language  learning   machine  \
0  0.000000  0.000000  0.631745  0.373119  0.000000  0.480458  0.480458   
1  0.000000  0.414541  0.000000  0.244835  0.000000  0.630538  0.315269   
2  0.410747  0.000000  0.000000  0.242594  0.410747  0.000000  0.000000   

    natural        of      part  processing    subset  
0  0.000000  0.000000  0.000000    0.000000  0.000000  
1  0.000000  0.315269  0.000000    0.000000  0.414541  
2  0.410747  0.312384  0.410747    0.410747  0.000000  



### **2️⃣ Word Embeddings (Word2Vec, GloVe, FastText)**
✅ **What are Word Embeddings?**  
- Unlike TF-IDF, **word embeddings capture relationships between words**.  
- **Words with similar meanings have similar vectors.**  
- Used in **deep learning models (RNNs, Transformers, etc.)**.

#### **🔹 Example: Word2Vec (Train Your Own Embeddings)**
```python
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Sample text data
sentences = [
    "Machine learning is great",
    "Deep learning is a subset of machine learning",
    "Natural language processing is part of AI"
]

# Tokenize sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Train Word2Vec model
w2v_model = Word2Vec(sentences=tokenized_sentences, vector_size=50, window=5, min_count=1, workers=4)

# Get word vector for "machine"
print(w2v_model.wv["machine"])
```
✅ **Now words are represented as dense numerical vectors instead of simple word counts!**  

---

### **3️⃣ Transformers (BERT, GPT)**
✅ **What Makes Transformers Special?**
- Unlike **Word2Vec**, which gives a single vector for each word, **BERT & GPT create different word vectors based on context.**  
- Example:  
  - "Apple is a fruit" → `Apple` means a **fruit**.  
  - "Apple makes iPhones" → `Apple` means a **company**.  
  - **BERT understands the difference!**

#### **🔹 Example: Extracting BERT Embeddings**
```python
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Example text
text = "Machine learning is amazing!"

# Tokenize text
tokens = tokenizer(text, return_tensors="pt")

# Get BERT embeddings
with torch.no_grad():
    outputs = model(**tokens)

# Extract embeddings
embeddings = outputs.last_hidden_state
print(embeddings.shape)  # Shape: (batch_size, sequence_length, hidden_size)
```
✅ **Now words are represented with deep contextual understanding!**  

# **🚀 Step 3: NLP Applications**
Now that we have **numerical representations of text (TF-IDF, Word Embeddings, BERT, etc.)**, let’s apply them to **real-world NLP tasks**:

### ✅ **Text Classification** (Spam Detection, Sentiment Analysis)  
### ✅ **Named Entity Recognition (NER)** (Extracting Names, Locations, Dates, etc.)  
### ✅ **Topic Modeling** (LDA, Latent Semantic Analysis)  

---

## **1️⃣ Text Classification (Spam Detection, Sentiment Analysis)**  
Text classification assigns a **category** to a given text (e.g., **spam vs. not spam, positive vs. negative sentiment**).  
We’ll cover:  
- **TF-IDF + Logistic Regression** (Shallow Learning)  
- **Word2Vec + Deep Learning** (Deep Learning)  

---

### **🔹 Example 1: Spam Detection using TF-IDF + Logistic Regression**
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Sample dataset (Spam vs Not Spam)
data = {
    "text": ["Win a free iPhone now!", "Meeting at 10 AM", "Congratulations! You won a lottery", 
             "Call me when you are free", "Get rich quick with this scheme!"],
    "label": [1, 0, 1, 0, 1]  # 1 = Spam, 0 = Not Spam
}
df = pd.DataFrame(data)

# Split data
X_train, X_test, y_train, y_test = train_test_split(df["text"], df["label"], test_size=0.2, random_state=42)

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train Logistic Regression model
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Predict and evaluate
y_pred = model.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
```
✅ **Simple but effective for spam detection!**  

---

### **🔹 Example 2: Sentiment Analysis using Word2Vec + Deep Learning**
```python
import gensim
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Sample dataset
sentences = ["I love this movie!", "This was a terrible experience", "Absolutely fantastic!", "Worst product ever"]
labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative

# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
X = tokenizer.texts_to_sequences(sentences)
X = pad_sequences(X, padding="post")

# Word2Vec Embeddings
word2vec = gensim.models.Word2Vec(sentences=[s.split() for s in sentences], vector_size=50, min_count=1, workers=4)

# Convert words to vectors
embedding_matrix = np.zeros((len(tokenizer.word_index) + 1, 50))
for word, i in tokenizer.word_index.items():
    if word in word2vec.wv:
        embedding_matrix[i] = word2vec.wv[word]

# Build LSTM Model
model = Sequential([
    Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=50, weights=[embedding_matrix], trainable=False),
    LSTM(10),
    Dense(1, activation="sigmoid")
])

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(X, np.array(labels), epochs=10, verbose=1)

# Predict sentiment
print("Sentiment Prediction:", model.predict(pad_sequences(tokenizer.texts_to_sequences(["Amazing experience!"]), maxlen=X.shape[1])))
```
✅ **Deep learning models like LSTM are more powerful for sentiment analysis!**  

---

## **2️⃣ Named Entity Recognition (NER)**
NER identifies **important entities** (names, locations, organizations, dates, etc.) in a sentence.  
We will use **SpaCy for efficient NER**.

### **🔹 Example: Extracting Names, Locations, and Dates using SpaCy**
```python
import spacy

# Load SpaCy's pre-trained model
nlp = spacy.load("en_core_web_sm")

text = "Elon Musk founded SpaceX in 2002 and Tesla in 2004. He was born in South Africa."

# Apply NLP pipeline
doc = nlp(text)

# Extract Named Entities
for ent in doc.ents:
    print(f"{ent.text} → {ent.label_}")

# Display named entities visually
spacy.displacy.render(doc, style="ent", jupyter=True)
```
✅ **Automatically extracts Elon Musk (PERSON), SpaceX (ORG), and 2002 (DATE)!**  

---

## **3️⃣ Topic Modeling (LDA, Latent Semantic Analysis)**
Topic Modeling helps **discover hidden topics in text**.  
We’ll use **LDA (Latent Dirichlet Allocation)** to find topics in a document.

### **🔹 Example: Topic Modeling with LDA**
```python
import gensim
from gensim import corpora
from nltk.tokenize import word_tokenize
import nltk

nltk.download("punkt")

# Sample dataset
documents = ["I love machine learning and AI", "Deep learning is the future of AI", 
             "Natural language processing enables AI to understand text"]

# Tokenize and create dictionary
tokenized_docs = [word_tokenize(doc.lower()) for doc in documents]
dictionary = corpora.Dictionary(tokenized_docs)

# Create term-document matrix
corpus = [dictionary.doc2bow(text) for text in tokenized_docs]

# Train LDA model
lda_model = gensim.models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=10)

# Display topics
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx+1}: {topic}")
```
✅ **Extracts AI-related topics from text!**  