### Naives Bayes Classifier


It is an efficient supervised classification algorithm.  It scales well in higher dimensions.

Why is it called naive? Because of the naive assumption that each input feature is independent of each other( when the correlation between the features is lower)

It can make good predictions even when the training data is relatively small.


## Naive Bayes Classifier - Simple Illustration

### Problem
We want to predict whether a person will **buy a computer** based on their `Age` and `Income`.

Here’s the dataset:

| Age    | Income | Buys Computer |
|--------|--------|----------------|
| Youth  | High   | No             |
| Youth  | High   | No             |
| Middle | High   | Yes            |
| Senior | Medium | Yes            |
| Senior | Low    | Yes            |
| Senior | Low    | No             |
| Middle | Low    | Yes            |
| Youth  | Medium | No             |
| Youth  | Low    | Yes            |
| Senior | Medium | Yes            |

Now, we want to predict for this new person:
> Age = **Senior**, Income = **High**

---

###  Step-by-step (Using Naive Bayes)

1. **Check how often people buy a computer:**
- Out of 10 people, 6 said **Yes**, and 4 said **No**
- P(yes)=6/10
- P(no)=4/10

2. **Calculate for Senior and High**
- P(yes|age=senior and income = High)= P(yes)*P(senior|yes)*P(income=high|yes)
                                    = (6/10)*(3/6)*(1/6) 

- P(no|age=senior and income = High)= P(no)*P(senior|no)*P(income=high|no)
                                    = (4/10)*(1/4)*(2/4)


Whichever prob is higher it will belong to that class 


### Credit data set

In [6]:
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix


In [2]:
df=pd.read_csv("data/credit_data.csv")
df.dropna(inplace=True) 
df.head()

Unnamed: 0,clientid,income,age,loan,default
0,1,66155.9251,59.017015,8106.532131,0
1,2,34415.15397,48.117153,6564.745018,0
2,3,57317.17006,63.108049,8020.953296,0
3,4,42709.5342,45.751972,6103.64226,0
4,5,66952.68885,18.584336,8770.099235,1


In [4]:
df.corr()

Unnamed: 0,clientid,income,age,loan,default
clientid,1.0,0.039133,-0.014704,0.018358,-0.021217
income,0.039133,1.0,-0.033687,0.441539,0.002222
age,-0.014704,-0.033687,1.0,0.002309,-0.429759
loan,0.018358,0.441539,0.002309,1.0,0.377169
default,-0.021217,0.002222,-0.429759,0.377169,1.0


naive bayes assumes that income, agen and loan are independent of each other. So nb may not fit properly

In [3]:
features = df[["income", "age", "loan"]] 
target = df["default"]
Xtrain, Xtest, ytrain, ytest = train_test_split(features, target, test_size=0.2, random_state=42)

In [7]:
model=GaussianNB()
model.fit(Xtrain, ytrain)
ypred=model.predict(Xtest)
accuracy = accuracy_score(ytest, ypred)
print(f"Accuracy: {accuracy * 100:.2f}%")
conf_matrix = confusion_matrix(ytest, ypred)
print("Confusion Matrix:",conf_matrix)

Accuracy: 90.50%
Confusion Matrix: [[330   9]
 [ 29  32]]



### Gaussian Naive Bayes - 2 Feature Example with Manual Calculation

### Mini Dataset

| Person | Income  | Age | Default |
|--------|---------|-----|---------|
| A      | 50,000  | 30  | 0       |
| B      | 60,000  | 40  | 0       |
| C      | 70,000  | 50  | 0       |
| D      | 30,000  | 20  | 1       |
| E      | 35,000  | 25  | 1       |

We want to predict whether a **new person** with:

- **Income = 65,000**
- **Age = 45**

will **Default (1)** or **Not Default (0)**.

---

#### Step 1: Separate Data by Class

**Class 0 (No Default):**
- Income: 50,000, 60,000, 70,000
- Age: 30, 40, 50

**Class 1 (Default):**
- Income: 30,000, 35,000
- Age: 20, 25

---

#### Step 2: Calculate Mean & Standard Deviation

#### Class 0:
- Income: Mean = 60,000, Std Dev ≈ 8,165
- Age: Mean = 40, Std Dev ≈ 8.16

#### Class 1:
- Income: Mean = 32,500, Std Dev ≈ 3,535
- Age: Mean = 22.5, Std Dev ≈ 3.53

---

#### Step 3: Gaussian Probability Formula

\[
P(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \cdot e^{- \frac{(x - \mu)^2}{2\sigma^2}}
\]

---

#### Class 0 (No Default) - New Point: Income = 65,000, Age = 45

#### Income:
\[
P = \frac{1}{\sqrt{2\pi \cdot 8165^2}} \cdot e^{- \frac{(65000 - 60000)^2}{2 \cdot 8165^2}}  
≈ 0.0000489
\]

#### Age:
\[
P = \frac{1}{\sqrt{2\pi \cdot 8.16^2}} \cdot e^{- \frac{(45 - 40)^2}{2 \cdot 8.16^2}}  
≈ 0.0443
\]

#### Combined Probability (Class 0):
\[
P(class 0) = 0.0000489 × 0.0443 ≈ 2.17 × 10^{-6}
\]

---

####  Class 1 (Default)

#### Income = 65,000 is **very far** from mean (32,500)

\[
P ≈ 0
\]

#### Age = 45 is also **far** from mean (22.5)

\[
P ≈ 0
\]

#### Combined Probability (Class 1):
\[
P(class 1) = 0 × 0 = 0
\]

---

#### Final Prediction

Since the combined score for **Class 0** is much higher than Class 1:

>  **Prediction: No Default**

---

####  Takeaway

Gaussian Naive Bayes uses the bell-curve formula to estimate how likely the input features are in each class, and chooses the class with the highest total likelihood.

## TEXT Clustering


aim of text clustering is to end up with a model thats capable of grouping similar documents together.

| Aspect          | Semantic Clustering               | Thematic Clustering             |
|-----------------|-----------------------------------|---------------------------------|
| Based on        | Meaning of words/sentences        | Underlying topic or theme       |
| Uses            | Sentence embeddings, similarity   | Topic models (LDA, BERTopic)    |
| Focus           | Fine-grained, meaning-level       | Broad, category-level           |
| Example         | “reset password” vs “recover access” | Finance news vs Sports news    |


