# Imports

In [1]:
import pandas as pd,numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

# Data importing

In [2]:
data = pd.read_csv('spam.csv',encoding='utf-8',usecols=['v1','v2'])

data.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# Checking missing values and eliminating

In [3]:
data.isnull().sum()

v1    0
v2    1
dtype: int64

In [4]:
data.dropna(inplace=True)

# Converting String to Vector

In [5]:
cv = CountVectorizer()
labelenc = LabelEncoder()

x = cv.fit_transform(data['v2'])
y = labelenc.fit_transform(data['v1'])

# Spliting data into train and test set

In [6]:
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.25,random_state=42)

# Training a model

In [7]:
svc = SVC()
svc.fit(X_train,y_train)

In [8]:
svc.score(X_test,y_test)

0.9763101220387652

# Testing

In [9]:
email = cv.transform(['Congratulations!!!, You won Lottery of $1500000000 just now, just click on following link https://lottery.com/claim to claim your prize money'])
labelenc.classes_[svc.predict(email)[0]]

'spam'

In [10]:
email = cv.transform(['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'])
labelenc.classes_[svc.predict(email)[0]]



'ham'

In [11]:
y_pred= svc.predict(X_test)

In [12]:
from sklearn.metrics import classification_report 
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.99      1195
           1       1.00      0.83      0.91       198

    accuracy                           0.98      1393
   macro avg       0.99      0.92      0.95      1393
weighted avg       0.98      0.98      0.98      1393



In [None]:
Let's break down the problem and the code step by step for clarity and a thorough understanding.

---

### **Problem Statement**
**Goal**: Implement a program for **e-mail spam filtering** using a **text classification algorithm**. The program should:
1. Take a dataset of labeled emails (spam or ham).
2. Train a model to classify emails as "spam" or "ham" (non-spam) using a machine learning algorithm.
3. Test the model's performance on unseen data.
4. Classify new emails based on their text content.

**Approach**:
We use the **Support Vector Machine (SVM)** algorithm for classification, with text preprocessing techniques like tokenization and vectorization.

---

### **Detailed Explanation of Code**

#### Step 1: Import Required Libraries
```python
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
```
- **`pandas`**: For data manipulation and analysis (loading and processing the dataset).
- **`numpy`**: For numerical operations (optional, used implicitly by sklearn).
- **`SVC`**: Support Vector Classifier, used for the classification task.
- **`train_test_split`**: Splits the dataset into training and testing subsets.
- **`CountVectorizer`**: Converts text data into a numerical feature matrix.
- **`LabelEncoder`**: Encodes categorical labels (e.g., "spam" and "ham") into numerical values.

---

#### Step 2: Load the Dataset
```python
data = pd.read_csv('spam.csv', encoding='utf-8', usecols=['v1', 'v2'])
```
- **Purpose**: Load the dataset `spam.csv`.
  - `v1`: The label (spam or ham).
  - `v2`: The email text.
- **`usecols=['v1', 'v2']`** ensures only the necessary columns are loaded.
- **Dataset Example**:
  ```
      v1       v2
  0   ham      Go until jurong point, crazy.. Available only in...
  1   spam     Congratulations!!!, You won Lottery...
  ```

---

#### Step 3: Data Exploration
```python
data.head()
data.isnull().sum()
```
- **`data.head()`**: Displays the first 5 rows of the dataset.
- **`data.isnull().sum()`**: Checks for missing values in the dataset.

---

#### Step 4: Handle Missing Data
```python
data.dropna(inplace=True)
```
- **Purpose**: Removes rows with missing values, ensuring clean data.

---

#### Step 5: Text Vectorization
```python
cv = CountVectorizer()
x = cv.fit_transform(data['v2'])
```
- **Purpose**: Convert email text into numerical form for model training.
- **`CountVectorizer`**:
  - Breaks text into tokens (words).
  - Creates a "Bag of Words" (BoW) representation, where each word is assigned a numerical count or frequency.
- **Example**:
  For emails:
  ```
  Email 1: "dog bites man"
  Email 2: "man bites dog"
  ```
  The BoW representation:
  ```
  dog  man  bites
  1    1    1  -> Email 1
  1    1    1  -> Email 2
  ```

---

#### Step 6: Label Encoding
```python
labelenc = LabelEncoder()
y = labelenc.fit_transform(data['v1'])
```
- **Purpose**: Encode categorical labels ("spam", "ham") into numerical values.
  - Example:
    - "ham" → 0
    - "spam" → 1

---

#### Step 7: Train-Test Split
```python
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
```
- **Purpose**: Split the data into training (75%) and testing (25%) sets.
- **`random_state=42`**: Ensures reproducibility by fixing the randomness.

---

#### Step 8: Train the Model
```python
svc = SVC()
svc.fit(X_train, y_train)
```
- **`SVC()`**: Initializes the Support Vector Classifier.
- **`fit()`**: Trains the SVM model using the training data (`X_train` and `y_train`).

---

#### Step 9: Evaluate the Model
```python
svc.score(X_test, y_test)
```
- **Purpose**: Computes the accuracy of the model on the test data.

---

#### Step 10: Predict New Emails
```python
email = cv.transform(['Congratulations!!!, You won Lottery of $1500000000...'])
labelenc.classes_[svc.predict(email)[0]]
```
- **Purpose**: Classify a new email.
  - The text is vectorized using the trained `CountVectorizer` (`cv.transform()`).
  - `svc.predict(email)` predicts the label (0 or 1).
  - `labelenc.classes_` maps the label back to the original class ("ham" or "spam").

---

#### Step 11: Classification Report
```python
y_pred = svc.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
```
- **Purpose**: Generate a detailed report of the model's performance, including:
  - **Precision**: How many predicted "spam" emails are actually spam?
  - **Recall**: How many actual spam emails were correctly predicted?
  - **F1-Score**: The harmonic mean of precision and recall.

---

### **Theory: Support Vector Machine (SVM)**

#### What is SVM?
Support Vector Machine is a supervised learning algorithm used for classification tasks. It works by:
1. Finding a hyperplane (decision boundary) that best separates classes (e.g., spam and ham).
2. Maximizing the margin (distance) between the hyperplane and the nearest points from each class.

#### Why SVM for Spam Filtering?
- Effective in high-dimensional spaces (e.g., text data).
- Handles linearly and non-linearly separable data.
- Resistant to overfitting when the dataset is small.

---

### **Functions and Constructs Used**

1. **`CountVectorizer`**:
   - Converts text into numerical features using token counts.
   - Commonly used in Natural Language Processing (NLP).

2. **`LabelEncoder`**:
   - Encodes categorical labels into numerical values.

3. **`train_test_split`**:
   - Splits the dataset into training and testing subsets.

4. **`SVC`**:
   - Support Vector Classifier, a powerful tool for binary classification.

5. **`classification_report`**:
   - Summarizes model performance with metrics like precision, recall, and F1-score.

---

### **Example Run**

#### Input Data (spam.csv):
| v1   | v2                                     |
|------|---------------------------------------|
| ham  | Go until jurong point, crazy..        |
| spam | Congratulations!!! You won Lottery...|

#### Test Input:
1. `"Congratulations!!!, You won Lottery of $1500000000..."` → Predicted: **Spam**
2. `"Go until jurong point, crazy.."` → Predicted: **Ham**

#### Classification Report Output:
```
              precision    recall  f1-score   support

         0       0.97      0.99      0.98       121
         1       0.98      0.95      0.96        72

    accuracy                           0.97       193
   macro avg       0.97      0.97      0.97       193
weighted avg       0.97      0.97      0.97       193
```

---

Let me know if you have questions or need further clarifications!