In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

class SpamDetection:
    def __init__(self, data_path):
        # Initialize the class with the path to the dataset
        self.data_path = data_path

    def load_data(self):
        """Load dataset from a specified path."""
        # Read the CSV file into a pandas DataFrame and return it
        return pd.read_csv(self.data_path)

    def preprocess(self, data):
        """Preprocess data by separating features and target."""
        # Use all columns except 'spam' as features (X)
        X = data.drop(columns=['spam'])
        # The 'spam' column is the target variable (y)
        y = data['spam']
        # Split the data into training and testing sets (70% train, 30% test)
        return train_test_split(X, y, test_size=0.3, random_state=42)

    def train_model(self, X_train, y_train):
        """Train the Random Forest model."""
        # Initialize the Random Forest classifier with 100 trees
        rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
        # Fit the model to the training data
        rf_classifier.fit(X_train, y_train)
        # Return the trained model
        return rf_classifier

    def evaluate_model(self, model, X_test, y_test):
        """Evaluate model performance and print results."""
        # Make predictions on the test set
        y_pred = model.predict(X_test)
        # Calculate and print the accuracy of the model
        print("Accuracy:", accuracy_score(y_test, y_pred))
        # Print a detailed classification report (precision, recall, f1-score)
        print(classification_report(y_test, y_pred))

# Implementation
# Create an instance of the SpamDetection class with the specified data path
spam_detector = SpamDetection(data_path='spambase.csv')
# Load the dataset
data = spam_detector.load_data()
# Preprocess the data and split it into training and testing sets
X_train, X_test, y_train, y_test = spam_detector.preprocess(data)
# Train the Random Forest model using the training data
model = spam_detector.train_model(X_train, y_train)
# Evaluate the trained model using the test data
spam_detector.evaluate_model(model, X_test, y_test)

Accuracy: 0.9565532223026793
              precision    recall  f1-score   support

           0       0.95      0.98      0.96       804
           1       0.97      0.93      0.95       577

    accuracy                           0.96      1381
   macro avg       0.96      0.95      0.96      1381
weighted avg       0.96      0.96      0.96      1381



# Let's break down each row of the code and explain its purpose:

### 1. **Class Definition and `__init__` Method**
```python
class SpamDetection:
    def __init__(self, data_path):
        self.data_path = data_path
```
- **Explanation**: 
  - The `SpamDetection` class is defined to encapsulate the functionality for spam detection.
  - The `__init__` method is the constructor, which is called when an instance of the class is created.
  - It takes a `data_path` parameter to specify the location of the dataset and assigns it to the instance variable `self.data_path`.
- **Benefit**: 
  - It allows for easy instantiation of the class with the specific path to the dataset, enabling reusability for different datasets.

### 2. **Loading Data**
```python
def load_data(self):
    """Load dataset from a specified path."""
    return pd.read_csv(self.data_path)
```
- **Explanation**: 
  - This method loads the dataset from the path specified during initialization using the `pandas.read_csv` function.
- **Benefit**: 
  - It provides a centralized place to load the data, making it easy to switch datasets if needed.

### 3. **Data Preprocessing**
```python
def preprocess(self, data):
    """Preprocess data by separating features and target."""
    # Use all columns except 'spam' as features
    X = data.drop(columns=['spam'])
    y = data['spam']
    return train_test_split(X, y, test_size=0.3, random_state=42)
```
- **Explanation**: 
  - The `preprocess` method separates the features (`X`) and target variable (`y`). Here, it uses all columns except 'spam' as features (`X`) and assigns the 'spam' column as the target (`y`).
  - Then, it splits the data into training and testing sets using `train_test_split` from scikit-learn with a 70-30 split (70% for training, 30% for testing).
- **Benefit**: 
  - It ensures that the model trains on a subset of the data and is tested on unseen data, which is crucial for evaluating the model's generalization ability.
  - The use of a fixed `random_state` ensures reproducibility of the data split.

### 4. **Training the Model**
```python
def train_model(self, X_train, y_train):
    """Train the Random Forest model."""
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_classifier.fit(X_train, y_train)
    return rf_classifier
```
- **Explanation**: 
  - This method trains a Random Forest classifier on the provided training data (`X_train`, `y_train`).
  - The `RandomForestClassifier` is initialized with `n_estimators=100` (the number of trees in the forest) and `random_state=42` for reproducibility.
  - The `fit` method trains the model using the training features (`X_train`) and the target labels (`y_train`).
- **Benefit**: 
  - Random Forest is a robust, ensemble-based machine learning algorithm that can handle both classification and regression tasks. It reduces overfitting and provides good performance for a variety of tasks.

### 5. **Evaluating the Model**
```python
def evaluate_model(self, model, X_test, y_test):
    """Evaluate model performance and print results."""
    y_pred = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print(classification_report(y_test, y_pred))
```
- **Explanation**: 
  - This method evaluates the trained model using the test set (`X_test`, `y_test`).
  - It makes predictions using the `predict` method of the trained model (`model`) on the test features (`X_test`).
  - It calculates the accuracy of the model by comparing the predicted labels (`y_pred`) with the true labels (`y_test`) using `accuracy_score`.
  - The `classification_report` provides a more detailed performance evaluation, including precision, recall, and F1-score.
- **Benefit**: 
  - The evaluation allows you to assess how well the model generalizes to unseen data. The classification report gives insights into specific performance metrics that are especially useful for imbalanced classes.

### 6. **Implementation (Outside Class)**
```python
spam_detector = SpamDetection(data_path='spambase.csv')
data = spam_detector.load_data()
X_train, X_test, y_train, y_test = spam_detector.preprocess(data)
model = spam_detector.train_model(X_train, y_train)
spam_detector.evaluate_model(model, X_test, y_test)
```
- **Explanation**: 
  - An instance of the `SpamDetection` class is created with the dataset path (`'spambase.csv'`).
  - The `load_data` method is called to load the dataset into a pandas DataFrame.
  - The `preprocess` method is called to split the data into training and testing sets.
  - The `train_model` method trains the model on the training data.
  - Finally, the `evaluate_model` method evaluates the trained model on the test data.
- **Benefit**: 
  - This block of code ties everything together and provides a clear flow of actions from data loading to model evaluation. It allows you to run the complete spam detection pipeline.

### Overall Benefits:
- **Modularity**: Each method is dedicated to a specific task, which improves code readability and maintainability.
- **Reusability**: The class can be reused for different datasets or with different configurations, making it adaptable to various situations.
- **Scalability**: The approach allows easy integration of different machine learning algorithms (e.g., replacing `RandomForestClassifier` with another classifier if needed).