# Decision Trees Classification on Spam Detection

## Task Overview
In this task, we will build a Decision Trees classifier to predict whether an email is spam or not based on its content. We will use a labeled dataset that contains various emails classified as either spam (1) or not spam (0). The goal is to evaluate the model's performance and discuss potential improvements.

## Dataset
The dataset used for this task is in CSV format and consists of two columns:
- **text**: The content of the email.
- **spam**: A binary label indicating whether the email is spam (1) or not spam (0).

### Data Loading
We will begin by loading the dataset and inspecting the first few entries to understand its structure. Additionally, we will output the total number of entries loaded to ensure the data is correctly imported.

## Steps
1. **Data Preprocessing**: Clean the text data by removing any unnecessary characters, handling missing values, and preparing the data for modeling.
2. **Feature Extraction**: Convert the text data into numerical format using techniques like TF-IDF vectorization.
3. **Data Splitting**: Split the dataset into training and testing sets to evaluate the model's performance.
4. **Model Training**: Train a Decision Trees classifier using the training data.
5. **Model Evaluation**: Evaluate the model's accuracy on both the training and testing datasets. A classification report and confusion matrix also are generated to analyze the performance.
6. **Predictions**: Use the trained model to predict the spam status of new email samples.

## Expected Outcomes
- An accuracy score for both training and testing datasets.
- A classification report showing precision, recall, and F1-score.
- A confusion matrix illustrating the number of true positive, true negative, false positive, and false negative predictions.
- Insights into the model's performance and potential areas for improvement.

## Conclusion
By the end of this task, we will have a functional spam detection model based on a Decision Trees classifier and a better understanding of its strengths and weaknesses. 


In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

data = pd.read_csv('emails.csv', on_bad_lines='skip')

print(f"Data loaded successfully: {len(data)} entries.")

data = data.dropna(subset=['spam'])

print("Data loaded successfully:")
print(data.head())  

data['spam'] = data['spam'].astype(int)

tfidf = TfidfVectorizer(stop_words='english')  # Optional: use stop words to improve processing
X = tfidf.fit_transform(data['text'])
y = data['spam']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dtc = DecisionTreeClassifier(random_state=42)
dtc.fit(X_train, y_train)

y_pred_train = dtc.predict(X_train)
y_pred_test = dtc.predict(X_test)

print("Training Accuracy:", accuracy_score(y_train, y_pred_train))
print("Testing Accuracy:", accuracy_score(y_test, y_pred_test))

print("\nClassification Report:\n", classification_report(y_test, y_pred_test))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_test))

example_emails = [
    "Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective .",
    "Subject:  Raptors  here is the most recent version of the spreadsheet and the accompanying  assumptions"
]
example_emails_tfidf = tfidf.transform(example_emails)
predictions = dtc.predict(example_emails_tfidf)
print("Predictions for example emails:", predictions)


Data loaded successfully: 2435 entries.
Data loaded successfully:
                                                text  spam
1  Subject: the stock trading gunslinger  fanny i...   1.0
3  Subject: save your money buy getting this thin...   1.0
5  Subject: save your money buy getting this thin...   1.0
6  Subject: save your money buy getting this thin...   1.0
9  Subject: security alert - confirm your nationa...   1.0
Training Accuracy: 1.0
Testing Accuracy: 0.9113924050632911

Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.88      0.89        32
           1       0.92      0.94      0.93        47

    accuracy                           0.91        79
   macro avg       0.91      0.91      0.91        79
weighted avg       0.91      0.91      0.91        79

Confusion Matrix:
 [[28  4]
 [ 3 44]]
Predictions for example emails: [1 0]
