<a href="https://colab.research.google.com/github/QusaiALBahri/Day_13_logistic_regression/blob/main/Day_13_logistic_regression_real_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
We will write a simple Logistic Regression tutorial to classify emails into spam or not spam using the dataset at "/content/emails.csv".

## Load the data

### Subtask:
Load the `emails.csv` dataset into a pandas DataFrame.


**Reasoning**:
Import pandas and load the CSV file into a DataFrame, then display the head to inspect the data.



In [None]:
import pandas as pd
#please use (upload to colab) the csv file in github repo.
df = pd.read_csv('/content/emails.csv')
display(df.head())

## Explore and preprocess data

### Subtask:
Inspect the data, handle missing values, and prepare the text data for the model (e.g., using TF-IDF).


**Reasoning**:
Inspect the data for shape and missing values, and then convert the text data to TF-IDF representation.



In [None]:
print("Shape of the DataFrame:", df.shape)

print("\nMissing values per column:")
print(df.isnull().sum())

# Handle missing values - drop rows with missing values
df.dropna(inplace=True)
print("\nShape of the DataFrame after dropping missing values:", df.shape)


from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text data to TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=5000) # Limiting features to 5000
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])

print("\nShape of the TF-IDF matrix:", tfidf_matrix.shape)

## Split the data

### Subtask:
Split the data into training and testing sets.


**Reasoning**:
Split the data into training and testing sets using `train_test_split` based on the instructions.



In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, df['spam'], test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

## Build and train the model

### Subtask:
Create a logistic regression model and train it on the training data.


**Reasoning**:
The subtask requires creating and training a Logistic Regression model. This involves importing the necessary class, instantiating the model, and fitting it to the training data.



In [None]:
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression object
model = LogisticRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

## Evaluate the model

### Subtask:
Evaluate the model's performance on the testing data using appropriate metrics.


**Reasoning**:
Import the necessary metrics from sklearn.metrics, make predictions on the test set, and calculate the evaluation metrics.



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the test features
y_pred = model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

## Make predictions

### Subtask:
Show how to make predictions on new data using the trained model.


**Reasoning**:
Create new email texts, transform them using the fitted TF-IDF vectorizer, and predict their spam classification using the trained model. Then, display the original texts and their predictions.



In [None]:
# 1. Create new email texts
new_emails = [
    "Subject: special offer just for you! limited time!", # Likely spam
]

# 2. Transform new email texts using the fitted tfidf_vectorizer
new_emails_tfidf = tfidf_vectorizer.transform(new_emails)

# 3. Use the trained model to make predictions
predictions = model.predict(new_emails_tfidf)

# 4. Print the original texts and their predictions
print("New Email Predictions:")
for email, prediction in zip(new_emails, predictions):
    print(f"Email: {email}")
    print(f"Prediction: {'Spam' if prediction == 1 else 'Not Spam'}")
    print("-" * 20)

## Summarize findings

### Subtask:
Provide a summary of the tutorial and the results.


**Reasoning**:
Write a markdown summary of the tutorial, including the steps, evaluation metrics, and prediction process.



In [None]:
summary = """
## Logistic Regression Tutorial: Email Spam Classification

This tutorial demonstrates how to build a Logistic Regression model to classify emails as spam or not spam using the provided dataset.

**1. Data Loading:**
The tutorial began by loading the email dataset from the `/content/emails.csv` file into a pandas DataFrame. The dataset contains two columns: 'text' (the email content) and 'spam' (the target variable, where 1 indicates spam and 0 indicates not spam).

**2. Data Exploration and Preprocessing:**
The data was inspected for its shape and missing values. Fortunately, no missing values were found. The 'text' column, containing the raw email content, was then transformed into a numerical representation using the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization technique. This process converts text into a matrix of numerical features, which is suitable for machine learning models. A maximum of 5000 features was used for the TF-IDF representation.

**3. Data Splitting:**
The dataset was split into training and testing sets. 80% of the data was allocated for training the model, and the remaining 20% was held out for evaluating its performance. This split ensures that the model is evaluated on data it has not seen during training.

**4. Model Building and Training:**
A Logistic Regression model was instantiated from the scikit-learn library. This model was then trained on the training data (the TF-IDF features and corresponding spam labels). The training process involves the model learning the relationship between the email features and the likelihood of an email being spam.

**5. Model Evaluation:**
After training, the model's performance was evaluated on the unseen testing data. The following metrics were used:

*   **Accuracy:** The proportion of correctly classified emails (both spam and not spam).
*   **Precision:** Out of all emails predicted as spam, the proportion that were actually spam. High precision indicates a low rate of false positives (non-spam emails incorrectly classified as spam).
*   **Recall:** Out of all actual spam emails, the proportion that were correctly identified by the model. High recall indicates a low rate of false negatives (spam emails incorrectly classified as non-spam).
*   **F1-score:** The harmonic mean of precision and recall, providing a balanced measure of the model's performance.

The evaluation results on the test set were:
*   Accuracy: {accuracy:.4f}
*   Precision: {precision:.4f}
*   Recall: {recall:.4f}
*   F1-score: {f1:.4f}

These metrics indicate that the model performed very well on the test set, with high accuracy and strong performance across precision, recall, and F1-score.

**6. Making Predictions on New Data:**
Finally, the tutorial demonstrated how to use the trained model to predict whether new, unseen emails are spam or not spam. This involves taking the new email text, transforming it using the *same* fitted TF-IDF vectorizer used during training, and then feeding the resulting numerical features into the trained Logistic Regression model to get a spam/not spam prediction.
"""

print(summary)

## Summary:

### Data Analysis Key Findings

*   The dataset contains 5728 emails with their content and spam classification.
*   There were no missing values in the dataset.
*   The text data was successfully transformed into a TF-IDF matrix with 5000 features.
*   The data was split into an 80% training set (4582 samples) and a 20% testing set (1146 samples).
*   A Logistic Regression model was trained on the TF-IDF features and spam labels.
*   The trained model achieved high performance on the test set:
    *   Accuracy: 0.9808
    *   Precision: 0.9963
    *   Recall: 0.9276
    *   F1-score: 0.9607

### Insights or Next Steps

*   The high precision score suggests the model is very good at correctly identifying emails as spam when it predicts they are spam, which is important for minimizing false positives.
*   While the recall is also high, exploring techniques to potentially increase it further without significantly sacrificing precision could be beneficial for capturing more actual spam emails.


In [None]:
# The code below demonstrates how to make predictions on new email texts using the trained model.
# It first creates a list of new emails, transforms them into TF-IDF features using the same vectorizer
# used during training, and then uses the trained Logistic Regression model to predict
# whether each email is spam or not.

# 1. Create new email texts
new_emails = [
    "Subject: special offer just for you! limited time!", # Likely spam
    "Subject: meeting scheduled for tomorrow at 10am", # Likely not spam
    "Subject: your account has been compromised - urgent action required", # Likely spam
    "Subject: project update and next steps", # Likely not spam
    "Subject: congratulations! you've won a prize!", # Likely spam
]

# 2. Transform new email texts using the fitted tfidf_vectorizer
new_emails_tfidf = tfidf_vectorizer.transform(new_emails)

# 3. Use the trained model to make predictions
predictions = model.predict(new_emails_tfidf)

# 4. Print the original texts and their predictions
print("New Email Predictions:")
for email, prediction in zip(new_emails, predictions):
    print(f"Email: {email}")
    print(f"Prediction: {'Spam' if prediction == 1 else 'Not Spam'}")
    print("-" * 20)