# **Email spam classfier: Team tasks**
### Let's name it **NeuralSheild**

### **From Nimona Engida, team leader**

<pre> 
Hey team! We're building a program that  detects whether an email is a spam or not. 
Each of you will do one important part. 
Don't worry I'll guide you through it.
</pre>

### **Rules**:
1. Only work in your assigned section.
2. Run all the cells above yours first.
3. If you get stuck, use comments(#) what you are trying to do or 
   what alternative/new method you have used.
   Or ask the team leader for help if you get stuck.
4. Finish on time so the next person can go.

# **Starter code (Please run this first)**

### **Tip:** *#type: ignore* is used to not show error if variable is not assigned(don't worry about it).


#### 🚨 ***Note: I assigned your names randomly.***

In [1]:
import pandas as pd 
import numpy as np
import re
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression # for now let's use logistic regression, may be later XGBoost
from sklearn.metrics import accuracy_score

print('All tools are loaded and ready.')
print('You can now start task 1.')

All tools are loaded and ready.
You can now start task 1.


In case if you want to know the diffrence between CountVectorizer and TfidfVectorizer:

  - **CountVectorizer**: Counts how many times each word appears in a document.
  - **TfidfVectorizer**: Also counts word occurrences, but then downweights words that are common across many documents.

## **Task 1: Data Loading & cleaning**

**Assigned to**: Hemen 

**Your Job**: Get our data ready by cleaning up the text messages.

In [6]:
# ============================================ Task 1 ===================================================================

# --- Step 1: Load the data from our file ---
df = pd.read_csv('NeuralSheild/data/raw/spam.csv') 

# Let's see what our data looks like!
print("First 5 rows of our data:")
print(df.head())  
print("\n")

# --- Step 2: Count how many spam vs ham messages we have ---
print(df['Category'].value_counts()) 
print("\n")

# --- Step 3: Clean the text messages ---

def clean_text(text):
    """
    Make this function clean our text by:
    1. Making everything lowercase
    2. Removing punctuation (! , . ? etc.)
    3. Removing extra spaces
    """
    text = text.lower() 
    text = re.sub(r'[^\w\s]', '', text)  
    text = re.sub(r'\s+', ' ', text).strip() 
    return text
    
# Apply our cleaning function to all messages
df['cleaned_text'] = df['Message'].apply(clean_text)    # type: ignore

# --- Step 4: Show before and after cleaning(already done) ---
print("🔍 Before vs After cleaning (first 2 messages):")
print("ORIGINAL: " + df.loc[0, 'Message'])
print("CLEANED:  " + df.loc[0, 'cleaned_text'])
print("---")
print("ORIGINAL: " + df.loc[1, 'Message'])
print("CLEANED:  " + df.loc[1, 'cleaned_text'])

print("\n✅ GREAT JOB! TASK 1 COMPLETE!")
print("Please tell Team Lead and Teammate 2 you're done")
# ======================== ✅ TASK 1 END ========================

First 5 rows of our data:
  Category                                            Message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...


Category
ham     19162
spam     6758
Name: count, dtype: int64


🔍 Before vs After cleaning (first 2 messages):
ORIGINAL: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
CLEANED:  go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
---
ORIGINAL: Ok lar... Joking wif u oni...
CLEANED:  ok lar joking wif u oni

✅ GREAT JOB! TASK 1 COMPLETE!
Please tell Team Lead and Teammate 2 you're done


## **Task 2: Preparing Data for the Computer**
**Assigned to:** Yanet

**Your Job:** Convert words into numbers that the computer can understand.

In [8]:
# ======================== 🛠️ TASK 2 START ========================
# PREREQUISITE: Make sure Teammate 1 has run their cell first!

# INSTRUCTIONS: 
# 1. Read the explanations

# --- Step 1: Convert words to numbers ---

# Create a word counter (limits to 500 most common words to keep it simple)
vectorizer = CountVectorizer(max_features=500)

# convert the cleaned messages to numbers
X = vectorizer.fit_transform(df['cleaned_text'])

print(f"We have {X.shape[0]} messages") # Alredy done
print(f"We're using {X.shape[1]} different words as features")
print("\n")

# --- Step 2: Prepare our labels ---
# Convert 'ham' to 0 and 'spam' to 1 (computers love numbers!)
y = df['Category'].map({'ham': 0, 'spam': 1})

print("First 10 labels:")
print(y.head(10))  

print("\n✅ AWESOME! TASK 2 COMPLETE!")
print("Please tell Team Lead and Teammate 3 you're done")
# ======================== ✅ TASK 2 END ========================

We have 25920 messages
We're using 500 different words as features


First 10 labels:
0    0
1    0
2    1
3    0
4    0
5    1
6    0
7    0
8    1
9    1
Name: Category, dtype: int64

✅ AWESOME! TASK 2 COMPLETE!
Please tell Team Lead and Teammate 3 you're done


## **Task 3: Splitting Data & Making Charts**
**Assigned to:** Yosef

**Your Job:** Split our data and create visualizations to understand it better.


In [None]:
# ======================== 🛠️ TASK 3 START ========================
# PREREQUISITE: Make sure Teammate 2 has run their cell first!

# INSTRUCTIONS: 
# 1. Enjoy the charts that appear!

# --- Step 1: Split our data into training and testing ---
# We'll use 80% for training, 20% for testing

X_train, X_test, y_train, y_test =     # type: ignore

print(f"Training messages: {X_train.shape[0]}") # Already done
print(f"Testing messages: {X_test.shape[0]}")
print("\n")

# --- Step 2: Create cool charts! ---

# Chart 1: Spam vs Ham distribution
plt.figure(figsize=(12, 4))

# First chart - how many spam vs ham. Write the code below.

# Add numbers on top of bars. Write the code below.

# Second chart - training vs test split. Write the code below.

# Add numbers on top of bars. Write the code below.


plt.tight_layout()
plt.show()

print("✅ FANTASTIC! TASK 3 COMPLETE!")
print("Please tell Team Lead and Teammate 4 you're done")
# ======================== ✅ TASK 3 END ========================

## **Task 4: Training Our Spam Detective**
**Assigned to:** Sumeya

**Your Job:** Train our computer model to recognize spam!

In [None]:
# ======================== 🛠️ TASK 4 START ========================
# PREREQUISITE: Make sure Teammate 3 has run their cell first!

# INSTRUCTIONS: 
# 1. See how smart our model becomes!

# --- Step 1: Create and train our spam detective ---
# Create a Logistic Regression model (great for yes/no questions). Max iteration 1000
model =      # type: ignore

# Train/fit the model with our training data


print("✅ Training complete! Our model is ready!")
print("\n")

# --- Step 2: Check how well our model learned ---
# See how accurate our model is on the training data
train_predictions =     # type: ignore
train_accuracy =        # type: ignore

print(f"Training Accuracy: {train_accuracy:.1%}")
print("(This shows how well our model learned from the training data)")

print("\n🎉 INCREDIBLE! TASK 4 COMPLETE!")
print("Please tell Team Lead you're done!")
# ======================== ✅ TASK 4 END ========================

# ======= 🚀 TEAM LEADER SECTION =======
# Amazing work team!

### ***As team leader my tasks are:***
- Building GUI desktop app for the spam detector (already done)
- Developing CLI app for the spam detector (under active development)
- Develping public website for the spam detector (under active development)
- Building the cross-platform spam detector module (80% complete)
- Develping browser extension for the spam detector (under active development)
- Trainig the model on many social media platforms spam datasets (under active development) and
- More.