<a href="https://colab.research.google.com/github/Harsha-Sankarasetty/spam-detector-final/blob/main/Spam_message_detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!curl -s https://api.ipify.org


34.83.9.157

In [None]:
# ==============================================================================
# SPAM DETECTOR V3: STATE-OF-THE-ART WITH BERT (for Google Colab)
# ==============================================================================
# INSTRUCTIONS:
# 1. In your Colab notebook, go to Runtime -> Change runtime type.
# 2. Select "GPU" as the hardware accelerator. This is crucial for training.
# 3. Copy and paste this entire code block into a single cell and run it.
# 4. This final version includes more targeted data to reduce false positives.
# 5. Click the final localtunnel URL to use the most advanced version of the app.
# ==============================================================================

# --- Step 1: Install necessary libraries ---
print("--- Step 1: Installing and Upgrading Libraries ---")
!pip install flask -q
!npm install -g localtunnel
# Force upgrade transformers to ensure the latest API is used
!pip install -U transformers[torch] datasets scikit-learn -q
print("Installation complete.")

# --- Verify transformers version ---
import transformers
print(f"Transformers version check: {transformers.__version__}")


# --- Step 2: Create the 'templates' folder and the HTML file ---
print("\n--- Step 2: Creating templates/index.html ---")
!mkdir -p templates
html_content = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Spam Detector v3 (BERT)</title>
    <script src="https://cdn.tailwindcss.com"></script>
    <link rel="preconnect" href="https://fonts.googleapis.com">
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display.swap" rel="stylesheet">
    <style>
        body { font-family: 'Inter', sans-serif; }
        .ham { color: #16a34a; }
        .spam { color: #dc2626; }
    </style>
</head>
<body class="bg-gray-100 flex items-center justify-center min-h-screen">
    <div class="w-full max-w-2xl mx-auto bg-white rounded-2xl shadow-lg p-8 md:p-12">
        <div class="text-center mb-8">
            <h1 class="text-3xl md:text-4xl font-bold text-gray-800">Spam Detector v3</h1>
            <p class="text-gray-500 mt-2">Powered by a State-of-the-Art (BERT) Model</p>
        </div>
        <form action="/predict" method="post">
            <div class="mb-6">
                <label for="message" class="block mb-2 text-sm font-medium text-gray-700">Your Message</label>
                <textarea id="message" name="message" rows="6" class="bg-gray-50 border border-gray-300 text-gray-900 text-sm rounded-lg focus:ring-blue-500 focus:border-blue-500 block w-full p-4" placeholder="Type or paste your message here...">{{ message_text if message_text }}</textarea>
            </div>
            <button type="submit" class="w-full text-white bg-blue-600 hover:bg-blue-700 focus:ring-4 focus:outline-none focus:ring-blue-300 font-medium rounded-lg text-lg px-5 py-3 text-center transition-colors duration-300">
                Check Message
            </button>
        </form>
        {% if prediction_text %}
        <div id="result" class="mt-8 border-t pt-6 text-center">
            <h2 class="text-xl font-semibold text-gray-800 mb-2">Result:</h2>
            <p id="prediction" class="text-2xl font-bold {{ result_class }}">
                {{ prediction_text }}
            </p>
        </div>
        {% endif %}
    </div>
</body>
</html>
"""
with open("templates/index.html", "w") as f:
    f.write(html_content)
print("HTML file created successfully.")


# --- Step 3: Fine-Tune and Save the BERT Model ---
print("\n--- Step 3: Fine-Tuning and Saving the BERT Model ---")
import requests
import zipfile
import io
import pandas as pd
import torch
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments

# Define data loading function
def load_data():
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
    response = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(response.content))
    df = pd.read_csv(z.open('SMSSpamCollection'), sep='\t', header=None, names=['label', 'message'])
    df['label'] = df['label'].map({'ham': 0, 'spam': 1})

    print("Augmenting data with new examples...")
    new_data = [
        # Modern Spam Examples
        {'label': 1, 'message': 'We noticed unusual activity on your account. Please verify your details immediately to avoid suspension: https://bit.ly/secure-now'},
        {'label': 1, 'message': 'Your package delivery has been rescheduled. Please confirm your address and a new delivery time here: https://bit.ly/delivery-update'},
        {'label': 1, 'message': 'URGENT! Your account has been compromised. Click here to secure it NOW.'},
        {'label': 1, 'message': 'Congratulations! You have won a $1000 gift card. Claim it here.'},
        {'label': 1, 'message': 'Your tax refund is pending. Please provide your details to process the payment.'},
        {'label': 1, 'message': 'A suspicious login to your account was detected. If this was not you, please click here to reset your password.'},
        {'label': 1, 'message': "Hey, I saw this and thought of you! A new investment opportunity that's doing really well. My cousin already made a lot. Check it out: [some-link]"},
        {'label': 1, 'message': "You won't believe the returns on this new crypto coin. I'm already up 50%. Get in before it's too late!"},
        {'label': 1, 'message': "A friend showed me this stock and it's about to take off. This is not financial advice but you should see this."},
        {'label': 1, 'message': "I'm so sorry to bother you, I'm stuck abroad and lost my wallet. Could you possibly help me out with a small loan? I'll pay you back as soon as I get home. It's really an emergency."},
        {'label': 1, 'message': "This is an urgent request. I'm in a difficult situation and need some financial assistance. Can you help?"},
        {'label': 1, 'message': "Someone you know mentioned you in a photo. Click here to see the post and who tagged you."},
        {'label': 1, 'message': "You have a new friend request. Click to view their profile."},
        {'label': 1, 'message': "Your friend tagged you in a new video. Watch it now!"},


        # Modern Ham Examples (that might look like spam)
        {'label': 0, 'message': 'Just a security notification: A new device logged into your account from a new location. If this was not you, please contact support.'},
        {'label': 0, 'message': 'Your monthly invoice is ready. You can view the full details and payment options in your account dashboard.'},
        {'label': 0, 'message': 'The prize for winning the sales competition will be announced Friday.'},
        {'label': 0, 'message': 'Your password has been successfully reset. If you did not make this change, please contact us immediately.'},
        {'label': 0, 'message': 'Your one-time security code is 123456. Do not share this with anyone.'},
        {'label': 0, 'message': "Hey, I saw this article about our favorite band and thought of you! Check it out when you have a chance."},
        {'label': 0, 'message': "Remember that investment app we talked about? I finally signed up. Let me know if you want my referral code."},
        {'label': 0, 'message': "My cousin is running a marathon for charity, I'm sharing the donation link in case you're interested."},
        {'label': 0, 'message': "The winner of our company's 'Innovator of the Month' prize will be announced at the meeting. Your project is a strong contender!"},
        {'label': 0, 'message': "Congratulations to the winner of this quarter's performance prize!"},
        {'label': 0, 'message': "Don't forget to submit your project for the annual innovation prize. The winner gets a bonus."},
        {'label': 0, 'message': "The free entry passes for the corporate event are now available for pickup."},
        {'label': 0, 'message': "Click here to view the agenda for the upcoming all-hands meeting."},
        {'label': 0, 'message': "Hey, my flight was cancelled and I'm stuck in London overnight. I'll have to reschedule our meeting tomorrow. So sorry for the trouble."},
        {'label': 0, 'message': "I'm having a bit of an emergency, my car broke down. I'm going to be late."},
        {'label': 0, 'message': "Dear customer, a payment of $55.90 to 'Online Retailer' was just processed. If you do not recognize this transaction, click here to report it."},
        {'label': 0, 'message': "Your order #12345 has shipped! You can track your package using the link provided."},
        {'label': 0, 'message': "Thanks for your recent purchase. Here is your receipt and order details."},
        {'label': 0, 'message': "Last chance to get 25% off! Our biggest sale of the year ends tonight. Don't miss out on these amazing deals. Click to shop now."},
        {'label': 0, 'message': "Your weekly newsletter is here! Check out the latest articles and updates."},
        {'label': 0, 'message': "Flash Sale Alert! Get 50% off all items for the next 24 hours. Thanks for being a loyal customer."},
        {'label': 0, 'message': "Sarah just shared a photo with you on SocialApp. Click to view."},
        {'label': 0, 'message': "You have a new message from John. Reply now."},
        {'label': 0, 'message': "Your friend just posted for the first time in a while."},
        {'label': 0, 'message': "We need to review the security protocols for the upcoming software release. Can you send me the documentation on our spam and phishing detection algorithms?"},
        {'label': 0, 'message': "Our new anti-spam filter is getting great reviews. Let's discuss the implementation plan."},
        {'label': 0, 'message': "Please review the attached report on the latest phishing threats and our mitigation strategies."},
        {'label': 0, 'message': "URGENT IT NOTIFICATION: We will be performing a critical security update tonight at 10 PM. Please ensure all your work is saved and you are logged out of all systems before then to avoid data loss."},
        {'label': 0, 'message': "Security Alert: Please be aware of a new phishing campaign targeting our employees. Do not click any suspicious links."}
    ]

    new_df = pd.DataFrame(new_data)
    df = pd.concat([df, new_df], ignore_index=True)
    print(f"Data augmented. New dataset size: {len(df)}")

    return df

# Load data
df = load_data()
train_texts = df['message'].tolist()
train_labels = df['label'].tolist()

# Load tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Tokenize the texts
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)

# Create a PyTorch dataset
class SpamDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)

train_dataset = SpamDataset(train_encodings, train_labels)

# Load the pre-trained model
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

# Define training arguments - Simplified for compatibility
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3, # Increased epochs for better learning on new data
    per_device_train_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    report_to="none",
)

# Create the standard Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model and tokenizer
model.save_pretrained('./spam_bert_model')
tokenizer.save_pretrained('./spam_bert_model')

print("✅ BERT Model and Tokenizer have been fine-tuned and saved.")


# --- Step 4: Run the Flask Web Application with Localtunnel ---
print("\n--- Step 4: Launching the Web App with Localtunnel ---")
from flask import Flask, render_template, request
from transformers import pipeline
import subprocess
import threading
import time

# Set up Flask app
app = Flask(__name__)

# Load the fine-tuned model using a pipeline for easy inference
spam_classifier = pipeline('text-classification', model='./spam_bert_model', tokenizer='./spam_bert_model')

# Define routes
@app.route('/')
def home():
    return render_template('index.html')

@app.route('/predict', methods=['POST'])
def predict():
    user_message = request.form.get('message')
    if not user_message:
        return render_template('index.html', prediction_text="Please enter a classify.", message_text="")
    prediction = spam_classifier(user_message)[0]
    if prediction['label'] == 'LABEL_1':
        result_text = "This looks like Spam."
        result_class = "spam"
    else:
        result_text = "This looks like a legitimate message (Ham)."
        result_class = "ham"
    return render_template('index.html', prediction_text=result_text, message_text=user_message, result_class=result_class)

# Function to run Flask app
def run_app():
    app.run(port=5001)

# Start Flask app in a separate thread
flask_thread = threading.Thread(target=run_app)
flask_thread.daemon = True
flask_thread.start()
time.sleep(2)

# Start localtunnel and automatically handle IP verification
print("Starting localtunnel...")
lt_process = subprocess.Popen(['lt', '--port', '5001'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

for line in iter(lt_process.stdout.readline, b''):
    if b'your url is:' in line:
        url = line.decode('utf-8').split(' ')[-1].strip()
        print(f"Tunnel URL: {url}")
        print("Waiting for tunnel to stabilize...")
        time.sleep(5)
        bypass_url = url + "/bypass-tunnel-reminder"
        for i in range(3):
            print(f"Attempt {i+1} to bypass password page...")
            try:
                subprocess.run(['curl', '-s', bypass_url], check=True, timeout=10)
                print("Bypass successful!")
                break
            except (subprocess.CalledProcessError, TimeoutError) as e:
                print(f"Bypass attempt {i+1} failed. Retrying in 5 seconds...")
                if i < 2:
                   time.sleep(5)
        print(f"\n✅ Your Spam Detector v3 is live!")
        print(f"👉 Click here to open: {url}")
        break

# Keep the main thread alive by waiting for the localtunnel process to terminate
try:
    lt_process.wait()
except KeyboardInterrupt:
    print("Stopping the web server.")
    lt_process.kill()


--- Step 1: Installing and Upgrading Libraries ---
[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K
changed 22 packages in 2s
[1G[0K⠴[1G[0K
[1G[0K⠴[1G[0K3 packages are looking for funding
[1G[0K⠴[1G[0K  run `npm fund` for details
[1G[0K⠴[1G[0KInstallation complete.
Transformers version check: 4.56.1

--- Step 2: Creating templates/index.html ---
HTML file created successfully.

--- Step 3: Fine-Tuning and Saving the BERT Model ---
Augmenting data with new examples...
Data augmented. New dataset size: 5615


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
10,0.6168
20,0.6087
30,0.567
40,0.5089
50,0.4351
60,0.3699
70,0.2822
80,0.1967
90,0.1292
100,0.1347


Device set to use cuda:0


✅ BERT Model and Tokenizer have been fine-tuned and saved.

--- Step 4: Launching the Web App with Localtunnel ---
 * Serving Flask app '__main__'
 * Debug mode: off


Address already in use
Port 5001 is in use by another program. Either identify and stop that program, or start the server with a different port.


Starting localtunnel...
Tunnel URL: https://fancy-tires-pull.loca.lt
Waiting for tunnel to stabilize...
Attempt 1 to bypass password page...


INFO:werkzeug:127.0.0.1 - - [06/Sep/2025 11:22:49] "[33mGET /bypass-tunnel-reminder HTTP/1.1[0m" 404 -


Bypass successful!

✅ Your Spam Detector v3 is live!
👉 Click here to open: https://fancy-tires-pull.loca.lt


INFO:werkzeug:127.0.0.1 - - [06/Sep/2025 11:23:09] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [06/Sep/2025 11:23:13] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [06/Sep/2025 11:23:14] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [06/Sep/2025 11:23:52] "POST /predict HTTP/1.1" 200 -
