# Final Assignment - Data science Methodology

## ROLE - The Client

### Problem:

We are an email service provider with 10,000 business clients. Our selling point is security and efficiency.

However, over the last six months, spammers have become much smarter. It used to be easy to spot spam—it was just bad grammar and 'You won a lottery' subject lines. Now, spammers are using sophisticated templates that look like real invoices or bank alerts.

**The Impact:**

- Lost Productivity: Our business clients are complaining that their employees spend 20 minutes every morning just deleting junk.

- Churn Risk: Three major clients have threatened to cancel their contracts and move to Gmail if we don't fix this.

- Volume: We receive 5 million emails a day. We cannot hire humans to check them (for privacy reasons and pure volume). We need a system that scales."



----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## ROLE - Data Scienctist

Solving the problem using the **CRISP-DM** Methodology - (Cross Industry Standard Process for Data Mining)

### Stage 1: Business Understanding

Before we even touch the data, we need to understand the business. We must identify the problem, the constraints, the goals, and the deliverables for the project. Essentially, we need to define what success looks like.

In our problem statement, we saw the core issue: spam emails are getting mixed in with important emails. This directly affects user productivity and is causing a high rate of customer churn. Since the volume of incoming emails is massive, we can't solve this manually.

Therefore, our goal isn't just identifying patterns; it is to help the client increase productivity and reduce churn risk.

The critical constraint here is minimizing False Positives. We cannot risk flagging a real invoice or a legitimate email as spam. If a critical email gets moved to the spam folder, it could cause a direct business loss.

Finally, the output we expect from the model is a binary classification: 0 for Safe and 1 for Spam.

### Stage 2: Data Understanding


Once we have understood the business problem, we move to the second stage: Data Understanding.

Here, we ask ourselves: What data do we actually have?

The major dataset is the email logs from the client's server. To understand this, we need to pull that raw data and save it into our database for inspection.

- We then look for specific patterns. What does a spam email actually look like?
- We analyze the Subject Lines: Are there common triggers?
- We examine the Body: What keywords appear most often? What is the structure of the text?
- We check the Metadata: Is there a specific sender address? What time of day did these emails arrive?

By asking these questions, we also discover problems with the data. The biggest issue we identify is that computers can't read language—they only understand numbers. We realize we will need to convert this text into tokens later. We also find 'dirty' data, such as empty emails or broken characters.

The outcome of this stage is crucial. All the insights and problems we discover here become the input for the next stage: Data Preparation.

### Stage 3: Data Preparation


In my three weeks of learning Data Science, I’ve heard countless times that this stage consumes the most time in any project. It is the 'janitor work' of the process.

Based on the insights we gathered in the previous stage, we now practically apply those fixes to the data.

First, we clean the data. Since we are dealing with emails, they often come full of HTML tags (like <div> or <br>). The first step is to strip these tags out so we are left with just the plain text.

Next, we handle inconsistency. Emails are written in all kinds of formats—some are all caps, some are lowercase, some are mixed. To a computer, 'FREE' and 'free' look like two different words. So, I will convert everything into a universal lowercase format to ensure consistency.

Finally, as we established earlier, computers don't understand language. We need to tokenize the emails. I will break every sentence down into individual words—or 'tokens'—and store them in a list. A raw email is now transformed into a clean list of keywords like ['lottery', 'winner', 'click', 'here'] or ['invoice', 'cash'].

This prepared data is now ready to be fed into the analysis. It becomes the clean input for our next stage: Modeling."

### Stage 4: Modeling


I’ll be honest—I haven't built a model practically yet. But through this course, I’ve learned that the most critical step isn't just writing code; it's selecting the appropriate model for the specific problem we are solving.

Why Logistic Regression? In our first stage (Business Understanding), we defined that we need a simple 'Yes' or 'No' output: Is this Spam (1) or Safe (0)? We also need to know the conviction—how confident is the model in its decision?

Based on my learning, Logistic Regression fits this perfectly. Why? Because unlike other models, it outputs a probability. It doesn't just guess; it tells us, 'I am 80% sure this is spam.' This aligns exactly with our business need to minimize errors.

The Process: We take our tens of thousands of cleaned, tokenized emails and split them. A common standard is a 75/25 split:

- 75% is used for Training (Teaching the model).
- 25% is hidden away for Testing (Checking the model later).

How it Learns: During training, the model 'crunches' the data.

- It learns that words like 'Meeting,' 'Project,' or 'Mom' usually appear in safe emails, pulling the probability down toward 0.
- In parallel, it learns that terms like 'Wire Transfer,' 'Lottery,' or 'Click Here' push the probability up toward 1 (Spam).

By the end of this stage, we have a trained model that can look at a new email and calculate the probability of it being junk.

### Stage 5: Evaluation

We have reached the fifth stage: Evaluation. This is arguably the most important safety check in the entire Data Science lifecycle.

Why? Because before we deploy this model into the real world, we need to prove it works.

Remember the 25% of data we held back during the previous stage? That data is now our 'Exam Paper.' The model has never seen these emails before. We feed this test data into the model and compare its predictions against the real answers.

What are we looking for? First, we look at Performance: If the model catches 90% to 98% of the spam, that’s a great start.

But—and this is critical—we must look back at our Business Constraints. In Stage 1, we explicitly stated that we cannot afford to lose important emails. So, I have to ask: 'Did the model accidentally flag an email from the CEO as spam?'

If the answer is 'Yes,' then even with 98% accuracy, the model is a failure. We cannot deploy it.

The Iterative Loop: If we fail this check, we stop. We don't deploy. Instead, we go back to the Modeling or even Data Preparation stage to fix the issue. We iterate through these steps again and again until we get a model that is both accurate and safe.

Only when the model meets the criteria we defined in the beginning do we give the green light for the final stage

### Stage 6: Deployment

Finally, we arrive at Stage 6: Deployment.

Once we have confirmed all the critical outcomes in the Evaluation stage, we are ready to go live.
The Integration: For this project, the model needs to sit directly on the client's Email Server.

Here is how it works in production:

- Every email that arrives at the server passes through our model first.
- The model flags it: 0 (Safe) or 1 (Spam).
- If it’s Safe, it goes to the Inbox. If it’s Spam, it is routed to the Junk folder.

A Critical Note on Privacy: We must remember we are dealing with private, sensitive user data. We need to ensure that every email is encrypted during this process so that no private information is leaked while the model is reading it.

The Feedback Loop: But here is the truth: Deployment is not the end of a Data Science project. In fact, this is where the real work starts.

Once the model is in the real world, we must continuously collect feedback:

Are users manually moving emails out of the Spam folder? (That means our model made a mistake).

Are spammers using new tricks?

This new data might reveal new variables we missed earlier—like spammers using emojis or specific colors to trick the system. This new information flows right back to Stage 1 (Business Understanding), and the cycle begins again. We constantly update the model to keep it smarter than the spammers

# Author
Prdeep Buchadi
