# Email Spam Detection System

## Business Problem
Email spam causes productivity loss and security risks (phishing, malware, fraud).

## Objective
Build an NLP classification system to automatically detect spam emails.

## Business Impact
- Reduce phishing risk
- Increase employee productivity
- Improve email security automation


In [6]:
import pandas as pd

df = pd.read_csv("data/emails.csv")
df.head()


Unnamed: 0,text,label
0,"Your account has been suspended, click here to...",spam
1,Meeting scheduled for tomorrow at 10am,ham
2,Limited offer!!! Buy now and get 50% discount,spam
3,Invoice attached for your recent purchase,ham
4,Urgent: Update your password immediately,spam


In [7]:
df['label'].value_counts()
df['text'][0]

'Your account has been suspended, click here to verify'

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X = df['text']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

tfidf = TfidfVectorizer(max_features=3000, ngram_range=(1,2))
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)


In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

y_pred = model.predict(X_test_tfidf)

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

         ham       1.00      1.00      1.00        36
        spam       1.00      1.00      1.00        44

    accuracy                           1.00        80
   macro avg       1.00      1.00      1.00        80
weighted avg       1.00      1.00      1.00        80



## Business Insight
The spam classifier achieves strong performance and can be integrated into:
- Corporate email gateways
- Phishing detection systems
- Security automation pipelines

ROI:
Even a 5% reduction in phishing can save significant IT and security costs.


# Resume to Job Matching System

## Business Problem
Recruiters waste time manually screening resumes.

## Objective
Use NLP embeddings to match resumes with job descriptions.


In [11]:
resumes = pd.read_csv("data/resumes.csv")
jobs = pd.read_csv("data/jobs.csv")
resumes.head()
jobs.head()

Unnamed: 0,description
0,Hiring machine learning engineer with Python a...
1,Looking for data analyst with SQL and BI tools
2,Backend developer with Java and cloud experience
3,AI engineer needed for deep learning projects
4,Hiring machine learning engineer with Python a...


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

all_text = pd.concat([resumes['text'], jobs['description']])

tfidf = TfidfVectorizer(max_features=2000)
tfidf_matrix = tfidf.fit_transform(all_text)

resume_vecs = tfidf_matrix[:len(resumes)]
job_vecs = tfidf_matrix[len(resumes):]

similarity_matrix = cosine_similarity(resume_vecs, job_vecs)
similarity_matrix[:3]


array([[0.67210318, 0.05796255, 0.37155057, 0.08229893, 0.67210318,
        0.05796255, 0.37155057, 0.08229893, 0.67210318, 0.05796255,
        0.37155057, 0.08229893, 0.67210318, 0.05796255, 0.37155057,
        0.08229893, 0.67210318, 0.05796255, 0.37155057, 0.08229893,
        0.67210318, 0.05796255, 0.37155057, 0.08229893, 0.67210318,
        0.05796255, 0.37155057, 0.08229893, 0.67210318, 0.05796255,
        0.37155057, 0.08229893, 0.67210318, 0.05796255, 0.37155057,
        0.08229893, 0.67210318, 0.05796255, 0.37155057, 0.08229893,
        0.67210318, 0.05796255, 0.37155057, 0.08229893, 0.67210318,
        0.05796255, 0.37155057, 0.08229893, 0.67210318, 0.05796255,
        0.37155057, 0.08229893, 0.67210318, 0.05796255, 0.37155057,
        0.08229893, 0.67210318, 0.05796255, 0.37155057, 0.08229893,
        0.67210318, 0.05796255, 0.37155057, 0.08229893, 0.67210318,
        0.05796255, 0.37155057, 0.08229893, 0.67210318, 0.05796255,
        0.37155057, 0.08229893, 0.67210318, 0.05

## Business Insight
This system can:
- Rank candidates automatically
- Reduce screening time by 60%+
- Improve recruiter efficiency

Use Case:
HR platforms, ATS systems, internal hiring tools.


# Customer Feedback Sentiment Analysis

## Business Problem
Companies struggle to monitor customer satisfaction at scale.

## Objective
Analyze feedback to detect negative customer experiences early.


In [13]:
feedback = pd.read_csv("data/feedback.csv")

feedback['sentiment'] = feedback['feedback'].apply(
    lambda x: 'negative' if any(w in x.lower() for w in ['terrible','late','slow','crash','complicated']) else 'positive'
)

feedback.head()


Unnamed: 0,feedback,sentiment
0,The delivery was very late and customer servic...,negative
1,Great product quality and fast shipping,positive
2,The app is slow and crashes frequently,negative
3,Excellent support team and quick response,positive
4,"Terrible experience, will not buy again",negative


In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

X = feedback['feedback']
y = feedback['sentiment']

tfidf = TfidfVectorizer(max_features=2000)
X_tfidf = tfidf.fit_transform(X)

model = LogisticRegression()
model.fit(X_tfidf, y)


0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


## Business Insight
This model helps:
- Detect service issues early
- Prioritize customer complaints
- Reduce churn by proactive intervention


# Knowledge Base Semantic Search

## Business Problem
Employees waste time searching documentation.

## Objective
Enable semantic document search using NLP similarity.


In [15]:
kb = pd.read_csv("data/kb_docs.csv")

tfidf = TfidfVectorizer(max_features=1000)
kb_vecs = tfidf.fit_transform(kb['document'])

def search(query, top_k=3):
    q_vec = tfidf.transform([query])
    sims = cosine_similarity(q_vec, kb_vecs)[0]
    top_idx = sims.argsort()[-top_k:][::-1]
    return kb.iloc[top_idx]

search("refund and return policy")


Unnamed: 0,document
0,Refund policy allows returns within 30 days.
3,Data privacy is protected under company policy.
4,Shipping usually takes 3 to 5 business days.


## Business Insight
Reduces support load and internal search time.
Improves employee productivity and customer self-service.


# Toxic & Fake Content Detection

## Business Problem
Toxic content harms brand reputation and user trust.

## Objective
Automatically detect toxic or misleading content.


In [16]:
toxic = pd.read_csv("data/toxic.csv")

X = toxic['text']
y = toxic['toxic']

tfidf = TfidfVectorizer(max_features=2000)
X_tfidf = tfidf.fit_transform(X)

model = LogisticRegression()
model.fit(X_tfidf, y)


0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


## Business Insight
This system can be used for:
- Social media moderation
- Comment filtering
- Brand safety systems
