# Title:Spam Classification Using Naive Bayes with Parameter Optimization


#### Group Member Names :Harsh Patel



### INTRODUCTION: 
The rise in unwanted messages on social media and email has made spam detection increasingly critical. Spammers exploit platforms like Facebook, Twitter, and YouTube to disseminate malware, phishing links, and deceptive content. Spam, defined as unsolicited digital communication, poses significant risks to users' safety and platform security. Research highlights the use of text-based techniques, machine learning, and deep learning for spam identification while addressing challenges related to datasets and detection methods. Despite efforts, spam continues to grow, causing financial losses for both companies and individuals.
*********************************************************************************************************************
#### AIM : 
This project aims to classify spam messages using the Naive Bayes algorithm, an effective method for text classification tasks. The study compares two popular text vectorization techniques, CountVectorizer and TfidfVectorizer, to identify the most suitable approach. A significant contribution is made by optimizing the Naive Bayes hyperparameters (alpha, ngram_range, and max_df) using GridSearchCV. The results demonstrate a marked improvement in model performance through hyperparameter tuning, as measured by metrics such as accuracy, precision, recall, F1-score, and AUC.

*********************************************************************************************************************
#### Github Repo: https://github.com/Harshhpp/Harsh.Patel.BDAT1004PS1.git

*********************************************************************************************************************
#### DESCRIPTION OF PAPER:
The paper explores the challenge of spam email detection, focusing on the limitations of existing spam filters like Gmail's system. It highlights how legitimate emails can sometimes be misclassified as spam, and vice versa. The research investigates the storage of user data in browsers (Internet Explorer, Firefox, and Chrome on Windows 10) and uses machine learning algorithms, particularly the KN and NB models, to improve spam detection accuracy. The study also introduces two filtering models, Opinion Rank and Latent Dirichlet Allocation (LDA), to enhance the classification of spam emails. The research aims to offer better solutions for managing the growing volume of spam emails and to contribute to the development of more effective spam filtering techniques using machine learning.

*********************************************************************************************************************
#### PROBLEM STATEMENT :
Trust in social systems evolves over time as individuals gain life experience and engage with diverse social networks. However, current models often overlook the dynamics of trust and fail to account for multiple languages, which is crucial due to the global nature of social platforms. Additionally, linguistic spam may lead to the disregard of valuable information. Future research should focus on incorporating trust dynamics, multilingual data, and multimedia content to enhance the accuracy and applicability of trust models across various domains.

*********************************************************************************************************************
#### CONTEXT OF THE PROBLEM:
The context of the problem revolves around the evolving nature of trust in social systems, particularly within online platforms like social networks. As individuals interact with these platforms, their trust in the system fluctuates based on their experiences and social engagements. Traditional trust models often fail to account for this dynamic shift, as well as the multilingual and multimedia aspects of user-generated content. Given the global nature of social networks, these models typically focus on textual tags and user profiles, ignoring the value of other forms of communication (audio, visual) and the challenges posed by linguistic spam. The problem lies in the need for more effective trust modeling that incorporates these complexities, ensuring that trust dynamics are better understood and applied across different languages, cultures, and types of content.
*
*********************************************************************************************************************
#### SOLUTION: 
The solution involves implementing a text-based Machine Learning pipeline using Multinomial Naive Bayes for spam classification. By preprocessing text data and leveraging feature extraction techniques, the model identifies spam messages with high accuracy. To enhance performance, hyperparameter tuning and ROC-AUC evaluation ensure robust detection. This approach provides an efficient, scalable, and adaptable solution to the spam detection problem.
*


# Background
*********************************************************************************************************************


|Reference|Explanation|Dataset/Input|Weakness|
|------|------|------|------|



*********************************************************************************************************************






# Implement paper code :

# Chart - 1 Pie Chart Visualization Code For Distribution of Spam vs Ham Messages
spread = df['Category'].value_counts()
plt.rcParams['figure.figsize'] = (5,5)

# Set Labels
spread.plot(kind = 'pie', autopct='%1.2f%%', cmap='Set1')
plt.title(f'Distribution of Spam vs Ham')

# Display the Chart
plt.show()
     
# Splitting Spam Messages
df_spam = df[df['Category']=='spam'].copy()
# Chart - 2 WordCloud Plot Visualization Code For Most Used Words in Spam Messages
# Create a String to Store All The Words
comment_words = ''

# Remove The Stopwords
stopwords = set(STOPWORDS)

# Iterate Through The Column
for val in df_spam.Message:

    # Typecaste Each Val to String
    val = str(val)

    # Split The Value
    tokens = val.split()

    # Converts Each Token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()

    comment_words += " ".join(tokens)+" "

# Set Parameters
wordcloud = WordCloud(width = 1000, height = 500,
                background_color ='white',
                stopwords = stopwords,
                min_font_size = 10,
                max_words = 1000,
                colormap = 'gist_heat_r').generate(comment_words)

# Set Labels
plt.figure(figsize = (6,6), facecolor = None)
plt.title('Most Used Words In Spam Messages', fontsize = 15, pad=20)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

# Display Chart
plt.show()
def evaluate_model(model, X_train, X_test, y_train, y_test):
    '''The function will take model, x train, x test, y train, y test
    and then it will fit the model, then make predictions on the trained model,
    it will then print roc-auc score of train and test, then plot the roc, auc curve,
    print confusion matrix for train and test, then print classification report for train and test,
    then plot the feature importances if the model has feature importances,
    and finally it will return the following scores as a list:
    recall_train, recall_test, acc_train, acc_test, roc_auc_train, roc_auc_test, F1_train, F1_test
    '''

    # fit the model on the training data
    model.fit(X_train, y_train)

    # make predictions on the test data
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    pred_prob_train = model.predict_proba(X_train)[:,1]
    pred_prob_test = model.predict_proba(X_test)[:,1]

    # calculate ROC AUC score
    roc_auc_train = roc_auc_score(y_train, y_pred_train)
    roc_auc_test = roc_auc_score(y_test, y_pred_test)
    print("\nTrain ROC AUC:", roc_auc_train)
    print("Test ROC AUC:", roc_auc_test)

    # plot the ROC curve
    fpr_train, tpr_train, thresholds_train = roc_curve(y_train, pred_prob_train)
    fpr_test, tpr_test, thresholds_test = roc_curve(y_test, pred_prob_test)
    plt.plot([0,1],[0,1],'k--')
    plt.plot(fpr_train, tpr_train, label="Train ROC AUC: {:.2f}".format(roc_auc_train))
    plt.plot(fpr_test, tpr_test, label="Test ROC AUC: {:.2f}".format(roc_auc_test))
    plt.legend()
    plt.title("ROC Curve")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.show()

    # calculate confusion matrix
    cm_train = confusion_matrix(y_train, y_pred_train)
    cm_test = confusion_matrix(y_test, y_pred_test)

    fig, ax = plt.subplots(1, 2, figsize=(11,4))

    print("\nConfusion Matrix:")
    sns.heatmap(cm_train, annot=True, xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'], cmap="Oranges", fmt='.4g', ax=ax[0])
    ax[0].set_xlabel("Predicted Label")
    ax[0].set_ylabel("True Label")
    ax[0].set_title("Train Confusion Matrix")

    sns.heatmap(cm_test, annot=True, xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'], cmap="Oranges", fmt='.4g', ax=ax[1])
    ax[1].set_xlabel("Predicted Label")
    ax[1].set_ylabel("True Label")
    ax[1].set_title("Test Confusion Matrix")

    plt.tight_layout()
    plt.show()


    # calculate classification report
    cr_train = classification_report(y_train, y_pred_train, output_dict=True)
    cr_test = classification_report(y_test, y_pred_test, output_dict=True)
    print("\nTrain Classification Report:")
    crt = pd.DataFrame(cr_train).T
    print(crt.to_markdown())
    # sns.heatmap(pd.DataFrame(cr_train).T.iloc[:, :-1], annot=True, cmap="Blues")
    print("\nTest Classification Report:")
    crt2 = pd.DataFrame(cr_test).T
    print(crt2.to_markdown())
    # sns.heatmap(pd.DataFrame(cr_test).T.iloc[:, :-1], annot=True, cmap="Blues")


    precision_train = cr_train['weighted avg']['precision']
    precision_test = cr_test['weighted avg']['precision']

    recall_train = cr_train['weighted avg']['recall']
    recall_test = cr_test['weighted avg']['recall']

    acc_train = accuracy_score(y_true = y_train, y_pred = y_pred_train)
    acc_test = accuracy_score(y_true = y_test, y_pred = y_pred_test)

    F1_train = cr_train['weighted avg']['f1-score']
    F1_test = cr_test['weighted avg']['f1-score']

    model_score = [precision_train, precision_test, recall_train, recall_test, acc_train, acc_test, roc_auc_train, roc_auc_test, F1_train, F1_test ]
    return model_score
    # Defining a function for the Email Spam Detection System
def detect_spam(email_text):
    # Load the trained classifier (clf) here
    # Replace the comment with your code to load the classifier model

    # Make a prediction using the loaded classifier
    prediction = clf.predict([email_text])

    if prediction == 0:
        return "This is a Ham Email!"
    else:
        return "This is a Spam Email!"
# Example of how to use the function
sample_email = 'Free Tickets for IPL'
result = detect_spam(sample_email)
print(result)
*********************************************************************************************************************

*



*********************************************************************************************************************
### Contribution  Code : # Function to Evaluate Model
def evaluate_model(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    pred_prob_test = model.predict_proba(X_test)[:, 1]

    # Print evaluation metrics
    print("Classification Report (Test Set):")
    print(classification_report(y_test, y_pred_test))
    print("\nConfusion Matrix (Test Set):")
    cm = confusion_matrix(y_test, y_pred_test)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title("Confusion Matrix")
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.show()

    # Plot ROC Curve  
    fpr, tpr, _ = roc_curve(y_test, pred_prob_test)
    plt.plot(fpr, tpr, label=f"AUC = {roc_auc_score(y_test, pred_prob_test):.2f}")
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("ROC Curve")
    plt.legend()
    plt.show()
    # 1: CountVectorizer with Default MultinomialNB
print("**Baseline Model: CountVectorizer + Default MultinomialNB**")
pipeline_count = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])
evaluate_model(pipeline_count, X_train, X_test, y_train, y_test)
# 2: TfidfVectorizer with Default MultinomialNB
print("\n**Baseline Model: TfidfVectorizer + Default MultinomialNB**")
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer 
pipeline_tfidf = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])
evaluate_model(pipeline_tfidf, X_train, X_test, y_train, y_test)
# Import GridSearchCV 
from sklearn.model_selection import GridSearchCV # Importing GridSearchCV

pipeline_optimized = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])
# Define parameter grid
param_grid = {
    'vectorizer__ngram_range': [(1, 1), (1, 2)],  
    'vectorizer__max_df': [0.75, 1.0],           
    'classifier__alpha': [0.1, 0.5, 1.0]         
}
# Perform Grid Search
grid_search = GridSearchCV(pipeline_optimized, param_grid, cv=3, scoring='roc_auc', verbose=2)
grid_search.fit(X_train, y_train)
# Best Parameters and Score
print("\nBest Parameters:", grid_search.best_params_)
print("Best Cross-Validation AUC Score:", grid_search.best_score_)
# Evaluate Best Model
best_model = grid_search.best_estimator_
evaluate_model(best_model, X_train, X_test, y_train, y_test)
# ML Model - 1 Implementation
# Create a machine learning pipeline using scikit-learn, combining text vectorization (CountVectorizer)
# and a Multinomial Naive Bayes classifier for email spam detection.
clf = Pipeline([
    ('vectorizer', CountVectorizer()),  # Step 1: Text data transformation
    ('nb', MultinomialNB())  # Step 2: Classification using Naive Bayes
])

# Model is trained (fit) and predicted in the evaluate model
# Visualizing evaluation Metric Score chart
MultinomialNB_score = evaluate_model(clf, X_train, X_test, y_train, y_test)
*

### Results :
Model performance: baseline models should perform relatively well, with TfidfVectorizer-based models consistently outperforming CountVectorizer-based models.
AUC Score: Optimizing the model using grid search should result in a higher AUC score, indicating improved classification accuracy.
Confusion Matrix: A good model should have a high True Positive and True Negative with a low False Positive and False Negative.
ROC Curve: An ideal model will have a high AUC, indicating that it can distinguish between classes successfully.

*******************************************************************************************************************************


#### Observations :
Model Performance: Using the evaluation metrics, determine how well the model is performing. 
Feature Engineering influence: Consider the influence of feature extraction approaches (for example, CountVectorizer vs. TfidfVectorizer). 
Hyperparameter Tuning: The grid search can help locate the optimum hyperparameters for fine-tuning the model. It could demonstrate that tweaking n-grams or alpha (for Naive Bayes) considerably increases performance.
Cross-validation and overfitting: Evaluate performance on both the training and test sets to see if there is a substantial gap (overfitting) or consistent performance (excellent generalization).

*******************************************************************************************************************************
*


### Conclusion and Future Direction:
This study examined the implementation and evaluation of multiple machine learning models for classification tasks, with an emphasis on text classification utilizing Naive Bayes classifiers and vectorization approaches (CountVectorizer and TfidfVectorizer). The models were evaluated using key measures such as precision, recall, F1 score, AUC score, and confusion matrices. The results show that TfidfVectorizer with Naive Bayes outperforms CountVectorizer in terms of AUC scores and model performance following grid search optimization. The revised model improves classification accuracy and is ideal for text classification tasks such as spam detection or sentiment analysis.
*******************************************************************************************************************************
#### Learnings :
Understanding Text Representation: Learn how alternative vectorization strategies (CountVectorizer vs TfidfVectorizer) affect text classification performance.
Model Optimization: The grid search demonstrates the value of hyperparameter adjustment and model optimization in improving model performance.
Model Evaluation: How metrics such as AUC, accuracy, recall, and confusion matrices aid in assessing model performance and capacity to generalize effectively to previously encountered data.
Practical Application: How the Naive Bayes classifier works well for text classification problems, especially when combined with the appropriate preprocessing steps.

*******************************************************************************************************************************
#### Results Discussion :
The model evaluation results showed that TfidfVectorizer + MultinomialNB outperformed CountVectorizer + MultinomialNB in terms of precision, recall, and AUC scores. This suggests that adding more weight to key phrases in the document, like Tfidf does, improves the model's ability to distinguish between classes. The grid search optimization also improved the model's performance by altering parameters like the n-gram range and smoothing factor (alpha). The confusion matrix and ROC curves offered visual information about the model's ability to accurately classify, with the optimized model displaying fewer misclassifications.

*******************************************************************************************************************************
#### Limitations :
Data Dependency: The caliber and volume of the training data have a significant impact on the models' performance. Poor generalization to unknown data or overfitting might result from incomplete or biased data.
Interpretability: Although Naive Bayes classifiers can be understood, deep learning models or ensemble approaches may be better at identifying intricate patterns in the data.
Text Preprocessing: Using more sophisticated preprocessing methods (such lemmatization, stop word removal, etc.) may enhance the models' performance.
Multilingual Handling: The models' efficacy in a worldwide setting may be limited because they fail to take into consideration the dataset's linguistic diversity or various languages.




*******************************************************************************************************************************
#### Future Extension :
Deep Learning Models: Going beyond conventional models, future research may use deep learning methods for more complex text understanding and categorization, such as transformers, recurrent neural networks (RNNs), or BERT.
Multilingual Models: Considering the worldwide scope of social networks, it may be beneficial to integrate multilingual text data and investigate multilingual vectorization techniques such as multilingual BERT.
Reducing Noise: The robustness and performance of the model would be enhanced by using improved methods for removing spam or noise from the textual input.
Domain-Specific Tuning: Improving classification accuracy in particular domains by fine-tuning the models for those domains (such as legal papers or medical texts).



# References: https://www.kaggle.com/code/anaghakp/email-spam-detection/notebook

[1]:  