# **TASK 2: TEXT CLASSIFICATION WITH NAIVE BAYES**

---



### **Description:**

Develop a text classification model using Naive Bayes algorithm to classify text documents into
predefined categories (e.g., spam detection, sentiment analysis). Preprocess the text data, train the
model, and assess its performance using evaluation metrics.

### **Requirements:**
* Python code for text classification.
* Trained Model.
* Evaluation metrics (accuracy, precision, recall).
* A brief analysis report

To develop a text classification model using the Naive Bayes algorithm, we'll focus on a step-by-step process for spam detection or sentiment analysis. The steps include:

1. Data Preprocessing: Text cleaning, tokenization, and converting text to numerical form using TF-IDF.
2. Model Training: Using the Naive Bayes algorithm.
3. Evaluation Metrics: Calculating accuracy, precision, and recall.
Here’s the Python code for the project:

#**Step 1: Install Required Libraries**

In [1]:
!pip install scikit-learn pandas numpy nltk



#**Step 2: Import Libraries and Load Data**

In [20]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report
import joblib
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Example dataset
data = {
    'text': ["Win a free iPhone now!", "Your order is confirmed", "Congratulations, you've won!",
             "Meeting at 5 PM", "Free vacation offer!", "Project deadline is tomorrow"],
    'label': [1, 0, 1, 0, 1, 0]  # 1 = spam, 0 = not spam
}

# Convert to DataFrame
df = pd.DataFrame(data)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#**Step 3: Text Preprocessing**

In [21]:
def preprocess_text(text):
    # Remove special characters and digits
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\d', ' ', text)

    # Convert to lowercase
    text = text.lower()

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    return ' '.join(tokens)

# Apply preprocessing to all text data
df['cleaned_text'] = df['text'].apply(preprocess_text)

#**Step 4: Split Data into Training and Testing Sets**

In [22]:
# Split data into features (text) and labels
X = df['text']
y = df['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Transform the text data into TF-IDF vectors
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# **Step 5: Train the Naive Bayes Model**

In [23]:
# Initialize and train the Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

# **Step 6: Make Predictions and Evaluate the Model**

In [24]:
# Predict on test data
y_pred = nb_model.predict(X_test_tfidf)

# Compute accuracy, precision, and recall
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')

# Generate classification report
print(classification_report(y_test, y_pred))

# Save the trained model and vectorizer
joblib.dump(nb_model, 'naive_bayes_text_classifier.pkl')
joblib.dump(tfidf_vectorizer, 'tfidf_vectorizer.pkl')

Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2



['tfidf_vectorizer.pkl']

# **Step 7: Analysis Report (Brief)**
Report:

1. Dataset: Text documents, split into categories like spam and non-spam (binary classification).
2. Model: Naive Bayes (Multinomial Naive Bayes).
3. Accuracy: The model achieves around 100% accuracy in this example.
4. Precision: Indicates the model’s ability to correctly identify spam.
5. Recall: Shows how well the model identifies all the spam emails.
This model performs well for simple text classification tasks like spam detection. More complex models like Support Vector Machines or Neural Networks may be needed for more nuanced tasks.