**Practical 05**

**Aim : Text Categorization**
*   Implement a text classification algorithm (e.g., Naive Bayes or Support Vector Machines).
*   Train the classifier on a labelled dataset and evaluate its performance.


In [5]:
#Step 1 Import Necessary Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

In [6]:
#Step 2: Load the training dataset
#Reads the CSV file that contains training data.
df = pd.read_csv("/content/Dataset - Dataset.csv")

Assume:

covid -> symptoms related to covid
fever -> fever symptoms
flu -> target label (Yes/No)

In [7]:
#Step 3: Combine text columns
data = df["covid"] + " " + df["fever"]
X = data.astype(str)   # Features (text)
y = df["flu"]          # Labels

data = df["covid"] + " " + df["fever"]
Joins covid symptoms and fever symptoms into one sentence.

X = data.astype(str)
Converts text into string format (safe for processing).

y = df["flu"]
Stores the output label (Yes / No) which we want to predict.

In [8]:
#Step 4: Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

#80% data for learning, 20% data for testing accuracy.

It splits the data into two parts:
- one part for learning (training)
- one part for checking (testing)

train_test_split()
This is a function from scikit-learn that divides data.
Think of it like: Studying from a question bank and keeping some questions for the exam.

X
X = input data
In this practical: combined text (covid + fever symptoms)
Example:

"dry cough high fever"
"runny nose normal temperature"

y
y = output labels
In this practical: flu = Yes or No
Example:
Yes
No

test_size=0.2
20% data -> test data
80% data -> training data

Example:
If you have 10 patients:
8 patients -> learn from
2 patients -> check diagnosis accuracy


random_state=42
Fixes the random selection of data.

So every time you run the program, you get the same split

Example:
Shuffling cards but using the same shuffle order every time



X_train
- Text used for training
- Example: symptoms to learn from

X_test
- Text used for testing
- Example: symptoms to check accuracy

y_train
- Correct answers for training data

y_test
- Correct answers for test data

In [9]:
#Step 5: Convert text into numbers (Bag of Words)

vectorizer = CountVectorizer()

X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

vectorizer = CountVectorizer()
Creates a tool that counts words in text.

X_train_counts = vectorizer.fit_transform(X_train)
Learns words from training data and converts them to numbers.

X_test_counts = vectorizer.transform(X_test)
Converts test data using same word rules.

In [10]:
#Step 6: Train Naive Bayes Classifier

classifier = MultinomialNB()
classifier.fit(X_train_counts, y_train)

classifier = MultinomialNB()
Creates the Naive Bayes classifier.

classifier.fit(X_train_counts, y_train)
Trains the model using training data and labels.

In [11]:
#Step 7: Evaluate the model

y_pred = classifier.predict(X_test_counts)
#Predicts output for test data.

accuracy = accuracy_score(y_test, y_pred)
#Predicts output for test data.
print(f"Accuracy: {accuracy:.2f}")
#Displays accuracy value.

print("Classification Report:")
print(classification_report(y_test, y_pred))
#Shows precision, recall, and F1-score.

Accuracy: 1.00
Classification Report:
              precision    recall  f1-score   support

          No       1.00      1.00      1.00         1
         Yes       1.00      1.00      1.00         2

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3



In [12]:
#Step 8: Test on new (unseen) dataset

data1 = pd.read_csv("/content/Test - Test.csv")
#Loads new unseen data for prediction.

new_data = data1["covid"] + " " + data1["fever"]
#Combines symptoms into single text.
new_data_counts = vectorizer.transform(new_data.astype(str))
#Converts new text into numeric form.

predictions = classifier.predict(new_data_counts)
#Predicts whether flu = Yes or No.

In [13]:
#Step 9: Save predictions to CSV

predictions_df = pd.DataFrame(predictions, columns=["flu_prediction"])
#Converts predictions into table format.

data1 = pd.concat([data1, predictions_df], axis=1)
#Adds predictions as a new column.

data1.to_csv("Downloads\Test_output.csv", index=False)
#Saves results to a new CSV file.

print("Predictions saved successfully!")

Predictions saved successfully!


  data1.to_csv("Downloads\Test_output.csv", index=False)
