# **Capstone Project: Multi-Class Text Classification**
Python 3.11 | End-to-End Machine Learning Pipeline


## **1. Problem Statement**
Build a machine learning model that classifies text into multiple categories. This includes preprocessing, vectorization, model training, evaluation, and predictions.

## **2. Import Dependencies**

In [ ]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
import matplotlib.pyplot as plt

## **3. Load Dataset**

In [ ]:
# Example dataset — replace with your real dataset
data = pd.DataFrame({
    'text': [
        "Python is great for data science",
        "Football is a popular sport",
        "Elections impact government policies",
        "Machine learning improves automation",
        "Basketball requires agility and speed"
    ],
    'label': ['tech', 'sports', 'politics', 'tech', 'sports']
})
data.head()

## **4. Text Preprocessing**

In [ ]:
import re

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z ]', '', text)
    return text

data['clean_text'] = data['text'].apply(preprocess)
data.head()

## **5. Train / Test Split**

In [ ]:
X = data['clean_text']
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## **6. Vectorization (TF-IDF)**

In [ ]:
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

## **7. Train Model (Logistic Regression)**

In [ ]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train_tfidf, y_train)

## **8. Evaluation**

In [ ]:
y_pred = model.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

## **9. Predict on New Text**

In [ ]:
def predict(text):
    cleaned = preprocess(text)
    vect = vectorizer.transform([cleaned])
    return model.predict(vect)[0]

predict("The government introduced a new AI policy")

## **10. Summary**

- Dataset loaded and preprocessed
- TF-IDF vectorization applied
- Multi-class classifier trained
- Achieved accuracy and classification metrics
- Ready for deployment or enhancement
