# Text Classification Notebook
This notebook reproduces the workflow in the project's `app.py` and `src/TextClassifier.py`.
Each logical step from data loading to training, evaluation and prediction is provided in its own cell with a short explanation.
Run the cells in order to reproduce the .py behavior.

## 1) Imports and logging
We import the same libraries used by the scripts and set up basic logging for visibility.

In [5]:
import logging
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import re

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logging.info("Imports complete.")

2026-02-20 13:28:15,594 - INFO - Imports complete.


## 2) TextClassifier class
Copying the `TextClassifier` implementation from `src/TextClassifier.py` so the notebook is standalone.
The class includes preprocessing, training, prediction and evaluation methods.

In [6]:
class TextClassifier:
    """
    A simple text classification pipeline.
    """

    def __init__(self):
        self.vectorizer = CountVectorizer()
        self.model = LogisticRegression(max_iter=1000)
        logging.info("TextClassifier initialized.")

    def preprocess_text(self, text: str) -> str:
        """
        Cleans and normalizes a single text string.
        - Lowercases text
        - Removes non-alphanumeric characters (keeping spaces)
        """
        logging.debug(f"Preprocessing text: '{text}'")
        text = text.lower()
        non_alphabetical_characters = r"[^a-z\s]"
        text = re.sub(non_alphabetical_characters, "", text)
        text = " ".join(text.split())
        logging.debug(f"Preprocessed text: '{text}'")
        return text

    def train(self, texts: list[str], labels: list[str]):
        """
        Trains the classification model.
        """
        logging.info("Starting model training.")
        # Preprocess all texts
        processed_texts = [self.preprocess_text(text) for text in texts]

        # Fit vectorizer and transform texts
        X = self.vectorizer.fit_transform(processed_texts)
        y = labels

        # Train the model
        self.model.fit(X, y)
        logging.info("Model training completed.")

    def predict(self, texts: list[str]) -> list[str]:
        """
        Makes predictions on new text data.
        """
        logging.info("Starting prediction.")
        # Preprocess new texts
        processed_texts = [self.preprocess_text(text) for text in texts]

        # Transform texts using the fitted vectorizer
        X_new = self.vectorizer.transform(processed_texts)

        # Make predictions
        predictions = self.model.predict(X_new).tolist()
        logging.info("Prediction completed.")
        return predictions

    def evaluate(self, texts: list[str], true_labels: list[str]) -> float:
        """
        Evaluates the model's accuracy.
        """
        logging.info("Starting model evaluation.")
        predictions = self.predict(texts)
        score = accuracy_score(true_labels, predictions)
        logging.info(f"Model accuracy: {score:.4f}")
        return score


## 3) Initialise the classifier
Create an instance of `TextClassifier`.

In [7]:
classifier = TextClassifier()
logging.info("Classifier instance created.")

2026-02-20 13:28:15,646 - INFO - TextClassifier initialized.
2026-02-20 13:28:15,648 - INFO - Classifier instance created.


## 4) Load dataset
Load the CSV at `data/raw/text-label.csv` and inspect a few rows.

In [8]:
df = pd.read_csv("./data/raw/text-label.csv")
texts = df['text'].tolist()
labels = df['label'].tolist()
print(f"Loaded {len(texts)} samples")
df.head()

Loaded 263 samples


Unnamed: 0,text,label
0,After deploying to staging the site returns a ...,config
1,Application fails to start on production becau...,config
2,SMTP emails are not being delivered; service l...,config
3,OAuth login fails with redirect URI does not m...,config
4,SSL handshake error in browser; certificate ch...,config


## 5) Split into train/test
We use `train_test_split` with the same parameters as `app.py`.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.2, random_state=42, stratify=labels
)
print(f"Train: {len(X_train)} samples, Test: {len(X_test)} samples")

Train: 210 samples, Test: 53 samples


## 6) Train the model
Call the `train` method on the classifier with the training split.

In [10]:
classifier.train(X_train, y_train)
print("Training completed.")

2026-02-20 13:28:15,741 - INFO - Starting model training.
2026-02-20 13:28:15,902 - INFO - Model training completed.


Training completed.


## 7) Evaluate on the test set
Compute accuracy on the held-out test split.

In [11]:
accuracy = classifier.evaluate(X_test, y_test)
print(f"Test accuracy: {accuracy:.2f}")

2026-02-20 13:28:15,954 - INFO - Starting model evaluation.
2026-02-20 13:28:15,955 - INFO - Starting prediction.
2026-02-20 13:28:15,958 - INFO - Prediction completed.
2026-02-20 13:28:15,963 - INFO - Model accuracy: 0.9057


Test accuracy: 0.91


## 8) Predict on new unseen data
Create a small list of unseen texts and call `predict`.

In [12]:
new_texts_for_prediction = [
    "This is an absolutely fantastic product, highly recommended!",
    "I am extremely disappointed with the quality.",
    "It's an average item, nothing special but not bad.",
    "What a great product, I will buy 2 more!",
    "This was a terrible investment, I regret it.",
]
predictions = classifier.predict(new_texts_for_prediction)
for i, (text, pred) in enumerate(zip(new_texts_for_prediction, predictions)):
    print(f"Text {i+1}: '{text}' -> Predicted: {pred}")

2026-02-20 13:28:15,979 - INFO - Starting prediction.
2026-02-20 13:28:15,982 - INFO - Prediction completed.


Text 1: 'This is an absolutely fantastic product, highly recommended!' -> Predicted: code
Text 2: 'I am extremely disappointed with the quality.' -> Predicted: code
Text 3: 'It's an average item, nothing special but not bad.' -> Predicted: config
Text 4: 'What a great product, I will buy 2 more!' -> Predicted: code
Text 5: 'This was a terrible investment, I regret it.' -> Predicted: code


## Notebook complete
You can run the cells in order. If you want, I can run the notebook here to capture outputs, or adjust the notebook to import the existing `src` module instead of copying the class.