# Joke Quality Classification with Naïve Bayes and n-grams
This notebook shows a **full, editable workflow** for rating jokes on a continous scale using Ridge Regression classifier with configurable n-gram ranges.

1. **Load** your CSV (`Id, Title, Category, Body, Rating`).
2. **Vectorise** the `Body` column with an n-gram span you choose.
3. **Train / evaluate** Ridge Regression.
4. **Tweak** `NGRAM_LOW` / `NGRAM_HIGH`, re-run and compare metrics.

> **Tip:** If `Rating` is empty, label some jokes manually, train the model, and then go back and predict on the rest.

In [1]:
# Install dependencies if needed
!pip install pandas scikit-learn matplotlib -q

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import Ridge 
from sklearn.metrics import mean_squared_error

## 1. Configuration – edit these values as needed

In [12]:
# ---------------- user-editable ----------------
CSV_PATH = Path('jester.csv')      # path to your dataset
TEXT_COL = 'Body'                 # column with the actual joke
LABEL_COL = 'Rating'        # numeric 1-5 label (or NaN if unlabeled)
NGRAM_LOW, NGRAM_HIGH = 1, 2      # n-gram range (inclusive)
TEST_SIZE = 0.2                   # 20% test split
ALPHA = 1.0                       # Ridge regression alpha
RANDOM_STATE = 42
# -----------------------------------------------

## 2. Load data

In [13]:
df = pd.read_csv(CSV_PATH)
print(f'Dataset shape: {df.shape}')
df.head()

Dataset shape: (150, 5)


Unnamed: 0,ID,Title,Category,Body,Rating
0,1,,,"A man visits the doctor. The doctor says ""I ha...",
1,2,,,This couple had an excellent relationship goin...,
2,3,,,Q. What's 200 feet long and has 4 teeth? \n\nA...,
3,4,,,Q. What's the difference between a man and a t...,
4,5,,,Q.\tWhat's O. J. Simpson's Internet address? \...,-1.756331


## 3. Train / Test split

In [16]:
X = df[TEXT_COL].fillna('')
y = df[LABEL_COL]

mask_labeled = y.notna()
X_labeled, y_labeled = X[mask_labeled], y[mask_labeled]

X_train, X_test, y_train, y_test = train_test_split(
    X_labeled, y_labeled,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
)
print(f'Train size: {X_train.shape[0]}, Test size: {X_test.shape[0]}')

Train size: 112, Test size: 28


## 4. Build & fit Pipeline

In [20]:
pipe = Pipeline([
    ("vect", CountVectorizer(ngram_range=(NGRAM_LOW, NGRAM_HIGH))),
    ("reg", Ridge(alpha=1.0))
])

pipe.fit(X_train, y_train)
rmse = mean_squared_error(y_test, pipe.predict(X_test))
print("RMSE =", rmse)

RMSE = 2.3607549063920024


## 6. Predict joke quality for texts without rating

In [28]:
# --- Predict joke quality for every row without rating ---------------------------------
UNLABELED_MASK = df[LABEL_COL].isna()        

if UNLABELED_MASK.any():
    print(f"{UNLABELED_MASK.sum()} glume nu au rating")
    preds = pipe.predict(df.loc[UNLABELED_MASK, TEXT_COL].fillna(""))

    df.loc[UNLABELED_MASK, LABEL_COL] = preds

    OUTPUT = CSV_PATH.with_stem(CSV_PATH.stem + "_scored")
    df.to_csv(OUTPUT, index=False)
    print(f"✔️  Ratingurile au fost adăugate. Fișier nou: {OUTPUT}")
else:
    print("Toate rândurile au deja rating – nimic de completat.")

Toate rândurile au deja rating – nimic de completat.
