# Notebook Description Math Problem Classification with Classical Machine Learning Models

In this notebook I explored how much simple classical models are performing on the math problem classification data from
[this](https://www.kaggle.com/competitions/classification-of-math-problems-by-kasut-academy/overview) Kaggle competition.

What this Notebook has:
- First it trains Logistic Regression and LightGBM models using sklearn.
- Many samples have some gibberish text in them, so preprocessing is applied to handle that.
    - Over-preprocessing degraded scores, so only processes the most unneccessary parts of the text.
    - It is simple, but more robust preprocessing function.
    - Despite of removing a lot of gibberish, the accuracy didn't improve a lot
- Since there is class imbalance in the data, SMOTE is used to oversample the minority classes.
- With SMOTE LGBM score improved by 1 point, which is not a much of a improvement.

Next steps for this exploration:
- LGBM is overfitting and it should be hadnled properly.
- Hyperparameter Tuning can improve the score.
- Only TFIDF features are used to train the models. Features engineering (ex: counts of particular words) will help the models pick the relationships.

**Public Test Score:** 0.7862

# Install Packages

In [3]:
!pip install imbalanced-learn -q


[notice] A new release of pip is available: 24.0 -> 25.1
[notice] To update, run: python.exe -m pip install --upgrade pip


# Import Libraries

In [4]:
import re
import pandas as pd
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE
from collections import Counter
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

# Load Data

In [None]:
train_data_path = "https://raw.githubusercontent.com/PrudhvirajuChekuri/Final-Project-Group8/refs/heads/master/code/data/train.csv"
test_data_path = "https://raw.githubusercontent.com/PrudhvirajuChekuri/Final-Project-Group8/refs/heads/master/code/data/test.csv"
sample_sub_path = "https://raw.githubusercontent.com/PrudhvirajuChekuri/Final-Project-Group8/refs/heads/master/code/data/sample_submission.csv"

train = pd.read_csv(train_data_path)
test = pd.read_csv(test_data_path)
sample_sub = pd.read_csv(sample_sub_path)

display(train.shape, test.shape)
display(sample_sub.head())

(10189, 2)

(3044, 2)

Unnamed: 0,id,label
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0


In [9]:
train.head()

Unnamed: 0,Question,label
0,A solitaire game is played as follows. Six di...,3
1,2. The school table tennis championship was he...,5
2,"Given that $x, y,$ and $z$ are real numbers th...",0
3,$25 \cdot 22$ Given three distinct points $P\l...,1
4,I am thinking of a five-digit number composed ...,5


In [10]:
test.head()

Unnamed: 0,id,Question
0,0,b'Solve 0 = -i - 91*i - 1598*i - 64220 for i.\n'
1,1,Galperin G.A.\n\nA natural number $N$ is 999.....
2,2,Example 7 Calculate $\frac{1}{2 \sqrt{1}+\sqrt...
3,3,"If $A$, $B$, and $C$ represent three distinct ..."
4,4,2. Calculate $1+12+123+1234+12345+123456+12345...


In [11]:
train["label"].value_counts()

label
0    2618
1    2439
5    1827
4    1712
2    1039
3     368
6     100
7      86
Name: count, dtype: int64

In [12]:
for ind, i in enumerate(train["Question"][80:100]):
    print("Q:", ind)
    print(i)
    print("\n")

Q: 0
b'Solve 19*v - 316 - 142 = 151 + 75 for v.\n'


Q: 1
10. (5 points) As shown in the figure, it is a toy clock. When the hour hand makes one complete revolution, the minute hand makes 9 revolutions. If the two hands overlap at the beginning, then when the two hands overlap again, the degree the hour hand has turned is $\qquad$ .




Q: 2
5) A cyclist climbs a hill at a constant speed $v$ and then descends the same road at a constant speed that is three times the previous speed. The average speed for the entire round trip is. .
(A) $\frac{3}{4} v$
(B) $\frac{4}{3} v$
(C) $\frac{3}{2} v$
(D) $2 v$
(E) depends on the length of the road


Q: 3
6. In $\triangle A B C$,
$$
\tan A 、(1+\sqrt{2}) \tan B 、 \tan C
$$

form an arithmetic sequence. Then the minimum value of $\angle B$ is


Q: 4
I3.1 Let $x \neq \pm 1$ and $x \neq-3$. If $a$ is the real root of the equation $\frac{1}{x-1}+\frac{1}{x+3}=\frac{2}{x^{2}-1}$, find the value of $a$
I3.2 If $b>1, f(b)=\frac{-a}{\log _{2} b}$ and $g(b)

- Many problems starts with random things in different forms in many examples like
    - numbers
    - Example ...
    - task ...
- Has links, random names
- Has names of question creators, authors etc
- Maybe extracted or scraped questions and are not derived by humans. Or these things are introduced randomly for competition purposes.

In [13]:
X = train["Question"]
y = train["label"]
display(X.shape, y.shape, X.head(2), y.head(2))

(10189,)

(10189,)

0    A solitaire game is played as follows.  Six di...
1    2. The school table tennis championship was he...
Name: Question, dtype: object

0    3
1    5
Name: label, dtype: int64

# Split data

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((9170,), (1019,), (9170,), (1019,))

# Create Training Pipeline with Logistic Regression

In [15]:
lr = Pipeline([('vect', CountVectorizer()),
               ('tfidf', TfidfTransformer()),
               ('clf', LogisticRegression(class_weight="balanced", multi_class='ovr'))
              ])
lr.fit(X_train, y_train)



# Make Predictions

In [16]:
train_preds_lr = lr.predict(X_train)
test_preds_lr = lr.predict(X_test)

# Evaluate

In [17]:
print("Logistic Regression Train Accuracy:", lr.score(X_train, y_train))
print("Logistic Regression Test Accuracy:", lr.score(X_test, y_test))

Logistic Regression Train Accuracy: 0.8735005452562704
Logistic Regression Test Accuracy: 0.7948969578017664


# Create Training Pipeline with LightGBM

In [18]:
lgbm = Pipeline([('vect', CountVectorizer()),
               ('tfidf', TfidfTransformer()),
               ('clf', LGBMClassifier(is_unbalance = True, verbose = -1))
              ])

lgbm.fit(X_train, y_train)

# Make Predictions

In [19]:
train_preds_lgbm = lgbm.predict(X_train)
test_preds_lgbm = lgbm.predict(X_test)

# Evaluation

In [20]:
print("LGBM Train Accuracy:", lgbm.score(X_train, y_train))
print("LGBM Test Accuracy:", lgbm.score(X_test, y_test))

LGBM Train Accuracy: 0.9924754634678299
LGBM Test Accuracy: 0.7899901864573111


# Data Preprocessing

In [21]:
def clean_math_text_final(text):
    text = str(text)
    text = re.sub(r'^\s*\d+\.\s*', '', text)    
    text = re.sub(r'https?://\S+|www\.\S+', ' ', text)
    text = re.sub(r'#\w+', ' ', text)
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"
                           u"\U0001F300-\U0001F5FF"
                           u"\U0001F680-\U0001F6FF"
                           u"\U0001F1E0-\U0001F1FF"
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r' ', text)
    text = re.sub(r'\s+', ' ', text).strip().lower()

    return text

# Re-run both models on Cleaned Data

In [22]:
train['Question'] = train['Question'].apply(clean_math_text_final)
test['Question'] = test['Question'].apply(clean_math_text_final)

train.head()

Unnamed: 0,Question,label
0,a solitaire game is played as follows. six dis...,3
1,the school table tennis championship was held ...,5
2,"given that $x, y,$ and $z$ are real numbers th...",0
3,$25 \cdot 22$ given three distinct points $p\l...,1
4,i am thinking of a five-digit number composed ...,5


In [23]:
X = train["Question"]
y = train["label"]
display(X.shape, y.shape, X.head(2), y.head(2))

(10189,)

(10189,)

0    a solitaire game is played as follows. six dis...
1    the school table tennis championship was held ...
Name: Question, dtype: object

0    3
1    5
Name: label, dtype: int64

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((9170,), (1019,), (9170,), (1019,))

In [25]:
lr = Pipeline([('vect', CountVectorizer()),
               ('tfidf', TfidfTransformer()),
               ('clf', LogisticRegression(class_weight="balanced", multi_class='ovr'))
              ])
lr.fit(X_train, y_train)



In [26]:
train_preds_lr = lr.predict(X_train)
test_preds_lr = lr.predict(X_test)

In [27]:
print("Logistic Regression Train Accuracy:", lr.score(X_train, y_train))
print("Logistic Regression Test Accuracy:", lr.score(X_test, y_test))

Logistic Regression Train Accuracy: 0.874154852780807
Logistic Regression Test Accuracy: 0.7978410206084396


In [28]:
lgbm = Pipeline([('vect', CountVectorizer()),
               ('tfidf', TfidfTransformer()),
               ('clf', LGBMClassifier(is_unbalance = True, verbose = -1))
              ])

lgbm.fit(X_train, y_train)

In [29]:
train_preds_lgbm = lgbm.predict(X_train)
test_preds_lgbm = lgbm.predict(X_test)

In [30]:
print("LGBM Train Accuracy:", lgbm.score(X_train, y_train))
print("LGBM Test Accuracy:", lgbm.score(X_test, y_test))

LGBM Train Accuracy: 0.9924754634678299
LGBM Test Accuracy: 0.7909715407262021


# SMOTE

In [31]:
print("\n--- Count Vectorization ---")
count_vectorizer = CountVectorizer()

count_vectorizer.fit(X_train)

X_train_counts = count_vectorizer.transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)
print("Shape of Count Vectorizer train features:", X_train_counts.shape)
print("Shape of Count Vectorizer test features:", X_test_counts.shape)


print("\n--- TF-IDF Transformation ---")
tfidf_transformer = TfidfTransformer()

tfidf_transformer.fit(X_train_counts)

X_train_tfidf = tfidf_transformer.transform(X_train_counts)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
print("Shape of TF-IDF train features:", X_train_tfidf.shape)
print("Shape of TF-IDF test features:", X_test_tfidf.shape)

print("\n--- SMOTE Oversampling ---")
print("Shape of TF-IDF train features before SMOTE:", X_train_tfidf.shape)
print("Original training labels:", Counter(y_train))

smote = SMOTE(random_state=42)

X_train_resampled, y_train_resampled = smote.fit_resample(X_train_tfidf, y_train)

print("Shape of TF-IDF train features AFTER SMOTE:", X_train_resampled.shape)
print("Labels in resampled training set:", Counter(y_train_resampled))

print("\n--- Logistic Regression Training ---")
lr_model = LogisticRegression(multi_class='ovr')

lr_model.fit(X_train_resampled, y_train_resampled)
print("Logistic Regression model trained.")

print("\n--- Evaluation ---")

y_pred_train = lr_model.predict(X_train_resampled)
y_pred_test = lr_model.predict(X_test_tfidf)


train_f1_micro = f1_score(y_train_resampled, y_pred_train, average='micro')
test_f1_micro = f1_score(y_test, y_pred_test, average='micro')

print(f"Logistic Regression Train F1 Score (Macro): {train_f1_micro:.4f}")
print(f"Logistic Regression Test F1 Score (Macro): {test_f1_micro:.4f}")


--- Count Vectorization ---
Shape of Count Vectorizer train features: (9170, 11762)
Shape of Count Vectorizer test features: (1019, 11762)

--- TF-IDF Transformation ---
Shape of TF-IDF train features: (9170, 11762)
Shape of TF-IDF test features: (1019, 11762)

--- SMOTE Oversampling ---
Shape of TF-IDF train features before SMOTE: (9170, 11762)
Original training labels: Counter({0: 2356, 1: 2195, 5: 1644, 4: 1541, 2: 935, 3: 331, 6: 90, 7: 78})
Shape of TF-IDF train features AFTER SMOTE: (18848, 11762)
Labels in resampled training set: Counter({4: 2356, 1: 2356, 0: 2356, 5: 2356, 2: 2356, 3: 2356, 7: 2356, 6: 2356})

--- Logistic Regression Training ---




Logistic Regression model trained.

--- Evaluation ---
Logistic Regression Train F1 Score (Macro): 0.9157
Logistic Regression Test F1 Score (Macro): 0.7821


In [32]:
print("\n--- LGBM Training ---")
lgbm = LGBMClassifier(verbose = -1)
lgbm.fit(X_train_resampled, y_train_resampled)
print("LGBM model trained.")

print("\n--- Evaluation ---")

y_pred_train = lgbm.predict(X_train_resampled)
y_pred_test = lgbm.predict(X_test_tfidf)


train_f1_micro = f1_score(y_train_resampled, y_pred_train, average='micro')
test_f1_micro = f1_score(y_test, y_pred_test, average='micro')

print(f"LGBM Train F1 Score (Macro): {train_f1_micro:.4f}")
print(f"LGBM Test F1 Score (Macro): {test_f1_micro:.4f}")


--- LGBM Training ---
LGBM model trained.

--- Evaluation ---
LGBM Train F1 Score (Macro): 0.9969
LGBM Test F1 Score (Macro): 0.7988


# Kaggle Competition Submission

In [34]:
test_counts = count_vectorizer.transform(test["Question"])
test_tfidf = tfidf_transformer.transform(test_counts)

predicted_labels = lgbm.predict(test_tfidf)

submission_df = pd.DataFrame({
    'id': test['id'],
    'label': predicted_labels
})


submission_filename = 'submission.csv'
submission_df.to_csv(submission_filename, index=False)
print(f"Submission file '{submission_filename}' created successfully.")
display(submission_df.head())

Submission file 'submission.csv' created successfully.


Unnamed: 0,id,label
0,0,0
1,1,4
2,2,0
3,3,4
4,4,4
