This project focuses on classifying math problems into specific topics using NLP and machine learning techniques. The dataset includes natural language math questions categorized into one of eight predefined math topics.
π Source: Kaggle Competition β Classification of Math Problems
The dataset consists of the following files:
train.csv
: ~10,000 labeled math questionstest.csv
: ~3,000 unlabeled math questions
Each question is expressed in natural language, such as:
"Find the real solutions to xΒ² - 5x + 6 = 0."
Label | Topic |
---|---|
0 | Algebra |
1 | Geometry and Trigonometry |
2 | Calculus and Analysis |
3 | Probability and Statistics |
4 | Number Theory |
5 | Combinatorics and Discrete Math |
6 | Linear Algebra |
7 | Abstract Algebra and Topology |
The notebook implements an NLP-based classification pipeline using:
- Text cleaning and preprocessing
- TF-IDF feature extraction
- Classification models such as Logistic Regression,SVM and XGBoost
- Hyperparameter tuning and evaluation
- Submission formatting for Kaggle
Achieved a private leaderboard score of 0.7711
on Kaggle.
π Tech Stack This project leverages the following tools and libraries:
- Python β Core programming language
- Pandas β Data manipulation and analysis
- NumPy β Numerical operations
- Scikit-learn β ML models, preprocessing, and evaluation
- XGBoost β Advanced gradient boosting classifier
- Jupyter Notebook β Interactive development and experimentation
- NLP Techniques β
- TF-IDF (Term Frequency-Inverse Document Frequency)
- Regular Expressions (Regex)
- Tokenization