A comprehensive Natural Language Processing project for classifying news articles written in Emakhuwa, Mozambique's most widely spoken language, into different topic categories.
This project implements and evaluates various machine learning approaches for automatic topic classification of Emakhuwa news articles. The work includes extensive data analysis, preprocessing, feature extraction, and classification using both traditional ML and advanced NLP techniques.
The dataset consists of scraped Emakhuwa news articles with the following features:
- Headlines: News article titles in Emakhuwa
- Content: Full article text in Emakhuwa
- Categories: 7 topic categories (desporto, cultura, política, economia, sociedade, saúde, mundo)
- Translators: Information about who translated the articles
- Split: Pre-defined train/test division following the original study
- Total Articles: 2,434
- Training Set: 1,337 articles
- Test Set: 560 articles
- Categories: 7 (with class imbalance issues)
- Language: Emakhuwa (Mozambican native language)
- Data Analysis: Comprehensive profiling and exploratory data analysis
- Preprocessing: Text cleaning, tokenization, and stopword removal for Emakhuwa
- Feature Engineering: Implementation of various text representation methods
- Classification: Evaluation of multiple machine learning algorithms
- Comparison: Baseline comparison between full-text and headline-only classification
- Text Normalization: Conversion to lowercase
- Stopword Removal: Custom Emakhuwa stopwords identified through iterative analysis
- Text Cleaning: Removal of non-alphabetic characters and punctuation
- Tokenization: Word-level tokenization for feature extraction
- Bag of Words (BOW): Traditional count-based representation
- TF-IDF: Term Frequency-Inverse Document Frequency weighting
- Character N-grams: Character-level n-grams (2-5 characters)
- Word2Vec: Dense vector representations trained on the corpus
- Logistic Regression: Linear classification with L2 regularization
- Naive Bayes: Multinomial Naive Bayes for text classification
- XGBoost: Gradient boosting for handling complex patterns
- Full Text Classification: Using both headlines and content
- Headline-Only Classification: Baseline using only article titles
- Performance Metrics: Precision, Recall, F1-Score, and Accuracy
- Cross-Method Comparison: Systematic evaluation across all combinations
├── Emakhuwa.ipynb # Main analysis notebook
├── README.md # This file
├── gemini_headline.py # Additional script for testing LLM for classification
├─ emakhuwa_news_topic_classification.html # Result of the data profiling
└── Emakhuwa News Topic Classification Dataset.json # Dataset file
Create an environment, and isntall the following libraries:
pip install pandas numpy matplotlib seaborn
pip install scikit-learn xgboost gensim
pip install wordcloud ydata-profiling
pip install nltk- Clone the repository and navigate to the project directory
- Ensure the dataset file is in the same directory
- Open and run
Emakhuwa.ipynbin Jupyter Notebook/Lab
The notebook is structured in the following sections:
- Data Loading and Initial Exploration
- Comprehensive Data Profiling
- Data Preprocessing and Cleaning
- Feature Extraction and Representation
- Model Training and Evaluation
- Results Comparison and Visualization
The analysis reveals performance variations across different feature extraction and classification combinations:
- TF-IDF + Logistic Regression: Generally strong performance across categories
- BOW + XGBoost: Good handling of feature interactions
- Character N-grams: Effective for morphologically rich Emakhuwa language
- Class Imbalance: Significant imbalance with "desporto" and "cultura" as major classes
- Feature Importance: TF-IDF effectively reduces noise from common Emakhuwa function words
- Text Length Impact: Full text generally outperforms headline-only classification
- Language-Specific Patterns: Character n-grams capture Emakhuwa morphological patterns effectively
The notebook includes comprehensive visualizations:
- Word Clouds: Category-specific and overall vocabulary analysis
- Class Distribution: Bar charts showing category imbalances
- Performance Heatmaps: Model comparison across metrics
- Feature Analysis: Most important features for each classification approach