Emakhuwa News Topic Classification

A comprehensive Natural Language Processing project for classifying news articles written in Emakhuwa, Mozambique's most widely spoken language, into different topic categories.

📋 Overview

This project implements and evaluates various machine learning approaches for automatic topic classification of Emakhuwa news articles. The work includes extensive data analysis, preprocessing, feature extraction, and classification using both traditional ML and advanced NLP techniques.

📊 Dataset

The dataset consists of scraped Emakhuwa news articles with the following features:

Headlines: News article titles in Emakhuwa
Content: Full article text in Emakhuwa
Categories: 7 topic categories (desporto, cultura, política, economia, sociedade, saúde, mundo)
Translators: Information about who translated the articles
Split: Pre-defined train/test division following the original study

Dataset Statistics

Total Articles: 2,434
Training Set: 1,337 articles
Test Set: 560 articles
Categories: 7 (with class imbalance issues)
Language: Emakhuwa (Mozambican native language)

🎯 Project Objectives

Data Analysis: Comprehensive profiling and exploratory data analysis
Preprocessing: Text cleaning, tokenization, and stopword removal for Emakhuwa
Feature Engineering: Implementation of various text representation methods
Classification: Evaluation of multiple machine learning algorithms
Comparison: Baseline comparison between full-text and headline-only classification

🛠️ Methodology

Data Preprocessing

Text Normalization: Conversion to lowercase
Stopword Removal: Custom Emakhuwa stopwords identified through iterative analysis
Text Cleaning: Removal of non-alphabetic characters and punctuation
Tokenization: Word-level tokenization for feature extraction

Feature Extraction Methods

Bag of Words (BOW): Traditional count-based representation
TF-IDF: Term Frequency-Inverse Document Frequency weighting
Character N-grams: Character-level n-grams (2-5 characters)
Word2Vec: Dense vector representations trained on the corpus

Classification Algorithms

Logistic Regression: Linear classification with L2 regularization
Naive Bayes: Multinomial Naive Bayes for text classification
XGBoost: Gradient boosting for handling complex patterns

Evaluation Approaches

Full Text Classification: Using both headlines and content
Headline-Only Classification: Baseline using only article titles
Performance Metrics: Precision, Recall, F1-Score, and Accuracy
Cross-Method Comparison: Systematic evaluation across all combinations

📁 File Structure

├── Emakhuwa.ipynb                 # Main analysis notebook
├── README.md                      # This file
├── gemini_headline.py             # Additional script for testing LLM for classification
├─  emakhuwa_news_topic_classification.html # Result of the data profiling
└── Emakhuwa News Topic Classification Dataset.json  # Dataset file

🚀 Getting Started

Prerequisites

Create an environment, and isntall the following libraries:

pip install pandas numpy matplotlib seaborn
pip install scikit-learn xgboost gensim
pip install wordcloud ydata-profiling
pip install nltk

Running the Analysis

Clone the repository and navigate to the project directory
Ensure the dataset file is in the same directory
Open and run Emakhuwa.ipynb in Jupyter Notebook/Lab

The notebook is structured in the following sections:

Data Loading and Initial Exploration
Comprehensive Data Profiling
Data Preprocessing and Cleaning
Feature Extraction and Representation
Model Training and Evaluation
Results Comparison and Visualization

📈 Key Results

Best Performing Models

The analysis reveals performance variations across different feature extraction and classification combinations:

TF-IDF + Logistic Regression: Generally strong performance across categories
BOW + XGBoost: Good handling of feature interactions
Character N-grams: Effective for morphologically rich Emakhuwa language

Important Findings

Class Imbalance: Significant imbalance with "desporto" and "cultura" as major classes
Feature Importance: TF-IDF effectively reduces noise from common Emakhuwa function words
Text Length Impact: Full text generally outperforms headline-only classification
Language-Specific Patterns: Character n-grams capture Emakhuwa morphological patterns effectively

📊 Visualizations

The notebook includes comprehensive visualizations:

Word Clouds: Category-specific and overall vocabulary analysis
Class Distribution: Bar charts showing category imbalances
Performance Heatmaps: Model comparison across metrics
Feature Analysis: Most important features for each classification approach

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Emakhuwa News Topic Classification

📋 Overview

📊 Dataset

Dataset Statistics

🎯 Project Objectives

🛠️ Methodology

Data Preprocessing

Feature Extraction Methods

Classification Algorithms

Evaluation Approaches

📁 File Structure

🚀 Getting Started

Prerequisites

Running the Analysis

📈 Key Results

Best Performing Models

Important Findings

📊 Visualizations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Emakhuwa.ipynb		Emakhuwa.ipynb
README.md		README.md
emakhuwa_news_topic_classification.html		emakhuwa_news_topic_classification.html
gemini_headline.py		gemini_headline.py

Folders and files

Latest commit

History

Repository files navigation

Emakhuwa News Topic Classification

📋 Overview

📊 Dataset

Dataset Statistics

🎯 Project Objectives

🛠️ Methodology

Data Preprocessing

Feature Extraction Methods

Classification Algorithms

Evaluation Approaches

📁 File Structure

🚀 Getting Started

Prerequisites

Running the Analysis

📈 Key Results

Best Performing Models

Important Findings

📊 Visualizations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages