Skip to content

T-yago/PLN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Emakhuwa News Topic Classification

A comprehensive Natural Language Processing project for classifying news articles written in Emakhuwa, Mozambique's most widely spoken language, into different topic categories.

📋 Overview

This project implements and evaluates various machine learning approaches for automatic topic classification of Emakhuwa news articles. The work includes extensive data analysis, preprocessing, feature extraction, and classification using both traditional ML and advanced NLP techniques.

📊 Dataset

The dataset consists of scraped Emakhuwa news articles with the following features:

  • Headlines: News article titles in Emakhuwa
  • Content: Full article text in Emakhuwa
  • Categories: 7 topic categories (desporto, cultura, política, economia, sociedade, saúde, mundo)
  • Translators: Information about who translated the articles
  • Split: Pre-defined train/test division following the original study

Dataset Statistics

  • Total Articles: 2,434
  • Training Set: 1,337 articles
  • Test Set: 560 articles
  • Categories: 7 (with class imbalance issues)
  • Language: Emakhuwa (Mozambican native language)

🎯 Project Objectives

  1. Data Analysis: Comprehensive profiling and exploratory data analysis
  2. Preprocessing: Text cleaning, tokenization, and stopword removal for Emakhuwa
  3. Feature Engineering: Implementation of various text representation methods
  4. Classification: Evaluation of multiple machine learning algorithms
  5. Comparison: Baseline comparison between full-text and headline-only classification

🛠️ Methodology

Data Preprocessing

  • Text Normalization: Conversion to lowercase
  • Stopword Removal: Custom Emakhuwa stopwords identified through iterative analysis
  • Text Cleaning: Removal of non-alphabetic characters and punctuation
  • Tokenization: Word-level tokenization for feature extraction

Feature Extraction Methods

  1. Bag of Words (BOW): Traditional count-based representation
  2. TF-IDF: Term Frequency-Inverse Document Frequency weighting
  3. Character N-grams: Character-level n-grams (2-5 characters)
  4. Word2Vec: Dense vector representations trained on the corpus

Classification Algorithms

  • Logistic Regression: Linear classification with L2 regularization
  • Naive Bayes: Multinomial Naive Bayes for text classification
  • XGBoost: Gradient boosting for handling complex patterns

Evaluation Approaches

  1. Full Text Classification: Using both headlines and content
  2. Headline-Only Classification: Baseline using only article titles
  3. Performance Metrics: Precision, Recall, F1-Score, and Accuracy
  4. Cross-Method Comparison: Systematic evaluation across all combinations

📁 File Structure

├── Emakhuwa.ipynb                 # Main analysis notebook
├── README.md                      # This file
├── gemini_headline.py             # Additional script for testing LLM for classification
├─  emakhuwa_news_topic_classification.html # Result of the data profiling
└── Emakhuwa News Topic Classification Dataset.json  # Dataset file

🚀 Getting Started

Prerequisites

Create an environment, and isntall the following libraries:

pip install pandas numpy matplotlib seaborn
pip install scikit-learn xgboost gensim
pip install wordcloud ydata-profiling
pip install nltk

Running the Analysis

  1. Clone the repository and navigate to the project directory
  2. Ensure the dataset file is in the same directory
  3. Open and run Emakhuwa.ipynb in Jupyter Notebook/Lab

The notebook is structured in the following sections:

  • Data Loading and Initial Exploration
  • Comprehensive Data Profiling
  • Data Preprocessing and Cleaning
  • Feature Extraction and Representation
  • Model Training and Evaluation
  • Results Comparison and Visualization

📈 Key Results

Best Performing Models

The analysis reveals performance variations across different feature extraction and classification combinations:

  • TF-IDF + Logistic Regression: Generally strong performance across categories
  • BOW + XGBoost: Good handling of feature interactions
  • Character N-grams: Effective for morphologically rich Emakhuwa language

Important Findings

  1. Class Imbalance: Significant imbalance with "desporto" and "cultura" as major classes
  2. Feature Importance: TF-IDF effectively reduces noise from common Emakhuwa function words
  3. Text Length Impact: Full text generally outperforms headline-only classification
  4. Language-Specific Patterns: Character n-grams capture Emakhuwa morphological patterns effectively

📊 Visualizations

The notebook includes comprehensive visualizations:

  • Word Clouds: Category-specific and overall vocabulary analysis
  • Class Distribution: Bar charts showing category imbalances
  • Performance Heatmaps: Model comparison across metrics
  • Feature Analysis: Most important features for each classification approach

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors