<h2 align="center">COMP8420 ADV NLP FINAL PROJECT</h2>
<h2 align="center">MultiLingAI: Multilingual Contextual Summarization for Global Enterprises</h2>

<h2 align="center">Submitted by:<h3>
<h4 align="center">Muhammad Haris Rizwan | Student ID: 47565284 </h4>
<h4 align="center">Syed Rafay Ali | Student ID: 47833920 </h4>

## __Table of Contents__

1. [Introduction](#1.-Introduction)
2. [Dataset](#2.-Dataset)
3. [Data Preprocessing](#3.-Data-Preprocessing)

# __1. Introduction__

![MULTILINGAI](images/MULTILINGAI_PIC.jpeg)

In this project, we assume the role of engineers at `MultiLinguaAI`, an IT company specializing in advanced Natural Language Processing (NLP) solutions for global enterprises. `MultiLinguaAI` offers a variety of services, including sentiment analysis, text summarization, named entity recognition, and chatbots. Our primary task is to develop and implement a multilingual summarization tool that addresses the unique challenges faced by these enterprises.

## __Problem Statement__
Global enterprises operate across multiple regions and languages, requiring accurate and context-preserving summaries of documents in various languages. This need is driven by the necessity to streamline operations, enhance communication, and ensure that vital information is accessible and understandable to all stakeholders, regardless of their linguistic background.

## __Objective__
The objective of our project is to develop a multilingual summarization tool that can generate accurate and contextually relevant summaries for documents written in multiple languages. This tool aims to maintain the integrity and key information of the original documents while making them concise and easy to understand for a diverse global audience.

## __Project Scope__
The scope of our project involves addressing the real-world challenge of handling and summarizing large volumes of multilingual documents.
* Our target users are global enterprises with diverse linguistic documentation needs. 
* By leveraging advanced NLP models such as mBERT, XLM-R, and multilingual T5, we aim to create a robust solution that can be seamlessly integrated into the company's existing systems.
* The project will include data collection, preprocessing, model training, evaluation, and integration phases, ensuring a comprehensive approach to solving this complex problem.

# __2. Dataset__

![dataset](images/dataset_pic.webp)

For our project on Multilingual Contextual Summarization for Global Enterprises, the dataset plays a critical role in ensuring the accuracy and relevance of the generated summaries. We have selected datasets that provide a diverse and comprehensive collection of multilingual documents, which are essential for training and evaluating our models.

## __Selected Dataset__
We will utilize the MLSUM dataset, which stands out as a large-scale multilingual summarization dataset. MLSUM contains over 1.5 million article-summary pairs in five different languages: French, German, Spanish, Russian, and Turkish. This dataset is particularly suitable for our project because it offers a wide variety of articles and summaries from reputable news sources, ensuring both the quality and diversity needed for robust model training.

## __References:__
* `MLSUM`: The Multilingual Summarization Corpus - This dataset was introduced to facilitate research in multilingual text summarization by providing a large-scale, diverse set of news articles and summaries. It includes articles from five languages and aims to enable new research directions in the text summarization community. Link to paper​​.

* `XL-Sum`: Large-Scale Multilingual Abstractive Summarization - XL-Sum provides an extensive collection of multilingual summarization data, enhancing the ability to develop models that perform well across various languages. This dataset complements MLSUM by offering additional resources and benchmarks for evaluating summarization models. Link to paper​​.

* Contrastive Aligned Joint Learning for Multilingual Summarization - This reference explores novel methods for improving multilingual summarization, focusing on contrastive learning strategies. It provides insights into the challenges and solutions for developing high-quality summarization models, which will be valuable for refining our approach. Link to paper​​.

## __Selected Dataset Details__
`MLSUM`: Contains over `1.5 million` article-summary pairs from five languages`.

* __Languages__: French, German, Spanish, Russian, Turkish.
* __Source__: News articles from reputable sources.
* __Data Collection Process__: We will collect the dataset from public repositories and ensure it is preprocessed for tokenization, normalization, and language detection. This preprocessing step is crucial for preparing the data for model training.

# __3. Data Preprocessing__

![Process flow](images/process_flow_pic.webp)
* __Data cleaning__: Removing unnecessary observations for the sake of project scope.
* __Tokenization__: Splitting text into words or subwords to facilitate model understanding.
* __Normalization__: Standardizing text data to remove inconsistencies.
* __Language Detection__: Identifying and labeling the language of each document to ensure accurate processing.
By leveraging the MLSUM dataset and incorporating insights from the referenced works, we aim to develop a robust multilingual summarization tool that meets the needs of global enterprises, providing accurate and context-preserving summaries across multiple languages.

In [13]:
from datasets import load_dataset
from transformers import MBartTokenizer, MBartForConditionalGeneration

#Load MLSUM dataset for French
dataset = load_dataset("mlsum", "fr")

In [14]:
#Load tokenizer
tokenizer = MBart50Tokenizer.from_pretrained('facebook/mbart-large-50')


In [15]:
#Preprocessing function
def preprocess_function(examples):
    inputs = tokenizer(examples['text'], max_length=512, padding='max_length', truncation=True)
    targets = tokenizer(examples['summary'], max_length=150, padding='max_length', truncation=True)
    return {
        'input_ids': inputs['input_ids'],
        'attention_mask': inputs['attention_mask'],
        'labels': targets['input_ids']
    }

In [None]:
#Applying the preprocessing to the dataset
dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/392902 [00:00<?, ? examples/s]

In [None]:
#Optional: Save the processed dataset
dataset.save_to_disk("processed_mlsum")