<h2 align="center">COMP8420 ADV NLP FINAL PROJECT</h2>
<h2 align="center">MultiLingAI: Multilingual Contextual Summarization for Global Enterprises</h2>

<h2 align="center">Submitted by:<h3>
<h4 align="center">Muhammad Haris Rizwan | Student ID: 47565284 </h4>
<h4 align="center">Syed Rafay Ali | Student ID: 47833920 </h4>

## __Table of Contents__

1. [Introduction](#1.-Introduction)
2. [Dataset](#2.-Dataset)
3. [Data Preprocessing](#3.-Data-Preprocessing)

# __1. Introduction__

![MULTILINGAI](images/MULTILINGAI_PIC.jpeg)

In this project, we assume the role of engineers at `MultiLinguaAI`, an IT company specializing in advanced Natural Language Processing (NLP) solutions for global enterprises. `MultiLinguaAI` offers a variety of services, including sentiment analysis, text summarization, named entity recognition, and chatbots. Our primary task is to develop and implement a multilingual summarization tool that addresses the unique challenges faced by these enterprises.

## __Problem Statement__
Global enterprises operate across multiple regions and languages, requiring accurate and context-preserving summaries of documents in various languages. This need is driven by the necessity to streamline operations, enhance communication, and ensure that vital information is accessible and understandable to all stakeholders, regardless of their linguistic background.

## __Objective__
The objective of our project is to develop a multilingual summarization tool that can generate accurate and contextually relevant summaries for documents written in multiple languages. This tool aims to maintain the integrity and key information of the original documents while making them concise and easy to understand for a diverse global audience.

## __Project Scope__
The scope of our project involves addressing the real-world challenge of handling and summarizing large volumes of multilingual documents.
* Our target users are global enterprises with diverse linguistic documentation needs. 
* By leveraging advanced NLP models such as mBERT, XLM-R, and multilingual T5, we aim to create a robust solution that can be seamlessly integrated into the company's existing systems.
* The project will include data collection, preprocessing, model training, evaluation, and integration phases, ensuring a comprehensive approach to solving this complex problem.

# __2. Dataset__

![dataset](images/dataset_pic.webp)

For our project on Multilingual Contextual Summarization for Global Enterprises, the dataset plays a critical role in ensuring the accuracy and relevance of the generated summaries. We have selected datasets that provide a diverse and comprehensive collection of multilingual documents, which are essential for training and evaluating our models.

## __Selected Dataset__
We will utilize the MLSUM dataset, which stands out as a large-scale multilingual summarization dataset. MLSUM contains over 1.5 million article-summary pairs in five different languages: French, German, Spanish, Russian, and Turkish. This dataset is particularly suitable for our project because it offers a wide variety of articles and summaries from reputable news sources, ensuring both the quality and diversity needed for robust model training.

## __References:__
* `MLSUM`: The Multilingual Summarization Corpus - This dataset was introduced to facilitate research in multilingual text summarization by providing a large-scale, diverse set of news articles and summaries. It includes articles from five languages and aims to enable new research directions in the text summarization community. Link to paper​​.

* `XL-Sum`: Large-Scale Multilingual Abstractive Summarization - XL-Sum provides an extensive collection of multilingual summarization data, enhancing the ability to develop models that perform well across various languages. This dataset complements MLSUM by offering additional resources and benchmarks for evaluating summarization models. Link to paper​​.

* Contrastive Aligned Joint Learning for Multilingual Summarization - This reference explores novel methods for improving multilingual summarization, focusing on contrastive learning strategies. It provides insights into the challenges and solutions for developing high-quality summarization models, which will be valuable for refining our approach. Link to paper​​.

## __Selected Dataset Details__

### __MLSUM:__

* Contents: Contains over 1.5 million article-summary pairs from five languages.
* Languages: French, German, Spanish, Russian, Turkish.
* Source: News articles from reputable sources.
* Data Collection Process: We will collect the dataset from public repositories and ensure it is preprocessed for tokenization, normalization, and language detection. This preprocessing step is crucial for preparing the data for model training.
### __CNN/DailyMail:__

* Contents: Contains over 300,000 article-summary pairs.
* Languages: English.
* Source: News articles primarily from CNN and the Daily Mail, providing a rich source of diverse topics and high-quality journalism.
* Data Collection Process: The dataset is available through Hugging Face and will be directly accessed using the datasets library. It includes preprocessing steps such as tokenization and normalization. The dataset is structured into three splits: train, validation, and test, facilitating the training and evaluation of summarization models. The article column contains the full text, while the highlights column contains the summaries.

# __3. Data Preprocessing__

![Process flow](images/process_flow_pic.webp)
* __Data Visualisation__: Loading and opening the dataset(s) to see what is in it and what can be done with it.
* __Data cleaning__: Removing unnecessary observations for the sake of project scope e.g. Removing extra columns, punctuations, and lowercase the text.
* __Normalization__: Standardizing text data to remove inconsistencies.
* __Tokenization__: Splitting text into words or subwords to facilitate model understanding.
* __Language Detection__: Identifying and labeling the language of each document to ensure accurate processing.
By leveraging the MLSUM dataset and incorporating insights from the referenced works, we aim to develop a robust multilingual summarization tool that meets the needs of global enterprises, providing accurate and context-preserving summaries across multiple languages.

In [1]:
# Relevant libraries

# dataset
from datasets import load_dataset, Dataset
import pandas as pd
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# model 
from transformers import MBart50Tokenizer, MBartForConditionalGeneration, Trainer, TrainingArguments

[nltk_data] Downloading package punkt to /Users/rafay/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/rafay/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
2024-06-08 02:10:04.920947: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
import sys
print("Python version")
print(sys.version)
print("Version info.")
print(sys.version_info)

Python version
3.11.3 (main, Apr 19 2023, 18:51:09) [Clang 14.0.6 ]
Version info.
sys.version_info(major=3, minor=11, micro=3, releaselevel='final', serial=0)


## STEP 1: Data Visualisation

In [3]:
#Load MLSUM dataset for French
dataset_fr = load_dataset("mlsum", "fr")

  table = cls._concat_blocks(blocks, axis=0)


In [4]:
# Load MLSUM dataset for German
dataset_de = load_dataset("mlsum", "de")

In [4]:
# Load the CNN/DailyMail dataset for english
dataset_eng = load_dataset('cnn_dailymail', '3.0.0')

In [23]:
# Print dataset details
print("DATASET-FRENCH DETAILS:",dataset_fr.shape)
print("DATASET-GERMAN DETAILS:", dataset_de.shape)
print("DATASET-ENGLISH DETAILS:",dataset_eng.shape)

DATASET-FRENCH DETAILS: {'train': (392902, 6), 'validation': (16059, 6), 'test': (15828, 6)}
DATASET-GERMAN DETAILS: {'train': (220887, 6), 'validation': (11394, 6), 'test': (10701, 6)}
DATASET-ENGLISH DETAILS: {'train': (287113, 3), 'validation': (13368, 3), 'test': (11490, 3)}


In [5]:
# bifercating the datsets (splits)
train_fr = dataset_fr['train']
validation_fr = dataset_fr['validation']
test_fr = dataset_fr['test']

train_de = dataset_de['train']
validation_de = dataset_de['validation']
test_de = dataset_de['test']

train_eng = dataset_eng['train']
validation_eng = dataset_eng['validation']
test_eng = dataset_eng['test']

In [11]:
# Convert to Pandas DataFrame
train_df_fr = pd.DataFrame(train_fr)
validation_df_fr = pd.DataFrame(validation_fr)
test_df_fr = pd.DataFrame(test_fr)

train_df_de = pd.DataFrame(train_de)
validation_df_de = pd.DataFrame(validation_de)
test_df_de = pd.DataFrame(test_de)

train_df_eng = pd.DataFrame(train_eng)
validation_df_eng = pd.DataFrame(validation_eng)
test_df_eng = pd.DataFrame(test_eng)

In [30]:
train_df_fr.head(2)

Unnamed: 0,text,summary,topic,url,title,date
0,"Jean-Jacques Schuhl, Gilles Leroy, Christian G...","Jean-Jacques Schuhl, Gilles Leroy, Christian G...",livres,https://www.lemonde.fr/livres/article/2010/01/...,La rentrée littéraire promet un programme de b...,01/01/2010
1,Une semaine après l'attaque terroriste manquée...,Cette demande intervient une semaine après l'a...,proche-orient,https://www.lemonde.fr/proche-orient/article/2...,Gordon Brown appelle à une réunion internation...,01/01/2010


In [32]:
train_df_de.sample(2)

Unnamed: 0,text,summary,topic,url,title,date
79051,Unter den Internetunternehmen ist Twitter eine...,Unter den Internetunternehmen ist Twitter eine...,wirtschaft,https://www.sueddeutsche.de/wirtschaft/twitter...,Twitter-Börsengang: Reich durch Zwitschern,00/10/2013
71089,Nach einem Sabbatjahr kehrt Ronnie O'Sullivan ...,Nach einem Sabbatjahr kehrt Ronnie O'Sullivan ...,sport,https://www.sueddeutsche.de/sport/snooker-welt...,Snooker-Weltmeister Ronnie O'Sullivan - Erholt...,00/05/2013


In [31]:
train_df_eng.head(2)

Unnamed: 0,article,highlights,id
0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...,42c027e4ff9730fbb3de84c1af0d2c506e41c3e4
1,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...,ee8871b15c50d0db17b0179a6d2beab35065f1e9


In [33]:
test_df_eng.sample(2)

Unnamed: 0,article,highlights,id
3788,After West Ham announced a vast reduction in s...,West Ham became first Premier League club to d...,fd984802497ba123291ca39b1f763d3ae195d831
6292,"Hassan Munshi, one of two teenagers feared to ...","Families of Hassan Munshi and Talha Asmal, bot...",b410ef51a9d6c9b4566b2d69b02e877500f07357


In [6]:
%%time
# only want the text/article and summary/highlights columns from the three datasets (French, German, English) for now
# Convert to Pandas DataFrame and select only the required columns (French)
train_df_fr1 = pd.DataFrame(train_fr)[['text', 'summary']]
validation_df_fr1 = pd.DataFrame(validation_fr)[['text', 'summary']]
test_df_fr1 = pd.DataFrame(test_fr)[['text', 'summary']]

CPU times: user 8.7 s, sys: 1.51 s, total: 10.2 s
Wall time: 15.3 s


In [7]:
%%time
# Convert to Pandas DataFrame and select only the required columns (German)
train_df_de1 = pd.DataFrame(train_de)[['text', 'summary']]
validation_df_de1 = pd.DataFrame(validation_de)[['text', 'summary']]
test_df_de1 = pd.DataFrame(test_de)[['text', 'summary']]

CPU times: user 4.6 s, sys: 707 ms, total: 5.31 s
Wall time: 8.11 s


In [8]:
%%time
# Convert to Pandas DataFrame and select only the required columns (English)
train_df_eng1 = pd.DataFrame(train_eng)[['article', 'highlights']]
validation_df_eng1 = pd.DataFrame(validation_eng)[['article', 'highlights']]
test_df_eng1 = pd.DataFrame(test_eng)[['article', 'highlights']]

CPU times: user 3.62 s, sys: 1.19 s, total: 4.81 s
Wall time: 10.2 s


In [36]:
# Display the resulting DataFrames to verify the columns
train_df_fr1.head(2)

Unnamed: 0,text,summary
0,"Jean-Jacques Schuhl, Gilles Leroy, Christian G...","Jean-Jacques Schuhl, Gilles Leroy, Christian G..."
1,Une semaine après l'attaque terroriste manquée...,Cette demande intervient une semaine après l'a...


In [34]:
train_df_de1.tail(2)

Unnamed: 0,text,summary
220885,In Deutschland gibt es ihn bisher vor allem an...,Explosiv und höllisch stark: Das chinesische N...
220886,Der Weihnachtsbaum vor dem Reichstag ist eine ...,Die Deutschen lieben ihre Weihnachtsbäume. Abe...


In [41]:
train_df_eng1.head(2)

Unnamed: 0,article,highlights
0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...
1,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...


## STEP 2: Data Cleaning: Normalisation

In [9]:
# Function to preprocess text
def preprocess_text(text):
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Convert to lowercase
    text = text.lower()
    return text

# Function to preprocess the dataset
def preprocess_dataset(df, text_column, summary_column):
    df[text_column] = df[text_column].apply(preprocess_text)
    df[summary_column] = df[summary_column].apply(preprocess_text)
    return df

In [10]:
%%time
# Preprocess the French dataset
train_df_fr_processed = preprocess_dataset(train_df_fr1, 'text', 'summary')
validation_df_fr_processed = preprocess_dataset(validation_df_fr1, 'text', 'summary')
test_df_fr_processed = preprocess_dataset(test_df_fr1, 'text', 'summary')

CPU times: user 51.5 s, sys: 5.85 s, total: 57.3 s
Wall time: 1min 31s


In [11]:
%%time
# Preprocess the German dataset
train_df_de_processed = preprocess_dataset(train_df_de1, 'text', 'summary')
validation_df_de_processed = preprocess_dataset(validation_df_de1, 'text', 'summary')
test_df_de_processed = preprocess_dataset(test_df_de1, 'text', 'summary')

CPU times: user 31.3 s, sys: 1.84 s, total: 33.1 s
Wall time: 38 s


In [12]:
%%time
# Preprocess the English dataset
train_df_eng_processed = preprocess_dataset(train_df_eng1, 'article', 'highlights')
validation_df_eng_processed = preprocess_dataset(validation_df_eng1, 'article', 'highlights')
test_df_eng_processed = preprocess_dataset(test_df_eng1, 'article', 'highlights')

CPU times: user 30.8 s, sys: 3.77 s, total: 34.5 s
Wall time: 53.1 s


In [50]:
# Display the preprocessed French DataFrame
train_df_fr_processed.head(2)

Unnamed: 0,text,summary
0,jeanjacques schuhl gilles leroy christian gail...,jeanjacques schuhl gilles leroy christian gail...
1,une semaine après lattaque terroriste manquée ...,cette demande intervient une semaine après lat...


In [48]:
# Display the preprocessed German DataFrame
train_df_de_processed.tail(2)

Unnamed: 0,text,summary
220885,in deutschland gibt es ihn bisher vor allem an...,explosiv und höllisch stark das chinesische na...
220886,der weihnachtsbaum vor dem reichstag ist eine ...,die deutschen lieben ihre weihnachtsbäume aber...


In [51]:
# Display the preprocessed English DataFrame
train_df_eng_processed.head(2)

Unnamed: 0,article,highlights
0,london england reuters harry potter star dani...,harry potter star daniel radcliffe gets £20m f...
1,editors note in our behind the scenes series c...,mentally ill inmates in miami are housed on th...


In [13]:
# For the sake of each, we will rename the English dataset's columns to the ones of the other two as follows:
# Rename the columns in the English dataset
train_df_eng_processed.rename(columns={'article': 'text', 'highlights': 'summary'}, inplace=True)
validation_df_eng_processed.rename(columns={'article': 'text', 'highlights': 'summary'}, inplace=True)
test_df_eng_processed.rename(columns={'article': 'text', 'highlights': 'summary'}, inplace=True)

In [14]:
# Display the resulting DataFrames to verify the column names
train_df_eng_processed.head(2)

Unnamed: 0,text,summary
0,london england reuters harry potter star dani...,harry potter star daniel radcliffe gets £20m f...
1,editors note in our behind the scenes series c...,mentally ill inmates in miami are housed on th...


## STEP 3: Data size reduction

Reducing the size of our training dataset is a crucial step to ensure the efficient use of computational resources and prevent potential crashes during the training process. Given the large size of our datasets—such as the French dataset with nearly `400,000` observations—it is important to balance the need for a representative sample with the limitations of our computing environment. 

By randomly sampling a subset of `100,000` observations, we maintain the diversity and representativeness of the data while significantly decreasing the computational load. This reduction allows us to streamline the training process, making it more manageable and ensuring smoother execution within the constraints of our Jupyter Notebook environment. This approach is particularly useful at this stage, as it facilitates faster iterations and debugging, ultimately leading to more efficient model development and refinement.

In [15]:
# Randomly sample 100,000 observations from each training dataset

# French dataset
train_df_fr_sampled = train_df_fr_processed.sample(n=100000, random_state=42)

# German dataset
train_df_de_sampled = train_df_de_processed.sample(n=100000, random_state=42)

# English dataset
train_df_eng_sampled = train_df_eng_processed.sample(n=100000, random_state=42)

In [38]:
# English dataset (smaller for initial training purposes)
train_df_eng_demo = train_df_eng_processed.sample(n=10000, random_state=42)
validation_df_eng_demo = validation_df_eng_processed.sample(n=1000, random_state=42)
test_df_eng_demo = test_df_eng_processed.sample(n=1000, random_state=42)

In [41]:
train_df_eng_demo.shape

(10000, 2)

## STEP 4: Tokenisation

In [16]:
# Save a copy of train_df_fr_processed in a new variable train_df_tokenised for all three datasets
train_df_fr_tokenised = train_df_fr_sampled.copy()
validation_df_fr_tokenised = validation_df_fr_processed.copy()
test_df_fr_tokenised = test_df_fr_processed.copy()

train_df_de_tokenised = train_df_de_sampled.copy()
validation_df_de_tokenised = validation_df_de_processed.copy()
test_df_de_tokenised = test_df_de_processed.copy()

train_df_eng_tokenised = train_df_eng_sampled.copy()
validation_df_eng_tokenised = validation_df_eng_processed.copy()
test_df_eng_tokenised = test_df_eng_processed.copy()

In [24]:
# Verify by printing the first few rows of the new DataFrame
print(train_df_fr_tokenised.shape)
print(validation_df_fr_tokenised.shape)
print(test_df_fr_tokenised.shape)

(100000, 2)
(16059, 2)
(15828, 2)


In [25]:
#Load tokenizer
tokenizer = MBart50Tokenizer.from_pretrained('facebook/mbart-large-50')

In [26]:
# Preprocessing function for tokenization
def preprocess_function(examples):
    inputs = tokenizer(examples['text'], max_length=700, padding='max_length', truncation=True)
    targets = tokenizer(examples['summary'], max_length=150, padding='max_length', truncation=True)
    return {
        'input_ids': inputs['input_ids'],
        'attention_mask': inputs['attention_mask'],
        'labels': targets['input_ids']
    }

In [28]:
# Apply the preprocessing function to the French dataset
train_df_fr_tokenized = train_df_fr_tokenised.apply(preprocess_function, axis=1)
validation_df_fr_tokenized = validation_df_fr_tokenised.apply(preprocess_function, axis=1)
test_df_fr_tokenized = test_df_fr_tokenised.apply(preprocess_function, axis=1)

In [30]:
# Repeat for German and German datasets
train_df_de_tokenized = train_df_de_tokenised.apply(preprocess_function, axis=1)
validation_df_de_tokenized = validation_df_de_tokenised.apply(preprocess_function, axis=1)
test_df_de_tokenized = test_df_de_tokenised.apply(preprocess_function, axis=1)

In [31]:
%%time
# Repeat for German and English datasets
train_df_eng_tokenized = train_df_eng_tokenised.apply(preprocess_function, axis=1)
validation_df_eng_tokenized = validation_df_eng_tokenised.apply(preprocess_function, axis=1)
test_df_eng_tokenized = test_df_eng_tokenised.apply(preprocess_function, axis=1)

CPU times: user 7min 9s, sys: 37.4 s, total: 7min 46s
Wall time: 10min 2s


In [50]:
# Convert DataFrame to Dataset format expected by Hugging Face Trainer
train_df_eng_hf = Dataset.from_pandas(train_df_eng_demo)
validation_df_eng_hf = Dataset.from_pandas(validation_df_eng_demo)
test_df_eng_hf = Dataset.from_pandas(test_df_eng_demo)

# Apply the preprocessing function to the datasets
train_df_eng_demo_tokenized = train_df_eng_hf.map(preprocess_function, batched=True)
validation_df_eng_demo_tokenized = validation_df_eng_hf.map(preprocess_function, batched=True)
test_df_eng_demo_tokenized = test_df_eng_hf.map(preprocess_function, batched=True)

Map: 100%|████████████████████████| 10000/10000 [00:35<00:00, 282.51 examples/s]
Map: 100%|██████████████████████████| 1000/1000 [00:03<00:00, 283.46 examples/s]
Map: 100%|██████████████████████████| 1000/1000 [00:03<00:00, 289.46 examples/s]


# __4. Model Selection__

In [1]:
import pandas as pd
from datasets import Dataset, load_dataset
from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments

# Load the MLSUM dataset for French language
dataset_fr = load_dataset("mlsum", "fr")

# Convert the dataset to a pandas DataFrame
df = pd.DataFrame(dataset_fr['train'])

# Shuffle the DataFrame and reset the index
df = df.sample(frac=1).reset_index(drop=True)

# Select 10 values for training
train_df = df.iloc[:10]

# Select 2 values for validation
validation_df = df.iloc[10:12]

# Convert pandas DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
validation_dataset = Dataset.from_pandas(validation_df)

# Define the tokenizer and model
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Preprocess the dataset
def preprocess_function(examples):
    inputs = [doc for doc in examples['text']]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True, padding="max_length")
    
    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['summary'], max_length=150, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Tokenize the dataset
train_dataset = train_dataset.map(preprocess_function, batched=True)
validation_dataset = validation_dataset.map(preprocess_function, batched=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    save_total_limit=3,
    load_best_model_at_end=True,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
model.save_pretrained("fine-tuned-bart-mlsum-fr-sampled")
tokenizer.save_pretrained("fine-tuned-bart-mlsum-fr-sampled")

print("Training complete. The model has been saved ")


2024-06-12 16:50:34.552180: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  table = cls._concat_blocks(blocks, axis=0)


Map:   0%|          | 0/10 [00:00<?, ? examples/s]



Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,No log,6.231728
2,No log,5.423596
3,No log,5.025204


Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_toke

Training complete. The model has been saved 


# __5. Results Infrence__

Since it was not possible to finetune the model we figured out some pretrained models which we used to infrence our results for this project. 

In [3]:
import tkinter as tk
from tkinter import scrolledtext
from transformers import BartTokenizer, BartForConditionalGeneration, MarianMTModel, MarianTokenizer
from langdetect import detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException

# Seed for reproducibility
DetectorFactory.seed = 0

# Load the fine-tuned model and tokenizer
model_name = "fine-tuned-bart-mlsum-fr-sampled"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Load translation models and tokenizers
translation_models = {
    "fr": {
        "model": MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-fr-en"),
        "tokenizer": MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en"),
    },
    "de": {
        "model": MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-de-en"),
        "tokenizer": MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en"),
    }
}

# Function to summarize text
def summarize_text(text, max_length=150, min_length=30):
    # Tokenize the input text
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=1024, truncation=True)
    
    # Generate the summary
    summary_ids = model.generate(inputs, max_length=max_length, min_length=min_length, length_penalty=2.0, num_beams=4, early_stopping=True)
    
    # Decode the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
    return summary

# Function to translate text to English
def translate_to_english(text, source_lang):
    translation_tokenizer = translation_models[source_lang]["tokenizer"]
    translation_model = translation_models[source_lang]["model"]
    
    inputs = translation_tokenizer.encode(text, return_tensors="pt", truncation=True)
    translated_ids = translation_model.generate(inputs, max_length=512, num_beams=4, early_stopping=True)
    translated_text = translation_tokenizer.decode(translated_ids[0], skip_special_tokens=True)
    
    return translated_text

# Function to handle the summarize button click
def summarize():
    input_text = input_text_area.get("1.0", tk.END).strip()
    if not input_text:
        output_text_area.delete("1.0", tk.END)
        output_text_area.insert(tk.END, "Please enter text to summarize.")
        return

    try:
        detected_language = detect(input_text)
    except LangDetectException:
        output_text_area.delete("1.0", tk.END)
        output_text_area.insert(tk.END, "Language detection failed. Please enter a valid text.")
        return
    
    language_label_var.set(f"Detected Language: {detected_language}")

    if detected_language in ["fr", "de", "en"]:
        summary = summarize_text(input_text)
        if detected_language != "en":
            summary = translate_to_english(summary, detected_language)
    else:
        summary = f"Unsupported language detected: {detected_language}"

    output_text_area.delete("1.0", tk.END)
    output_text_area.insert(tk.END, summary)

# Create the main window
root = tk.Tk()
root.title("MultiLinguAI")

# Create and place the input text area
input_text_label = tk.Label(root, text="Input Text:")
input_text_label.pack()
input_text_area = scrolledtext.ScrolledText(root, wrap=tk.WORD, width=60, height=10)
input_text_area.pack(padx=10, pady=10)

# Create and place the summarize button
summarize_button = tk.Button(root, text="Summarize", command=summarize)
summarize_button.pack(pady=10)

# Create and place the language detection label
language_label_var = tk.StringVar(value="Detected Language: N/A")
language_label = tk.Label(root, textvariable=language_label_var)
language_label.pack()

# Create and place the output text area
output_text_label = tk.Label(root, text="Summary:")
output_text_label.pack()
output_text_area = scrolledtext.ScrolledText(root, wrap=tk.WORD, width=60, height=10)
output_text_area.pack(padx=10, pady=10)

# Start the main loop
root.mainloop()