# Title: AIDI 1002 Final Term Project Report

#### Members' Names or Individual's Name: Hitesh Nirola(200511247) , Harkirat Singh Sran (200534289)
####  Emails: 200511247@student.georgianc.on.ca , 200534289@student.georgianc.on.ca


# Introduction:
This paper introduces a game-changing advancement in Natural Language Processing (NLP) by leveraging deep unsupervised language representations trained exclusively on plain text corpora. Unlike previous methods that assigned a single vector to each wordform, the proposed models generate contextual embeddings, considering the entire input sequence for each token instance. Originally based on recurrent neural networks, these models quickly adopted the Transformer architecture, including well-known models like GPT, BERT, XLNet, and RoBERTa.

By employing these pre-trained models in a transfer learning approach, significant improvements have been observed across diverse NLP tasks. The availability of publicly shared pre-trained weights facilitates the development of state-of-the-art NLP systems, saving time and resources. While the effectiveness of unsupervised language model pre-training is well-established for English, this paper explores multi-lingual or crosslingual variants, accommodating over a hundred languages within a single model, such as mBERT, XLM, and XLM-R.

In essence, the paper outlines a methodology that harnesses deep unsupervised language representations, showcasing their transformative impact on NLP tasks. This approach has become a standard practice in the field, enabling the development of powerful language models applicable across various languages and tasks.
#### Problem Description:

The key aspects of the problem include:

Contextual Embeddings and Limitations of Word Embeddings:
The text addresses the limitations of traditional word embeddings, which assign a single representation to each word regardless of context. The problem is that a word may have multiple meanings depending on its context, and word embeddings do not capture this nuance.

Shift to Contextual Embeddings:
The paper highlights a shift from context-free word embeddings to contextual embeddings, where the representation of a word is influenced by the entire input sequence. This enables the encoding of complex syntactic and semantic characteristics of words or sentences.

Evolution of Self-Supervised Learning:
The problem involves the evolution of self-supervised learning techniques, starting with neural language modeling and progressing through various models like ELMo, ULMFiT, and Transformer-based architectures (GPT, BERT, XLNet, etc.).

Extension to Other Languages:
The paper discusses the extension of pre-trained language models beyond English to various languages, including French. It addresses the need for language-specific models and benchmarks, considering the success of models like CamemBERT for French.

Lack of French-Specific Benchmark:
A notable problem highlighted in the text is the absence of a standardized benchmark, similar to GLUE for English, for evaluating NLP models specifically in the French language. The need for such benchmarks to evaluate the performance of models on various French NLP tasks is emphasized.

Development of FlauBERT:
The paper describes the development of FlauBERT, outlining the training data collection process, the text preprocessing pipeline, and the model architecture for both FlauBERTBASE and FlauBERTLARGE.
#### Context of the Problem:

The problem discussed in the text, centered around the development of FlauBERT, is crucial for advancing natural language processing (NLP) capabilities, particularly in the context of the French language. The shift from traditional word embeddings to contextual embeddings reflects a need for more nuanced language representations that capture the richness of semantics and contextual meanings. The extension of pre-trained language models to languages beyond English, exemplified by models like CamemBERT and FlauBERT, underscores the importance of addressing linguistic diversity. The absence of a standardized benchmark for evaluating NLP models in French highlights a gap in the field, emphasizing the need for rigorous evaluation methodologies. Overall, the development of FlauBERT contributes not only to enhancing language understanding for French but also fosters open research, collaboration, and the applicability of advanced language models to a broad spectrum of NLP tasks.


#### Limitation About other Approaches:

While traditional word embeddings, such as word2vec and GloVe, have been instrumental in capturing semantic relationships, a notable limitation is their inability to account for context-dependent meanings of words. Early approaches like ELMo and ULMFiT improved upon this limitation by introducing contextual embeddings, yet they required significant in-domain data for effective pre-training. This dependency on large domain-specific datasets can be a drawback, and it paved the way for more recent transformer-based models, like BERT and GPT, which leverage extensive general-domain text corpora for pre-training, achieving state-of-the-art results on a wide array of natural language processing tasks. 

#### Solution:
The method under consideration, exemplified by FlauBERT, addresses the limitations of prior approaches by incorporating a contextual embedding paradigm. By training on a diverse French text corpus from various sources, FlauBERT captures nuanced contextual meanings, overcoming the challenge of assigning a single representation to words. Its architecture, informed by the advancements in transformer models, enables comprehensive language understanding, yielding state-of-the-art results on French NLP tasks and reducing the dependency on large in-domain datasets, thus offering an effective solution to the contextual representation challenge in contemporary NLP.


# Background

Explain the related work using the following table

| Task                   | Dataset                     | Weakness                                | Explanation                                      |
| ---------------------- | --------------------------- | --------------------------------------- | ------------------------------------------------ |
| Text Classification    | CLS-FR (Books, DVD, Music)  | not provided in the paper  | CLS dataset consists of Amazon reviews for books, DVDs, and music in French. Models are evaluated based on their accuracy in classifying sentiment. |
| Paraphrasing           | PAWS-X-FR (General domain)  | not provided in the paper  | PAWS-X dataset is an extension of the Paraphrase Adversaries from Word Scrambling (PAWS) for English, now including French. Models are evaluated on their accuracy in identifying semantically equivalent sentence pairs. |
| Natural Language Inference | XNLI-FR (Diverse genres) | not provided in the paper | XNLI is a cross-lingual NLI corpus that extends the MultiNLI corpus to 15 languages, including French. Models are evaluated based on their accuracy in determining the relationship (entailment, contradiction, or neutral) between sentence pairs. |
| Constituency Parsing and POS Tagging | French Treebank (Daily newspaper) | not provided in the paper  | The French Treebank consists of sentences from the French daily newspaper Le Monde, annotated with constituency and dependency syntactic trees and part-of-speech tags. Models are evaluated based on labeled F-score for constituency parsing and POS tagging accuracy. |
| Dependency Parsing     | French Treebank (Daily newspaper) | not provided in the paper | Dependency parsing task uses a reimplementation of the model of Dozat and Manning (2016), and models are evaluated based on their performance in dependency parsing. |

# Methodology

Provide details of the existing paper method and your contribution that you are implementing in the next section with figure(s).  

For figures you can use this tag:

![Alternate text ](Figure.png "Title of the figure, location is simply the directory of the notebook")

# Implementation

In this section, you will provide the code and its explanation. You may have to create more cells after this. (To keep the Notebook clean, do not display debugging output or thousands of print statements from hundreds of epochs. Make sure it is readable for others by reviewing it yourself carefully.)

In [2]:
# Code cells

In [4]:
# Code cells

In [3]:
# Code cells

# Conclusion and Future Direction

In this work, we introduce FlauBERT, a pre-trained language model specifically designed for the French language. FlauBERT was trained on a diverse corpus from multiple sources and has demonstrated state-of-the-art performance on various French Natural Language Processing (NLP) tasks. Notably, it outperforms multilingual/cross-lingual models and is on par with CamemBERT, another pre-trained language model for French, despite being trained on nearly half the amount of text data.

To ensure the reproducibility of our pipeline, we have not only released FlauBERT itself but also provided preprocessing and training scripts. Additionally, we introduce a general benchmark, FLUE, for evaluating French NLP systems. This benchmark serves as a standardized measure for assessing the performance of different models on French-specific tasks.

Furthermore, FlauBERT is now integrated into Hugging Face's transformers library, enhancing its accessibility and usability for the wider NLP community.

In simpler terms, we present FlauBERT as a powerful language model tailored for the French language, showcasing its impressive performance on various tasks. We've made our work easily reproducible by sharing scripts and benchmarks, and FlauBERT is now conveniently available through the Hugging Face transformers library.

 we have a few exciting possibilities for making our FlauBERT language model even better:

Training for Specific Topics:

We can teach FlauBERT to understand specific topics like law or medicine better by giving it more focused training on those subjects.
Continuous Learning:

We want FlauBERT to keep learning and staying updated as language use evolves over time.
Understanding More Than Just Text:

We're exploring ways to teach FlauBERT to understand not only text but also images and other types of information together. This could make it more powerful in understanding the world.
Using Less Resources:

We're looking into ways to make FlauBERT work faster and use less computer power, so it can be used on devices like phones.
Speaking Many Languages:

We're trying to help FlauBERT understand and work well with different languages, making it useful for people who speak various languages.
Explaining Itself Better:

We want FlauBERT to be better at explaining why it makes certain predictions, especially in situations where understanding its decisions is important.
Being Fair and Unbiased:

We're working to make sure FlauBERT doesn't show unfair preferences and behaves reliably across different groups of people.
Easy for Everyone to Use:

We're designing simple and easy-to-use interfaces or tools so that even people who aren't experts in language technology can benefit from FlauBERT.
Working Together with Others:

We're excited about working with other researchers and developers to improve FlauBERT and contribute to the growing knowledge in language technology.
By exploring these ideas, we hope to make FlauBERT even more useful and capable in understanding and working with language.








# References:

[1]:  Abeillé, A., Clément, L., and Toussenel, F. (2003). "Building a Treebank for French." In Building and Using Parsed Corpora, pages 165–187. Springer Netherlands, Dordrecht.

[2]:  Antoun, W., Baly, F., and Hajj, H. (2020). "Arabert: Transformer-based model for Arabic language understanding." arXiv preprint arXiv:2003.00104.

[3]:  Artetxe, M. and Schwenk, H. (2019). "Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond." Transactions of the Association for Computational Linguistics, 7:597–610.

[4]:  Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). "A neural probabilistic language model." Journal of Machine Learning Research, 3(Feb):1137–1155.