# Projet IA - Rap generation
# Quentin Le Lan & Marius Le Douarin


This project aims to train a GPT-2 model to generate French rap for us. To achieve this, we followed the advice provided [here](https://discuss.huggingface.co/t/fine-tune-gpt2-for-french-belgium-rap/7098). You can find in this notebook instructions on how to train the model, how to use it, and a graphical interface.

We use the model **louis2020belgpt2** from [github](https://github.com/antoiloui/belgpt2) or [huggingface](https://huggingface.co/antoinelouis/belgpt2)

author = Louis, Antoine
title = BelGPT-2: a GPT-2 model pre-trained on French corpora.
year = 2020



In [1]:
!pip install jaxlib==0.4.20 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
!pip install transformers datasets flax
!pip install -q streamlit

Looking in links: https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
Collecting jaxlib==0.4.20
  Downloading https://storage.googleapis.com/jax-releases/cuda12/jaxlib-0.4.20%2Bcuda12.cudnn89-cp310-cp310-manylinux2014_x86_64.whl (138.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.7/138.7 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jaxlib
  Attempting uninstall: jaxlib
    Found existing installation: jaxlib 0.4.23+cuda12.cudnn89
    Uninstalling jaxlib-0.4.23+cuda12.cudnn89:
      Successfully uninstalled jaxlib-0.4.23+cuda12.cudnn89
Successfully installed jaxlib-0.4.20+cuda12.cudnn89
Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

## Train the GPT model

This is the command to fine tune the model a first time. You need the file [flax](https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_clm_flax.py). The time of an epoch depends of the size of your dataset. For and dataset of 22000 it's take ~1h

In [None]:
!python run_clm_flax.py --model_name_or_path antoiloui/belgpt2 --train_file train.csv --do_eval --validation_file validation.csv --output_dir output --do_train --preprocessing_num_workers 2 --num_train_epoch 1 --block_size=1024 --per_device_train_batch_size 4 --eval_steps 1000

This is the command for fine tune the model after the first epoch. You should be becareful to the param `config_name`, `tokenize_name`, `model_name_or_path`

In [None]:
!python run_clm_flax.py --config_name ./drive/MyDrive/ia/epoch4/config.json --tokenizer_name ./drive/MyDrive/ia/epoch4/ --model_name_or_path ./drive/MyDrive/ia/epoch4/flax_model.msgpack --train_file ./drive/MyDrive/ia/train.csv --do_eval --validation_file ./drive/MyDrive/ia/validation.csv --output_dir ./drive/MyDrive/ia/epoch5 --do_train --preprocessing_num_workers 2 --num_train_epoch 1 --block_size=1024 --per_device_train_batch_size 3 --eval_steps 1000

## Execute the model localy

It's to generate with no interface

In [2]:
from transformers import (
    CONFIG_MAPPING,
    FLAX_MODEL_FOR_CAUSAL_LM_MAPPING,
    AutoConfig,
    AutoTokenizer,
    FlaxAutoModelForCausalLM,
)


In [3]:
MAX_LEN = 40 #@param {type:"slider", min:1, max:500, step:10}
MIN_LEN = 31 #@param {type:"slider", min:1, max:500, step:10}
temp = 0.7 #@param {type:"slider", min:0.0, max:2.0, step:0.1}
top_p=0.95 #@param {type:"slider", min:0.0, max:3.0, step:0.1}
top_k=100 #@param {type:"slider", min:0, max:1000, step:10}
repetition_penalty=1.5 #@param {type:"slider", min:0.0, max:10, step:0.5}
input_text="Négro jtire une taffe, jfais des gros nuages Jrappe tellement ma life, ça devient meme plus un jeu" #@param {type:"string"}

If you want to use it, change the path of the `config`, `tokenizer` and the `model`

In [4]:
import numpy as np
config = AutoConfig.from_pretrained('./drive/MyDrive/ia/epoch4/config.json')
tokenizer = AutoTokenizer.from_pretrained('drive/MyDrive/ia/epoch4')
model = FlaxAutoModelForCausalLM.from_pretrained('drive/MyDrive/ia/epoch4/flax_model.msgpack',config=config)

input_ids = tokenizer.encode(input_text, return_tensors="np")
attention_mask = np.ones(input_ids.shape)


output = model.generate(input_ids, attention_mask=attention_mask, pad_token_id=tokenizer.eos_token_id,do_sample=True,
            top_k=top_k,
            max_length=MAX_LEN,
            min_length=MIN_LEN,
            top_p=top_p,
            temperature=temp,
            repetition_penalty=repetition_penalty,
            num_return_sequences=1)

output=np.array(output.sequences)

decoded_output = []
for sample in output:
    decoded_output.append(tokenizer.decode(sample, skip_special_tokens=True))
print(decoded_output)

["Négro jtire une taffe, jfais des gros nuages Jrappe tellement ma life, ça devient meme plus un jeu  j'ai la bite d'un mec qui veut"]


## Use graphic interface

We use streamlite to make a great interface like chatGPT

### Installation

In [5]:
!npm install localtunnel

[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35msaveError[0m ENOENT: no such file or directory, open '/content/package.json'
[K[?25h[37;40mnpm[0m [0m[34;40mnotice[0m[35m[0m created a lockfile as package-lock.json. You should commit this file.
[0m[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35menoent[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No description
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No repository field.
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No README data
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No license field.
[0m
+ localtunnel@2.0.2
added 22 packages from 22 contributors and audited 22 packages in 2.163s

3 packages are looking for funding
  run `npm fund` for details

found 1 [93mmoderate[0m severity vulnerability
  run `npm audit fix` to fix them, or `npm audit` for details


### Run the app

Before this part go to `The streamlite app`

In [7]:
!streamlit run /content/app.py &>/content/logs.txt &

**Expose the port 8501**

Then just click in the `url` showed.

A `log.txt`file will be created. Copy the IP adresse of the "External URL" and past it in the url of the localtunnel

In [8]:
!npx localtunnel --port 8501

[K[?25hnpx: installed 22 in 2.04s
your url is: https://public-bears-happen.loca.lt
^C


### The sreamlite app

If you want to use it, change the path of the `config`, `tokenizer` and the `model`

In [6]:
%%writefile app.py
from transformers import (
    CONFIG_MAPPING,
    FLAX_MODEL_FOR_CAUSAL_LM_MAPPING,
    AutoConfig,
    AutoTokenizer,
    FlaxAutoModelForCausalLM,
)
import numpy as np
import streamlit as st

if "messages" not in st.session_state:
    st.session_state.messages = []

config = AutoConfig.from_pretrained('./drive/MyDrive/ia/epoch4/config.json')
tokenizer = AutoTokenizer.from_pretrained('drive/MyDrive/ia/epoch4')
@st.cache_resource
def loadModel():
  return FlaxAutoModelForCausalLM.from_pretrained('drive/MyDrive/ia/epoch4/flax_model.msgpack',config=config)
model=loadModel()

col1, col2 = st.columns(2)
with col1:
  with st.form("slider_form"):
    with st.expander("Options"):
      MAX_LEN = st.slider('Max length of the sentence',1, 500, 100, step=10)
      MIN_LEN = st.slider('Min length of the sentence',1, 500, 50, step=10)
      temp = st.slider('Balance between deterministic outputs and creative exploration',0.0, 2.0, 0.7, step=0.1,help="Here high values tend to flatten the distribution of probabilities")
      penalty = st.slider('How strongly should we discourage repetitive tokens',0.0, 10.0, 2.0, step=1.0)
      top_k = st.slider('The number of token with highest probability to keep (top-k-filtering)',0, 1000, 100, step=10)
      top_p = st.slider('Keeps only tokens whose summed probabilities are greater than or equal to top_p (top-p-sampling)',0.0, 3.0, 0.90, step=0.1)
      submit = st.form_submit_button("Submit Slider Values")
with col2:
  for message in st.session_state.messages:
      with st.chat_message(message["role"]):
          st.markdown(message["content"])
if prompt := st.chat_input("Say something"):
  st.session_state.messages.append({"role": "user", "content": prompt}) # on écrit le prompt dans l'historique
  with col2:
    with st.chat_message("user"):
      st.markdown(prompt) # on écrit le prompt
    with st.chat_message("assistant"):
      message_placeholder = st.empty()
      input_ids = tokenizer.encode(prompt, return_tensors="np")
      attention_mask = np.ones(input_ids.shape)  # Créer un masque d'attention avec des 1 pour tous les tokens

      output = model.generate(input_ids, attention_mask=attention_mask, pad_token_id=tokenizer.eos_token_id,do_sample=True,
          top_k=top_k,
          max_length=MAX_LEN,
          min_length=MIN_LEN,
          top_p=top_p,
          temperature=temp,
          repetition_penalty=penalty,
          num_return_sequences=1)

      output=np.array(output.sequences)

      decoded_output = []
      for sample in output:
          tmp=tokenizer.decode(sample, skip_special_tokens=True)
          decoded_output.append(tmp)
      message_placeholder.markdown( ' '.join(decoded_output))
    st.session_state.messages.append({"role": "assistant", "content":  ' '.join(decoded_output)})#on la push à l'historique

Writing app.py
