# NLP (Hugging Face)

In this notebook, we want to illustrate how you can perform natural language processing. In particular, we want to see how you can use pretrained models from Hugging Face based on transformers perform this task.

Let's start by loading and installing all we will need for the first example in this notebook. Because we want to have a look at a few pretrained models. This might take a while.

In [1]:
!apt-get install ffmpeg
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2CTCTokenizer
import subprocess
import os
from IPython.display import Audio, display, HTML
from google.colab import drive, files
import logging
import warnings
# Suppress info messages (only display errors)
logging.getLogger("transformers").setLevel(logging.ERROR)
warnings.filterwarnings("ignore", message="The secret `HF_TOKEN` does not exist in your Colab secrets.")

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


## Speech to text

As a first task, we would like to turn a video into an audiofile and then turn the audiofile into a text. You can find another example on the same lines on [Kaggle](https://www.kaggle.com/code/dikshabhati2002/speech-to-text-with-hugging-face). We first load a pre-trained Wav2Vec2 model provided by Facebook AI through the Hugging Face Transformers model hub. Wav2Vec2 is a model architecture developed by Facebook AI for speech recognition tasks. We then load a tokenizer. A tokenizer converts text into individual tokens for natural language processing. These pre-trained resources are then used for performing speech recognition tasks, such as transcribing audio data from a video file.

In [2]:
# Load the pre-trained speech recognition model and tokenizer
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("facebook/wav2vec2-base-960h")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Let's have a look at one sentence from one of the lectures on transformers in the Deep Learning in Python Introductory course. If you were to run the code on your computer, you could deal with rather large files. But on Google Colab, we might runtime issues (in the default environment). So, we use a small file: *Download example_transformers_rcds.mp4 from the GitHub folder to you computer and upload it below.*

**NB:** If you run this cell several times, you might find that you run into erros, as the files a claimed not to be uploaded. What happens is that example_transformers_rcds.mp4 will be uploaded to example_transformers_rcds(1).mp4 etc. if the file already exists, whereafter it isn't found. You can delete existing files by clicking on the folder icon on the left.

In [3]:
# Mount Google Drive
drive.mount('/content/drive')

# Specify the destination path in Google Drive
drive_path = "/content/drive/My Drive/example_transformers_rcds.mp4"

# Prompt the user to upload the file "example.mp4"
print("Please upload the file 'example_transformers_rcds.mp4':")
uploaded = files.upload()

# Check if "example.mp4" was uploaded
if 'example_transformers_rcds.mp4' in uploaded:
    # Save the uploaded file to the specified destination path
    with open(drive_path, 'wb') as f:
        f.write(uploaded['example_transformers_rcds.mp4'])
    print(f"File 'example_transformers_rcds.mp4' uploaded successfully to '{drive_path}'.")
else:
    print("No file named 'example_transformers_rcds.mp4' uploaded.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Please upload the file 'example_transformers_rcds.mp4':


Saving example_transformers_rcds.mp4 to example_transformers_rcds.mp4
File 'example_transformers_rcds.mp4' uploaded successfully to '/content/drive/My Drive/example_transformers_rcds.mp4'.


In [4]:
!ffmpeg -y -i "/content/drive/My Drive/example_transformers_rcds.mp4" -vn -acodec pcm_s16le -ar 16000 -ac 1 "/content/drive/My Drive/audio_transformers_rcds.wav" 2>/dev/null

Now, we use the pre-trained Wav2Vec2 model to convert the audio into written text.

In [5]:
audio_file = "/content/drive/My Drive/audio_transformers_rcds.wav"

# Load the audio file and perform speech recognition
if os.path.exists(audio_file):
    try:
        # Play audio
        display(Audio(audio_file, autoplay=False))

        # Load audio waveform
        waveform, sample_rate = torchaudio.load(audio_file)

        # Perform speech recognition
        with torch.no_grad():
            logits = model(waveform).logits

        # Decode the predicted tokens into text
        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = tokenizer.batch_decode(predicted_ids, skip_special_tokens=True)

        display (HTML('<span style="color: #00008b">'+transcription[0]+'</span>.'))

    except Exception as e:
        print("Error occurred during transcription:", e)
    finally:
        if os.path.exists(audio_file):
            os.remove(audio_file)
else:
    print("Error: Audio file not found.")

## Punctuation .,;?!

Okay, that did the job: we have generated a text based on a recording. Amazing. But it's not really that readable. After all, it's all capitals and there's no punctuation. Of course, punctuation is an incredibly complicated task that requires quite a bit of understanding of the text, meaning that you would have to be able to train a sophisticated network to do this. Luckily, huggingface offers [pretrained packages](https://stackoverflow.com/questions/31514136/how-to-add-punctuation-to-text-using-python).

In [6]:
try:
    from deepmultilingualpunctuation import PunctuationModel
except ImportError:
    !pip install deepmultilingualpunctuation
    from deepmultilingualpunctuation import PunctuationModel

model1 = PunctuationModel()



In [7]:
transcript = transcription[0].lower() # Shift from capitals to lower case letters
transcript_punctuation = model1.restore_punctuation(transcript)

That's it. We added punctuation. But before we print anything, let's first ensure that the first letters of each sentence are capitalised. Such tasks can be solved with nltk.

In [8]:
try:
    import nltk
except ImportError:
    !pip install nltk
    import nltk
try:
    import re
except ImportError:
    !pip install re
    import re

nltk.download('punkt')

def capitalize_sentences(text):
    # Tokenize the text into sentences
    sentences = nltk.sent_tokenize(text)

    # Capitalize the first letter of each sentence
    capitalized_sentences = [sentence.capitalize() for sentence in sentences]

    # Join the sentences back into a single text
    return ' '.join(capitalized_sentences)

transcript_punctuation_capitals = capitalize_sentences(transcript_punctuation)

display (HTML('<font color="#00008b">'+transcript_punctuation_capitals+'</font>'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Autocorrect

The paragraph now looks much nicer than before... indeed, it's now looking so nice that something else has become obvious: some words were incorrectly identified, and some words are direct nonsense. Can we deal with that using Python? Yes! For this purpose, let's have a look at yet another library for NLP.

In [9]:
try:
    import language_tool_python
except ImportError:
    !pip install language_tool_python
    import language_tool_python

tool = language_tool_python.LanguageTool('en-GB')

In [10]:
# Check grammar
matches = tool.check(transcript_punctuation_capitals)

# Get the corrected text
corrected_text = tool.correct(transcript_punctuation_capitals)

# Tokenize the original text and the corrected text
original_words = transcript_punctuation_capitals.split()
corrected_words = corrected_text.split()

# Define HTML syntax for colors and strikethrough
BLUE = "<font color='#00008b'>"
RED = "<font color='red'>"
GREEN = "<font color='green'>"
STRIKETHROUGH = "<s>"
ENDC = "</font>"

# Build the HTML-formatted output
output = ""
counter = 0
for index, word in enumerate(original_words):
    if index+counter >= len(original_words):
        break
    if original_words[index+counter] != corrected_words[index]:
        # Print the original word with strikethrough in red
        output += f"{RED}<s>{original_words[index+counter]}</s>{ENDC}{ENDC} "
        # Print the corrected word in green
        output += f"{GREEN}{corrected_words[index]}{ENDC} "
        if original_words[index+counter]+original_words[index+counter+1] == corrected_words[index] or original_words[index+counter]+"-"+original_words[index+counter+1] == corrected_words[index]:
            counter = counter + 1
            output += f"{RED}<s>{original_words[index+counter]}</s>{ENDC} "
    else:
        # Print the unchanged word in blue
        output += f"{BLUE}{original_words[index+counter]}{ENDC} "

# If corrected text has extra words, print them in green
if len(corrected_words) > len(original_words):
    extra_words = corrected_words[len(original_words):]
    for extra_word in extra_words:
        output += f"{GREEN}{extra_word}{ENDC} "

# Display the HTML-formatted output
display(HTML(output))

Admittedly, that's somewhat better... but not yet perfect. How can we improve? Have a look in the exercises.

## Creating summaries

If you have a long text, maybe you want a summary. That also sounds like an NLP task and something transformers should be able to deal with... and so it is. Let's load another pre-trained model. For this purpose, we can get some text from the internet. We could for instance use [Beautiful Soup](https://pypi.org/project/beautifulsoup4/) (named after a poem in Alice in Wonderland) to scrape the arXiv. However, there already exists a package that allows us to do this in a few lines of code only.

In [11]:
try:
    import arxiv
except ImportError:
    !pip install arxiv
    import arxiv

from transformers import BartForConditionalGeneration, BartTokenizer

In [14]:
# Construct the default API client.
client = arxiv.Client()

# Search for the 10 most recent articles matching the keyword "quantum."
search = arxiv.Search(
    query="Cosmology",
    max_results=10,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

# Fetch and display titles of the search results
titles = []
for result in client.results(search):
    titles.append(result.title)

# Prompt the user to select a paper by its title
print("Available papers:")
for i, title in enumerate(titles, start=1):
    print(f"{i}. {title}")

selected_title_index = input("\nEnter the number of the paper you want to view: ")

# Check if the input is valid
if selected_title_index.isdigit():
    selected_title_index = int(selected_title_index) - 1

    if 0 <= selected_title_index < len(titles):
        selected_title = titles[selected_title_index]
        print(f"\nFetching abstract for '{selected_title}'...\n")

        # Search for the paper with the selected title
        search_by_title = arxiv.Search(query=f"ti:\"{selected_title}\"")
        try:
            selected_paper = next(client.results(search_by_title))
            display(HTML(f"<b>Title:</b> {selected_paper.title}"))
            # Abstract of the paper
            abstract = selected_paper.summary
            display(HTML(f"<b>Abstract:</b> {abstract}"))
        except StopIteration:
            display(HTML("Paper not found."))
    else:
        display(HTML("Invalid selection."))
else:
    display(HTML("Invalid input. Please enter a number."))

Available papers:
1. Bounds on galaxy stochasticity from halo occupation distribution modeling
2. Observation of Gravitational Waves from the Coalescence of a $2.5-4.5~M_\odot$ Compact Object and a Neutron Star
3. {\sc SimBIG}: Cosmological Constraints using Simulation-Based Inference of Galaxy Clustering with Marked Power Spectra
4. Early evolution of spin direction in dark matter halos and the effect of the surrounding large-scale tidal field
5. Gravitational collapse in effective loop quantum gravity: beyond marginally bound configurations
6. Gravitational lensing by a Lorentz-violating black hole
7. Quadratic Rastall Gravity: from low-mass HESS J1731-347 to high-mass PSR J0952-0607 pulsars
8. Neutrinos as possible probes for quantum gravity
9. Impact of Black Hole Parameters on Photon Sphere and Shadow Radius: New Analytical Approach
10. Tidal heating as a discriminator for horizons in equatorial eccentric extreme mass ratio inspirals

Enter the number of the paper you want to view

In [15]:
# Load pre-trained BART model and tokenizer
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Tokenize and summarize the input text using BART
inputs = tokenizer.encode("summarize: " + abstract, return_tensors="pt", max_length=1024, truncation=True)
summary_ids = model.generate(inputs, max_length=150, min_length=50, length_penalty=0.7, num_beams=4, early_stopping=True)

# Decode and post-process the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Display the summary using Markdown
display(HTML("<b>Summary:</b>"))
display(HTML(f"{BLUE} {summary} {ENDC}"))

## Exercises

**Exercise 1:** The library used above for dealing with grammatical errors did solve most of the problems with the text. Moreover, it's not from Hugging Face. However, libraries exist that are able to alter the text more thoroughly. For instance, take a look at [gramformer](https://github.com/PrithivirajDamodaran/Gramformer).

**Exercise 2:** I have been rather sparse in my comments in this notebook. Expand on some of the comments. Are there any lines of code that you do not understand? Discuss in the group.