Considere o artigo:
```python
Bloice, M.D., Holzinger, A. (2016). A Tutorial on Machine Learning and Data Science Tools with Python.
```
In: Holzinger, A. (eds) Machine Learning for Health Informatics.<br> Lecture Notes in Computer Science(), vol 9605. Springer, Cham. <br>https://doi.org/10.1007/978-3-319-50478-0_22



Utilizando as bibliotecas `PyMuPDF` e `Regex`, faça a extração do texto e apresente todas as **URLs** presentes no artigo.

Acesso ao Formulário para envio do código:
* https://forms.gle/D54GFxjB8s6ZqkPo9

In [34]:
import pdfx
import re

def reconstruct_broken_urls(raw_urls):
    """
    Reconstrói URLs que foram fragmentadas ou incompletas no texto extraído.
    """
    reconstructed_urls = []
    for url in raw_urls:
        # Limpar espaços extras
        url = url.strip()

        # Reunir URLs quebradas por espaços
        url = re.sub(r'\s+', '', url)

        # Substituir caracteres estranhos ao final das URLs
        url = re.sub(r'[.,;:!?]+$', '', url)

        # Tentar identificar fragmentos incompletos (como palavras separadas por espaço ou falta de barras)
        if re.match(r'https?://', url) or re.match(r'www\.', url):
            reconstructed_urls.append(url)
        elif re.search(r'\.html|\.org|\.com|\.io|\.gov|\.net|\.edu', url):  # Reconhecer domínios comuns
            if 'http' not in url:
                reconstructed_urls.append('http://' + url)
            else:
                reconstructed_urls.append(url)
        else:
            # URLs aparentemente incompletas são adicionadas para posterior verificação
            reconstructed_urls.append(url)

    # Remover duplicatas e ordenar
    return sorted(set(reconstructed_urls))

def extract_pdf_data(pdf_path):
    """
    Extrai URLs de um PDF e tenta reconstruí-las se estiverem incompletas.
    """
    # Carregar o PDF com pdfx
    pdf = pdfx.PDFx(pdf_path)

    # Extrair URLs
    references = pdf.get_references_as_dict()
    raw_urls = references.get('url', [])

    # Reconstruir e limpar URLs
    fixed_urls = reconstruct_broken_urls(raw_urls)

    # Exibir URLs no console
    print("\n=== URLs no PDF ===")
    for i, url in enumerate(fixed_urls, start=1):
        print(f"{i}. {url}")

# Caminho do PDF
pdf_path = "artigoAtividade2.pdf"

# Executar extração de dados
extract_pdf_data(pdf_path)


=== URLs no PDF ===
1. 10.1007/978-3-319-50478-0
2. 10.1007/978-3-642-40763-5
3. fibonacci.py
4. http://augmentor.readthedocs.io
5. http://augmentorjl.readthedocs.io
6. http://cacm.acm.org/blogs/blog-cacm/176450-
7. http://cacm.acm.org/blogs/blog-cacm/176450-python-is-now-the-most-popular-introductory-teaching-language-at-top-u-s-universities
8. http://developer.nvidia.com/digits
9. http://dx.doi.org/10.1007/978-3-642-40763-5_51
10. http://localhost:8888/
11. http://mathesaurus.sourceforge.net/matlab-numpy.html
12. http://pandas
13. http://pandas.pydata.org/pandas-docs/stable/missing
14. http://pandas.pydata.org/pandas-docs/stable/missing_data.html
15. http://pandas.pydata.org/pandas-docs/stable/visualization.html
16. http://pydata.org/pandas-docs/stable/visualization.html
17. http://scikit-learn.org/stable/documentation.html
18. http://topepo.github.io/caret/index.html
19. http://torch.ch/docs/getting-started.html
20. http://www.cancer.gov
21. http://www.scipy-lectures.org/
22. http: