### Recolher dados da Web

Neste notebook vamos sistematizar a recolha de dados a partir do site do Genbank.

Dado uma determinada sequência, identificada por um id, por exemplo, quer-se ir ao site descarregar o respetivo registo, tratá-lo (somo já fizemos) e depois inserir as partes relevantes em base de dados.

Exemplos de links para sequências:
- https://www.ncbi.nlm.nih.gov/nuccore/L42022
- https://www.ncbi.nlm.nih.gov/nuccore/L42023
- https://www.ncbi.nlm.nih.gov/nuccore/LC740868.1

Depois de se pedir este link, dentro da página, em Javascript, é feito um outro pedido ao servidor a pedir o record da sequência.

Exemplo:
- https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&log$=seqview&db=nuccore&report=genbank&id=804715&conwithfeat=on&hide-cdd=on&ncbi_phid=null

No exemplo seguinte,. pede-se a página, mas a mesma não contém o registo que nos interessa.

O registo é carregado assincronamente, através de programação Javascript.

In [1]:
# import requests
# r = requests.get('https://www.ncbi.nlm.nih.gov/nuccore/PA500505.1')
# print(r.content)

No exemplo seguinte,. em vez de se pedir a página que contém o registo, pede-se apenas o registo, depois de percebermos como o mesmo é pedido.

In [2]:
# pip3 install html5lib

import requests
from bs4 import BeautifulSoup

# Making a GET request
r = requests.get('https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=2246533317&db=nuccore&report=genbank&conwithfeat=on&hide-cdd=on&retmode=text&ncbi_phid=CE88F25338A215A1000000000483042A&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000')
#r = requests.get('https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=804715&db=nuccore&report=genbank&conwithfeat=on&hide-cdd=on&retmode=html&ncbi_phid=CE8B6449389BE9F100000000068605EB&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000')

print( r.content ) 




b'LOCUS       PA500505                 318 bp    DNA     linear   PAT 31-MAY-2022\nDEFINITION  JP 2022506561-A/6: ANTIBODY BINDING TO HUMAN IL-1beta, PREPARATION\n            METHOD THEREFOR AND USE THEREOF.\nACCESSION   PA500505\nVERSION     PA500505.1\nKEYWORDS    JP 2022506561-A/6.\nSOURCE      Homo\n  ORGANISM  Homo\n            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;\n            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;\n            Catarrhini; Hominidae.\nREFERENCE   1  (bases 1 to 318)\n  AUTHORS   Xia,Y., Wang,Z., Zhang,P. and Li,B.\n  TITLE     ANTIBODY BINDING TO HUMAN IL-1beta, PREPARATION METHOD THEREFOR AND\n            USE THEREOF\n  JOURNAL   Patent: JP 2022506561-A 6 17-JAN-2022;\n            ZEDA BIOPHARMACEUTICALS INC\nCOMMENT     OS   Homo\n            PN   JP 2022506561-A/6\n            PD   17-JAN-2022\n            PF   04-NOV-2019 JP 2021523978\n            PR   07-NOV-2018 CN 201811322002.7\n            PA   ZEDA BIOP

### Problema

Ao pedir a página, não vem a registo da sequência.

Para pedir o registo, o mesmo pede-se por um id (um número interno) que é diferente do id da sequência (L42022, por exemplo).

### Solução

A solução passa por fazer dois pedidos. No primeiro, pede-se a página e extrai-se apenas o id numérico interno, associado à sequência. Esse id interno é então usado para se fazer o segundo pedido.



In [3]:
# # pip3 install html5lib
# def get_file(url):
# 	import requests
# 	from bs4 import BeautifulSoup

# 	# Making a GET request
# 	r = requests.get(url)
# 	# Parsing the HTML
# 	soup = BeautifulSoup(r.content, 'html.parser')

# 	# Procurar um tag meta com um determinado atributo
# 	lines = soup.find_all('meta', {'name':"ncbi_uidlist"} )

# 	id = ""
# 	url = ""
# 	for line in lines:
# 		# print(line)
# 		# if 'name' in line.attrs:
# 		# 	print(line.attrs['name'])
# 		if 'content' in line.attrs:
# 			# print(line.attrs['content'])		
# 			id = line.attrs['content']

# 	if id:
# 		url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={}&db=nuccore&report=genbank&conwithfeat=on&hide-cdd=on&retmode=text&maxdownloadsize=5000000".format(id)

# 	r2 = requests.get( url )

# 	return print(r2.content)

# get_file("https://www.ncbi.nlm.nih.gov/nuccore/L42200")



In [4]:
# # pip3 install html5lib      #modificado
# def get_file(url):
# 	import requests
# 	from bs4 import BeautifulSoup

# 	# Making a GET request
# 	r = requests.get(url)
# 	# Parsing the HTML
# 	soup = BeautifulSoup(r.content, 'html.parser')

# 	# Procurar um tag meta com um determinado atributo
# 	lines = soup.find_all('meta', {'name':"ncbi_uidlist"} )

# 	id = ""
# 	url = ""
# 	for line in lines:
# 		# print(line)
# 		# if 'name' in line.attrs:
# 		# 	print(line.attrs['name'])
# 		if 'content' in line.attrs:
# 			# print(line.attrs['content'])		
# 			id = line.attrs['content']

# 	if id:
# 		url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={}&db=nuccore&report=genbank&conwithfeat=on&hide-cdd=on&retmode=text&maxdownloadsize=5000000".format(id)

# 	r2 = requests.get( url )

# 	return r2.content

# # a = get_file("https://www.ncbi.nlm.nih.gov/nuccore/L42200")
# b = str(a).replace("\\n", "\n")
# c = print(b)
# c

In [5]:
# pip3 install html5lib      #modificado
def get_file(url):
	import requests
	from bs4 import BeautifulSoup

	# Making a GET request
	r = requests.get(url)
	# Parsing the HTML
	soup = BeautifulSoup(r.content, 'html.parser')

	# Procurar um tag meta com um determinado atributo
	lines = soup.find_all('meta', {'name':"ncbi_uidlist"} )

	id = ""
	url = ""
	for line in lines:
		# print(line)
		# if 'name' in line.attrs:
		# 	print(line.attrs['name'])
		if 'content' in line.attrs:
			# print(line.attrs['content'])		
			id = line.attrs['content']

	if id:
		url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={}&db=nuccore&report=genbank&conwithfeat=on&hide-cdd=on&retmode=text&maxdownloadsize=5000000".format(id)

	r2 = requests.get( url )

	return r2.content

a = get_file("https://www.ncbi.nlm.nih.gov/nuccore/L42200")
b = str(a).replace("\\n", "\n")
c = print(b)
c

b'LOCUS       HLICPITSAD               249 bp    DNA     linear   PLN 07-JUL-2003
DEFINITION  Heliamphora minor chloroplast internal transcribed spacer 2,
            partial sequence.
ACCESSION   L42200
VERSION     L42200.2
KEYWORDS    .
SOURCE      chloroplast Heliamphora minor
  ORGANISM  Heliamphora minor
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae;
            Pentapetalae; asterids; Ericales; Sarraceniaceae; Heliamphora.
REFERENCE   1  (bases 1 to 249)
  AUTHORS   Bayer,R.J., Hufford,L. and Soltis,D.E.
  TITLE     Phylogenetic relationships in Sarraceniaceae based on rbcL and ITS
            sequences
  JOURNAL   Syst. Bot. 21 (2), 121-134 (1996)
REFERENCE   2  (bases 1 to 249)
  AUTHORS   Bayer,R.J.
  TITLE     The implications of morphological and molecular sequence data from
            cpDNA and nrDNA for phylogeny reconstruction in Sarraceniaceae
  JOURNAL   Unpublished
CO

In [6]:
# # pip3 install html5lib

# import requests
# from bs4 import BeautifulSoup


# # Making a GET request
# r = requests.get('https://www.ncbi.nlm.nih.gov/nuccore/L42022')

# # https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=804715&db=nuccore&report=genbank&conwithfeat=on&hide-cdd=on&retmode=html&ncbi_phid=CE8B6449389BE9F100000000068605EB&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000

# # Parsing the HTML
# soup = BeautifulSoup(r.content, 'html.parser')

# # Getting the title tag
# print(soup.title)

# s = soup.find('pre', class_='genbank')
# # s = soup.find('div', class_='sequence')
# # s = soup.find_all('meta', name='ncbi_uidlist')

# # lines = soup.find_all('meta')
# # lines = soup.find_all("genbank")
# # for line in lines:
# # 	print(line.text)

# # use the child attribute to get
# # the name of the child tag

In [7]:
# # pip3 install html5lib

# import requests
# from bs4 import BeautifulSoup

# r = requests.get('https://www.ncbi.nlm.nih.gov/nuccore/L42022')


# soup = BeautifulSoup(r.content, 'html.parser')
# print(soup)
# s = soup.find('pre', class_='genbank')



In [8]:
# import requests
# from bs4 import BeautifulSoup


# # Making a GET request
# r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')

# # Parsing the HTML
# soup = BeautifulSoup(r.content, 'html.parser')
# s = soup.find('div', class_='entry-content')

# lines = s.find_all('p')

# for line in lines:
# 	print(line.text)

In [9]:
# import time

# for i in range(200,206):
#     url = f"https://www.ncbi.nlm.nih.gov/nuccore/L42{i}" 
#     print(url)
#     get_file(url)
#     print()
#     time.sleep(1)


