# 项目：使用爬虫爬取基于`Protein_ID`的基因名称、功能以及序列

## 分析目标

我们在`protein_ids-1.csv`表格中只有相关蛋白的`Protein_ID`号，因此我们需要利用requests方法到Uniprot上获取相关的基因名称、功能以及序列，并将其导出到新的表格中。

数据集为从文献上获取的包含`Protein_ID`的一个CSV文件`protein_ids-1.csv`。

`Protein_ID`：相关蛋白的ACCESSION号，可从相关的检索网站获取相关的蛋白信息。

## 读取数据

导入数据分析以及网站获取所需要的库

In [90]:
import numpy as np
import pandas as pd
import requests
import urllib.request, urllib.error
import json

利用Pandas`protein_ids-1.csv`文件的读取，并将其赋值给变量`protein_ids`

In [91]:
protein_ids = pd.read_csv("C:\\Users\\king\\Documents\\WeChat Files\\wxid_9yc2eazro0k312\\FileStorage\\File\\2024-11\\protein_ids-1.csv")
protein_ids.head(5)

Unnamed: 0,Protein_ID
0,A0A023PZG4
1,O13539
2,P00127
3,P00546
4,P00549


## 数据获取

使用resquests获取uniprot网站信息，最新api可从官网获得

- 首先遍历表格中的`Protein_ID`以用于根据不同的蛋白编号获取相应的网址。
- 生成相应的url，并赋值给变量`url`。
- 而后使用`urllib.request`中的`Request`发送请求相应，并赋值给变量`request`
- 使用`urllib.request`的`urlopen`方法获取网页相应，并将其赋值给变量`response`

### 先通过第一个`Protein_ID`获得Json格式，而后根据相应的字符串进行后面的数据抓取工作

In [108]:
base_url = f"https://www.ebi.ac.uk/proteins/api/proteins/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 Edg/130.0.0.0"}
Gene_name = [] # 用于储存相关的基因名称
Functions = [] # 用于储存蛋白功能
Sequence = [] # 用于储存蛋白序列

In [93]:
url = [base_url, protein_ids['Protein_ID'][1]]
url = "".join(url)#得到第一个蛋白的下载地址
request = urllib.request.Request(url=url, headers=headers)
try:
    response = urllib.request.urlopen(request)
    HTML = response.read()
    print(response.code) # 读取网页的内容
    print(HTML.decode('utf-8')) 
except urllib.error.URLError as E:
    if hasattr(E, 'code'): #要检查E是否具有某种属性或状态
        print(E.code)
    if hasattr(E, 'reason'):
        print(E.reason)


200
{"accession":"O13539","id":"THP2_YEAST","proteinExistence":"Evidence at protein level","info":{"type":"Swiss-Prot","created":"2006-12-12","modified":"2024-10-02","version":162},"organism":{"taxonomy":559292,"names":[{"type":"scientific","value":"Saccharomyces cerevisiae (strain ATCC 204508 / S288c)"},{"type":"common","value":"Baker's yeast"}],"lineage":["Eukaryota","Fungi","Dikarya","Ascomycota","Saccharomycotina","Saccharomycetes","Saccharomycetales","Saccharomycetaceae","Saccharomyces"]},"secondaryAccession":["D3DLB6"],"protein":{"recommendedName":{"fullName":{"value":"THO complex subunit THP2"}}},"gene":[{"name":{"value":"THP2"},"olnNames":[{"value":"YHR167W"}]}],"comments":[{"type":"FUNCTION","text":[{"value":"Component the THO subcomplex of the TREX complex, which operates in coupling transcription elongation to mRNA export. The THO complex is recruited to transcribed genes and moves along the gene with the elongating polymerase during transcription. THO is important for stabi

从输出的结果可以看出，Gene_name在"gene"->"name"->"value"；Functions在"protein"->"recommendedName"->"fullName"->"value";Sequence在"sequence"->"sequence"。    
尝试获取这些数据

In [94]:
All_data = json.loads(HTML)  # 解析为 JSON 格式的 Python 对象
gene_name = All_data.get('gene', {})[0]['name'].get('value', {})
protein_functions = All_data.get('protein', {}).get('recommendedName', {}).get('fullName', {}).get('value')
protein_sequence = All_data.get('sequence', {}).get('sequence', {})
print(gene_name)
print(protein_functions)
print(protein_sequence)

THP2
THO complex subunit THP2
MTKEEGRTYFESLCEEEQSLQESQTHLLNILDILSVLADPRSSDDLLTESLKKLPDLHRELINSSIRLRYDKYQTREAQLLEDTKTGRDVAAGVQNPKSISEYYSTFEHLNRDTLRYINLLKRLSVDLAKQVEVSDPSVTVYEMDKWVPSEKLQGILEQYCAPDTDIRGVDAQIKNYLDQIKMARAKFGLENKYSLKERLSTLTKELNHWRKEWDDIEMLMFGDDAHSMKKMIQKIDSLKSEINAPSESYPVDKEGDIVLE


信息获取成功，接下来尝试遍历表格前五个`Protein_ID`获取其`gene_name`，`protein_functions`，`protein_sequence`。

In [95]:
for Protein_ID in protein_ids['Protein_ID'][0:4]:
    url = [base_url, Protein_ID]
    url = "".join(url)#得到第一个蛋白的下载地址
    request = urllib.request.Request(url=url, headers=headers)
    try:
        response = urllib.request.urlopen(request)
        HTML = response.read()
        print(response.code) # 读取网页的内容
    except urllib.error.URLError as E:
        if hasattr(E, 'code'): #要检查E是否具有某种属性或状态
            print(E.code)
        if hasattr(E, 'reason'):
            print(E.reason)


    All_data = json.loads(HTML)  # 解析为 JSON 格式的 Python 对象
    try:
        gene_name = All_data.get('gene', {})[0]['name'].get('value', {})
    except: 
        gene_name = All_data.get('gene', {})[0]['olnNames'][0]['value'] # 有的时候有的蛋白没有基因名称，因此可以使用其locus_tag
    
    protein_functions = All_data.get('protein', {}).get('recommendedName', {}).get('fullName', {}).get('value')
    protein_sequence = All_data.get('sequence', {}).get('sequence', {})
    print(f"gene_name: {gene_name}, protein_functions: {protein_functions}, protein_sequence: {protein_sequence}")

200
gene_name: YLR236C, protein_functions: Uncharacterized protein YLR236C, protein_sequence: MHTICLRSPIDESSPLPYKSIRQPLENAHSCQALCSLMAVLCASAAHRLSETFPMRLVVAREYANWGAFQHAFTRRAGASVAATSAWFDAVAAGTENAHMQSAESCN
200
gene_name: THP2, protein_functions: THO complex subunit THP2, protein_sequence: MTKEEGRTYFESLCEEEQSLQESQTHLLNILDILSVLADPRSSDDLLTESLKKLPDLHRELINSSIRLRYDKYQTREAQLLEDTKTGRDVAAGVQNPKSISEYYSTFEHLNRDTLRYINLLKRLSVDLAKQVEVSDPSVTVYEMDKWVPSEKLQGILEQYCAPDTDIRGVDAQIKNYLDQIKMARAKFGLENKYSLKERLSTLTKELNHWRKEWDDIEMLMFGDDAHSMKKMIQKIDSLKSEINAPSESYPVDKEGDIVLE
200
gene_name: QCR6, protein_functions: Cytochrome b-c1 complex subunit 6, mitochondrial, protein_sequence: MGMLELVGEYWEQLKITVVPVVAAAEDDDNEQHEEKAAEGEEKEEENGDEDEDEDEDEDDDDDDDEDEEEEEEVTDQLEDLREHFKNTEEGKALVHHYEECAERVKIQQQQPGYADLEHKEDCVEEFFHLQHYLDTATAPRLFDKLK
200
gene_name: CDC28, protein_functions: Cyclin-dependent kinase 1, protein_sequence: MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGA

接下来循环所有的`Protein_ID`，并将得到的所有的数据储存起来

In [109]:
for Protein_ID in protein_ids['Protein_ID']:
    url = [base_url, Protein_ID]
    url = "".join(url)#得到第一个蛋白的下载地址
    request = urllib.request.Request(url=url, headers=headers)
    try:
        response = urllib.request.urlopen(request)
        HTML = response.read()
        print(response.code) # 读取网页的内容
        
        All_data = json.loads(HTML)  # 解析为 JSON 格式的 Python 对象
        try:
            gene_name = All_data.get('gene', {})[0]['name'].get('value', {})
            Gene_name.append(gene_name) # 将gene_name收集到之前创建的Gene_name
        except KeyError:
            try:
                gene_name = All_data.get('gene', {})[0]['olnNames'][0]['value'] # 有的时候有的蛋白没有基因名称，因此可以使用其locus_tag
                Gene_name.append(gene_name)
            except KeyError:
                gene_name = All_data.get('gene', {})[0]["orfNames"][0]['value'] # 如果连locus_tag都没有
                Gene_name.append(gene_name) 
            except:
                gene_name = 'We are finding it...'
                Gene_name.append(gene_name) 

        try:
            protein_functions = All_data.get('protein', {}).get('recommendedName', {}).get('fullName', {}).get('value')
            Functions.append(protein_functions) # 将protein_functions收集到之前创建的Functions
        except exception as E:
            protein_functions = f"code:{E.code}, We are trying to get it..."

        try:
            protein_sequence = All_data.get('sequence', {}).get('sequence', {})
            Sequence.append(protein_sequence) #将protein_sequence收集到之前创建的Sequence
        except exception as E:
            protein_sequence = f"code:{E.code}, We are trying to get it..."
    except urllib.error.URLError as E:
        if hasattr(E, 'code'): #要检查E是否具有某种属性或状态
            print(E.code)
        if hasattr(E, 'reason'):
            print(E.reason)

"That's all..."

200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200


"That's all..."

将储存的数据合并为DataFrame，赋值为变量`protein_gene_function_sequence`，并与`protein_ids`表格合并，并赋值为变量`protein_ids_gene_function_sequence`。

In [111]:
protein_gene_function_sequence = pd.DataFrame({'Gene_name': Gene_name, 'Functions': Functions, 'Sequence': Sequence})
protein_ids_gene_function_sequence = pd.concat([protein_ids, protein_gene_function_sequence], axis=1)
protein_ids_gene_function_sequence.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 830 entries, 0 to 829
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Protein_ID  830 non-null    object
 1   Gene_name   830 non-null    object
 2   Functions   830 non-null    object
 3   Sequence    830 non-null    object
dtypes: object(4)
memory usage: 26.1+ KB


In [112]:
protein_ids_gene_function_sequence.head(50)

Unnamed: 0,Protein_ID,Gene_name,Functions,Sequence
0,A0A023PZG4,YLR236C,Uncharacterized protein YLR236C,MHTICLRSPIDESSPLPYKSIRQPLENAHSCQALCSLMAVLCASAA...
1,O13539,THP2,THO complex subunit THP2,MTKEEGRTYFESLCEEEQSLQESQTHLLNILDILSVLADPRSSDDL...
2,P00127,QCR6,"Cytochrome b-c1 complex subunit 6, mitochondrial",MGMLELVGEYWEQLKITVVPVVAAAEDDDNEQHEEKAAEGEEKEEE...
3,P00546,CDC28,Cyclin-dependent kinase 1,MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLES...
4,P00549,CDC19,Pyruvate kinase 1,MSRLERLTSLNVVAGSDLRRTSIIGTIGPKTNNPETLVALRKAGLN...
5,P00812,CAR1,Arginase,METGPHYNYYKNRELSIVLAPFSGGQGKLGVEKGPKYMLKHGLQTS...
6,P00815,HIS4,Histidine biosynthesis trifunctional protein,MVLPILPLIDDLASWNSKKEYVSLVGQVLLDGSSLSNEEILQFSKE...
7,P00899,TRP2,Anthranilate synthase component 1,MTASIKIQPDIDSLKQLQQQNDDSSINMYPVYAYLPSLDLTPHVAY...
8,P00942,TPI1,Triosephosphate isomerase,MARTFFVGGNFKLNGSKQSIKEIVERLNTASIPENVEVVICPPATY...
9,P00950,GPM1,Phosphoglycerate mutase 1,MPKLVLVRHGQSEWNEKNLFTGWVDVKLSAKGQQEAARAGELLKEK...


## 表格保存

In [114]:
protein_ids_gene_function_sequence.to_csv("C:\\Users\\king\\Documents\\WeChat Files\\wxid_9yc2eazro0k312\\FileStorage\\File\\2024-11\\protein_ids_gene_function_sequence.csv", index=False)