### Properties of Verified Proteins of Homosapiens

Using the advanced search tool in https://www.uniprot.org/, we queried only for Proteins with Taxonomy: Homo Sapiens and Review-Status: Yes. We found 20,427(as of 10th October, 2023) such results, all of which are downloaded from the website in a gz format

As of March 8, 2024 - we find 20,434 swissprot verified Human proteins on uniprot

#### Extracting the fasta file from the gzip

In [1]:
import gzip

input_file = "uniprotkb_taxonomy_id_9606_AND_reviewed_2024_03_08.fasta.gz"
output_file = "uniprotkb_taxonomy_id_9606_AND_reviewed_2024_03_08.fasta"

with gzip.open(input_file, 'rb') as gz_file:
    with open(output_file, 'wb') as out_file:
        out_file.write(gz_file.read())

print("File extracted successfully.")

File extracted successfully.


#### Importing necessary libraries/modules

In [1]:
from pyfaidx import Fasta
import Bio
from Bio.PDB import *
from Bio.SeqUtils.ProtParam import ProteinAnalysis  
from Bio.SeqUtils.ProtParam import ProtParamData
from tqdm.notebook import tqdm

ModuleNotFoundError: No module named 'pyfaidx'

#### Extracting information from fasta file

In [3]:
sequences = Fasta('uniprotkb_taxonomy_id_9606_AND_reviewed_2023_10_10.fasta', key_function=lambda x: x.split('|')[1])
print("Number of Sequences:",len(sequences.keys()))

Number of Sequences: 20427


#### Function to extract properties of protein

In [20]:
# cif_parser = MMCIFParser()
# structure = cif_parser.get_structure('P07197', "1fat-sf.cif")

# pdb_parser = PDBParser()
# structure = pdb_parser.get_structure("PHA-L", "AF-P05087-F1-model_v4.pdb")

# polypeptide_builder = CaPPBuilder()
# counter = 1
# for polypeptide in polypeptide_builder.build_peptides(structure):
#     seq = polypeptide.get_sequence()
#     print(f"Sequence: {counter}, Length: {len(seq)}")
#     print(seq)
#     counter += 1

In [4]:
def sequence_analysis(sequence, pH=7):
    # Replace 'U', Selenocysteine with 'C', Cysteine
    sequence = sequence.replace('U', 'C')
    seq_info = dict()
    
    analyzed_seq = ProteinAnalysis(str(sequence))
    seq_info['Sequence'] = str(sequence)
    seq_info['Sequence Length'] = len(sequence)
    seq_info['Molecular Weight'] = analyzed_seq.molecular_weight()
    seq_info['GRAVY'] = analyzed_seq.gravy()
    seq_info['Amino Acid Count'] = analyzed_seq.count_amino_acids()
    seq_info['Amino Acid Percent'] = analyzed_seq.get_amino_acids_percent()
    seq_info['Molar Extinction Coefficient'] = analyzed_seq.molar_extinction_coefficient()
    seq_info['Isoelectric Point']= analyzed_seq.isoelectric_point()
    seq_info['Instability Index']= analyzed_seq.instability_index()
    seq_info['Aromaticity']= analyzed_seq.aromaticity()
    seq_info['Secondary Structure'] = analyzed_seq.secondary_structure_fraction()
    seq_info['Flexibility'] = analyzed_seq.flexibility()
    seq_info[f'Charge at {pH}'] = analyzed_seq.charge_at_pH(pH=pH)
    return seq_info

#### Extracting properties from all proteins

In [5]:
uniprot_ids = list(sequences.keys())
uniprot_protein_props = dict()
for id_ in uniprot_ids:
    uniprot_protein_props[id_] = sequence_analysis(str(sequences[id_]))

In [6]:
import json

json_data = json.dumps(uniprot_protein_props)
with open("proteins.json", "w") as file:
    file.write(json_data)

In [7]:
with open("proteins.json", "r") as file:
    json_data = file.read()
d = json.loads(json_data)
print(d['P07197'])

{'Sequence': 'MSYTLDSLGNPSAYRRVTETRSSFSRVSGSPSSGFRSQSWSRGSPSTVSSSYKRSMLAPRLAYSSAMLSSAESSLDFSQSSSLLNGGSGPGGDYKLSRSNEKEQLQGLNDRFAGYIEKVHYLEQQNKEIEAEIQALRQKQASHAQLGDAYDQEIRELRATLEMVNHEKAQVQLDSDHLEEDIHRLKERFEEEARLRDDTEAAIRALRKDIEEASLVKVELDKKVQSLQDEVAFLRSNHEEEVADLLAQIQASHITVERKDYLKTDISTALKEIRSQLESHSDQNMHQAEEWFKCRYAKLTEAAEQNKEAIRSAKEEIAEYRRQLQSKSIELESVRGTKESLERQLSDIEERHNHDLSSYQDTIQQLENELRGTKWEMARHLREYQDLLNVKMALDIEIAAYRKLLEGEETRFSTFAGSITGPLYTHRPPITISSKIQKPKVEAPKLKVQHKFVEEIIEETKVEDEKSEMEEALTAITEELAVSMKEEKKEAAEEKEEEPEAEEEEVAAKKSPVKATAPEVKEEEGEKEEEEGQEEEEEEDEGAKSDQAEEGGSEKEGSSEKEEGEQEEGETEAEAEGEEAEAKEEKKVEEKSEEVATKEELVADAKVEKPEKAKSPVPKSPVEEKGKSPVPKSPVEEKGKSPVPKSPVEEKGKSPVPKSPVEEKGKSPVSKSPVEEKAKSPVPKSPVEEAKSKAEVGKGEQKEEEEKEVKEAPKEEKVEKKEEKPKDVPEKKKAESPVKEEAVAEVVTITKSVKVHLEKETKEEGKPLQQEKEKEKAGGEGGSEEEGSDKGAKGSRKEDIAVNGEVEGKEEVEQETKEKGSGREEEKGVVTNGLDLSPADEKKGGDKSEEKVVVTKTVEKITSEGGDGATKYITKSVTVTQKVEEHEETFEEKLVSTKKVEKVTSHAIVKEVTQSD', 'Sequence Length': 916, 'Molecular Weight': 102470.78320000065, 'GR

In [9]:
len(d)

20427

### Reading from protein_props.json

In [3]:
import json
with open('protein_props.json', 'r') as file:
    data = json.load(file)

info = data['A0A024R1R8']
for key, value in info.items():
    print(f"{key}: {value}")

Sequence: MSSHEGGKKKALKQPKKQAKEMDEEEKAFKQKQKEEQKKLEVLKAKVVGKGPLATGGIKKSGKK
Sequence Length: 64
Molecular Weight: 7091.307600000003
GRAVY: -1.4281250000000014
Amino Acid Count: {'A': 5, 'C': 0, 'D': 1, 'E': 8, 'F': 1, 'G': 7, 'H': 1, 'I': 1, 'K': 20, 'L': 4, 'M': 2, 'N': 0, 'P': 2, 'Q': 5, 'R': 0, 'S': 3, 'T': 1, 'V': 3, 'W': 0, 'Y': 0}
Amino Acid Percent: {'A': 0.078125, 'C': 0.0, 'D': 0.015625, 'E': 0.125, 'F': 0.015625, 'G': 0.109375, 'H': 0.015625, 'I': 0.015625, 'K': 0.3125, 'L': 0.0625, 'M': 0.03125, 'N': 0.0, 'P': 0.03125, 'Q': 0.078125, 'R': 0.0, 'S': 0.046875, 'T': 0.015625, 'V': 0.046875, 'W': 0.0, 'Y': 0.0}
Molar Extinction Coefficient: [0, 0]
Isoelectric Point: 10.000094032287599
Instability Index: 41.382812499999986
Aromaticity: 0.015625
Secondary Structure: [0.609375, 0.203125, 0.15625]
Flexibility: [1.0270357142857143, 1.0508214285714286, 1.069107142857143, 1.0568690476190477, 1.0592857142857144, 1.0289880952380952, 1.0330238095238096, 1.0715833333333333, 1.03335714285714