##  SCOP2 data extraction modules


As demonstrated in EDA section, SCOP2 data is separated into multiple files. SCOP2 also contains lots of variables irrelevant for fold classification. 

This sections contains modules for automatical tranforming SCOP2 data into "sequence - fold" data suitable for fold classification problem.

In [33]:
import typing as tp
import os.path
import requests

import pandas as pd
from sklearn import preprocessing

Section with modules devoted to downloading data and transforming it into format readable by pandas.

In [37]:
def delete_header_of_classification_file(filename: str) -> None:
    """
    Deletes header of classification file.

    Input: 
        `filename (str)` -- classification file name
    """
    scop_classes_raw = open(filename, 'r')
    lines = scop_classes_raw.readlines()
    scop_classes_raw.close()
    scop_classes_transformed = open(filename, 'w')
    scop_classes_transformed.write(lines[5][2:])
    for i in range(6, len(lines)):
        scop_classes_transformed.write(lines[i])
    scop_classes_transformed.close()

    
def load_datafiles(urls: tp.List[str], filenames: tp.List['str'], 
                  cla_id: tp.Optional[int] = None) -> None:
    """
    Downloads data from urls and stores them in /data/ folder with
    names given in filenames.

    Input: 
        `urls (list[str])` -- urls of data files
        `filenames (list[str])` -- names of files
        `cla_id (int or None)` -- id of classification file in filenames

    """
    for i in range(4):
        if os.path.isfile('./../data/' + filenames[i]):
            continue
        r = requests.get(urls[i], allow_redirects=True)
        file= open('./../data/' + filenames[i], 'wb')
        file.write(r.content)
        file.close()
        if i is cla_id:
            delete_header_of_classification_file('./../data/' + filenames[i])
        

Section with modules that transform data into pandas dataframe.

In [34]:
def get_classes_table(cla_filename: str = 'scop-classes.txt', 
                      id_name: str = 'FA-DOMID') -> pd.DataFrame:
    """
    Converts .txt classification file into dataframe.

    Input: 
        `clas_filename (str)` -- file name of classification file
        `id_name (str)` -- name of id column in cla_filename
    """
    classes_df = pd.read_csv('./../data/' + cla_filename, delimiter = " ")
    classes_df = classes_df.loc[:, [id_name, "SCOPCLA"]]
    classes_df['FOLD-RAW']=classes_df['SCOPCLA'].str.extract(r'.*CF=(\d+).*')
    classes_df = classes_df.drop("SCOPCLA", axis = 1)
    le = preprocessing.LabelEncoder()
    folds = le.fit_transform(classes_df["FOLD-RAW"])
    classes_df["FOLD"] = folds
    del classes_df["FOLD-RAW"]
    return classes_df

In [129]:
def get_sequence_table(filename: str) -> pd.DataFrame:
    '''
    Returns data extracted from .fa representation file.

    Input: 
            'filename (str)': Name of the file

    Returns:
            'df (pd.DataFrame)': Extracted data
    '''
    scop_fa_rep_raw = open(filename, 'r')
    fa_lines = scop_fa_rep_raw.readlines()
    scop_fa_rep_raw.close()
    assert(len(fa_lines) % 2 == 0), "WRONG FILE REPRESENTATION"
    info = fa_lines[::2]
    sequences = fa_lines[1::2]

    data = {'info': info,
            'sequence': sequences
            }
    df = pd.DataFrame(data, columns = ['info', 'sequence'])
    df['id']=df['info'].str.extract(r'>(\d+).*').astype(int)
    return df.drop("info", axis = 1)


In [170]:
def get_sequence_to_class_table(class_table: pd.DataFrame,
                               sequence_table: pd.DataFrame) -> pd.DataFrame:
    '''
    Takes class and sequence tables and concatenates them. 

    Input: 
            'class_table (pd.DataFrame)': class table with [id, fold] columns
            'sequence_table (pd.DataFrame)': seq table with [seq, id] columns 

    Returns:
            'sequence_to_class_table (pd.DataFrame)': sequence to class table 
             with [seq, fold] columns
    '''
    cla_id = class_table.columns[0]
    seq_id = sequence_table.columns[1]
    joined_table = pd.merge(class_table, sequence_table, left_on = cla_id,
                           right_on = seq_id)
    del joined_table[seq_id]
    del joined_table[cla_id]
    return joined_table

In [178]:
def prepare_scop_data(
    scop_fa_filename: str = 'scop_fa_represeq_lib20210227.fa', 
    scop_sf_filename: str = 'scop_sf_represeq_lib20210227.fa'
) -> pd.DataFrame:
    """
    Performs default preparation of SCOP2 data.
    Input:
        'scop_fa_filename(str)': name of scop family representative domain 
            sequences file
        'scop_sf_filename(str)': name of scop superfamily representative 
            domain sequences file
    Returns:
            'sequence_to_class_table (pd.DataFrame)': sequence to class table 
             with [seq, fold] columns
    """
    urls = ['https://scop2.mrc-lmb.cam.ac.uk/files/scop-cla-latest.txt',
           'https://scop2.mrc-lmb.cam.ac.uk/files/scop-des-latest.txt',
           'https://scop2.mrc-lmb.cam.ac.uk/files/' + scop_fa_filename,
           'https://scop2.mrc-lmb.cam.ac.uk/files/' + scop_sf_filename]
    filenames = ['scop-classes.txt', 'scop-des-latest.txt', 
                'scop_fa_represeq.fa', 'scop_sf_represeq']
    load_datafiles(urls, filenames, cla_id = 0)
    class_table = get_classes_table()
    seq_table = get_sequence_table('./../data/' + filenames[2])
    return get_sequence_to_class_table(class_table, seq_table)

In [179]:
#  prepare_scop_data()

Unnamed: 0,FOLD,sequence
0,1400,DMKRQQRFFRIPFIRPADQYKDPQNKKKGWWYAHFDGPWIARQMEL...
1,1400,RQREIEMNRQQRFFRIPFIRPADQYKDPQSKKKGWWYAHFDGPWIA...
2,11,MKIKVALLDKDKEYLDRLTGVFNTKYADKLEVYSFTDEKNAIESVK...
3,11,QTPHILIVEDELVTRNTLKSIFEAEGYDVFEATDGAEMHQILSEYD...
4,11,SFERVFGKRVIILGGGALVSQVAIGAISEADRHNLRGERISVDTMP...
...,...,...
34587,9,EKKYIVGFKQTMSAMSSAKKKDVISEKGGKVQKQFKYVNAAAATLD...
34588,9,EKREVLAGHARRQAPQAVDKGPVTGDQRISVTVVLRRQRGDELEAH...
34589,9,HEIYDGHAVYQVDVASMDQVKLVHDFENDLMLDVWSDAVPGRPGKV...
34590,9,FVNEWAAEIPGGQEAASAIAEELGYDLLGQIGSLENHYLFKHKSHP...
