## Building the ClinVar TP53 Dataset
---

#### **Project Objectives**:

**1) Downloading the raw ClinVar dataset**

- Downloaded from: [https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/archive/](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/archive/)
- Loading the dataset into a pandas DataFrame.
- Reviewing columns and data types.
- Identifying potential features.

**2) Initial data cleaning (`1`)**

- Filtering for only the **TP53** target gene.
- Removing duplicates and incomplete variant records.
- Removing metadata (non-informative features lacking biological relevance).
- Identifying missing values (NaN) and applying imputation methods.

**3) Engineering the target variable (binary classification)**

- For binary classification, focusing solely on **benign** and **pathogenic** labels.
- Removing all other variant types: **VUS**, **conflicting**, and **-1**.
- Mapping: `likely benign` → **benign** and `likely pathogenic` → **pathogenic**.

**4) Integrating VEP (Variant Effect Predictor) annotations**

- Using the Ensembl REST API to fetch additional annotations: [https://rest.ensembl.org/#VEP](https://rest.ensembl.org/#VEP)
- Goal: Create an extended dataset with enriched annotations to improve subsequent model performance.

**5) Cleaning the extended dataset (`2`)**

- Removing variants containing 'NA' values.
- Filtering out variants without any VEP annotations.
- Performing complete encoding of categorical annotations.

**6) Forming the final feature set**

- Defining a minimal, interpretable set of features.
- Finalizing a cleaned dataset ready for training various ML models.

---

In [5]:
#libraries for this phase
from pathlib import Path
import pandas as pd
import numpy as np
import os
from pathlib import Path

Local project folder structure:

Project root: Binary Classification

Folders: raw data and notebooks

In [6]:
current = os.getcwd()
parent_dir = Path.cwd().parent
for root,dirs,files in os.walk(parent_dir):
    for file in files:
        if file.endswith('_raw.txt'):
            print(root)
            raw_file_path = os.path.join(root,file)


c:\Users\Admin\Binary Classification\raw data


In [None]:
raw_data_iterator = pd.read_csv(
    raw_file_path,
    sep='\t',
    chunksize=200000,
    iterator=True
)
#TexTfile obj because of the file size

In [8]:
dataset = []
for chunk in raw_data_iterator:
    dataset.append(chunk)
dataset = pd.concat(dataset)

  for chunk in raw_data_iterator:
  for chunk in raw_data_iterator:
  for chunk in raw_data_iterator:
  for chunk in raw_data_iterator:
  for chunk in raw_data_iterator:
  for chunk in raw_data_iterator:
  for chunk in raw_data_iterator:
  for chunk in raw_data_iterator:
  for chunk in raw_data_iterator:
  for chunk in raw_data_iterator:


In [9]:
dataset.shape

(8364862, 43)

In [10]:
dataset = dataset[dataset['GeneSymbol']=='TP53'] #tp53 

In [11]:
# Filter dataset to keep only GRCh38 assembly (newer version)
# Rationale: GRCh38 contains updated genome annotations and corrections compared to GRCh37
dataset = dataset[dataset['Assembly'] == 'GRCh38']

In [12]:
dataset = dataset[dataset['Assembly']=='GRCh38']

In [13]:
dataset['Assembly'].value_counts() # GRCh38 - 3758


Assembly
GRCh38    3758
Name: count, dtype: int64

In [14]:
dataset.duplicated().any()

np.False_

In [15]:
dataset.isnull().any()
dataset.isna().any().sum()

np.int64(0)

In [16]:
dataset.columns

Index(['#AlleleID', 'Type', 'Name', 'GeneID', 'GeneSymbol', 'HGNC_ID',
       'ClinicalSignificance', 'ClinSigSimple', 'LastEvaluated', 'RS# (dbSNP)',
       'nsv/esv (dbVar)', 'RCVaccession', 'PhenotypeIDS', 'PhenotypeList',
       'Origin', 'OriginSimple', 'Assembly', 'ChromosomeAccession',
       'Chromosome', 'Start', 'Stop', 'ReferenceAllele', 'AlternateAllele',
       'Cytogenetic', 'ReviewStatus', 'NumberSubmitters', 'Guidelines',
       'TestedInGTR', 'OtherIDs', 'SubmitterCategories', 'VariationID',
       'PositionVCF', 'ReferenceAlleleVCF', 'AlternateAlleleVCF',
       'SomaticClinicalImpact', 'SomaticClinicalImpactLastEvaluated',
       'ReviewStatusClinicalImpact', 'Oncogenicity',
       'OncogenicityLastEvaluated', 'ReviewStatusOncogenicity',
       'SCVsForAggregateGermlineClassification',
       'SCVsForAggregateSomaticClinicalImpact',
       'SCVsForAggregateOncogenicityClassification'],
      dtype='object')

##### 2) Initial Dataset Cleaning

> Dropping all columns except the identification columns necessary for subsequent VEP annotation integration.

- **Chromosome** - Required for precise genomic location
- **Start** - Unique variant position (I'm only tracking single nucleotide changes, so the **Stop** column is not needed)
- **ReferenceAlleleVCF** - Reference nucleotide
- **AlternativeAlleleVCF** - Alternative/mutated nucleotide  
- **GeneSymbol** - Confirmation that the variant is from the target gene in this project (**TP53**)
- **ClinSigSimple** - Binary label (0 and 1)

[Chromosome, Start, ReferenceAlleleVCF, AlternateAlleleVCF]

In [17]:
selected_cols = ['Chromosome','Start','ReferenceAlleleVCF','AlternateAlleleVCF','GeneSymbol','ClinSigSimple']
dataset = dataset[selected_cols] 

In [18]:
dataset.isna().any()

Chromosome            False
Start                 False
ReferenceAlleleVCF    False
AlternateAlleleVCF    False
GeneSymbol            False
ClinSigSimple         False
dtype: bool

In [19]:
dataset.isnull().any()

Chromosome            False
Start                 False
ReferenceAlleleVCF    False
AlternateAlleleVCF    False
GeneSymbol            False
ClinSigSimple         False
dtype: bool

In [20]:
dataset['Chromosome'].value_counts() 

Chromosome
17    3674
17      84
Name: count, dtype: int64

In [21]:
# Standardizing the 'Chromosome' column, merging mixed dtype (str) and (int) into `str` type
dataset['Chromosome'] = dataset['Chromosome'].astype(str)

In [22]:
dataset['Chromosome'].value_counts()

Chromosome
17    3758
Name: count, dtype: int64

In [23]:
# odbacivanje svih vrednosti koji su 'na'
dataset = dataset[dataset['Chromosome']=='17']


### `ReferenceAlleleVCF` and `AlternateAlleleVCF` contain specific nucleotides (A/C/G/T) and are aligned with the reference genome. They are precise sources of information and essential for VEP annotation.

### These two columns require additional cleaning as they contain not only single characters but also longer sequences. This indicates variant types like insertions and deletions affecting larger gene segments, which are not the current focus of this research. The focus is on single nucleotide variants (SNVs) and determining their pathogenicity.

#### Alternative: Also filter for valid nucleotides only

In [24]:
dataset['ReferenceAlleleVCF'].value_counts()

ReferenceAlleleVCF
G                           948
C                           890
A                           576
T                           555
GC                           41
                           ... 
GGAGTCTTCCAGTGTGATGATGGT      1
ACTATGTCG                     1
GCGGCTCATAGGGCACC             1
AGTTGGCAAAACATCTT             1
ATGTAGT                       1
Name: count, Length: 416, dtype: int64

In [25]:
# Keeping the rows with only 1 Character AKA nucleotide 
dataset = dataset[(dataset['ReferenceAlleleVCF'].str.len()==1)&(dataset['AlternateAlleleVCF'].str.len()==1)]
dataset['ReferenceAlleleVCF'].value_counts()

ReferenceAlleleVCF
G    852
C    812
A    515
T    503
Name: count, dtype: int64

In [26]:
dataset['AlternateAlleleVCF'].value_counts()

AlternateAlleleVCF
A    778
T    704
C    632
G    568
Name: count, dtype: int64

In [27]:
dataset['GeneSymbol'].value_counts()

GeneSymbol
TP53    2682
Name: count, dtype: int64

In [28]:
dataset = dataset.reset_index(drop=True)
dataset

Unnamed: 0,Chromosome,Start,ReferenceAlleleVCF,AlternateAlleleVCF,GeneSymbol,ClinSigSimple
0,17,7674221,G,A,TP53,1
1,17,7674191,C,T,TP53,1
2,17,7674230,C,A,TP53,1
3,17,7674208,A,G,TP53,1
4,17,7676154,G,C,TP53,1
...,...,...,...,...,...,...
2677,17,7676274,T,A,TP53,-1
2678,17,7676351,T,A,TP53,-1
2679,17,7676405,T,G,TP53,-1
2680,17,7676517,T,C,TP53,-1


Further filter the dataset according to the coordinates/location of the gene (TP53) provided on the official Ensembl website [https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000141510;r=17:7661779-7687546]

**location** - Chromosome 17: 7,661,779-7,687,546 reverse strand.

In [29]:
tp53_start = 7661779
tp53_end = 7687546

dataset = dataset[
    (dataset['Start']>=tp53_start) &
    (dataset['Start']<=tp53_end)
]

dataset

Unnamed: 0,Chromosome,Start,ReferenceAlleleVCF,AlternateAlleleVCF,GeneSymbol,ClinSigSimple
0,17,7674221,G,A,TP53,1
1,17,7674191,C,T,TP53,1
2,17,7674230,C,A,TP53,1
3,17,7674208,A,G,TP53,1
4,17,7676154,G,C,TP53,1
...,...,...,...,...,...,...
2677,17,7676274,T,A,TP53,-1
2678,17,7676351,T,A,TP53,-1
2679,17,7676405,T,G,TP53,-1
2680,17,7676517,T,C,TP53,-1


### 3) Target feature -> __ClinSigSimple__
- label 0 `benign`/`likely bening` -> **benign**

- label 1 `pathogenic`/`likely pathogenic` -> **pathogenic**

- label -1 ‘‘conflicting interpretation of pathogenicity’’  variants without clear effects on the carrier (drop)

In [30]:
dataset = dataset[dataset['ClinSigSimple']!=-1]

In [31]:
dataset['ClinSigSimple'].value_counts()

ClinSigSimple
0    2026
1     621
Name: count, dtype: int64

#### 4) Integration of VEP variants (ENSEBML REST API) [ https://rest.ensembl.org/documentation/info/vep_region_post ]

In practice, the Ensembl VEP returns multiple annotations per variant because a single variant can have multiple transcripts.

In scientific papers, the practice is to use a representative annotation per variant, marked as canonical. This way, I would establish uniformity in the data I retrieve for the variants in the current dataset.

I also came across this rule in the paper 'Annotating and prioritizing genomic variants using the Ensembl Variant Effect Predictor—A tutorial',

Hunt SE et al. (2021) [https://pmc.ncbi.nlm.nih.gov/articles/PMC7613081/]

The `plugin` section on the page refers specifically to the additional annotations I will integrate into the dataset.

- AlphaMissense

-  Blosum62 

- CADD (cadd_raw & cadd_phred)

-  LoF 

- PolyPhen score

- Sift_score

+ IMPACT which request returns always.

After several unsuccessful attempts with the GET method for a single variant, I tried a POST request [https://rest.ensembl.org/documentation/info/vep_region_post] to retrieve annotations for a larger number of variants.

The format used for the POST request is: [Chromosome Start . ReferenceAlleleVCF AlternateAlleleVCF]

In [32]:

dataset['vcf_format'] = dataset.apply(
    lambda row: f"{row['Chromosome']} {row['Start']} . {row['ReferenceAlleleVCF']} {row['AlternateAlleleVCF']}", 
    axis=1
)

variantsParams = dataset['vcf_format'].to_list()
print(len(variantsParams))
_test = variantsParams
# 50 test

2647


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['vcf_format'] = dataset.apply(


For the API call, I followed the official example at the bottom of the page [https://rest.ensembl.org/documentation/info/vep_region_post] with minor modifications.

In the param variable, I defined the final set of available parameters after several attempts.

In [33]:
chunk = 200
for i in variantsParams:
    print(i)

17 7674221 . G A
17 7674191 . C T
17 7674230 . C A
17 7674208 . A G
17 7676154 . G C
17 7674216 . C A
17 7675143 . C A
17 7674238 . C T
17 7674229 . C T
17 7674220 . C T
17 7675214 . A G
17 7673806 . C A
17 7674241 . G A
17 7676011 . T A
17 7673776 . G A
17 7674230 . C T
17 7673802 . C T
17 7673554 . C A
17 7673781 . C G
17 7675161 . G T
17 7675161 . G A
17 7676264 . C A
17 7674193 . A T
17 7675088 . C T
17 7670678 . A G
17 7675200 . C G
17 7673745 . T A
17 7670699 . C T
17 7674965 . G A
17 7674872 . T G
17 7673766 . T A
17 7668434 . T G
17 7675070 . C A
17 7676564 . C G
17 7670669 . G T
17 7676230 . G A
17 7674945 . G A
17 7674894 . G A
17 7674892 . T C
17 7673803 . G A
17 7673610 . T C
17 7670658 . T C
17 7670616 . G A
17 7674859 . C T
17 7675212 . A G
17 7673728 . C A
17 7673523 . A G
17 7674944 . C A
17 7674227 . T C
17 7673733 . T C
17 7687373 . T G
17 7676196 . G C
17 7676121 . G A
17 7676567 . C T
17 7676040 . C T
17 7675146 . G A
17 7675145 . C T
17 7675140 . G A
17 7675127 . A

In [35]:
import requests
import json

server = "https://rest.ensembl.org" 
ext = "/vep/homo_sapiens/region"

headers = { 
    "Content-Type": "application/json",
    "Accept": "application/json"
} 

variant = _test[:1]  # 50 varijanti test

data = {
    "variants": variant
} 
param ={ 
    'canonical':1, # returns canonical transkript 
    'CADD':1, 
    'AlphaMissense':1,
    'Blosum62':1,
    'LoF':1,
    'REVEL':1
}
try:
    r = requests.post(server + ext, headers=headers, json=data,params=param)
    print(f"Connection status: {r.status_code}")
except:
    ConnectionError(f"{r.status_code}")

print(r.json())


Connection status: 200
[{'assembly_name': 'GRCh38', 'seq_region_name': '17', 'strand': 1, 'end': 7674221, 'input': '17 7674221 . G A', 'colocated_variants': [{'strand': 1, 'seq_region_name': '17', 'id': 'CM010465', 'start': 7674221, 'clinvar_somatic_classification': 'Neoplasm(oncogenicity:Oncogenic)', 'allele_string': 'HGMD_MUTATION', 'phenotype_or_disease': 1, 'end': 7674221}, {'seq_region_name': '17', 'id': 'CM900211', 'strand': 1, 'end': 7674221, 'phenotype_or_disease': 1, 'allele_string': 'HGMD_MUTATION', 'clinvar_somatic_classification': 'Neoplasm(oncogenicity:Oncogenic)', 'start': 7674221}, {'seq_region_name': '17', 'var_synonyms': {'COSMIC': ['COSM10656']}, 'id': 'COSV52662035', 'strand': 1, 'end': 7674221, 'somatic': 1, 'allele_string': 'COSMIC_MUTATION', 'phenotype_or_disease': 1, 'start': 7674221, 'clinvar_somatic_classification': 'Neoplasm(oncogenicity:Oncogenic)'}, {'somatic': 1, 'end': 7674221, 'phenotype_or_disease': 1, 'allele_string': 'COSMIC_MUTATION', 'start': 7674221

In [36]:
result = r.json()
print(result[0]) #json string

{'assembly_name': 'GRCh38', 'seq_region_name': '17', 'strand': 1, 'end': 7674221, 'input': '17 7674221 . G A', 'colocated_variants': [{'strand': 1, 'seq_region_name': '17', 'id': 'CM010465', 'start': 7674221, 'clinvar_somatic_classification': 'Neoplasm(oncogenicity:Oncogenic)', 'allele_string': 'HGMD_MUTATION', 'phenotype_or_disease': 1, 'end': 7674221}, {'seq_region_name': '17', 'id': 'CM900211', 'strand': 1, 'end': 7674221, 'phenotype_or_disease': 1, 'allele_string': 'HGMD_MUTATION', 'clinvar_somatic_classification': 'Neoplasm(oncogenicity:Oncogenic)', 'start': 7674221}, {'seq_region_name': '17', 'var_synonyms': {'COSMIC': ['COSM10656']}, 'id': 'COSV52662035', 'strand': 1, 'end': 7674221, 'somatic': 1, 'allele_string': 'COSMIC_MUTATION', 'phenotype_or_disease': 1, 'start': 7674221, 'clinvar_somatic_classification': 'Neoplasm(oncogenicity:Oncogenic)'}, {'somatic': 1, 'end': 7674221, 'phenotype_or_disease': 1, 'allele_string': 'COSMIC_MUTATION', 'start': 7674221, 'clinvar_somatic_class

In [37]:
for key,value in result[0].items():
    print(key,value) # transcript_consequences = in this nested dict are stored variants information!!!!

assembly_name GRCh38
seq_region_name 17
strand 1
end 7674221
input 17 7674221 . G A
colocated_variants [{'strand': 1, 'seq_region_name': '17', 'id': 'CM010465', 'start': 7674221, 'clinvar_somatic_classification': 'Neoplasm(oncogenicity:Oncogenic)', 'allele_string': 'HGMD_MUTATION', 'phenotype_or_disease': 1, 'end': 7674221}, {'seq_region_name': '17', 'id': 'CM900211', 'strand': 1, 'end': 7674221, 'phenotype_or_disease': 1, 'allele_string': 'HGMD_MUTATION', 'clinvar_somatic_classification': 'Neoplasm(oncogenicity:Oncogenic)', 'start': 7674221}, {'seq_region_name': '17', 'var_synonyms': {'COSMIC': ['COSM10656']}, 'id': 'COSV52662035', 'strand': 1, 'end': 7674221, 'somatic': 1, 'allele_string': 'COSMIC_MUTATION', 'phenotype_or_disease': 1, 'start': 7674221, 'clinvar_somatic_classification': 'Neoplasm(oncogenicity:Oncogenic)'}, {'somatic': 1, 'end': 7674221, 'phenotype_or_disease': 1, 'allele_string': 'COSMIC_MUTATION', 'start': 7674221, 'clinvar_somatic_classification': 'Neoplasm(oncogeni

In [38]:

ciljneAnotacije = result[0]['transcript_consequences']


The output contains many annotations that are not needed, so I am only looking at the 'transcript_consequences' nested list, which contains annotations relevant for ML.

Also, transcripts contain different keys, which is expected behavior in the context of genomics, as one variant can have multiple transcripts with different functional effects (1 gene = many different products).

For this reason, I will make a selection such that 1 Variant : 1 Transcript (representative)

> The transcript must be canonical:1 and 'protein-coding'

In [39]:
# transcripts selection
def select_transcript(anotacije):
    is_proteinCoding = [p for p in anotacije if p.get('biotype')=='protein_coding'] # only protein coding hence they can be pathogenic
    is_canonical = [c for c in anotacije if c.get('canonical')==1] # only canonical
    selected = is_canonical[0] if anotacije else is_proteinCoding[0] 
    return selected
selected = select_transcript(ciljneAnotacije)

In [40]:
rows=[]
for i in range(len(result)):
    transcripts = result[i]['transcript_consequences']
    select = select_transcript(transcripts)
    features = {
    'variant_id': i,
    'cadd_phred': select.get('cadd_phred'),
    'cadd_raw': select.get('cadd_raw'),
    'revel': select.get('revel'),
    'sift_score': select.get('sift_score'),
    'polyphen_score': select.get('polyphen_score'),
    'blosum62': select.get('blosum62'),
    'alphamissense_score': select.get('alphamissense',{}).get('am_pathogenicity'), #alphamissens vraca 2 vrednosti ali uzimam samo prvu tj scor, label mi ne treba
    'impact': select.get('impact'),
    'consequence': (select.get('consequence_terms', [None])[0])
    }   

    rows.append(features)   
print(rows)

[{'variant_id': 0, 'cadd_phred': 29.2, 'cadd_raw': 5.203168, 'revel': 0.927, 'sift_score': 0, 'polyphen_score': None, 'blosum62': -3, 'alphamissense_score': 0.9968, 'impact': 'MODERATE', 'consequence': 'missense_variant'}]


#### Making dataset from these features

In [41]:
values = []
for i in rows:
    column_names = i.keys()
    column_values = i.values()
    values.append(list(column_values))

In [42]:
for i in rows:
    data = list(i.values())
    print(data)

[0, 29.2, 5.203168, 0.927, 0, None, -3, 0.9968, 'MODERATE', 'missense_variant']


In [43]:
features_df = pd.DataFrame(data=values,columns= column_names)
features_df

Unnamed: 0,variant_id,cadd_phred,cadd_raw,revel,sift_score,polyphen_score,blosum62,alphamissense_score,impact,consequence
0,0,29.2,5.203168,0.927,0,,-3,0.9968,MODERATE,missense_variant


In [44]:
# copy of the starting dataset , 50 rows appending to 50 variants (TEST)
dataset_cp = dataset.copy()
dataset_cp = dataset_cp.iloc[:1,:]
dataset_cp

Unnamed: 0,Chromosome,Start,ReferenceAlleleVCF,AlternateAlleleVCF,GeneSymbol,ClinSigSimple,vcf_format
0,17,7674221,G,A,TP53,1,17 7674221 . G A


Ovde je prikazan integrisani dataset nakon spajanja ClinVar zapisa sa Ensembl VEP anotacijama. Svi atributi nisu namenjeni modeliranju, već služe kao izvor ili meta-informacije

In [45]:
test_df = pd.concat([dataset_cp,features_df],axis=1) 
test_df

Unnamed: 0,Chromosome,Start,ReferenceAlleleVCF,AlternateAlleleVCF,GeneSymbol,ClinSigSimple,vcf_format,variant_id,cadd_phred,cadd_raw,revel,sift_score,polyphen_score,blosum62,alphamissense_score,impact,consequence
0,17,7674221,G,A,TP53,1,17 7674221 . G A,0,29.2,5.203168,0.927,0,,-3,0.9968,MODERATE,missense_variant


### Purposefully defined test dataset that will serve for EDA
(LoF score was not available to retrieve, so it was excluded from the start)

In [46]:
features_df = test_df.iloc[:,8:]
features_df

Unnamed: 0,cadd_phred,cadd_raw,revel,sift_score,polyphen_score,blosum62,alphamissense_score,impact,consequence
0,29.2,5.203168,0.927,0,,-3,0.9968,MODERATE,missense_variant


### And target:

In [47]:
target = test_df['ClinSigSimple']
features_df['Target'] = target

features + target:

In [48]:
features_df.head(35)

Unnamed: 0,cadd_phred,cadd_raw,revel,sift_score,polyphen_score,blosum62,alphamissense_score,impact,consequence,Target
0,29.2,5.203168,0.927,0,,-3,0.9968,MODERATE,missense_variant,1


Next step:
Complete retrieval of batch annotations from the REST API

U prilogu se nalazi kompletan dataset spreman za EDA (POST request ima limit od 200)

---

In [49]:
def preuzmi_anotacije(batch_list): 
    server = "https://rest.ensembl.org" 
    ext = "/vep/homo_sapiens/region"

    headers = { 
        "Content-Type": "application/json",
        "Accept": "application/json"
    } 

    variant = batch_list  

    data = {
        "variants": variant
    } 
    param ={ 
        'canonical':1, 
        'CADD':1, 
        'AlphaMissense':1,
        'Blosum62':1,
        'LoF':1,
        'REVEL':1
    }
    try:
        r = requests.post(server + ext, headers=headers, json=data,params=param)
        r.raise_for_status()
        return r.json()
    except requests.exceptions.HTTPError as err:
        print(f"HTTP Error: {err}")


In [51]:
storage = []
batch_size = 150
for i in range(0,len(variantsParams),batch_size):
    batch = variantsParams[i:i + batch_size] #  150, 200 limit
    batch_result = preuzmi_anotacije(batch)
    storage.extend(batch_result)



HTTP Error: 503 Server Error: Service Unavailable for url: https://rest.ensembl.org/vep/homo_sapiens/region?canonical=1&CADD=1&AlphaMissense=1&Blosum62=1&LoF=1&REVEL=1


TypeError: 'NoneType' object is not iterable

In [None]:
features_kompletno = []
for i,variant in enumerate(storage):
    if not isinstance(variant,dict):
        continue
    if 'transcript_consequences' not in variant:
        continue
    transcripti = variant['transcript_consequences']
    selected = select_transcript(transcripti)

    features = {
        'variant_id': i+1,
        'cadd_phred': selected.get('cadd_phred'),
        'cadd_raw': selected.get('cadd_raw'),
        'revel': selected.get('revel'),
        'sift_score': selected.get('sift_score'),
        'blosum62': selected.get('blosum62'),
        'alphamissense_score': selected.get('alphamissense', {}).get('am_pathogenicity'),
        'impact': selected.get('impact'),
        'consequence': selected.get('consequence_terms', [None])[0]
    }
    
    features_kompletno.append(features)

In [None]:
print(features_kompletno[-1]['variant_id'])
# 2647 

2647


In [None]:
svi_redovi = []
for i in features_kompletno:
    ime_kolone = i.keys()
    redovi = i.values()
    svi_redovi.append(list(redovi))



In [None]:
to_append = pd.DataFrame(data=svi_redovi,columns=ime_kolone)

In [None]:
# dataset with annotations
to_append.shape

(2646, 9)

In [None]:
# pocetni dataset
dataset.shape
dataset

Unnamed: 0,Chromosome,Start,ReferenceAlleleVCF,AlternateAlleleVCF,GeneSymbol,ClinSigSimple,vcf_format
0,17,7674221,G,A,TP53,1,17 7674221 . G A
1,17,7674191,C,T,TP53,1,17 7674191 . C T
2,17,7674230,C,A,TP53,1,17 7674230 . C A
3,17,7674208,A,G,TP53,1,17 7674208 . A G
4,17,7676154,G,C,TP53,1,17 7676154 . G C
...,...,...,...,...,...,...,...
2658,17,7676070,T,G,TP53,0,17 7676070 . T G
2659,17,7673762,T,C,TP53,0,17 7673762 . T C
2660,17,7669609,T,G,TP53,0,17 7669609 . T G
2661,17,7675221,T,G,TP53,1,17 7675221 . T G


In [None]:
Path.cwd()

WindowsPath('c:/Users/Admin/ML/tp53_variant_classification/notebooks')

In [None]:
complete_dataset = pd.concat([dataset,to_append],axis=1)
complete_dataset.to_csv('c:/Users/Admin/ML/tp53_variant_classification/data/processed/dataset_raw.csv')

In [None]:
complete_dataset.shape

---