# Notebook for initial data to CSV conversion

### **Please use the following folder structure**:

```
└───data
    ├───EE_015
    |     └───EE_015.vcf.gz
    ├───EE_050
    |     └───EE_050.vcf.gz
    └───EE_069
          └───EE_069.vcf.gz
```

## Unpack original files to CSV

In [5]:
!pip install unvcf



In [6]:
%%cmd

unvcf data/EE_015/EE_015.vcf.gz data/EE_015/
unvcf data/EE_050/EE_050.vcf.gz data/EE_050/
unvcf data/EE_069/EE_069.vcf.gz data/EE_069/

Microsoft Windows [Version 10.0.22621.2428]
(c) Microsoft Corporation. All rights reserved.

(pathogen) c:\Users\barte\Desktop\Studies\V semester\pathogenicity-assessment>
(pathogen) c:\Users\barte\Desktop\Studies\V semester\pathogenicity-assessment>unvcf data/EE_015/EE_015.vcf.gz data/EE_015/


Destination folder: c:\Users\barte\Desktop\Studies\V semester\pathogenicity-assessment\data\EE_015
Files that are being generated:
- EE_015.vcf.gz.sample.AB.csv
- EE_015.vcf.gz.sample.AD.csv
- EE_015.vcf.gz.sample.AF.csv
- EE_015.vcf.gz.sample.DP.csv
- EE_015.vcf.gz.sample.F1R2.csv
- EE_015.vcf.gz.sample.F2R1.csv
- EE_015.vcf.gz.sample.GT.csv
- EE_015.vcf.gz.sample.PGT.csv
- EE_015.vcf.gz.sample.PID.csv
- EE_015.vcf.gz.sample.PS.csv
- EE_015.vcf.gz.sample.SB.csv
- EE_015.vcf.gz.default.csv
- EE_015.vcf.gz.genotype.csv


Please ensure that each individual file can fit in memory and
use the keyword ``blocksize=None to remove this message``
Setting ``blocksize=None``
  warn(


Warming up the engine... done.


102517 genotypes [00:28, 3569.45 genotypes/s]


Finished successfully!

(pathogen) c:\Users\barte\Desktop\Studies\V semester\pathogenicity-assessment>unvcf data/EE_050/EE_050.vcf.gz data/EE_050/
Destination folder: c:\Users\barte\Desktop\Studies\V semester\pathogenicity-assessment\data\EE_050
Files that are being generated:
- EE_050.vcf.gz.sample.AB.csv
- EE_050.vcf.gz.sample.AD.csv
- EE_050.vcf.gz.sample.AF.csv
- EE_050.vcf.gz.sample.DP.csv
- EE_050.vcf.gz.sample.F1R2.csv
- EE_050.vcf.gz.sample.F2R1.csv
- EE_050.vcf.gz.sample.GT.csv
- EE_050.vcf.gz.sample.PGT.csv
- EE_050.vcf.gz.sample.PID.csv
- EE_050.vcf.gz.sample.PS.csv
- EE_050.vcf.gz.sample.SB.csv
- EE_050.vcf.gz.default.csv
- EE_050.vcf.gz.genotype.csv


Please ensure that each individual file can fit in memory and
use the keyword ``blocksize=None to remove this message``
Setting ``blocksize=None``
  warn(


Warming up the engine... done.


201753 genotypes [00:54, 3676.53 genotypes/s]


Finished successfully!

(pathogen) c:\Users\barte\Desktop\Studies\V semester\pathogenicity-assessment>unvcf data/EE_069/EE_069.vcf.gz data/EE_069/
Destination folder: c:\Users\barte\Desktop\Studies\V semester\pathogenicity-assessment\data\EE_069
Files that are being generated:
- EE_069.vcf.gz.sample.AB.csv
- EE_069.vcf.gz.sample.AD.csv
- EE_069.vcf.gz.sample.DP.csv
- EE_069.vcf.gz.sample.GQ.csv
- EE_069.vcf.gz.sample.GT.csv
- EE_069.vcf.gz.sample.PGT.csv
- EE_069.vcf.gz.sample.PID.csv
- EE_069.vcf.gz.sample.PL.csv
- EE_069.vcf.gz.sample.SAC.csv
- EE_069.vcf.gz.default.csv
- EE_069.vcf.gz.genotype.csv


Please ensure that each individual file can fit in memory and
use the keyword ``blocksize=None to remove this message``
Setting ``blocksize=None``
  warn(


Warming up the engine... done.


140174 genotypes [00:37, 3740.17 genotypes/s]


Finished successfully!

(pathogen) c:\Users\barte\Desktop\Studies\V semester\pathogenicity-assessment>

## Clean up folders

In [7]:
import os
from pathlib import Path
import pandas as pd


vcfs = ["EE_015", "EE_050", "EE_069"]
data_folders = [Path("data/") / x for x in vcfs]

for folder in data_folders:
    for root, dirs, files in os.walk(folder):
        for file in files:
            # Remove sample files
            if "sample" in file:
                os.remove(folder / file)
            # Rewrite genotype info without csq
            elif "genotype" in file:
                df = pd.read_csv(folder / file, sep="\t")
                df.drop("CSQ", axis=1).to_csv(folder / f"{file[:6]}_genotype.csv.gz", sep=";", compression="gzip")
                os.remove(folder / file)
            # Change default separator and compress
            elif "default" in file:
                df = pd.read_csv(folder / file, sep="\t")
                df.to_csv(folder / f"{file[:6]}_default.csv.gz", sep=";", compression="gzip")
                os.remove(folder / file)

  df = pd.read_csv(folder / file, sep="\t")


## Move CSQ

In [8]:
import vcf
import pandas as pd


for i, folder in enumerate(data_folders):
    data = []

    vcf_file = folder / f"{vcfs[i]}.vcf.gz"
    with open(vcf_file, 'rb') as vcf_file_binary:
        vcf_reader = vcf.Reader(vcf_file_binary)
        
        column_csq_headers = vcf_reader.infos["CSQ"].desc[50:].split("|")

        for record in vcf_reader:
            for field in vcf_reader.infos.keys():
                if field == "CSQ":
                    data.append(record.INFO.get(field)[0].split("|"))
                    break

    df = pd.DataFrame(data, columns=column_csq_headers)
    df.to_csv(folder / f"{vcfs[i]}_csq.csv.gz", sep=";", compression="gzip")

##### To read data: 
```py
df = pd.read_csv("data/EE_015/EE_015_csq.csv.gz", sep=";", compression="gzip")
```