### Standardize Data Study Format to CSV

The goal of this notebook is to standardize the data in this study's raw directory to csv. No attempt will be made to harmonize of modify the data beyond reformating.

In [1]:
import pandas as pd
import fsspec
from pathlib import Path

import openpyxl

In [2]:
fs = fsspec.filesystem("")

In [3]:
fs.mkdirs('../csv/', exist_ok = True)

In [4]:
avail_data = [f.split('/')[-1] for f in fs.ls('../raw/') if f.split('/')[-1][0] != '.']
avail_data

['CINECA_synthetic_cohort_Europe_CH_SIB_2021_03_10.xlsx']

Just a single excel file in this study:

In [5]:
workbook = openpyxl.load_workbook('../raw/CINECA_synthetic_cohort_Europe_CH_SIB_2021_03_10.xlsx')

And just a single sheet in the workbook. 

In [6]:
workbook.sheetnames

['CINECA_synthetic_cohort_Europe_']

Open the sheet:

In [7]:
df = pd.read_excel('../raw/CINECA_synthetic_cohort_Europe_CH_SIB_2021_03_10.xlsx', 'CINECA_synthetic_cohort_Europe_')

In [8]:
df

Unnamed: 0,pt,phyact,alcfrq,sbsmk,ethori_self,jobtyp,lvpl,SBP,DBP,mrtsts2,gender,age,wt,ht,cafuse,HRTRTE,dginvtx2,dginvtx3,cmatccd1_2,cmatccd1_3
0,FAKE0,>3WK,N,N,O,SE,SW,132.5,68.0,0,0,73,71.3,146.0,NONE,80.5,CEREBRAL PROBLEMS,HEROIN ADDICT,S01B,S01CA
1,FAKE1,2WK,SW,S,B,IW,PT,188.0,59.0,1,0,41,69.6,149.0,1B3,83.0,LUMBAR PAINS,ESOPHAGEAL AND STOMACH DISORDER,D05A,C08DB
2,FAKE2,K,R,N,O,PR,SW,178.0,114.0,1,0,52,105.8,143.0,1B3,74.0,ALCOHOL PROBLEMS,"ANXIETY, DEPRESSION",A03FA,J05AE
3,FAKE3,K,SW,F,O,ME,SZ,204.5,114.0,1,1,50,48.7,181.5,>6,66.5,EXPECTORANT,ACID RELFUX DISEASE,B05BA,C05CX
4,FAKE4,K,3D,F,B,PR,SW,110.0,87.5,0,1,43,82.4,182.5,>6,77.0,ANTIBIOTIC (GERM INFECTION),PROSTATE PROBLEM,S01AX,A12CC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6728,FAKE6728,1WK,2WK,N,B,EU,PT,151.0,102.0,1,1,51,89.7,186.0,4B6,90.5,UNKNOWN STOMACH PROBLEM,TONIC,N05AN,C08DB
6729,FAKE6729,N,1D,F,K,QW,PT,155.0,72.0,0,0,40,63.4,182.0,>6,80.0,ALLERGIC TO DUST MITE (STOPPED THE MEDICATION ...,RENAL LITHIASIS,A10BG,S01CA
6730,FAKE6730,2WK,N,S,W,FM,SZ,112.5,55.0,1,1,61,61.3,187.0,4B6,91.0,HANDS PAINS,VESTIBULITIS,N05AX,N02CX
6731,FAKE6731,>3WK,R,S,A,FM,SW,102.0,53.5,1,1,59,82.5,174.0,4B6,79.5,PAIN RELEVER,PREVENTION - WELLNESS,S01XA,N07CA


Get study name from folder name:

In [9]:
study_name = fs.ls('')[0].split('/')[-3]
study_name

'Europe_CH_SIB'

Save to csv format:

In [10]:
df.to_csv(f'../csv/{study_name}.csv')