# Harmonize the format
The datasets differ in format. Therefore, the project's first step is to 
uniformize all the datasets.

## Loading the data
Locate the files housing the original data:

In [1]:
from os import listdir
from os.path import isfile, join

# where the original data is located
directory = '../../../../data/datasets/01_original'

files = [f for f in listdir(directory) if isfile(join(directory, f))] 
subjects = [file.split('_raw.')[0] for file in files] # without extension

Group the datasets within a dictionary:

In [2]:
import pandas as pd

datasets = {subjects[count]: {
    'dataframe': pd.read_csv(f'{directory}/{filename}',
                             sep='\t' if filename.endswith('.tsv') else ','),
    'synergy': True if filename.endswith('.csv') else False
} for count, filename in enumerate(files)}

## Different Formats
Access a dataset from the dictionary by its subject:

In [3]:
datasets['pancreatic_surgery']['dataframe'].head()

Unnamed: 0,State,StudyType,Abstract,Title,LiteratureId,ArticleUrl,FirstAuthor,Doi
0,3,7,"In this paper, I consider: the value of variou...",Reflections and proposals for the standardizat...,10718171,,Elias,10.1053/ejso.1999.0731
1,3,7,The importance of diagnostic endoscopic retrog...,Diagnostic endoscopic retrograde cholangiopanc...,10718385,,Ponchon,10.1055/s-2000-95
2,3,7,A number of endoscopic interventions have expa...,Therapeutic pancreatic endoscopy.,10718387,,Neuhaus,10.1055/s-2000-94
3,3,7,BACKGROUND: Gastric lipase contributes signifi...,Cephalic phase of lipolysis is impaired in pan...,10720121,,Wøjdemann,10.1080/003655200750024407
4,3,7,BACKGROUND/AIM: The pancreas is an organ highl...,Ischemia/Reperfusion-Induced pancreatitis.,10720825,,Sakorafas,10.1159/000018793


Notice how the format of the dataset on pancreatic surgery differs from SYNERGY:

In [4]:
datasets['adhd']['dataframe'].head()

Unnamed: 0,pmid,doi,openalex_id,label_included
0,https://pubmed.ncbi.nlm.nih.gov/10051933,https://doi.org/10.1007/bf03012457,https://openalex.org/W2082613933,0
1,https://pubmed.ncbi.nlm.nih.gov/10053177,https://doi.org/10.1056/nejm199903043400903,https://openalex.org/W2312609348,0
2,https://pubmed.ncbi.nlm.nih.gov/10066996,https://doi.org/10.1037/0021-843x.108.1.90,https://openalex.org/W2022904832,0
3,https://pubmed.ncbi.nlm.nih.gov/10072008,https://doi.org/10.1097/00000539-199903000-00020,https://openalex.org/W2021097359,0
4,https://pubmed.ncbi.nlm.nih.gov/10072410,https://doi.org/10.1056/nejm199903113401003,https://openalex.org/W4239283954,0


## Uniformize

Define a function that transforms the dataframes to the uniform format

| include 	| title 	| abstract 	| doi 	| literatureid 	| openalex_id 	|
|---------	|-------	|----------	|-----	|------	|-------------	|
| bool    	| str   	| str      	| str 	| str  	| str         	|

In [5]:
import numpy as np
import re

# column names and values differ between SYNERGY and non-SYNERGY datasets
def uniformize(dataframe: pd.DataFrame, synergy: bool) -> pd.DataFrame:

    # the datasets differ in column names for labels, doi, and identifiers
    label_column = 'label_included' if synergy else 'State'
    doi = 'doi' if synergy else 'Doi'
    id_column = 'pmid' if synergy else 'LiteratureId'

    # uniformize the include label
    exclude_label = 0 if synergy else 3
    mapping = lambda x: False if x == exclude_label else True
    
    # identifiers have different formats for web of science (WOS:), cochrane central (CN-), pubmed () and hand-signed (HS-)
    id_formats = r'(WOS:|CN-|HS-)*([A-Z]|\d)+$'
    # extract the identifier from the original column
    literature_ids = [re.search(id_formats, id).group()
                      if id is not np.nan else pd.NA for id in dataframe[id_column]]

    return pd.DataFrame(
        data={
            'include': dataframe[label_column].map(mapping),
            'title': pd.NA if synergy else dataframe['Title'],
            'abstract': pd.NA if synergy else dataframe['Abstract'],
            'doi': dataframe[doi],
            'literature_id': literature_ids,
            'openalex_id': dataframe['openalex_id'] if synergy else pd.NA,
        }
    )

Uniformize the datasets and save them in a new dictionary:

In [6]:
uniform_datasets = {key: uniformize(
    value['dataframe'], value['synergy']) for key, value in datasets.items()}

Verify the new format:

In [7]:
uniform_datasets['adhd'].head()

Unnamed: 0,include,title,abstract,doi,literature_id,openalex_id
0,False,,,https://doi.org/10.1007/bf03012457,10051933,https://openalex.org/W2082613933
1,False,,,https://doi.org/10.1056/nejm199903043400903,10053177,https://openalex.org/W2312609348
2,False,,,https://doi.org/10.1037/0021-843x.108.1.90,10066996,https://openalex.org/W2022904832
3,False,,,https://doi.org/10.1097/00000539-199903000-00020,10072008,https://openalex.org/W2021097359
4,False,,,https://doi.org/10.1056/nejm199903113401003,10072410,https://openalex.org/W4239283954


### Export

In [8]:
directory_to_save = '../../../../data/01_uniform'

[dataframe.to_csv(f'{directory_to_save}/{subject}_uniform.csv', index=False)
 for subject, dataframe in uniform_datasets.items()];