# 1) Análise exploratória de dados
## 1.1) Arquivo train.csv

In [1]:
import pandas as pd

### O arquivo train.csv relaciona IDs com nomes de estudos e com labels indicando qual dataset cada estudo menciona.
### Os IDs correspondem a arquivos JSON dentro do diretório /train, cada um contento o texto de um estudo.

In [2]:
dataset = pd.read_csv('train.csv')

In [3]:
dataset

Unnamed: 0,Id,pub_title,dataset_title,dataset_label,cleaned_label
0,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,The Impact of Dual Enrollment on College Degre...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
1,2f26f645-3dec-485d-b68d-f013c9e05e60,Educational Attainment of High School Dropouts...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
2,c5d5cd2c-59de-4f29-bbb1-6a88c7b52f29,Differences in Outcomes for Female and Male St...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
3,5c9a3bc9-41ba-4574-ad71-e25c1442c8af,Stepping Stone and Option Value in a Model of ...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
4,c754dec7-c5a3-4337-9892-c02158475064,"Parental Effort, School Resources, and Student...",National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
...,...,...,...,...,...
19656,b3498176-8832-4033-aea6-b5ea85ea04c4,RSNA International Trends: A Global Perspectiv...,RSNA International COVID-19 Open Radiology Dat...,RSNA International COVID Open Radiology Database,rsna international covid open radiology database
19657,f77eb51f-c3ac-420b-9586-cb187849c321,MCCS: a novel recognition pattern-based method...,CAS COVID-19 antiviral candidate compounds dat...,CAS COVID-19 antiviral candidate compounds dat...,cas covid 19 antiviral candidate compounds dat...
19658,ab59bcdd-7b7c-4107-93f5-0ccaf749236c,Quantitative Structure–Activity Relationship M...,CAS COVID-19 antiviral candidate compounds dat...,CAS COVID-19 antiviral candidate compounds dat...,cas covid 19 antiviral candidate compounds dat...
19659,fd23e7e0-a5d2-4f98-992d-9209c85153bb,A ligand-based computational drug repurposing ...,CAS COVID-19 antiviral candidate compounds dat...,CAS COVID-19 antiviral candidate compounds dat...,cas covid 19 antiviral candidate compounds dat...


### São quase 20 mil linhas, mas o código abaixo mostra que há apenas cerca de 14 mil estudos únicos. (Curiosamente, o número de IDs e de títulos distintos não bate...!) São então cerca de 5 mil estudos repetidos, possivelmente envolvendo linhas com exatamente as mesmas informações.

In [4]:
print(len(dataset["Id"].unique()))
print(len(dataset["pub_title"].unique()))

14316
14268


### Ao todo, 156 diferentes datasets são citados em todos esses estudos. O número provavelmente é menor, porque parece que os mesmos datasets às vezes são referidos com nomes diferentes. Por exemplo, aqueles perto do final mencionando o genôma do SARS-CoV-19.

In [5]:
x = dataset['cleaned_label'].unique()
print(len(x))
x

156


array(['national education longitudinal study', 'noaa tidal station',
       'slosh model', 'noaa c cap', 'aging integrated database agid ',
       'alzheimers disease neuroimaging initiative',
       'aging integrated database',
       'noaa national water level observation network',
       'noaa water level station',
       'baltimore longitudinal study of aging blsa ',
       'Baltimore Longitudinal Study of Aging (BLSA)',
       'national water level observation network',
       'arms farm financial and crop production practices',
       'beginning postsecondary student',
       'noaa sea lake and overland surges from hurricanes',
       'noaa tide gauge',
       'the national institute on aging genetics of alzheimer s disease data storage site',
       'national center for education statistics common core of data',
       'national science foundation survey of industrial research and development',
       'baccalaureate and beyond',
       'noaa international best track archive for

### Não há informações faltantes. Antes, eu achava que nossa missão era detectar SE um estudo cita um dataset, mas agora vejo que cada estudo cita exatamente um dataset e precisamos saber QUAL. Creio que as citações ocorrerão apenas dentre os 156 datasets acima... ou é algo aberto a qualquer dataset possível do mundo?

In [6]:
print(dataset['Id'].isna().sum())
print(dataset['pub_title'].isna().sum())
print(dataset['cleaned_label'].isna().sum())

0
0
0


### Os estudos variam muito em sua frequência de citação. Muitos são citados centenas de vezes, outros são citados apenas algumas vezes (sem levar em conta duplicatas).

In [7]:
for label, count in dataset['cleaned_label'].value_counts().items():
    print(f"{count} times: {label}.")

3660 times: adni.
2390 times: alzheimer s disease neuroimaging initiative adni .
1152 times: baltimore longitudinal study of aging.
1151 times: trends in international mathematics and science study.
1003 times: early childhood longitudinal study.
671 times: education longitudinal study.
643 times: census of agriculture.
621 times: agricultural resource management survey.
545 times: national education longitudinal study.
490 times: rural urban continuum codes.
431 times: baltimore longitudinal study of aging blsa .
426 times: survey of earned doctorates.
380 times: north american breeding bird survey.
314 times: world ocean database.
304 times: slosh model.
299 times: noaa tide gauge.
298 times: survey of doctorate recipients.
280 times: ibtracs.
255 times: coastal change analysis program.
252 times: common core of data.
242 times: sars cov 2 genome sequences.
239 times: beginning postsecondary students.
222 times: genome sequence of sars cov 2.
210 times: our world in data.
198 times: 

## 1.2) Arquivos em /train
### Função que cria um dicionário cujas chaves são IDs e cujos valores são os textos correspondentes, lidos a partir dos arquivos .json no diretório /train

In [8]:
import os

In [10]:
def get_JSON(URL):
    dirpath = os.path.join(os.getcwd(), URL)
    id2text = dict()
    for filename in os.listdir(dirpath):
        ID = filename[:-5]
        filepath = os.path.join(dirpath, filename)
        id2text[ID] = open(filepath).read()
    return id2text

### Cria um dicionário contendo como entradas os pares ID:texto

In [11]:
id2text = get_JSON('train')

for ID, text in id2text.items():
    print(ID)
    print(text[:1000])
    break

0007f880-0a9b-492d-9a58-76eb0b0e0bd7
[{"section_title": "Abstract", "text": "The aim of this study was to identify if acquiring ICT skills through DOT Lebanon's ICT training program (a local NGO) improved income generation opportunities after 3-months of completing the training. The target population was the NGO's vulnerable young beneficiaries. This study was completed in an effort to find creative and digital solutions to the high rate of youth unemployment in Lebanon (37%), one of the highest rates in the world. Results showed that 48% of beneficiaries who were unemployed at baseline, were exposed to at least one income generation opportunity 3 months after completing the DOT Lebanon training. Also, 49% of beneficiaries who were already employed at baseline were exposed to at least one income generation opportunity. Gender, English proficiency and governorate were variables that were found to be statistically significant. Males were more likely than females to be exposed to income g

# 2) Limpeza de dados
### Remove linhas duplicadas com o mesmo ID e torna as linhas acessíveis pelo ID

In [12]:
dataset.drop_duplicates(subset="Id", inplace=True)
dataset.set_index("Id", inplace=True)
dataset

Unnamed: 0_level_0,pub_title,dataset_title,dataset_label,cleaned_label
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d0fa7568-7d8e-4db9-870f-f9c6f668c17b,The Impact of Dual Enrollment on College Degre...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
2f26f645-3dec-485d-b68d-f013c9e05e60,Educational Attainment of High School Dropouts...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
c5d5cd2c-59de-4f29-bbb1-6a88c7b52f29,Differences in Outcomes for Female and Male St...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
5c9a3bc9-41ba-4574-ad71-e25c1442c8af,Stepping Stone and Option Value in a Model of ...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
c754dec7-c5a3-4337-9892-c02158475064,"Parental Effort, School Resources, and Student...",National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
...,...,...,...,...
f89dd9fa-07af-4384-aa0c-0d14602c0cea,Artificial Intelligence of COVID-19 Imaging: A...,RSNA International COVID-19 Open Radiology Dat...,RSNA International COVID-19 Open Radiology Dat...,rsna international covid 19 open radiology dat...
b3498176-8832-4033-aea6-b5ea85ea04c4,RSNA International Trends: A Global Perspectiv...,RSNA International COVID-19 Open Radiology Dat...,RSNA International COVID Open Radiology Database,rsna international covid open radiology database
f77eb51f-c3ac-420b-9586-cb187849c321,MCCS: a novel recognition pattern-based method...,CAS COVID-19 antiviral candidate compounds dat...,CAS COVID-19 antiviral candidate compounds dat...,cas covid 19 antiviral candidate compounds dat...
ab59bcdd-7b7c-4107-93f5-0ccaf749236c,Quantitative Structure–Activity Relationship M...,CAS COVID-19 antiviral candidate compounds dat...,CAS COVID-19 antiviral candidate compounds dat...,cas covid 19 antiviral candidate compounds dat...


In [13]:
dataset.loc["d0fa7568-7d8e-4db9-870f-f9c6f668c17b", "cleaned_label"]

'national education longitudinal study'

### Associa aos IDs em dataset o texto contido no arquivo .json correspondente

In [14]:
for ID, text in id2text.items():
    dataset.loc[ID, "text"] = text

dataset

Unnamed: 0_level_0,pub_title,dataset_title,dataset_label,cleaned_label,text
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d0fa7568-7d8e-4db9-870f-f9c6f668c17b,The Impact of Dual Enrollment on College Degre...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study,"[{""section_title"": ""What is this study about?""..."
2f26f645-3dec-485d-b68d-f013c9e05e60,Educational Attainment of High School Dropouts...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study,"[{""section_title"": ""November 2004"", ""text"": ""D..."
c5d5cd2c-59de-4f29-bbb1-6a88c7b52f29,Differences in Outcomes for Female and Male St...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study,"[{""section_title"": ""Differences in Outcomes fo..."
5c9a3bc9-41ba-4574-ad71-e25c1442c8af,Stepping Stone and Option Value in a Model of ...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study,"[{""section_title"": ""Abstract"", ""text"": ""Federa..."
c754dec7-c5a3-4337-9892-c02158475064,"Parental Effort, School Resources, and Student...",National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study,"[{""section_title"": ""Abstract"", ""text"": ""This a..."
...,...,...,...,...,...
f89dd9fa-07af-4384-aa0c-0d14602c0cea,Artificial Intelligence of COVID-19 Imaging: A...,RSNA International COVID-19 Open Radiology Dat...,RSNA International COVID-19 Open Radiology Dat...,rsna international covid 19 open radiology dat...,"[{""section_title"": """", ""text"": ""T he coronavir..."
b3498176-8832-4033-aea6-b5ea85ea04c4,RSNA International Trends: A Global Perspectiv...,RSNA International COVID-19 Open Radiology Dat...,RSNA International COVID Open Radiology Database,rsna international covid open radiology database,"[{""section_title"": ""Introduction"", ""text"": ""Ou..."
f77eb51f-c3ac-420b-9586-cb187849c321,MCCS: a novel recognition pattern-based method...,CAS COVID-19 antiviral candidate compounds dat...,CAS COVID-19 antiviral candidate compounds dat...,cas covid 19 antiviral candidate compounds dat...,"[{""section_title"": ""Introduction"", ""text"": ""Th..."
ab59bcdd-7b7c-4107-93f5-0ccaf749236c,Quantitative Structure–Activity Relationship M...,CAS COVID-19 antiviral candidate compounds dat...,CAS COVID-19 antiviral candidate compounds dat...,cas covid 19 antiviral candidate compounds dat...,"[{""section_title"": ""INTRODUCTION"", ""text"": ""Th..."


# 3) Armazenamento do dataset limpo

### Salva o dataset em um arquivo serializado (pickle) para não precisar rodar get_JSON() de novo

In [16]:
dataset.to_pickle('clean_train')