# Integrate Labeled Dataset

In [1]:
import pandas as pd

Read CSV from Google Spreadsheet. This only works when the spreadsheet is public AND was **published to web** under `File > Publish To Web`

In [2]:
df = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSgqy_wWDV8vT_nVjMcL71shNayrzINnLKFBTq5IHe5EjeDemP4DbGcjEdsmHyz_o9mVq49-9txSCBl/pub?output=csv")

## Clean incoming CSV from typos and empty fields

In [3]:
df = df.dropna(subset=["labels"])
df["labels"] = df["labels"].astype(int) 
df = df[(df["labels"] == 1) | (df["labels"] == 0)] #fix typos in labels eg. misstyped 0 as 9
df = df[["Title", "PMID", "text", "DOI", "labels"]]
df.head()

Unnamed: 0,Title,PMID,text,DOI,labels
0,Pregabalin for postherpetic itch: a case report,32206971,Postherpetic itch has not commonly received at...,10.1186/s40981-020-00330-x,1
1,Endocytosis and transcytosis of gliadin peptides,26883352,Celiac disease (CD) is a frequent inflammatory...,10.1186/s40348-015-0029-z,1
2,Introducing A Family With Tens of Rare Craniof...,31299789,Craniofacial clefts are one of the rarest cong...,10.1097/SCS.0000000000005269,1
3,Smart Wearable Device Users' Behavior Is Essen...,34363130,This study aimed to explore the effect on phys...,10.1007/s12529-021-10013-1,0
4,Effect of grape seed proanthocyanidins on acti...,33427704,Grape seed proanthocyanidin extract (GSPE) has...,10.3233/THC-202655,1


In [4]:
df_old = pd.read_csv("datasets/problem_statements.csv")

In [5]:
new_data = df[~df["PMID"].isin(df_old["PMID"])]
new_data["source"] = "labels_oct7"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


In [8]:
dfx = df_old.append(new_data, ignore_index=True)
dfx = dfx.dropna(subset=["labels"])
dfx["labels"] = dfx["labels"].astype(int) 
dfx = dfx[(dfx["labels"] == 1) | (dfx["labels"] == 0)] #fix typos in labels eg. misstyped 0 as 9
dfx = dfx[["Title", "PMID", "text", "DOI", "labels", "source"]]

In [9]:
dfx

Unnamed: 0,Title,PMID,text,DOI,labels,source
0,,,The difficulty with this task lies in the fact...,,1,acl_cambridge
1,,,The problem with rich annotations is that they...,,1,acl_cambridge
2,,,"As a consequence , when adapting existing meth...",,1,acl_cambridge
3,,,The second problem of traditional word alignme...,,1,acl_cambridge
4,,,The main drawback of these systems is that the...,,1,acl_cambridge
...,...,...,...,...,...,...
3574,Ten-Year Trends and Clinical Relevance of the ...,28472788.0,Antimicrobial resistance of Streptococcus pneu...,10.1159/000470828,1,labels_oct7
3575,"Program FACTOR at 10: Origins, development and...",28438248.0,We aim to provide a conceptual view of the ori...,10.7334/psicothema2016.304,0,labels_oct7
3576,How low an effect of a preventive measure agai...,27777090.0,Traveller's diarrhoea (TD) is the most common ...,10.1016/j.tmaid.2016.10.005,1,labels_oct7
3577,"Clinical study on the efficacy, acceptance, an...",29663792.0,The primary objective of this trial was to dem...,10.23736/S0031-0808.18.03447-X,0,labels_oct7


In [10]:
dfx.to_csv("datasets/problem_statements.csv")