# Cleaning Data PD


This notebook is used in order to clean the metadata retrieved with the software Arcas.

In [1]:
import glob
import pandas as pd


In [2]:
dfs = []
for filename in glob.glob("../data/PD_*.json"):
    dfs.append(pd.read_json(filename))


In [3]:
dfs.append(pd.read_json("../data/bibliography.json"))


In [4]:
df = pd.concat(dfs, ignore_index=True, sort=False)


In [5]:
df.provenance.unique()


array(['arXiv', 'Nature', 'IEEE', 'Springer', 'PLOS', 'Manual'],
      dtype=object)

In [6]:
len(df.title.unique()), len(df.unique_key.unique())


(3107, 3204)

In [7]:
provenance_size = (
    df.groupby(["unique_key", "provenance"])
    .size()
    .reset_index()
    .groupby("provenance")
    .size()
)
provenance_size


provenance
IEEE         295
Manual        90
Nature       687
PLOS         482
Springer     576
arXiv       1074
dtype: int64

In [8]:
df = df[~(df["date"] < 1950)]
df = df[~(df["date"] > 2018)]


In [9]:
df = df.replace(to_replace=2021, value=2015)


In [10]:
df.to_json("../data/pd_November_2018.json")


Cleaning authors' names 
----------------------------

The issue with names is that there are various ways ones name can be written. This issue could have not been avoided during the data collection because journals and the authors themsleves have different ways of writing one's name.

> *ex. Nikoleta Evdokia Glynatsi, Nikoleta E Glynatsi, N E Glynatsi, N Glynatsi.*

Not many efficient ways for addressing the problem have been found. After a search on various ways of string comparison the Levenshtein distance has been chosen as a measure. The Levenshtein distance is a string metric for measuring the difference between two sequences. [wikipedia link](https://en.wikipedia.org/wiki/Levenshtein_distance).

To compute the difference in python the open source library [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) will be used. 

In [11]:
df = pd.read_json("../data/pd_November_2018.json")


In [12]:
# Initial all letter in the string author are lowercased.
df.author = df.author.str.lower()


In [13]:
# from fuzzywuzzy import fuzz
# import itertools

# import tqdm

# temp = df
# pairs = itertools.combinations(temp.author.unique(), 2)

# to_check = []
# for i, j in tqdm.tqdm(pairs):
#     ratio = fuzz.token_set_ratio(i,j)
#     if ratio >=90 and ratio != 100:
#         to_check.append((i, j))


Duplicate articles
------------------

In [14]:
table = (
    df.groupby(["title", "unique_key"]).size().reset_index().groupby("title").count()
)
duplicates = table[table["unique_key"] > 1]


In [15]:
duplicates_title = df[df["title"].isin(duplicates.index)]["title"].unique()


In [16]:
duplicates_in_arxiv = df[
    (df["title"].isin(duplicates.index)) & (df["provenance"] == "arXiv")
]["title"].unique()


In [17]:
diff = list(set(duplicates_title) - set(duplicates_in_arxiv))


In [18]:
df_without_arxiv = df[~(df["provenance"] == "arXiv")]


In [19]:
df_without_arxiv = df_without_arxiv.drop_duplicates(subset="title")


In [20]:
df_without_arxiv.to_json("../data/pd_November_2018_without_arxiv.json")


**Drop duplicates.**

In [21]:
articles_to_drop = df[
    (df["title"].isin(duplicates.index)) & (df["provenance"] == "arXiv")
]["unique_key"].unique()


In [22]:
df = df[~df["unique_key"].isin(articles_to_drop)]


In [23]:
len(df["title"].unique()), len(df["unique_key"].unique())


(3089, 3167)

**Export clean json.**

In [24]:
df.to_json("../data/pd_November_2018_clean.json")
