In [1]:
import pandas as pd


pd.set_option("display.max_columns", None)

In [2]:
df = pd.read_parquet("../data/dblp-v10.parquet")

In [3]:
df.head()

Unnamed: 0,abstract,authors,n_citation,references,title,venue,year,id
0,"In this paper, a robust 3D triangular mesh wat...","['S. Ben Jabra', 'Ezzeddine Zagrouba']",50,"['09cb2d7d-47d1-4a85-bfe5-faa8221e644b', '10aa...",A new approach of 3D watermarking based on ima...,international symposium on computers and commu...,2008,4ab3735c-80f1-472d-b953-fa0557fed28b
1,We studied an autoassociative neural network w...,"['Joaquín J. Torres', 'Jesús M. Cortés', 'Joaq...",50,"['4017c9d2-9845-4ad2-ad5b-ba65523727c5', 'b118...",Attractor neural networks with activity-depend...,Neurocomputing,2007,4ab39729-af77-46f7-a662-16984fb9c1db
2,It is well-known that Sturmian sequences are t...,"['Genevi eve Paquin', 'Laurent Vuillon']",50,"['1c655ee2-067d-4bc4-b8cc-bc779e9a7f10', '2e4e...",A characterization of balanced episturmian seq...,Electronic Journal of Combinatorics,2007,4ab3a4cf-1d96-4ce5-ab6f-b3e19fc260de
3,One of the fundamental challenges of recognizi...,"['Yaser Sheikh', 'Mumtaz Sheikh', 'Mubarak Shah']",221,"['056116c1-9e7a-4f9b-a918-44eb199e67d6', '05ac...",Exploring the space of a human action,international conference on computer vision,2005,4ab3a98c-3620-47ec-b578-884ecf4a6206
4,This paper generalizes previous optimal upper ...,"['Efraim Laksman', 'Håkan Lennerstad', 'Magnus...",0,"['01a765b8-0cb3-495c-996f-29c36756b435', '5dbc...",Generalized upper bounds on the minimum distan...,Ima Journal of Mathematical Control and Inform...,2015,4ab3b585-82b4-4207-91dd-b6bce7e27c4e


In [4]:
df.shape

(1000000, 8)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   abstract    827533 non-null   object
 1   authors     999998 non-null   object
 2   n_citation  1000000 non-null  int64 
 3   references  875583 non-null   object
 4   title       1000000 non-null  object
 5   venue       822245 non-null   object
 6   year        1000000 non-null  int64 
 7   id          1000000 non-null  object
dtypes: int64(2), object(6)
memory usage: 61.0+ MB


In [6]:
df.duplicated().sum()

0

In [7]:
def print_null_percentage(df: pd.DataFrame):
    for col in df.columns:
        print(f"{col} has {((df[col].isnull().sum() / df.shape[0]) * 100):.2f} of null values")


print_null_percentage(df)

abstract has 17.25 of null values
authors has 0.00 of null values
n_citation has 0.00 of null values
references has 12.44 of null values
title has 0.00 of null values
venue has 17.78 of null values
year has 0.00 of null values
id has 0.00 of null values


# Documentação das Colunas da Base de Dados

A tabela abaixo descreve as colunas presentes na base de dados, incluindo o tipo de dado, uma breve descrição e exemplos para referência.

| Nome da Coluna | Tipo de Dado          | Descrição                                                            | Exemplo                                                                                                                               |
|----------------|-----------------------|----------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| `id`           | `string`              | Identificador único do artigo                                        | `013ea675-bb58-42f8-a423-f5534546b2b1`                                                                                                |
| `title`        | `string`              | Título do artigo                                                     | `Prediction of consensus binding mode geometries for related chemical series of positive allosteric modulators of adenosine and muscarinic acetylcholine receptors` |
| `authors`      | `list of strings`     | Lista de autores do artigo                                           | `["Leon A. Sakkal", "Kyle Z. Rajkowski", "Roger S. Armen"]`                                                                           |
| `venue`        | `string`              | Local de publicação (revista, conferência)                           | `Journal of Computational Chemistry`                                                                                                  |
| `year`         | `int`                 | Ano de publicação                                                    | `2017`                                                                                                                                |
| `n_citation`   | `int`                 | Número de citações recebidas                                         | `0`                                                                                                                                   |
| `references`   | `list of strings`     | Lista de IDs dos artigos citados                                     | `["4f4f200c-0764-4fef-9718-b8bccf303dba", "aa699fbf-fabe-40e4-bd68-46eaf333f7b1"]`                                                    |
| `abstract`     | `string`              | Resumo do artigo                                                     | `This paper studies ...`                                                                                                              |

### Exemplo de Registro JSON

```json
{
  "id": "013ea675-bb58-42f8-a423-f5534546b2b1",
  "title": "Prediction of consensus binding mode geometries for related chemical series of positive allosteric modulators of adenosine and muscarinic acetylcholine receptors",
  "authors": [
    "Leon A. Sakkal",
    "Kyle Z. Rajkowski",
    "Roger S. Armen"
  ],
  "venue": "Journal of Computational Chemistry",
  "year": 2017,
  "n_citation": 0,
  "references": [
    "4f4f200c-0764-4fef-9718-b8bccf303dba",
    "aa699fbf-fabe-40e4-bd68-46eaf333f7b1"
  ],
  "abstract": "This paper studies ..."
}

# Problemas encontrados

- Linhas nulas nas colunas: *abstract*, *authors*, *references* e *venue*.
- Colunas *id*, *references* e *n_citation* talvez sejam inúteis.
    - Como é uma análise temática, talvez apenas as colunas com informações de conteúdo sejam úteis (*title*, *authors*, *abstract*, *year*, *venue*).
    - Colunas que representam mais relações que conteúdo, como *id*, *references* e *n_citation* não seriam necessárias, já que não é necessario explorar essas relações para clusterização de temas.
    - Talvez no final das conts apenas *title* e *abstract* sejam úteis. Em qustão de conteúdo de verdade elas são mais importantes.