# Processo de limpeza dos datasets de Pleiades

### Sobre o dataset escolhido
<p>Pleiades é uma comunidade responsável por manter um "dicionário geográfico" de lugares antigos, como templos, santuários, fortes, cavernas, lagos, entre outras estruturas consideradas portadoras de valor histórico, com o objetivo de fornecer informações para pesquisas históricas, computacionais, para ferramentas de visualização e usuários curiosos sobre o assunto. </p>

<a href="https://pleiades.stoa.org/home" >Site Pleiades</a>

<p> O site construído sobre <b>o resultado final deste processo de limpeza</b> tem como público alvo curiosos sobre estruturas anciãs catalogadas pelo Pleiades e como objetivo apresentar uma exploração de seu conteúdo visando descobrir novas curiosidades sobre o que já foi encontrado.</p>

### Formatação dos dados
<p>As informações desse dicionário estão disponíveis em json e csv. Por estar com a organização mais intuitiva, os dados csv, do dia <b>08/06/2020 07:09</b>, foram selecionados. Eles estão divididos em 3 arquivos diferentes: </p>
<ul>
<li>Locations: as localizações de cada estrutura.</li>
<li>Places: as conexões entre as estruturas.</li>
<li>Names: nomes das estruturas, sua traduções literais e lingua de origem.</li>
</ul>
<p>O próprio site fornece uma explicação breve dos atributos de cada arquivo e até faz sugestão de como unificá-los. A nota de <b>06/09/2019 13:46</b> foi usada como referência para organizar e filtrar os atributos considerados relevantes para o trabalho de visualização de dados: </p>

<a href="http://atlantides.org/downloads/pleiades/dumps/README.txt" >Explicação dos atributos</a>

### Objetivo da limpeza
<p>O processo de exploração do conteúdo dos datasets, a junção entre os arquivos e a filtragem dos atributos tem como principal objetivo preparar os dados para facilitar seu uso no trabalho final da disciplina de Visualização de Dados.</p>

### PROCESSO DE LIMPEZA E GERAÇÃO DO DATASET PRINCIPAL:

#### 1) Carregando as bibliotecas e os datasets

In [1]:
import pandas as pd
import re
import numpy as np

In [2]:
places = pd.read_csv('pleiades-places.csv')

In [3]:
names = pd.read_csv('pleiades-names.csv')

In [4]:
locations = pd.read_csv('pleiades-locations.csv')

#### 2) Observando seus formatos

In [5]:
places.shape

(37156, 26)

In [6]:
names.shape

(32972, 26)

In [7]:
locations.shape

(40003, 25)

#### 3) Observando os atributos
<p><b>Atributos exclusivos encontrados em cada arquivo:</b></p>
<ul>
<li><b>Locations:</b> geometry (equivalente ao extent em places e names), featuresTypes e pid.</li>
<li><b>Places:</b> connectsWith (outros locais aos quais um se conecta), hasConnectionWith (outros locais conecatos a um).</li>
<li><b>Names:</b> nameAttested (pronúncia do nome), nameLanguage (lingua original do nome), nameTransliterated (tradução literal do nome).</li>
</ul>
<p><b>OBS: </b>Neste ponto algumas inconsistências foram encontradas entre a descrição auxiliar fornecida pelo Pleiades e o conteúdo dos csvs. Algumas colunas dadas como adicionais em uma rquivo estavam presentes nos outros dois (como featureTypes "exclusivo" de locations também presente em places e a equivalência de geometry à extent) e alguns atributos não foram encontrados (como avgRating e numRatings ausentes em names).</p>

In [8]:
places.loc[0]

authors                                          Becker, J., T. Elliott
bbox                       13.4119837, 42.082885, 13.4119837, 42.082885
connectsWith                                                     413005
created                                            2016-11-04T16:36:09Z
creators                                               jbecker, thomase
currentVersion                                                        1
description           The post-Roman settlement at Alba Fucens becam...
extent                {"type": "Point", "coordinates": [13.4119837, ...
featureTypes                                                 settlement
geoContext                                                          NaN
hasConnectionsWith                                                  NaN
id                                                             48210385
locationPrecision                                               precise
maxDate                                                         

In [9]:
locations.loc[0]

authors                                                     Becker, J.
bbox                      13.4119837, 42.082885, 13.4119837, 42.082885
created                                           2013-07-15T17:16:48Z
creators                                                       jbecker
currentVersion                                                       4
description          The post-Roman settlement at Alba Fucens. Loca...
featureType                                                settlement,
geometry             {"type": "Point", "coordinates": [13.4119837, ...
id                                        location-of-borgo-medioevale
locationPrecision                                              precise
locationType                                           representative,
maxDate                                                           1453
minDate                                                            640
modified                                          2016-11-04T23:33:08Z
path  

In [10]:
names.loc[0]

authors               Spann, P., R. Warner, R. Talbert, T. Elliott, ...
bbox                         -3.606772, 39.460299, -3.606772, 39.460299
created                                            2010-09-24T19:02:22Z
creators                                                     P.O. Spann
currentVersion                                                      NaN
description                                                         NaN
extent                {"type": "Point", "coordinates": [-3.606772, 3...
id                                                            consabura
locationPrecision                                               precise
maxDate                                                             640
minDate                                                            -330
modified                                           2011-09-05T20:57:22Z
nameAttested                                                        NaN
nameLanguage                                                    

#### 4) Comparando uma mesma tupla
<p>Após a escolha do id de um local, seus dados foram buscados e comparados nos 3 arquivos. Foi descoberto que no names um mesmo id pode ter mais de uma tupla, ou seja, um mesmo local pode ter mais de um nome, algo a ser levado em consideração e cuidado no momentode junção dos arquivos.</p>

In [11]:
places.loc[places['id'] == 265876]

Unnamed: 0,authors,bbox,connectsWith,created,creators,currentVersion,description,extent,featureTypes,geoContext,...,path,reprLat,reprLatLong,reprLong,tags,timePeriods,timePeriodsKeys,timePeriodsRange,title,uid
2,"Spann, P., DARMC, R. Talbert, S. Gillies, R. W...","-3.606772, 39.460299, -3.606772, 39.460299",,2010-09-24T19:02:22Z,P.O. Spann,14.0,"An ancient settlement, likely of Celtic origin...","{""type"": ""Point"", ""coordinates"": [-3.606772, 3...",settlement,Consuegra,...,/places/265876,39.460299,"39.460299,-3.606772",-3.606772,"dare:ancient=1, dare:major=1, dare:feature=maj...",HRL,"hellenistic-republican,roman,late-antique","-330.0,640.0",Consabura/Consabrum,3fb26862377912da0f866fc310bcaf0c


In [12]:
locations.loc[locations['pid'] == '/places/265876']

Unnamed: 0,authors,bbox,created,creators,currentVersion,description,featureType,geometry,id,locationPrecision,...,pid,reprLat,reprLatLong,reprLong,tags,timePeriods,timePeriodsKeys,timePeriodsRange,title,uid
2,"Spann, P., D. R. Talbert, T. Elliott, S. Gillies","-3.606772, 39.460299, -3.606772, 39.460299",2011-03-10T00:05:52Z,P.O. Spann,1.0,1M scale point location,unknown,"{""type"": ""Point"", ""coordinates"": [-3.606772, 3...",darmc-location-20192,precise,...,/places/265876,39.460299,"39.460299,-3.606772",-3.606772,,HRL,"hellenistic-republican,roman,late-antique","-330.0,640.0",DARMC location 20192,62e35ae520c733fd1b3d538cfc93024d


In [13]:
names.loc[names['pid'] == '/places/265876']

Unnamed: 0,authors,bbox,created,creators,currentVersion,description,extent,id,locationPrecision,maxDate,...,pid,reprLat,reprLatLong,reprLong,tags,timePeriods,timePeriodsKeys,timePeriodsRange,title,uid
0,"Spann, P., R. Warner, R. Talbert, T. Elliott, ...","-3.606772, 39.460299, -3.606772, 39.460299",2010-09-24T19:02:22Z,P.O. Spann,,,"{""type"": ""Point"", ""coordinates"": [-3.606772, 3...",consabura,precise,640.0,...,/places/265876,39.460299,"39.460299,-3.606772",-3.606772,,HRL,"hellenistic-republican,roman,late-antique","-330.0,640.0",Consabura,fc0ee157ff11ce6d2e72cd7c5df06fee
1,"Spann, P., R. Warner, R. Talbert, T. Elliott, ...","-3.606772, 39.460299, -3.606772, 39.460299",2010-09-24T19:02:22Z,P.O. Spann,,,"{""type"": ""Point"", ""coordinates"": [-3.606772, 3...",consabrum,precise,640.0,...,/places/265876,39.460299,"39.460299,-3.606772",-3.606772,,HRL,"hellenistic-republican,roman,late-antique","-330.0,640.0",Consabrum,e2b8756302fb62ff0e301710c265a3e6
2,"Spann, P., R. Warner, R. Talbert, T. Elliott, ...","-3.606772, 39.460299, -3.606772, 39.460299",2010-09-24T19:02:22Z,P.O. Spann,,,"{""type"": ""Point"", ""coordinates"": [-3.606772, 3...",kondabora,precise,640.0,...,/places/265876,39.460299,"39.460299,-3.606772",-3.606772,,HRL,"hellenistic-republican,roman,late-antique","-330.0,640.0",Kondabora,f742b67b8343a6c68f222fbf607dcbf9
3,"Becker, J.","-4.0, 39.0, -3.0, 40.0",2016-02-08T23:21:01Z,jbecker,0.0,An ethnic name used by Pliny the Elder.,"{""type"": ""Polygon"", ""coordinates"": [[[-4.0, 39...",consaburrenses,precise,300.0,...,/places/265876,39.460299,"39.460299,-3.606772",-3.606772,,R,roman,"-30.0,300.0",Consaburrenses,ddc6939eba8aaf3729e6c254dedc90e4


#### 5) Observando duplicações
<p>Por não apresentar repetições, <b>places</b> foi escolhido como arquivo principal. Foi gerada uma lista com os ids que aparecem mais de uma vez para cada arquivo que apresenta duplicações.</p>

In [14]:
def getDups(dataset, c):
    num = 0
    if (c == 0):
        dd = dataset.pivot_table(index=['pid'], aggfunc='size')
    else:
        dd = dataset.pivot_table(index=['id'], aggfunc='size')
    l = []
    for x in range(len(dd)):
        if(dd.iloc[x] > 1):
            num = num + 1
            l.append(dd.index[x])
    print(str(num) + '/' + str(len(dataset)))
    return l

In [15]:
dubP = getDups(places,1)

0/37156


In [16]:
dubL = getDups(locations, 0)

2987/40003


In [17]:
dubN = getDups(names, 0)

5200/32972


#### 6) Selecionando atributos e removendo os desnecessários
<p>Os seguintes atributos de cada arquivo foram considerados relevantes para a visualização do trabalho:</p>
<ul>
<li><b>Places:</b> id, title, maxDate, minDate, description, timePeriodsKeys, tags, connectsWith, hasConnectionsWith, reprLat, reprLong, bbox, extent.</li>
<li><b>Locations:</b> featureType.</li>
<li><b>Names:</b> nameAttested, nameLanguage, nameTransliterated.</li>
</ul>
<p><b>OBS: </b>Atributos não listados foram descatados por não serem relevantes (como autor, data de alteração da tupla, etc) ou por serem redundantes (como reprLatLong ser a junção de reprLat com reprLong).</p>

#### 7) Pre-filtragem de atributos
<p>Os atributos definitivamente desnecessários foram removidos. Alguns ainda foram mantidos para que posteriormente seja feito um merge entre os arquivos para se obter o maior volume de dados possível.</p>

In [18]:
places = places.filter(items=['id','title','maxDate','minDate','description','timePeriodsKeys','tags',
                              'connectsWith','hasConnectionsWith','reprLat','reprLong','bbox','extent'])

In [19]:
places.loc[0]

id                                                             48210385
title                                                   Borgo Medievale
maxDate                                                            1453
minDate                                                             640
description           The post-Roman settlement at Alba Fucens becam...
timePeriodsKeys                                     mediaeval-byzantine
tags                                                                NaN
connectsWith                                                     413005
hasConnectionsWith                                                  NaN
reprLat                                                         42.0829
reprLong                                                         13.412
bbox                       13.4119837, 42.082885, 13.4119837, 42.082885
extent                {"type": "Point", "coordinates": [13.4119837, ...
Name: 0, dtype: object

In [20]:
locations = locations.filter(items=['pid','featureType','maxDate','minDate','description',
                                    'timePeriodsKeys','tags','reprLat','reprLong','bbox','geometry'])

In [21]:
locations.loc[0]

pid                                                 /places/48210385
featureType                                              settlement,
maxDate                                                         1453
minDate                                                          640
description        The post-Roman settlement at Alba Fucens. Loca...
timePeriodsKeys                                  mediaeval-byzantine
tags                                                  extant remains
reprLat                                                      42.0829
reprLong                                                      13.412
bbox                    13.4119837, 42.082885, 13.4119837, 42.082885
geometry           {"type": "Point", "coordinates": [13.4119837, ...
Name: 0, dtype: object

In [22]:
names = names.filter(items=['pid','nameAttested','nameLanguage','nameTransliterated'])

In [23]:
names.loc[0]

pid                   /places/265876
nameAttested                     NaN
nameLanguage                     NaN
nameTransliterated         Consabura
Name: 0, dtype: object

#### 8) Adaptando alguns tipos
<p>Alguns atributos, como tags e featureTypes, foram re-organizados para facilitar sua leitura na visualização. Além disso, algumas tags e ids estavam repetidas dentro de uma mesma célula de um mesmo atributo, estas repetições também foram descartadas, e alguns dados com espaços e caracteres desnecessários, ou que poderiam atrapalhar na visualização posteriormente, foram limpos com uso de regular expressions.</p>

In [24]:
def fixingDataTypes(dataset, c):
    for i in range(len(dataset.index)):

        if(c == 0):
            idd = int(re.search(r'\d+', str(dataset.at[i, 'id'])).group(0)) #Remove espaços estranho antes dos ids
            dataset.at[i, 'id'] = idd
            
            ################################ FIXING CONNECTIONS
        
            cw = str(dataset.at[i, 'connectsWith'])
            if(cw == "nan"):
                dataset.at[i, 'connectsWith'] = []
            else:
                cw = cw.split(",")
                if(cw[-1] == ''):
                    cw = cw[:-1]
                for w in range(len(cw)):
                    cw[w] = re.search(r'\d+', str(cw)).group(0)
                cw = list(set([int(x) for x in cw]))
                dataset.at[i, 'connectsWith'] = cw

            ################################ FIXING CONNECTED

            hc = str(dataset.at[i, 'hasConnectionsWith'])
            if(hc == "nan"):
                dataset.at[i, 'hasConnectionsWith'] = []
            else:
                hc = hc.split(",")
                if(hc[-1] == ''):
                    hc = hc[:-1]
                for w in range(len(hc)):
                    hc[w] = re.search(r'\d+', str(hc)).group(0)
                hc = list(set([int(x) for x in hc]))
                dataset.at[i, 'hasConnectionsWith'] = hc
        
        elif(c == 1):
            idd = int(re.search(r'\d+', str(dataset.at[i, 'pid'])).group(0))
            dataset.at[i, 'pid'] = idd
            
            ################################ FIXING FEATURETYPE
            
            feature = str(dataset.at[i, 'featureType'])
            if(feature == "nan"):
                dataset.at[i, 'featureType'] = ['unknown']
            else:
                feature = feature.split(",")
                if(feature[-1] == ''):
                    feature = feature[:-1]
                for w in range(len(feature)):
                    feature[w] = re.sub(r'^\s{1,}', '', feature[w])
                feature = list(set(feature))
                dataset.at[i, 'featureType'] = feature
        
        else:
            idd = int(re.search(r'\d+', str(dataset.at[i, 'pid'])).group(0))
            dataset.at[i, 'pid'] = idd
            
            ################################ FIXING NAMEATTES
        
            na = str(dataset.at[i, 'nameAttested'])
            if(na == "nan"):
                dataset.at[i, 'nameAttested'] = ['unknown']
            else:
                na = na.split(",")
                if(na[-1] == ''):
                    na = na[:-1]
                for w in range(len(na)):
                    na[w] = re.sub(r'^\s{1,}', '', na[w])
                na = list(set(na))
                dataset.at[i, 'nameAttested'] = na

            ################################ FIXING LANGUAGE

            nl = str(dataset.at[i, 'nameLanguage'])
            if(nl == "nan"):
                dataset.at[i, 'nameLanguage'] = ['unknown']
            else:
                nl = nl.split(",")
                if(nl[-1] == ''):
                    nl = nl[:-1]
                for w in range(len(nl)):
                    nl[w] = re.sub(r'^\s{1,}', '', nl[w])
                nl = list(set(nl))
                dataset.at[i, 'nameLanguage'] = nl

            ################################ FIXING TRANSLITERATION

            nt = str(dataset.at[i, 'nameTransliterated'])
            if(nt == "nan"):
                dataset.at[i, 'nameTransliterated'] = ['unknown']
            else:
                nt = nt.split(",")
                if(nt[-1] == ''):
                    nt = nt[:-1]
                for w in range(len(nt)):
                    nt[w] = re.sub(r'^\s{1,}', '', nt[w])
                nt = list(set(nt))
                dataset.at[i, 'nameTransliterated'] = nt
            
        
        if(c == 0 or c == 1):
        
            ################################ FIXING MAXDATE

            maxd = str(dataset.at[i, 'maxDate'])
            if(maxd != "nan"):
                dataset.at[i, 'maxDate'] = int(float(maxd))              

            ################################ FIXING MINDATE

            mind = str(dataset.at[i, 'minDate'])
            if(mind != "nan"):
                dataset.at[i, 'minDate'] = int(float(mind))         

            ################################ FIXING TIMEPERIODS

            pkeys = str(dataset.at[i, 'timePeriodsKeys'])  
            if(pkeys == "nan"):
                dataset.at[i, 'timePeriodsKeys'] = ['unknown']
            else:
                pkeys = pkeys.split(",")
                if(pkeys[-1] == ''):
                    pkeys = pkeys[:-1]
                for w in range(len(pkeys)):
                    pkeys[w] = re.sub(r'^\s{1,}', '', pkeys[w])
                pkeys = list(set(pkeys))
                dataset.at[i, 'timePeriodsKeys'] = pkeys
            
            ################################ FIXING TAGS
            
            tag = str(dataset.at[i, 'tags'])
            if(tag == "nan"):
                dataset.at[i, 'tags'] = []
            else:
                tag = tag.split(",")
                if(tag[-1] == ''):
                    tag = tag[:-1]
                for w in range(len(tag)):
                    tag[w] = re.sub(r'^\s{1,}', '', tag[w])
                tag = list(set(tag))
                dataset.at[i, 'tags'] = tag
            
            ################################ FIXING BBOX
            
            box = str(dataset.at[i, 'bbox'])
            if(box == "nan"):
                dataset.at[i, 'bbox'] = []
            else:
                box = box.split(",")
                if(box[-1] == ''):
                    box = box[:-1]
                for w in range(len(box)):
                    box[w] = re.sub(r'^\s{1,}', '', box[w])
                box = [float(x) for x in box]
                dataset.at[i, 'bbox'] = box
                

In [25]:
fixingDataTypes(places, 0)

In [26]:
places.loc[0]

id                                                             48210385
title                                                   Borgo Medievale
maxDate                                                            1453
minDate                                                             640
description           The post-Roman settlement at Alba Fucens becam...
timePeriodsKeys                                   [mediaeval-byzantine]
tags                                                                 []
connectsWith                                                   [413005]
hasConnectionsWith                                                   []
reprLat                                                         42.0829
reprLong                                                         13.412
bbox                     [13.4119837, 42.082885, 13.4119837, 42.082885]
extent                {"type": "Point", "coordinates": [13.4119837, ...
Name: 0, dtype: object

In [27]:
fixingDataTypes(locations, 1)

In [28]:
locations.loc[0]

pid                                                         48210385
featureType                                             [settlement]
maxDate                                                         1453
minDate                                                          640
description        The post-Roman settlement at Alba Fucens. Loca...
timePeriodsKeys                                [mediaeval-byzantine]
tags                                                [extant remains]
reprLat                                                      42.0829
reprLong                                                      13.412
bbox                  [13.4119837, 42.082885, 13.4119837, 42.082885]
geometry           {"type": "Point", "coordinates": [13.4119837, ...
Name: 0, dtype: object

In [29]:
fixingDataTypes(names, 2)

In [30]:
names.loc[0]

pid                        265876
nameAttested            [unknown]
nameLanguage            [unknown]
nameTransliterated    [Consabura]
Name: 0, dtype: object

<p><b>Salvando o progresso até agora...</b></p>

In [31]:
places.to_csv('PlacesPV.csv', index=False)

In [32]:
locations.to_csv('LocationsPV.csv', index=False)

In [33]:
names.to_csv('NamesPV.csv', index=False)

#### 9) Filtrando identificadores repetidos em um mesmo arquivo
<p>O objetivo desta etapa é fazer o merge de tuplas repetidas em um mesmo arquivo.</p>

In [34]:
ppv = places.copy()

In [35]:
lpv = locations.copy()

In [36]:
npv = names.copy()

<p><b>Recordando as cópias...</b></p>

In [37]:
dupP = getDups(ppv,1)

0/37156


In [38]:
dupL = getDups(lpv, 0)

2987/40003


In [39]:
dupN = getDups(npv, 0)

5200/32972


<p><b>Filtrando names...</b></p>

In [40]:
def filteringNames(dataset, l):    
    for i in range(len(l)):
        aux = dataset[dataset['pid'] == l[i]].iloc[0] ## primeira tupla que aparecer com certo id
        dat = dataset.loc[dataset['pid'] == l[i]] ## dataframe de tuplas com mesmo id
        
        for t in range(len(dat)):
            
            ################################ FIXING NAMEATTES
            
            if(aux['nameAttested'][0] == 'unknown' and dat.iloc[t]['nameAttested'][0] != 'unknown'):
                aux['nameAttested'] = dat.iloc[t]['nameAttested']
                
            elif(aux['nameAttested'][0] != 'unknown' and dat.iloc[t]['nameAttested'][0] != 'unknown'):
                aux['nameAttested'] = list(set(aux['nameAttested'] + dat.iloc[t]['nameAttested']))
            
            ################################ FIXING LANGUAGE
            
            if(aux['nameLanguage'][0] == 'unknown' and dat.iloc[t]['nameLanguage'][0] != 'unknown'):
                aux['nameLanguage'] = dat.iloc[t]['nameLanguage']
                
            elif(aux['nameLanguage'][0] != 'unknown' and dat.iloc[t]['nameLanguage'][0] != 'unknown'):
                aux['nameLanguage'] = list(set(aux['nameLanguage'] + dat.iloc[t]['nameLanguage']))
                
            ################################ FIXING TRANSLITERATION
            
            if(aux['nameTransliterated'][0] == 'unknown' and dat.iloc[t]['nameTransliterated'][0] != 'unknown'):
                aux['nameTransliterated'] = dat.iloc[t]['nameTransliterated']
                
            elif(aux['nameTransliterated'][0] != 'unknown' and dat.iloc[t]['nameTransliterated'][0] != 'unknown'):
                aux['nameTransliterated'] = list(set(aux['nameTransliterated'] + dat.iloc[t]['nameTransliterated']))

        dataset = dataset.drop(dataset[dataset['pid'] == l[i]].index)
        dataset.append(aux.to_frame())
        
    print('DONE')
    
    return dataset

In [41]:
nf = filteringNames(npv,dupN)

DONE


In [42]:
getDups(nf,0)

0/17716


[]

In [43]:
def filteringLocations(dataset, l):    
    for i in range(len(l)):
        aux = dataset[dataset['pid'] == l[i]].iloc[0] ## primeira tupla que aparecer com certo id
        dat = dataset.loc[dataset['pid'] == l[i]] ## dataframe de tuplas com mesmo id
        
        for t in range(len(dat)):
            
            ################################ FIXING LATLONG
            
            if(np.isnan(aux['reprLat']) and np.isnan(aux['reprLong']) and 
               not(np.isnan(dat.iloc[t]['reprLat'])) and not(np.isnan(dat.iloc[t]['reprLong']))):
                aux['reprLat'] = dat.iloc[t]['reprLat']
                aux['reprLong'] = dat.iloc[t]['reprLong']
            
            ################################ FIXING FEATURETYPE
            
            if(aux['featureType'][0] == 'unknown' and dat.iloc[t]['featureType'][0] != 'unknown'):
                aux['featureType'] = dat.iloc[t]['featureType']
                
            elif(aux['featureType'][0] != 'unknown' and dat.iloc[t]['featureType'][0] != 'unknown'):
                aux['featureType'] = list(set(aux['featureType'] + dat.iloc[t]['featureType']))
            
            ################################ FIXING MAXDATE
            
            if(np.isnan(aux['maxDate']) and not(np.isnan(dat.iloc[t]['maxDate']))):
                aux['maxDate'] = dat.iloc[t]['maxDate']
            
            ################################ FIXING MINDATE
            
            if(np.isnan(aux['minDate']) and not(np.isnan(dat.iloc[t]['minDate']))):
                aux['minDate'] = dat.iloc[t]['minDate']
            
            ################################ FIXING DESCRIPTION
            
            if(aux['description'] == '' and dat.iloc[t]['description'] != ''):
                aux['description'] = dat.iloc[t]['description']
            
            ################################ FIXING TIMEPERIODS
            
            if(aux['timePeriodsKeys'][0] == 'unknown' and dat.iloc[t]['timePeriodsKeys'][0] != 'unknown'):
                aux['timePeriodsKeys'] = dat.iloc[t]['timePeriodsKeys']
                
            elif(aux['timePeriodsKeys'][0] != 'unknown' and dat.iloc[t]['timePeriodsKeys'][0] != 'unknown'):
                aux['timePeriodsKeys'] = list(set(aux['timePeriodsKeys'] + dat.iloc[t]['timePeriodsKeys']))
            
            ################################ FIXING TAGS
            
            if(len(aux['tags']) == 0 and len(dat.iloc[t]['tags']) != 0):
                aux['tags'] = dat.iloc[t]['tags']
                
            elif(len(aux['tags']) != 0 and len(dat.iloc[t]['tags']) != 0):
                aux['tags'] = list(set(aux['tags'] + dat.iloc[t]['tags']))
            
            ################################ FIXING BBOX
            
            if(len(aux['bbox']) == 0 and len(dat.iloc[t]['bbox']) != 0):
                aux['bbox'] = dat.iloc[t]['bbox']
            
            ################################ FIXING GEOMETRY
            
            if(aux['geometry'] == '' and dat.iloc[t]['geometry'] != ''):
                aux['geometry'] = dat.iloc[t]['geometry']
            

        dataset = dataset.drop(dataset[dataset['pid'] == l[i]].index)
        dataset.append(aux.to_frame())
    
    print('DONE')
    
    return dataset

In [44]:
lf = filteringLocations(lpv, dupL)

DONE


In [45]:
getDups(lf,0)

0/33631


[]

<p><b>Mais um save point...</b></p>

In [46]:
nf.to_csv('NamesPV2.csv', index=False)

In [47]:
lf.to_csv('LocationsPV2.csv', index=False)

#### 10) Fazendo o merge dos 3 arquivos
<p>Será feito em partes, primeiro entre places e locations, depois o resultado desse primeiro merge com names.</p>

In [48]:
pm = ppv.copy()

In [49]:
pm.shape

(37156, 13)

In [50]:
lm = lf.copy()

In [51]:
nm = nf.copy()

<p><b>Fazendo o merge entre places e locations...</b></p>

In [52]:
def mergePL(places, locations):
    
    places['featureType'] = ''

    for l in range(len(locations)):
        lax = locations.iloc[l]
        pax = places[places['id'] == lax.iat[0]]
        p = pax.index[0]

        ################################ FIXING MAXDATE

        if(str(pax['maxDate'].iat[0]) == 'nan' and not(np.isnan(lax.iat[2]))):
            places.at[p,'maxDate'] = lax.iat[2]               

        ################################ FIXING MINDATE

        if(str(pax['minDate'].iat[0]) == 'nan' and not(np.isnan(lax.iat[3]))):
            places.at[p,'minDate'] = lax.iat[3]

        ################################ FIXING DESCRIPTION

        if(str(pax['description'].iat[0]) == '' and lax.iat[4] != ''):
            places.at[p,'description'] = lax.iat[4]

        ################################ FIXING TIMEPERIODS

        if(pax['timePeriodsKeys'].iat[0][0] == 'unknown' and lax.iat[5][0] != 'unknown'):
            places.at[p,'timePeriodsKeys'] = lax.iat[5]

        elif(pax['timePeriodsKeys'].iat[0][0] != 'unknown' and lax.iat[5][0] != 'unknown'):
            places.at[p,'timePeriodsKeys'] = list(set(places.at[p,'timePeriodsKeys'] + lax.iat[5]))

        ################################ FIXING TAGS

        if(len(pax['tags'].iat[0]) == 0 and len(lax.iat[6]) != 0):
            places.at[p,'tags'] = lax.iat[6]

        elif(len(pax['tags'].iat[0]) != 0 and len(lax.iat[6]) != 0):
            places.at[p,'tags'] = list(set(places.at[p,'tags'] + lax.iat[6]))

        ################################ FIXING LATLONG

        if((str(pax['reprLat'].iat[0]) == 'nan' and str(pax['reprLong'].iat[0]) == 'nan') and 
           (not(np.isnan(lax.iat[7])) and not(np.isnan(lax.iat[8])))):
            places.at[p,'reprLat'] = lax.iat[7]
            places.at[p,'reprLong'] = lax.iat[8]

        ################################ FIXING BBOX

        if(len(pax['bbox'].iat[0]) == 0 and len(lax.iat[9]) != 0):
            places.at[p,'bbox'] = lax.iat[9]

        ################################ FIXING GEOMETRY

        if((str(pax['extent'].iat[0]) == '' or str(pax['extent'].iat[0]) == 'nan') and
           (lax.iat[10] != '' or str(lax.iat[10]) != 'nan')):
            places.at[p,'extent'] = lax.iat[10]

        ################################ FIXING FEATURETYPE

        if(pax['featureType'].iat[0] == '' and lax.iat[1][0] != 'unknown'):
            places.at[p,'featureType'] = lax.iat[1]

        elif(pax['featureType'].iat[0] == '' and lax.iat[1][0] == 'unknown'):
            places.at[p,'featureType'] = ['unknown']
                
    print('DONE')
    
    return places   

In [53]:
pm2 = mergePL(pm, lm)

DONE


In [54]:
pm3 = pm2.copy()

In [55]:
pm3.shape

(37156, 14)

In [56]:
getDups(pm3,1)

0/37156


[]

<p><b>Fazendo o merge entre places e names...</b></p>

In [57]:
def mergeNP(places, names):
    places['nameAttested'] = ''
    places['nameLanguage'] = ''
    places['nameTransliterated'] = ''
    
    for l in range(len(names)):
        nax = names.iloc[l]
        pax = places[places['id'] == nax.iat[0]]
        p = pax.index[0]
        
        ################################ FIXING NAMEATTES
        
        if(pax['nameAttested'].iat[0] == '' and nax.iat[1][0] != 'unknown'):
            places.at[p,'nameAttested'] = nax.iat[1]

        elif(pax['nameAttested'].iat[0] == '' and nax.iat[1][0] == 'unknown'):
            places.at[p,'nameAttested'] = ['unknown']
        
        ################################ FIXING LANGUAGE
        
        if(pax['nameLanguage'].iat[0] == '' and nax.iat[1][0] != 'unknown'):
            places.at[p,'nameLanguage'] = nax.iat[1]

        elif(pax['nameLanguage'].iat[0] == '' and nax.iat[1][0] == 'unknown'):
            places.at[p,'nameLanguage'] = ['unknown']
        
        ################################ FIXING TRANSLITERATION
        
        if(pax['nameTransliterated'].iat[0] == '' and nax.iat[1][0] != 'unknown'):
            places.at[p,'nameTransliterated'] = nax.iat[1]

        elif(pax['nameTransliterated'].iat[0] == '' and nax.iat[1][0] == 'unknown'):
            places.at[p,'nameTransliterated'] = ['unknown']
        
    print('DONE')
     
    return places

In [58]:
pm4 = mergeNP(pm3, nm)

DONE


In [64]:
pm5 = pm4.copy()

In [65]:
pm5.shape

(37156, 17)

In [66]:
getDups(pm5,1)

0/37156


[]

<p><b>Últimos ajustes</b></p>

In [67]:
def finalAdjst(places):
    for p in range(len(places)):
        pax = places.loc[p]
        
        if(isinstance(pax['featureType'], str)):
            places.at[p,'featureType'] = ['unknown']
        
        if(isinstance(pax['nameAttested'],str)):
            places.at[p,'nameAttested'] = ['unknown']
            
        if(isinstance(pax['nameLanguage'],str)):
            places.at[p,'nameLanguage'] = ['unknown']
            
        if(isinstance(pax['nameTransliterated'],str)):
            places.at[p,'nameTransliterated'] = ['unknown']
    
    print('DONE')
     
    return places    

In [68]:
pm6 = finalAdjst(pm5)

DONE


In [92]:
pf = pm6.copy()

In [93]:
pf.shape

(37156, 17)

In [94]:
getDups(pf,1)

0/37156


[]

In [95]:
pf = pf[pf['reprLat'].notna()]

In [96]:
pf = pf[pf['reprLong'].notna()]

In [97]:
pf.shape

(29767, 17)

In [98]:
pf.to_csv('PLEAIDES.csv', encoding='utf-8', index=False)