# Limpeza dos dados

Neste notebook estão algumas operações para a limpeza dos dados obtidos.

Global Biotic Interactions - Os dados limpos para as interações podem ser encontrados nesse [link](https://drive.google.com/file/d/1YhU8QPXXilg4icozIvttXu2hyfm_0yOC/view?usp=sharing)

## Interações (Global Biotic Interactions)

Esses dados são fornecidos na forma de um arquivo csv contendo em forma de pares as interações entre diferentes seres no meio ambiente. Como o arquivo fornecido inicialmente era muito grande (aproximadamente 14 GB) para possibilitar o manuseio de forma prática, o primeiro passo adotado foi excluir algumas colunas extras contidas no arquivo, mantendo somente os dados referentes aos animais e o tipo de interação que havia entre eles. Para isso, foi utilizada a ferramenta `miller` extrair somente as colunas desejadas do csv. Com o arquivo reduzido para conter somente as informações necessárias, é possível carregar os dados resultantes da seguinte maneira:

In [1]:
import pandas as pd

interactions_df = pd.read_csv('../../interactions.csv')
interactions_df

Unnamed: 0,sourceTaxonSpeciesName,sourceTaxonKingdomName,interactionTypeName,targetTaxonSpeciesName,targetTaxonKingdomName
0,Andrena milwaukeensis,Animalia,visitsFlowersOf,Zizia aurea,Plantae
1,Andrena mandibularis,Animalia,visitsFlowersOf,Zanthoxylum americanum,Plantae
2,Andrena edwardsi,Animalia,visitsFlowersOf,Wyethia mollis,Plantae
3,Andrena mandibularis,Animalia,visitsFlowersOf,Viburnum dentatum,Plantae
4,Andrena milwaukeensis,Animalia,visitsFlowersOf,Viburnum lentago,Plantae
...,...,...,...,...,...
7822665,Calyptra orthograpta,Animalia,eats,Cervus unicolor,Animalia
7822666,Calyptra orthograpta,Animalia,eats,Elephas maximus,Animalia
7822667,Calyptra parva,Animalia,eats,,Animalia
7822668,Calyptra pseudobicolor,Animalia,eats,Homo sapiens,Animalia


No entanto, esses dados ainda contém linhas com informações faltantes, o que pode não ser tão útil. Então, para limpar essa tabela foram removidas todas as linhas contendo dados faltantes nas colunas que indicam o nome científico dos animais que participam da interação e a classificação da interação que ocorre entre eles.

In [3]:
interactions_df = interactions_df.dropna(subset=['sourceTaxonSpeciesName', 'interactionTypeName', 'targetTaxonSpeciesName'])
interactions_df

Unnamed: 0,sourceTaxonSpeciesName,sourceTaxonKingdomName,interactionTypeName,targetTaxonSpeciesName,targetTaxonKingdomName
0,Andrena milwaukeensis,Animalia,visitsFlowersOf,Zizia aurea,Plantae
1,Andrena mandibularis,Animalia,visitsFlowersOf,Zanthoxylum americanum,Plantae
2,Andrena edwardsi,Animalia,visitsFlowersOf,Wyethia mollis,Plantae
3,Andrena mandibularis,Animalia,visitsFlowersOf,Viburnum dentatum,Plantae
4,Andrena milwaukeensis,Animalia,visitsFlowersOf,Viburnum lentago,Plantae
...,...,...,...,...,...
7822664,Calyptra orthograpta,Animalia,eats,Bubalus bubalis,Animalia
7822665,Calyptra orthograpta,Animalia,eats,Cervus unicolor,Animalia
7822666,Calyptra orthograpta,Animalia,eats,Elephas maximus,Animalia
7822668,Calyptra pseudobicolor,Animalia,eats,Homo sapiens,Animalia


In [4]:
interactions_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3835705 entries, 0 to 7822669
Data columns (total 5 columns):
 #   Column                  Dtype 
---  ------                  ----- 
 0   sourceTaxonSpeciesName  object
 1   sourceTaxonKingdomName  object
 2   interactionTypeName     object
 3   targetTaxonSpeciesName  object
 4   targetTaxonKingdomName  object
dtypes: object(5)
memory usage: 304.6+ MB


Por fim, os dados são salvos mais uma vez em disco.

In [5]:
interactions_df.to_csv('../../clean-interactions.csv', index=False)

## Red list (IUCN)



In [8]:
redlist_df = pd.read_csv('../../redlist_species_data_83aad1b1-09d6-4283-b629-0ccacb10797b/assessments.csv')
redlist_df

Unnamed: 0,assessmentId,internalTaxonId,scientificName,redlistCategory,redlistCriteria,yearPublished,assessmentDate,criteriaVersion,language,rationale,...,populationTrend,range,useTrade,systems,conservationActions,realm,yearLastSeen,possiblyExtinct,possiblyExtinctInTheWild,scopes
0,495630,10030,Hexanchus griseus,Near Threatened,A2bd,2020,2019-11-21 00:00:00 UTC,3.1,English,<p>The&#160;Bluntnose Sixgill Shark (<em>Hexan...,...,Decreasing,"The Bluntnose Sixgill Shark has a widespread, ...","<p>The species is utilized for its meat, liver...",Marine,"<p>Since 2010, the European Union Fisheries Co...",,,False,False,Global
1,495907,10041,Heosemys annandalii,Critically Endangered,A2cd+4cd,2021,2018-03-13 00:00:00 UTC,3.1,English,<p><em>Heosemys annandalii</em> is considered ...,...,Decreasing,<p>The range of <em>Heosemys annandalii</em> i...,The species is collected for local consumption...,Terrestrial|Freshwater (=Inland waters),<p><em>Heosemys annandalii </em>is included in...,Indomalayan,,False,False,Global
2,497499,132523146,Hubbsina turneri,Critically Endangered,"B1ab(i,ii,iii,iv)+2ab(i,ii,iii,iv)",2019,2018-04-17 00:00:00 UTC,3.1,English,The Highland Splitfin is now only known to be ...,...,Decreasing,The Highland Splitfin is a freshwater fish spe...,The Highland Splitfin is not a target species ...,Freshwater (=Inland waters),No conservation actions targeting&#160;<em>Hub...,Neotropical,,False,False,Global
3,497550,10267,Hungerfordia pelewensis,Endangered,"B1ab(ii,iii)+2ab(ii,iii)",2012,2011-08-22 00:00:00 UTC,3.1,English,"<p><span lang=""EN-US"">In recent surveys, the s...",...,Unknown,"<p><span lang=""EN-US"">This is a land snail end...",This species is not utilized.,Terrestrial,<p> </p><p> </p><p>Field work to define the ...,Oceanian,,False,False,Global
4,498476,10769,Ictalurus mexicanus,Vulnerable,D2,2019,2018-12-06 00:00:00 UTC,3.1,English,<em>I. mexicanus </em>is herein categorized as...,...,Unknown,<p><em>Ictalurus mexicanus</em> is a species e...,This species is not utilised or traded.,Freshwater (=Inland waters),"<p>In Mexico,&#160;<em>Ictalurus mexicanus</em...",Neotropical,,False,False,Global
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50245,197564498,64563550,Filicium thouarsianum,Near Threatened,"B2ab(i,ii,iii)",2020,2020-03-28 00:00:00 UTC,3.1,English,<em>Filicium</em> <em>thouarsianum </em>is<em>...,...,Decreasing,<em>Filicium</em> <em>thouarsianum</em>&#160;i...,There is no reported use information for the s...,Terrestrial,"The species occurs in <span class=""ItemText"">A...",Afrotropical,,False,False,Global
50246,197565732,46486,Melanophylla angustior,Endangered,"A3c; B2ab(ii,iii,v)",2020,2020-06-17 00:00:00 UTC,3.1,English,"<em>Melanophylla</em> <em>angustior,</em> smal...",...,Decreasing,<em>Melanophylla</em> <em>angustior</em> endem...,There is no reported use information of this s...,Terrestrial,There are two subpopulations known for the spe...,Afrotropical,,False,False,Global
50247,197569838,46489,Melanophylla madagascariensis,Endangered,B1ab(iii)+2ab(iii),2020,2020-03-26 00:00:00 UTC,3.1,English,<em><em><em><em>Melanophylla</em> <em>madagasc...,...,Decreasing,<em>Melanophylla</em> <em>madagascariensis </e...,There is no reported use information for this ...,Terrestrial,One known subpopulation is recorded within Bet...,Afrotropical,,False,False,Global
50248,197570616,46490,Melanophylla modestei,Endangered,"A3c; B2ab(ii,iii,iv,v)",2020,2020-03-25 00:00:00 UTC,3.1,English,<em>Melanophylla</em> <em>modestei </em>is a t...,...,Decreasing,<em>Melanophylla</em> <em>modestei </em>is end...,There is no reported use information for this ...,Terrestrial,The species is known from Makira and Masoala p...,Afrotropical,,False,False,Global
