<a href="https://colab.research.google.com/github/Daniel022de/Bootcamp_SoulCode_EngenhariaDados/blob/main/ETL/atividade_customers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#ETL collections do MongoDB

Esse Notebook é referente a uma atividade proposta pelo professor **Igor Gondim** no bootcamp **engenharia de dados** da SoulCode de extração, tratamento, limpeza e carregamento do customers collection, uma coleção de um Dataset público do MongoDB (Sample Analytics Dataset).

 **A proposta foi a seguinte:**

* Conecte ao DataSet MongoDB, extraia a collection para o colab;
* Trate,limpe e normalize os dados.


! **Você pode encontrar esse notebook no meu repositório** [GitHub](https://github.com/Daniel022de/Bootcamp_SoulCode_EngenhariaDados)

! **Você pode entrar em contato comigo através do meu email** ddololiveira.pessoal@gmail.com **e** [Linkedin](https://www.linkedin.com/in/daniel-oliveira-503b0323b/).

! **Toda dúvida,recomendações e feedbacks serão bem-vindas.**


#Instalação

In [47]:
pip install pandera

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#Importação bibliotecas

In [48]:
import pandas as pd
import pandera as pa
import numpy as np
from pymongo import MongoClient

#Conexão MongoDB

In [49]:
uri = "mongodb+srv://daniel-soulcode.v9iencb.mongodb.net/?authSource=%24external&authMechanism=MONGODB-X509&retryWrites=true&w=majority"
client = MongoClient(uri,tls=True,tlsCertificateKeyFile='/content/drive/MyDrive/Colab Notebooks/X509-cert-1388842986422460516.pem')


#Extração

In [50]:
cursor = client['sample_analytics']['customers'].find()
df_customers = pd.DataFrame(list(cursor))

#Pré-análise

***Esse DataFrame contém informações sobre clientes de um aplicativo de serviços financeiros.***

*Informações das colunas:*

1. **"username"** : usuário do cliente;
2. **"name"** : nome;
3. **"address"**: endereço;
4. **"birthdate"**: data de nascimento;
5. **"email"**: email;
6. **"Active"**: Coluna relacionada a status de atividade (TRUE | FALSE);
7. **"accounts"**: uma lista das contas mantidas pelo cliente;
8. **"tier_and_details"**: detalhes sobre os benefícios que o cliente tem direito.

***->*** *Essas informações foram encontradas* [aqui](https://www.mongodb.com/docs/atlas/sample-data/sample-analytics/)

In [51]:
df_customers.head(1)

Unnamed: 0,_id,username,name,address,birthdate,email,active,accounts,tier_and_details
0,5ca4bbcea2dd94ee58162a68,fmiller,Elizabeth Ray,"9286 Bethany Glens\nVasqueztown, CO 22939",1977-03-02 02:20:31,arroyocolton@gmail.com,True,"[371138, 324287, 276528, 332179, 422649, 387979]",{'0df078f33aa74a2e9696e0520c1a828a': {'tier': ...


## Tipo dos dados



1. A coluna '_id' legada do MongoDB pode ser dropada **OK**
2.  Tipologia das colunas **OK**



In [52]:
df_customers.dtypes

_id                         object
username                    object
name                        object
address                     object
birthdate           datetime64[ns]
email                       object
active                      object
accounts                    object
tier_and_details            object
dtype: object

## Verificando os dados por coluna

1. Coluna 'active' com 99% dos seus dados NaN. **|>**

  **Duas opções**: Dropar a coluna toda ou Deixar com os dados Nulos 

  **Escolha**: Vou preservar essa coluna com os dados nulos,porque esse status pode ser alterado ao longo do tempo.
  **OK** 

2. Muitos dicionários vazios na coluna 'tier_and_details' **|>**

    **opção**: Vou converter esses dicionários vazios pra nulos numpy.  **OK**


In [53]:
#Verificando quantidade dados preenchidos

df_customers.count()

_id                 500
username            500
name                500
address             500
birthdate           500
email               500
active                1
accounts            500
tier_and_details    500
dtype: int64

In [54]:
#verificando quantidade dados nulos

df_customers.isna().sum()

_id                   0
username              0
name                  0
address               0
birthdate             0
email                 0
active              499
accounts              0
tier_and_details      0
dtype: int64

In [55]:
#Agrupando os dados para observar inconsistências

df_customers.groupby(['email'],dropna=False).size()

email
aacosta@yahoo.com         1
aadkins@hotmail.com       1
aaron15@yahoo.com         1
aaron99@yahoo.com         1
aarongreer@hotmail.com    1
                         ..
zamoragary@gmail.com      1
zanderson@hotmail.com     1
zhines@yahoo.com          1
zmelton@gmail.com         1
zyoung@gmail.com          1
Length: 499, dtype: int64

In [56]:
#Verificando dados de accounts
for i in df_customers.accounts:
  print(i)

[371138, 324287, 276528, 332179, 422649, 387979]
[116508]
[462501, 228290, 968786, 515844, 377292]
[170945, 951849]
[721914, 817222, 973067, 260799, 87389]
[904260, 565468]
[627629, 55958, 771641]
[385397, 337979, 325377, 440243, 586395, 86702]
[702610, 240640]
[344885, 839927, 853542]
[987709]
[662207, 816481]
[571880]
[88112, 567199, 436071, 226641]
[883283, 980867, 164836, 200611, 528224, 931483]
[631901, 814687]
[550665, 321695]
[66698, 859246, 183400, 460192]
[205563, 616602, 387877, 460069, 442724]
[700880, 376846, 271554]
[177069, 233104, 671035, 575454, 285919, 947160]
[928230, 120548, 667833, 810947]
[784245, 896066, 991412, 951840]
[836981]
[602560, 986196, 51080, 690617, 225602]
[388578]
[84115]
[607567, 429282]
[130832, 685011, 958231, 924297]
[457709, 852937, 271109, 601671, 343230]
[884849]
[391557, 280758, 90117, 867593, 719343]
[158557]
[87965, 312230, 759079, 986996]
[380253]
[973364, 83355]
[741673, 145588, 956881, 965514, 654939]
[179746]
[181073, 358036, 769877, 163

In [57]:
#Verifcando dados de tier_and_details
dicionario = df_customers.tier_and_details 
for k,v in dicionario.items():
  print(f'{k} : {v}')

0 : {'0df078f33aa74a2e9696e0520c1a828a': {'tier': 'Bronze', 'id': '0df078f33aa74a2e9696e0520c1a828a', 'active': True, 'benefits': ['sports tickets']}, '699456451cc24f028d2aa99d7534c219': {'tier': 'Bronze', 'benefits': ['24 hour dedicated line', 'concierge services'], 'active': True, 'id': '699456451cc24f028d2aa99d7534c219'}}
1 : {'c06d340a4bad42c59e3b6665571d2907': {'tier': 'Platinum', 'benefits': ['dedicated account representative'], 'active': True, 'id': 'c06d340a4bad42c59e3b6665571d2907'}, '5d6a79083c26402bbef823a55d2f4208': {'tier': 'Bronze', 'benefits': ['car rental insurance', 'concierge services'], 'active': True, 'id': '5d6a79083c26402bbef823a55d2f4208'}, 'b754ec2d455143bcb0f0d7bd46de6e06': {'tier': 'Gold', 'benefits': ['airline lounge access'], 'active': True, 'id': 'b754ec2d455143bcb0f0d7bd46de6e06'}}
2 : {}
3 : {'a15baf69a759423297f11ce6c7b0bc9a': {'tier': 'Platinum', 'benefits': ['airline lounge access'], 'active': True, 'id': 'a15baf69a759423297f11ce6c7b0bc9a'}}
4 : {}
5 :

#Tratamento | Limpeza dos dados

## Drop coluna '_id'

In [58]:
df_customers.drop(['_id'],axis=1,inplace=True)

## Convertendo dados nulos para nulo numpy

In [59]:
df_customers.replace(['NaN'],np.NAN,inplace=True)

##tier_and_details

* Transformando os dicionários vazios da coluna 'tier_and_details' para nulo

In [60]:
for i in range(len(df_customers)):
  if df_customers.loc[i,'tier_and_details'] == {}:
    df_customers.loc[i,'tier_and_details'] = np.NAN

In [61]:
#Verificando se a alteração foi feita
dicionario = df_customers.tier_and_details 
for k,v in dicionario.items():
  print(f'{k} : {v}')

0 : {'0df078f33aa74a2e9696e0520c1a828a': {'tier': 'Bronze', 'id': '0df078f33aa74a2e9696e0520c1a828a', 'active': True, 'benefits': ['sports tickets']}, '699456451cc24f028d2aa99d7534c219': {'tier': 'Bronze', 'benefits': ['24 hour dedicated line', 'concierge services'], 'active': True, 'id': '699456451cc24f028d2aa99d7534c219'}}
1 : {'c06d340a4bad42c59e3b6665571d2907': {'tier': 'Platinum', 'benefits': ['dedicated account representative'], 'active': True, 'id': 'c06d340a4bad42c59e3b6665571d2907'}, '5d6a79083c26402bbef823a55d2f4208': {'tier': 'Bronze', 'benefits': ['car rental insurance', 'concierge services'], 'active': True, 'id': '5d6a79083c26402bbef823a55d2f4208'}, 'b754ec2d455143bcb0f0d7bd46de6e06': {'tier': 'Gold', 'benefits': ['airline lounge access'], 'active': True, 'id': 'b754ec2d455143bcb0f0d7bd46de6e06'}}
2 : nan
3 : {'a15baf69a759423297f11ce6c7b0bc9a': {'tier': 'Platinum', 'benefits': ['airline lounge access'], 'active': True, 'id': 'a15baf69a759423297f11ce6c7b0bc9a'}}
4 : nan
5

## Tradução das colunas


In [62]:
df_customers.rename(columns={
    'username':'usuario',
    'name':'nome',
    'address':'endereco',
    'birthdate':'dt_nascimento',
    'accounts':'contas',
    'tier_and_details':'beneficios'

},inplace=True)

#Validação

In [63]:
#Schema de validação

schema = pa.DataFrameSchema(
    columns = {
        'usuario':pa.Column(pa.String),
        'nome':pa.Column(pa.String),
        'endereco':pa.Column(pa.String),
        'dt_nascimento':pa.Column(pa.DateTime),
        'email':pa.Column(pa.String),
        'active':pa.Column(pa.String,nullable=True),
        'contas':pa.Column(pa.String),
        'beneficios':pa.Column(pa.String,nullable=True),
    })

In [64]:
schema.validate(df_customers)

Unnamed: 0,usuario,nome,endereco,dt_nascimento,email,active,contas,beneficios
0,fmiller,Elizabeth Ray,"9286 Bethany Glens\nVasqueztown, CO 22939",1977-03-02 02:20:31,arroyocolton@gmail.com,True,"[371138, 324287, 276528, 332179, 422649, 387979]",{'0df078f33aa74a2e9696e0520c1a828a': {'tier': ...
1,valenciajennifer,Lindsay Cowan,Unit 1047 Box 4089\nDPO AA 57348,1994-02-19 23:46:27,cooperalexis@hotmail.com,,[116508],{'c06d340a4bad42c59e3b6665571d2907': {'tier': ...
2,hillrachel,Katherine David,"55711 Janet Plaza Apt. 865\nChristinachester, ...",1988-06-20 22:15:34,timothy78@hotmail.com,,"[462501, 228290, 968786, 515844, 377292]",
3,serranobrian,Leslie Martinez,Unit 2676 Box 9352\nDPO AA 38560,1974-11-26 14:30:20,tcrawford@gmail.com,,"[170945, 951849]",{'a15baf69a759423297f11ce6c7b0bc9a': {'tier': ...
4,charleshudson,Brad Cardenas,"2765 Powers Meadow\nHeatherfurt, CT 53165",1977-05-06 21:57:35,dustin37@yahoo.com,,"[721914, 817222, 973067, 260799, 87389]",
...,...,...,...,...,...,...,...,...
495,amandawilliams,Brandy Huang,"9505 Melissa Streets\nSouth Frankville, NJ 91189",1975-09-22 14:21:58,scottjonathan@yahoo.com,,"[650729, 991663, 144876, 912504, 88163]",
496,stricklandjeffery,Xavier Myers,"499 Jonathan Streets Apt. 890\nEast Ashley, MD...",1987-10-24 19:05:15,fredsmith@yahoo.com,,"[285957, 875868, 138703, 122908, 370468]",
497,smcintyre,Christopher Lawrence,"00881 West Flat\nNorth Emily, IL 32130",1997-03-05 18:20:57,vkeith@yahoo.com,,"[551774, 264502, 599670, 193228, 397774]",
498,qknight,Gabriel Romero,"79375 David Neck\nWest Matthewton, NJ 92863",1971-05-04 21:20:10,erica98@gmail.com,,"[568852, 351063, 635650, 229182, 732327, 89698]",


#Carregando 

***Vou carregar esse DataFrame em uma nova data base e collection do MongoDB***

In [65]:
colecao_customers = client['sample_analytics_tratado']['customers_tratado']

In [66]:
df_customers_dict = df_customers.to_dict('records')

In [67]:
colecao_customers.insert_many(df_customers_dict)

<pymongo.results.InsertManyResult at 0x7f857d123520>