<a href="https://colab.research.google.com/github/JosenildoJunior/StatPyDataScience/blob/main/Desafio_Ifood.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Análise exploratória em dados do Ifood**

O conjunto de dados é composto por clientes da empresa Ifood com dados sobre:

- Perfis de clientes
- Preferências do produto
- Sucessos/fracassos da campanha
- Desempenho do canal


**O objetivo é realizar uma análise exploratória desses dados.**

## **Importando os dados**

Nessa parte iremos importar todas as bibliotecas que iremos precisar para realizar nossa analise e também os dados, dito isso vamos seguir para a importação das bibliotecas e posteriormente dos dados

In [1]:
# Manipulação de dados
import pandas as pd

# Algébra linear
import numpy as np

# Visualização de dados
import matplotlib.pyplot as plt
import seaborn as sns

# Funções estatísticas
import statistics
import scipy.stats

Com as bibliotecas importadas podemos seguir para a importação dos dados propriamente dito

In [2]:
# Acesso ao drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Importando o dataset
df = pd.read_csv('/content/drive/MyDrive/Estatística para ciência de dados/mkt_data.csv')

# Observando os primeiro registros
df.head()

Unnamed: 0.1,Unnamed: 0,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,...,education_Graduation,education_Master,education_PhD,MntTotal,MntRegularProds,AcceptedCmpOverall,marital_status,education_level,kids,expenses
0,0,58138.0,0,0,58,635,88,546,172,88,...,3.0,,,1529,1441,0,Single,Graduation,0,1529
1,1,46344.0,1,1,38,11,1,6,2,1,...,3.0,,,21,15,0,Single,Graduation,2,21
2,2,71613.0,0,0,26,426,49,127,111,21,...,3.0,,,734,692,0,Together,Graduation,0,734
3,3,26646.0,1,0,26,11,4,20,10,3,...,3.0,,,48,43,0,Together,Graduation,1,48
4,4,58293.0,1,0,94,173,43,118,46,27,...,,,5.0,407,392,0,Married,PhD,1,407


Agora podemos partir para a proxima etapa

## **Análise inicial**

Nessa parte do desafio responderemos algumas perguntas com o intuito de enterdermos um pouco mais de como esses dados se comportam, para isso responderemos as seguintes perguntas:

- Quantos dados temos? Linhas e colunas
- Quais são as colunas numéricas?
- Temos duplicados na nossa base? Se tivermos, retire-os
- Temos dados nulos nessa base? Será que eles indicam algo? O que fazer com eles?
- Qual é a média, mediana, 25 percentil, 75 percentil, mínimo e máximo de cada uma das colunas numéricas?

Agora que já sabemos as perguntas, vamos partir para a resolução das mesmas.

### **Quantos dados temos? Linhas e colunas**

In [None]:
# Observando as dimensões dos dados
df.shape

(2205, 44)

Dessa forma, podemos observar que temos 2205 linhas e 44 colunas nessa base de dados.

### **Quais são as colunas numéricas?**

In [None]:
# Observando o tipo de dado
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2205 entries, 0 to 2204
Data columns (total 44 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            2205 non-null   int64  
 1   Income                2205 non-null   float64
 2   Kidhome               2205 non-null   int64  
 3   Teenhome              2205 non-null   int64  
 4   Recency               2205 non-null   int64  
 5   MntWines              2205 non-null   int64  
 6   MntFruits             2205 non-null   int64  
 7   MntMeatProducts       2205 non-null   int64  
 8   MntFishProducts       2205 non-null   int64  
 9   MntSweetProducts      2205 non-null   int64  
 10  MntGoldProds          2205 non-null   int64  
 11  NumDealsPurchases     2205 non-null   int64  
 12  NumWebPurchases       2205 non-null   int64  
 13  NumCatalogPurchases   2205 non-null   int64  
 14  NumStorePurchases     2205 non-null   int64  
 15  NumWebVisitsMonth    

Dessa forma, podemos observar o tipo de dado presente em cada uma das colunas. Temos 2 variáveis categóricas e 42 colunas numéricas, sendo elas:

**Variáveis categóricas:**
 - marital_status        
 - education_level

**Variáveis numéricas:**
- Income
- Kidhome
- Teenhome              
- Recency               
- MntWines              
- MntFruits             
- MntMeatProducts       
- MntFishProducts     
-  MntSweetProducts       
- MntGoldProds          
- NumDealsPurchases
- NumWebPurchases       
- NumCatalogPurchases    
- NumStorePurchases     
- NumWebVisitsMonth    
- AcceptedCmp3        
- AcceptedCmp4         
- AcceptedCmp5          
- AcceptedCmp1          
- AcceptedCmp2          
- Complain             
- Z_CostContact        
-  Z_Revenue             
-  Response              
- Age                   
- Customer_Days       
- marital_Divorced     
- marital_Married      
- marital_Single        
- marital_Together    
- marital_Widow         
- education_2n Cycle    
- education_Basic       
- education_Graduation  
- education_Master     
-  education_PhD         
-  MntTotal              
-  MntRegularProds       
-  AcceptedCmpOverall   
-  kids                
-  expenses             




### **Temos duplicados na nossa base? Se tivermos, retire-os**

In [4]:
# Verificando o valores duplicados
df.duplicated().sum()

0

Podemos observar que nesta base não existem valores duplicados. Sendo assim, podemos seguir adiante.

### **Temos dados nulos nessa base? Será que eles indicam algo? O que fazer com eles?**

In [6]:
# Observando os valores ausentes
def percent_ausentes(df_medias):
    p_faltantes = df_medias.isnull().mean()
    valores_faltantes = pd.DataFrame({'Variavéis': df_medias.columns,
                                           '% de ausentes': p_faltantes}
                                      ).reset_index(drop = True)

    return valores_faltantes.sort_values(by = ['% de ausentes'], ascending = False)

percent_ausentes(df)

Unnamed: 0,Variavéis,% de ausentes
33,education_Basic,0.97551
31,marital_Widow,0.965533
32,education_2n Cycle,0.910204
27,marital_Divorced,0.895692
35,education_Master,0.834921
36,education_PhD,0.784127
29,marital_Single,0.783673
30,marital_Together,0.742404
28,marital_Married,0.612698
34,education_Graduation,0.495238


Essa função nos retorna a porcentagem de valores ausentes presentes em cada uma coluna, podemos observar que existem variaveis com sua grande maioreia de registros ausentes como por exemplo education_basic, marital_widow, education_2n Cycle possuem mais de 90% dos valores ausentes, as variaveis marital_divorced e education_master possuem mais de 80%, variaveis como education_Phd, marital_single e marital_togheter possuem um pouco mais de 70% de valores ausentes enquanto as variaveis marital_married e education_graduation possuem respectivamente 61% e 49% de valores ausentes

Pelo expressivo numero de valores ausentes é possivel levantar a hipotese de que o pessoal que os usuarios não davam tanta importancia para o preenchimento dessas caracteristicas e acabavam deixando em branco,

Essas varaiveis que em sua grande maioria posuem valores ausentes não vai nos dizer nada a respeito dos nossos dados, então no momento elas podem ser retidaras do nosso dataset, vamos aproveitar para retirar também a coluna Unnamed, já que ela não será utilizada

In [7]:
# Realizando a exclusão das colunas
df = df.drop(['Unnamed','education_Basic', 'marital_Widow', 'education_2n Cycle', 'marital_Divorced', 'education_Master',
              'education_PhD', 'marital_Single', 'marital_Together', 'marital_Married', 'education_Graduation'], axis = 1)

Agora vamos observar novamente os valores ausentes

In [8]:
# Observando os valores ausentes
def percent_ausentes(df_medias):
    p_faltantes = df_medias.isnull().mean()
    valores_faltantes = pd.DataFrame({'Variavéis': df_medias.columns,
                                           '% de ausentes': p_faltantes}
                                      ).reset_index(drop = True)

    return valores_faltantes.sort_values(by = ['% de ausentes'], ascending = False)

percent_ausentes(df)

Unnamed: 0,Variavéis,% de ausentes
0,Unnamed: 0,0.0
25,Age,0.0
19,AcceptedCmp1,0.0
20,AcceptedCmp2,0.0
21,Complain,0.0
22,Z_CostContact,0.0
23,Z_Revenue,0.0
24,Response,0.0
26,Customer_Days,0.0
1,Income,0.0


Com os valores ausentes tratados podemos seguir adiante

### **Qual é a média, mediana, 25 percentil, 75 percentil, mínimo e máximo de cada uma das colunas numéricas?**

Como esse dataset possui bastante colunas vamos utilizar o seguinte método para que todas as colunas sejam exibidas corretamente

In [11]:
# Configurar para exibir todas as colunas
pd.set_option('display.max_columns', None)

Vamos utilizar o método describe para observar essas informação de todas as colunas numéricas de uma só vez

In [12]:
# Resumo estatístico
df.describe()

Unnamed: 0.1,Unnamed: 0,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response,Age,Customer_Days,MntTotal,MntRegularProds,AcceptedCmpOverall,kids,expenses
count,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0
mean,1102.0,51622.094785,0.442177,0.506576,49.00907,306.164626,26.403175,165.312018,37.756463,27.128345,44.057143,2.318367,4.10068,2.645351,5.823583,5.336961,0.073923,0.074376,0.073016,0.064399,0.013605,0.00907,3.0,11.0,0.15102,51.095692,2512.718367,562.764626,518.707483,0.29932,0.948753,562.764626
std,636.672993,20713.063826,0.537132,0.54438,28.932111,337.493839,39.784484,217.784507,54.824635,41.130468,51.736211,1.886107,2.737424,2.798647,3.241796,2.413535,0.261705,0.262442,0.260222,0.245518,0.115872,0.094827,0.0,0.0,0.35815,11.705801,202.563647,575.936911,553.847248,0.68044,0.749231,575.936911
min,0.0,1730.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0,24.0,2159.0,4.0,-283.0,0.0,0.0,4.0
25%,551.0,35196.0,0.0,0.0,24.0,24.0,2.0,16.0,3.0,1.0,9.0,1.0,2.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0,43.0,2339.0,56.0,42.0,0.0,0.0,56.0
50%,1102.0,51287.0,0.0,0.0,49.0,178.0,8.0,68.0,12.0,8.0,25.0,2.0,4.0,2.0,5.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0,50.0,2515.0,343.0,288.0,0.0,1.0,343.0
75%,1653.0,68281.0,1.0,1.0,74.0,507.0,33.0,232.0,50.0,34.0,56.0,3.0,6.0,4.0,8.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0,61.0,2688.0,964.0,884.0,0.0,1.0,964.0
max,2204.0,113734.0,2.0,2.0,99.0,1493.0,199.0,1725.0,259.0,262.0,321.0,15.0,27.0,28.0,13.0,20.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,11.0,1.0,80.0,2858.0,2491.0,2458.0,4.0,3.0,2491.0
