# Projeto Final

Você fazem parte do time de Data Science e Analytics da Popolishoshop e receberam uma base de dados contendo as infromações sobre a última Black Friday que ocorreu. O time de negócio solicitou para vocês um relatório, onde especificaram algumas informações e também um estudo para que vocês respondessem utilizando as bases fornecidas.

Para esse desafio, vamos trabalhar com o data set [Black Friday](https://www.kaggle.com/sdolezel/black-friday), que reúne dados sobre transações de compras em uma loja de varejo. Esse dataset está quebrado em diferentes arquivos e é sua função entender como cada um se relaciona com o outro.

Vamos utilizá-lo para praticar a exploração utilizando pandas.

Na tabela a seguir podemos ver os nomes das colunas e as descrições dos campos.

| Coluna                 | Descrição                                                 |
|------------------------|-----------------------------------------------------------|
| User_ID                | ID do usuário                                             |
| Product_ID             | ID do produto                                             |
| Gender                 | Sexo do usuário                                           |
| Age                    | Ano em intervalos                                         |
| Occupation             | Ocupação (mascarada)                                      |
| City_Category          | Categoria da cidade (A, B, C)                             |
| StayInCurrentCityYears | Número de anos de permanência na cidade atual             |
| Marital_Status         | Estado civil                                              |
| ProductCategory1       | Categoria do produto (Mascarada)                          |
| ProductCategory2       | Categoria que o produto pode pertencer também (Mascarada) |
| ProductCategory3       | Categoria que o produto pode pertencer também (Mascarada) |
| Purchase               | Valor da compra                                           | 

Todo o código desenvolvido deve ser pensado para ser reutilizado. A avaliação se dará executando todo o notebook com outra tabela, de mesmas colunas. Sendo assim, pensem na qualidade e reprodução do código.

## _Set up_ da análise

Faça a leitura das três bases fornecidas e junte-as em um único DataFrame.

In [1]:
import pandas as pd
import numpy as np

In [2]:
produtos = pd.read_csv('product_info.csv', sep=';')
compras = pd.read_csv('purchase.csv', sep=',')
usuarios = pd.read_csv('user_profile.csv', sep='|')

In [3]:
base_prdts_comprs = pd.merge(produtos, compras, on='Product_ID')

In [4]:
base_final = pd.merge(base_prdts_comprs, usuarios, on='User_ID')
base_final

Unnamed: 0,Product_ID,Product_Category_1,Product_Category_2,Product_Category_3,User_ID,Purchase,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status
0,P00069042,3,,,1000001,8370,F,0-17,10,A,2,0
1,P00248942,1,6.0,14.0,1000001,15200,F,0-17,10,A,2,0
2,P00087842,12,,,1000001,1422,F,0-17,10,A,2,0
3,P00085442,12,14.0,,1000001,1057,F,0-17,10,A,2,0
4,P00184942,1,8.0,17.0,1000001,19219,F,0-17,10,A,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...
537572,P00353842,15,,,1002204,12965,M,55+,3,B,4+,1
537573,P00311442,11,15.0,,1002204,1714,M,55+,3,B,4+,1
537574,P00325642,11,15.0,,1002204,1561,M,55+,3,B,4+,1
537575,P00341842,15,16.0,,1002204,21172,M,55+,3,B,4+,1


## Questão 1

Quantas observações e quantas colunas há no dataset completo (todas as bases juntas)? Responda no formato de uma tuple `(n_observacoes, n_colunas)`.

In [5]:
base_final.shape

(537577, 12)

## Questão 2

Há quantas mulheres com idade entre 26 e 35 anos no dataset? Responda como um único escalar.

In [6]:
base_final[base_final['Gender'] == 'F']['Age'].value_counts()[0]

49348

## Questão 3

Quantos usuários únicos há no dataset? Responda como um único escalar.

In [7]:
len(pd.unique(base_final['User_ID']))

5891

## Questão 4

Qual porcentagem dos registros (percentual de linhas) possui ao menos um valor null (`None`, `ǸaN` etc)? Responda como um único escalar entre 0 e 1.

In [8]:
base_final.isna().sum()/base_final.shape[0]

Product_ID                    0.000000
Product_Category_1            0.000000
Product_Category_2            0.310627
Product_Category_3            0.694410
User_ID                       0.000000
Purchase                      0.000000
Gender                        0.000000
Age                           0.000000
Occupation                    0.000000
City_Category                 0.000000
Stay_In_Current_City_Years    0.000000
Marital_Status                0.000000
dtype: float64

## Questão 5

Quantos valores null existem na variável (coluna) com o maior número de null? Responda como um único escalar.

In [9]:
base_final.isnull().sum().max()

373299

## Questão 6

Qual o valor mais frequente (sem contar nulls) em `Product_Category_3`? Responda como um único escalar.

In [10]:
base_final['Product_Category_3'].dropna().mode()[0]

16.0

## Questão 7

Podemos afirmar que se uma observação é null em `Product_Category_2` ela também o é em `Product_Category_3`? Responda com um bool (`True`, `False`).

In [11]:
(base_final['Product_Category_2'].isnull() == base_final['Product_Category_3'].isnull()).all()

False

## Questão 8

Qual o ID do usuário que mais gastou na Black Friday?

In [12]:
compras_usuario = base_final[['User_ID']].join(base_final[['Purchase']])
compras_user_total = compras_usuario.groupby(by = 'User_ID').sum()
compras_user_total.idxmax()

Purchase    1004277
dtype: int64

## Questão 9

Qual grupo (homens ou mulheres) mais gastou na Black Friday?

In [13]:
compras_genero = base_final[['Gender']].join(base_final[['Purchase']])
compras_genero_total = compras_genero.groupby(by = 'Gender').sum()
compras_genero_total.idxmax()

Purchase    M
dtype: object

## Questão 10

Faça uma nova tabela com a categoria mais comprada por cada cliente.

Obs: se ele comprou um produto que possuir valores nas três colunas de categorias, então deve-se considerar todas as categorias.

Categoria_1 é um intervalo de 1 a 12
Categoria_2 é um intervalo de 1 a 12
Categoria_3 é um intervalo de 1 a 12

Categoria_1: 2
Categoria_2: 4
Categoria_3: 1

       1 2 3 4 5 6 7 8 9 10 11 12
739273 1 1 0 1 0 0 0 0 0 0  0  0

In [15]:
compras_categoria = base_final[['User_ID']].join(base_final[['Product_Category_1', 'Product_Category_2', 'Product_Category_3']])
compras_categoria

Unnamed: 0,User_ID,Product_Category_1,Product_Category_2,Product_Category_3
0,1000001,3,,
1,1000001,1,6.0,14.0
2,1000001,12,,
3,1000001,12,14.0,
4,1000001,1,8.0,17.0
...,...,...,...,...
537572,1002204,15,,
537573,1002204,11,15.0,
537574,1002204,11,15.0,
537575,1002204,15,16.0,


In [16]:
for index, row in compras_categoria.iterrows():
    
    print(row.duplicated())
    break

User_ID               False
Product_Category_1    False
Product_Category_2    False
Product_Category_3     True
Name: 0, dtype: bool


In [54]:
for k, v in np.ndenumerate(compras_categoria):
    print(k)
    break
    

(0, 0)


In [17]:
for n in range(1, 4):
    min_ = compras_categoria[f'Product_Category_{n}'].min()
    max_ = compras_categoria[f'Product_Category_{n}'].max()
    print(f'Categoria_{n} é um intervalo de {min_} e {max_}')

Categoria_1 é um intervalo de 1 e 18
Categoria_2 é um intervalo de 2.0 e 18.0
Categoria_3 é um intervalo de 3.0 e 18.0


In [18]:
for i in range(1, 4):
    a = compras_categoria[f'Product_Category_{i}'].dropna()
    a.apply(str)
    b = a.mode()
    print(f'Product_Category_{i}:', int(b[0]))
    

Product_Category_1: 5
Product_Category_2: 8
Product_Category_3: 16


In [19]:
base_final

Unnamed: 0,Product_ID,Product_Category_1,Product_Category_2,Product_Category_3,User_ID,Purchase,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status
0,P00069042,3,,,1000001,8370,F,0-17,10,A,2,0
1,P00248942,1,6.0,14.0,1000001,15200,F,0-17,10,A,2,0
2,P00087842,12,,,1000001,1422,F,0-17,10,A,2,0
3,P00085442,12,14.0,,1000001,1057,F,0-17,10,A,2,0
4,P00184942,1,8.0,17.0,1000001,19219,F,0-17,10,A,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...
537572,P00353842,15,,,1002204,12965,M,55+,3,B,4+,1
537573,P00311442,11,15.0,,1002204,1714,M,55+,3,B,4+,1
537574,P00325642,11,15.0,,1002204,1561,M,55+,3,B,4+,1
537575,P00341842,15,16.0,,1002204,21172,M,55+,3,B,4+,1


In [58]:
dummies01 = pd.get_dummies(compras_categoria, columns=['Product_Category_1']).drop(columns=['Product_Category_2', 'Product_Category_3'])
novo_dummie01 = dummies01.groupby(by = 'User_ID').sum()
novo_dummie01.rename(columns={'Product_Category_1_1': '1', 'Product_Category_1_2': '2', 'Product_Category_1_3': '3',
                              'Product_Category_1_4': '4', 'Product_Category_1_5': '5', 'Product_Category_1_6': '6',
                              'Product_Category_1_7': '7', 'Product_Category_1_8': '8', 'Product_Category_1_9': '9',
                              'Product_Category_1_10': '10', 'Product_Category_1_11': '11', 'Product_Category_1_12': '12',
                              'Product_Category_1_13': '13', 'Product_Category_1_14': '14', 'Product_Category_1_15': '15',
                              'Product_Category_1_16': '16','Product_Category_1_17': '17', 'Product_Category_1_18': '18'}, inplace=True)
novo_dummie01

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1000001,4.0,1.0,11.0,2.0,2.0,1.0,0.0,8.0,0.0,0.0,0.0,3.0,0.0,1.0,0.0,1.0,0.0,0.0
1000002,31.0,1.0,0.0,0.0,13.0,6.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000003,15.0,2.0,1.0,0.0,9.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1000004,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000005,18.0,2.0,1.0,3.0,20.0,6.0,5.0,44.0,0.0,0.0,2.0,0.0,0.0,1.0,1.0,3.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1006036,81.0,14.0,12.0,13.0,119.0,20.0,6.0,151.0,0.0,5.0,28.0,1.0,4.0,2.0,10.0,8.0,0.0,2.0
1006037,14.0,2.0,0.0,1.0,24.0,6.0,0.0,43.0,0.0,3.0,4.0,0.0,4.0,1.0,2.0,11.0,0.0,1.0
1006038,0.0,0.0,2.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1006039,7.0,3.0,11.0,0.0,28.0,0.0,0.0,5.0,0.0,3.0,0.0,4.0,6.0,0.0,0.0,0.0,0.0,0.0


In [59]:
dummies02 = pd.get_dummies(compras_categoria, columns=['Product_Category_2']).drop(columns=['Product_Category_1', 'Product_Category_3'])

novo_dummie02 = dummies02.groupby(by = 'User_ID').sum()
novo_dummie02.rename(columns={'Product_Category_2_2.0': '2', 'Product_Category_2_3.0': '3', 'Product_Category_2_4.0': '4',
                             'Product_Category_2_5.0': '5', 'Product_Category_2_6.0': '6', 'Product_Category_2_7.0': '7',
                              'Product_Category_2_8.0': '8', 'Product_Category_2_9.0': '9', 'Product_Category_2_10.0': '10',
                              'Product_Category_2_11.0': '11', 'Product_Category_2_12.0': '12', 'Product_Category_2_13.0': '13',
                             'Product_Category_2_14.0': '14', 'Product_Category_2_15.0': '15', 'Product_Category_2_16.0': '16',
                             'Product_Category_2_17.0': '17', 'Product_Category_2_18.0': '18'}, inplace=True)
novo_dummie02.insert(loc=0, column='1', value=0)
novo_dummie02

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1000001,0,2,0,11,0,1,0,4,0,0,0,0,0,1,1,0,1,0
1000002,0,8,0,0,2,2,0,18,0,0,2,0,1,5,5,10,1,0
1000003,0,13,0,1,3,0,0,3,0,0,0,0,0,1,0,1,0,1
1000004,0,4,0,0,0,1,0,2,0,0,1,0,0,0,3,1,0,0
1000005,0,3,1,1,5,2,0,11,0,0,3,1,4,11,4,11,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1006036,0,27,1,13,16,12,1,53,4,5,10,5,9,54,43,32,10,1
1006037,0,3,0,1,1,6,0,14,1,1,0,0,3,10,9,16,5,0
1006038,0,0,0,0,2,0,0,2,0,0,1,0,0,4,0,0,0,0
1006039,0,3,0,9,5,1,0,6,2,0,1,8,5,6,2,5,1,0


In [60]:
dummies03 = pd.get_dummies(compras_categoria, columns=['Product_Category_3']).drop(columns=['Product_Category_1', 'Product_Category_2'])
novo_dummie03 = dummies03.groupby(by = 'User_ID').sum()
novo_dummie03.rename(columns={'Product_Category_3_3.0': '3', 'Product_Category_3_4.0': '4', 'Product_Category_3_5.0': '5',
                              'Product_Category_3_6.0': '6', 'Product_Category_3_7.0': '7', 'Product_Category_3_8.0': '8',
                              'Product_Category_3_9.0': '9', 'Product_Category_3_10.0': '10', 'Product_Category_3_11.0': '11',
                              'Product_Category_3_12.0': '12', 'Product_Category_3_13.0': '13', 'Product_Category_3_14.0': '14',
                              'Product_Category_3_15.0': '15', 'Product_Category_3_16.0': '16', 'Product_Category_3_17.0': '17',
                              'Product_Category_3_18.0': '18'}, inplace=True)
novo_dummie03.insert(loc=0, column='1', value=0)
novo_dummie03.insert(loc=1, column='2', value=0)
novo_dummie03.insert(loc=6, column='7', value=0)
novo_dummie03

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1000001,0,0,0,0,3,0,0,1,2,0,0,5,0,1,0,1,1,0
1000002,0,0,0,0,0,1,0,3,1,1,0,0,2,5,2,4,6,1
1000003,0,0,0,0,3,0,0,2,0,0,2,0,0,2,1,1,0,2
1000004,0,0,0,0,0,0,0,0,1,0,1,0,0,2,2,2,1,0
1000005,0,0,0,0,1,1,0,2,0,1,0,1,0,1,1,7,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1006036,0,0,1,0,5,2,0,5,4,1,1,3,3,17,26,25,5,3
1006037,0,0,0,0,0,0,0,4,3,0,0,1,1,5,3,11,4,0
1006038,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0
1006039,0,0,0,0,3,0,0,2,1,0,0,8,1,5,0,2,1,1


In [61]:
merge_1_2_3 = (novo_dummie01 + novo_dummie02 + novo_dummie03)
merge_1_2_3

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1000001,4.0,3.0,11.0,13.0,5.0,2.0,0.0,13.0,2.0,0.0,0.0,8.0,0.0,3.0,1.0,2.0,2.0,0.0
1000002,31.0,9.0,0.0,0.0,15.0,9.0,0.0,46.0,1.0,1.0,2.0,0.0,3.0,10.0,7.0,14.0,7.0,1.0
1000003,15.0,15.0,1.0,1.0,15.0,0.0,0.0,6.0,0.0,0.0,2.0,0.0,0.0,3.0,1.0,2.0,0.0,4.0
1000004,13.0,4.0,0.0,0.0,0.0,1.0,0.0,2.0,1.0,0.0,2.0,0.0,0.0,2.0,5.0,3.0,1.0,0.0
1000005,18.0,5.0,2.0,4.0,26.0,9.0,5.0,57.0,0.0,1.0,5.0,2.0,4.0,13.0,6.0,21.0,2.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1006036,81.0,41.0,14.0,26.0,140.0,34.0,7.0,209.0,8.0,11.0,39.0,9.0,16.0,73.0,79.0,65.0,15.0,6.0
1006037,14.0,5.0,0.0,2.0,25.0,12.0,0.0,61.0,4.0,4.0,4.0,1.0,8.0,16.0,14.0,38.0,9.0,1.0
1006038,0.0,0.0,2.0,0.0,6.0,0.0,0.0,6.0,0.0,0.0,1.0,0.0,0.0,4.0,0.0,0.0,3.0,0.0
1006039,7.0,6.0,11.0,9.0,36.0,1.0,0.0,13.0,3.0,3.0,1.0,20.0,12.0,11.0,2.0,7.0,2.0,1.0


In [62]:
lista_compra = {}
a = []
b = []
for index, row in merge_1_2_3.iterrows():
    a.append(index)
    b.append(row.argmax() + 1)
lista_compra['Usuário'] = a
lista_compra['Mais_comprado'] = b

In [63]:
pd.DataFrame(lista_compra)

Unnamed: 0,Usuário,Mais_comprado
0,1000001,4
1,1000002,8
2,1000003,1
3,1000004,1
4,1000005,8
...,...,...
5886,1006036,8
5887,1006037,8
5888,1006038,5
5889,1006039,5


In [64]:
pd.pivot_table(merge_1_2_3,index=["User_ID"], values=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18'], aggfunc=[np.sum, np.cumsum], fill_value=0, margins=True)

Unnamed: 0_level_0,sum,sum,sum,sum,sum,sum,sum,sum,sum,sum,...,cumsum,cumsum,cumsum,cumsum,cumsum,cumsum,cumsum,cumsum,cumsum,cumsum
Unnamed: 0_level_1,1,10,11,12,13,14,15,16,17,18,...,17,18,2,3,4,5,6,7,8,9
User_ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1000001,4,0,0,8,0,3,1,2,2,0,...,2.0,0.0,3.0,11.0,13.0,5.0,2.0,0.0,13.0,2.0
1000002,31,1,2,0,3,10,7,14,7,1,...,7.0,1.0,9.0,0.0,0.0,15.0,9.0,0.0,46.0,1.0
1000003,15,0,2,0,0,3,1,2,0,4,...,0.0,4.0,15.0,1.0,1.0,15.0,0.0,0.0,6.0,0.0
1000004,13,0,2,0,0,2,5,3,1,0,...,1.0,0.0,4.0,0.0,0.0,0.0,1.0,0.0,2.0,1.0
1000005,18,1,5,2,4,13,6,21,2,0,...,2.0,0.0,5.0,2.0,4.0,26.0,9.0,5.0,57.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1006037,14,4,4,1,8,16,14,38,9,1,...,9.0,1.0,5.0,0.0,2.0,25.0,12.0,0.0,61.0,4.0
1006038,0,0,1,0,0,4,0,0,3,0,...,3.0,0.0,0.0,2.0,0.0,6.0,0.0,0.0,6.0,0.0
1006039,7,3,1,20,12,11,2,7,2,1,...,2.0,1.0,6.0,11.0,9.0,36.0,1.0,0.0,13.0,3.0
1006040,23,6,12,3,6,24,19,19,9,1,...,9.0,1.0,10.0,1.0,3.0,49.0,15.0,5.0,87.0,4.0


## Questão 11

Normalize a coluna Purchase. A fórmula de normalização é:


$$x_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}} $$

In [65]:
purchase = base_final['Purchase']
norm_purchase = (purchase - purchase.min())/(purchase.max() - purchase.min())
norm_purchase

0         0.344255
1         0.631519
2         0.052027
3         0.036676
4         0.800555
            ...   
537572    0.537517
537573    0.064309
537574    0.057873
537575    0.882697
537576    0.538863
Name: Purchase, Length: 537577, dtype: float64

In [66]:
purchase_n = base_final['Purchase'].value_counts(normalize=True)
purchase_n

6855     0.000346
7011     0.000344
6891     0.000339
7193     0.000339
6879     0.000337
           ...   
11197    0.000002
21153    0.000002
9204     0.000002
14726    0.000002
9435     0.000002
Name: Purchase, Length: 17959, dtype: float64

## Questão 12
O estado civil influencia no valor gasto e na categoria de produto comprada? Mostre!

Se eu quisesse vender mais produtos da categoria 14, deveria investir em propagandas para qual estado civil?

In [67]:
def estado_civil(x):
    if x == 'Nan':
        return 0
    elif x > 0: 
        return 1
    else:
        return x

In [68]:
bs_fnl_group = base_final.groupby(by = 'User_ID').sum()
novo_bs_fnl = bs_fnl_group[['Marital_Status']].applymap(estado_civil)
marital_analysis = merge_1_2_3.join(novo_bs_fnl[['Marital_Status']]).join(bs_fnl_group[['Purchase']])


In [69]:
marital_analysis[['Marital_Status', 'Purchase']].groupby(by='Marital_Status').sum()

Unnamed: 0_level_0,Purchase
Marital_Status,Unnamed: 1_level_1
0,2966289500
1,2051378878


In [70]:
pd.pivot_table(marital_analysis, index=["Marital_Status"], aggfunc=np.mean)

Unnamed: 0_level_0,1,10,11,12,13,14,15,16,17,18,2,3,4,5,6,7,8,9,Purchase
Marital_Status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,24.357624,1.551068,7.04302,3.093064,3.572725,12.663447,12.329236,14.786362,5.043606,1.689494,12.701493,4.189055,6.971905,33.422886,7.224173,0.693298,31.773193,3.123207,868097.600234
1,22.280922,1.786985,6.310428,3.160469,3.632175,12.331447,11.722716,13.711399,5.219078,1.857316,11.551738,3.625707,5.985853,30.978173,6.688763,0.773646,31.934115,2.72312,829174.970897


In [71]:
pd.pivot_table(marital_analysis, index=["Marital_Status"], aggfunc=np.max)

Unnamed: 0_level_0,1,10,11,12,13,14,15,16,17,18,2,3,4,5,6,7,8,9,Purchase
Marital_Status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,209.0,22.0,161.0,49.0,58.0,209.0,123.0,182.0,63.0,32.0,110.0,53.0,108.0,330.0,78.0,24.0,394.0,28.0,10536783
1,204.0,22.0,142.0,52.0,46.0,115.0,120.0,173.0,56.0,29.0,115.0,49.0,110.0,340.0,66.0,23.0,381.0,27.0,8699232


In [72]:
pd.pivot_table(marital_analysis, values="14", index=["Marital_Status"], aggfunc=[np.min, np.mean, np.max, np.argmax])

Unnamed: 0_level_0,amin,mean,amax,argmax
Unnamed: 0_level_1,14,14,14,14
Marital_Status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,0.0,12.663447,209.0,521
1,0.0,12.331447,115.0,397


## Questão 13
Quais as variáveis que mais impactaram no valor da compra? Como você chegou a essa conclusão?

In [22]:
base_final[base_final['User_ID'] == 1000456]['Purchase'].sum()

925393

In [75]:
group_by_user_id = base_final.groupby(by = 'User_ID')

In [76]:
purchase_by_group_user = group_by_user_id[['Purchase']].sum()
purchase_by_group_user

Unnamed: 0_level_0,Purchase
User_ID,Unnamed: 1_level_1
1000001,333481
1000002,810353
1000003,341635
1000004,205987
1000005,821001
...,...
1006036,3821666
1006037,1075037
1006038,80859
1006039,554504


In [77]:
gender_by_group_user = group_by_user_id[['Gender']].max()
gender_by_group_user

Unnamed: 0_level_0,Gender
User_ID,Unnamed: 1_level_1
1000001,F
1000002,M
1000003,M
1000004,M
1000005,M
...,...
1006036,F
1006037,F
1006038,F
1006039,F


In [78]:
age_by_group_user = group_by_user_id[['Age']].max()
age_by_group_user

Unnamed: 0_level_0,Age
User_ID,Unnamed: 1_level_1
1000001,0-17
1000002,55+
1000003,26-35
1000004,46-50
1000005,26-35
...,...
1006036,26-35
1006037,46-50
1006038,55+
1006039,46-50


In [79]:
occ_by_group_user = group_by_user_id[['Occupation']].max()
occ_by_group_user

Unnamed: 0_level_0,Occupation
User_ID,Unnamed: 1_level_1
1000001,10
1000002,16
1000003,15
1000004,7
1000005,20
...,...
1006036,15
1006037,1
1006038,1
1006039,0


In [80]:
city_type_by_group_user = group_by_user_id[['City_Category']].max()
city_type_by_group_user

Unnamed: 0_level_0,City_Category
User_ID,Unnamed: 1_level_1
1000001,A
1000002,C
1000003,A
1000004,B
1000005,A
...,...
1006036,B
1006037,C
1006038,C
1006039,B


In [81]:
city_time_by_group_user = group_by_user_id[['Stay_In_Current_City_Years']].max()
city_time_by_group_user

Unnamed: 0_level_0,Stay_In_Current_City_Years
User_ID,Unnamed: 1_level_1
1000001,2
1000002,4+
1000003,3
1000004,2
1000005,1
...,...
1006036,4+
1006037,4+
1006038,2
1006039,4+


In [82]:
mar_st_by_group_user = group_by_user_id[['Marital_Status']].max()
mar_st_by_group_user

Unnamed: 0_level_0,Marital_Status
User_ID,Unnamed: 1_level_1
1000001,0
1000002,0
1000003,0
1000004,1
1000005,1
...,...
1006036,1
1006037,0
1006038,0
1006039,1


In [83]:
geral_group_by_user = purchase_by_group_user.join([gender_by_group_user, age_by_group_user, 
                             occ_by_group_user, city_type_by_group_user, 
                             city_time_by_group_user, mar_st_by_group_user])

geral_group_by_user

Unnamed: 0_level_0,Purchase,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1000001,333481,F,0-17,10,A,2,0
1000002,810353,M,55+,16,C,4+,0
1000003,341635,M,26-35,15,A,3,0
1000004,205987,M,46-50,7,B,2,1
1000005,821001,M,26-35,20,A,1,1
...,...,...,...,...,...,...,...
1006036,3821666,F,26-35,15,B,4+,1
1006037,1075037,F,46-50,1,C,4+,0
1006038,80859,F,55+,1,C,2,0
1006039,554504,F,46-50,0,B,4+,1


In [84]:
geral_group_by_user.groupby(by = 'Gender')['Purchase'].sum()

Gender
F    1164624021
M    3853044357
Name: Purchase, dtype: int64

In [85]:
geral_group_by_user.groupby(by = 'Gender')['Purchase'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
F,1666.0,699054.034214,795738.053574,44108.0,200245.25,398178.0,863756.75,6186498.0
M,4225.0,911963.16142,975397.759694,45551.0,254674.0,565925.0,1193530.0,10536783.0


In [86]:
geral_group_by_user.groupby(by = 'Age')['Purchase'].sum()

Age
0-17      132659006
18-25     901669280
26-35    1999749106
36-45    1010649565
46-50     413418223
51-55     361908356
55+       197614842
Name: Purchase, dtype: int64

In [87]:
geral_group_by_user.groupby(by = 'Age')['Purchase'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0-17,218.0,608527.550459,675160.7,75906.0,188927.25,404517.5,773810.75,5628295.0
18-25,1069.0,843469.859682,878749.5,46070.0,245489.0,536567.0,1098464.0,6476786.0
26-35,2053.0,974061.912323,1019681.0,44432.0,255866.0,606952.0,1313679.0,8699232.0
36-45,1167.0,866023.620394,971746.6,55900.0,244167.0,512174.0,1132586.5,10536783.0
46-50,531.0,778565.391714,917390.7,62250.0,226492.0,461704.0,928761.0,6044178.0
51-55,481.0,752408.224532,783468.9,45551.0,230895.0,443201.0,980068.0,4799323.0
55+,372.0,531222.693548,611005.0,44108.0,175645.0,326574.0,674327.0,5961987.0


In [88]:
geral_group_by_user.groupby(by = 'Occupation')['Purchase'].sum().sort_values(ascending=False).head(10)

Occupation
4     657530393
0     625814811
7     549282744
1     414552829
17    387240355
12    300672105
20    292276985
14    255594745
16    234442330
2     233275393
Name: Purchase, dtype: int64

In [89]:
geral_group_by_user.groupby(by = 'Occupation')['Purchase'].describe().sort_values('mean', ascending=False).head(10)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Occupation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
20,273.0,1070612.0,1154509.0,64526.0,296701.0,680293.0,1407088.0,8699232.0
19,71.0,1029796.0,1009385.0,77353.0,327798.0,633162.0,1350664.5,4836540.0
5,111.0,1013742.0,1038643.0,65645.0,269102.0,551853.0,1487262.0,4689382.0
16,235.0,997626.9,1130652.0,49104.0,256533.5,609448.0,1328137.5,10536783.0
3,170.0,943696.8,1010049.0,61580.0,241660.75,556793.0,1279912.5,6511302.0
2,256.0,911232.0,886681.3,57793.0,242439.5,572798.5,1386000.5,4642305.0
0,688.0,909614.6,995364.9,44432.0,234463.75,498255.5,1121956.25,6126540.0
18,67.0,899249.3,1061300.0,97953.0,205740.0,588712.0,1007572.0,6044178.0
4,740.0,888554.6,941804.2,60660.0,248872.0,556233.0,1144994.0,6476786.0
14,294.0,869369.9,947044.2,44108.0,249832.75,516149.0,1108524.75,6565878.0


In [90]:
geral_group_by_user.groupby(by = 'City_Category')['Purchase'].sum()

City_Category
A    1295668797
B    2083431612
C    1638567969
Name: Purchase, dtype: int64

In [91]:
geral_group_by_user.groupby(by = 'City_Category')['Purchase'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
City_Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A,1045.0,1239874.0,1362553.0,45551.0,281780.0,730131.0,1672669.0,10536783.0
B,1707.0,1220522.0,1054555.0,44432.0,355724.5,873346.0,1876029.0,5327346.0
C,3139.0,522003.2,422754.3,44108.0,196893.5,378037.0,727077.0,2456078.0


In [92]:
geral_group_by_user.groupby(by = 'Stay_In_Current_City_Years')['Purchase'].sum()

Stay_In_Current_City_Years
0      672505429
1     1763243917
2      934676626
3      872531130
4+     774711276
Name: Purchase, dtype: int64

In [93]:
geral_group_by_user.groupby(by = 'Stay_In_Current_City_Years')['Purchase'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Stay_In_Current_City_Years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,772.0,871121.022021,1016661.0,44432.0,229245.25,525988.5,1065989.25,10536783.0
1,2086.0,845275.127996,913984.0,45551.0,232679.75,509719.0,1115229.0,7577505.0
2,1145.0,816311.463755,843417.4,49104.0,239413.0,492235.0,1092133.0,5985405.0
3,979.0,891247.3238,997781.0,44108.0,238062.0,542807.0,1132382.0,8699232.0
4+,909.0,852267.630363,937305.4,46070.0,243535.0,520203.0,1075149.0,6511302.0


In [94]:
geral_group_by_user.groupby(by = 'Marital_Status')['Purchase'].sum()

Marital_Status
0    2966289500
1    2051378878
Name: Purchase, dtype: int64

In [95]:
geral_group_by_user.groupby(by = 'Marital_Status')['Purchase'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Marital_Status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3417.0,868097.600234,941075.547299,44432.0,239367.0,535792.0,1116522.0,10536783.0
1,2474.0,829174.970897,921437.493612,44108.0,231294.25,496568.0,1076033.25,8699232.0
