# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [20]:
# import required libraries
# Import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Next, import `Orders` from Ironhack's database into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [21]:
list(orders.columns)

['InvoiceNo',
 'StockCode',
 'year',
 'month',
 'day',
 'hour',
 'Description',
 'Quantity',
 'InvoiceDate',
 'UnitPrice',
 'CustomerID',
 'Country',
 'amount_spent']

In [22]:
# your code here
path = './Orders.csv'
cols = list(orders.columns)
orders = pd.read_csv(path)
orders.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


In [23]:
# Voy a quitar unnamed xq son lo mismo que los indices
orders.drop('Unnamed: 0',axis=1,inplace=True)

In [24]:
orders.head()

Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


In [25]:
orders.columns = [c.lower().replace(' ', '') for c in orders.columns] # Quito los espacios para poder usar puntos
orders.head()

Unnamed: 0,invoiceno,stockcode,year,month,day,hour,description,quantity,invoicedate,unitprice,customerid,country,amount_spent
0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [35]:
# your code here
cust = orders.groupby(['customerid']).sum(numeric_only=True)
cust_orden = cust.sort_values('amount_spent', ascending=False)
cust_orden.head()


Unnamed: 0_level_0,invoiceno,year,month,day,hour,quantity,unitprice,amount_spent
customerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
14646,1163267611,4182810,14191,6552,24488,197491,5176.09,280206.02
18102,243297801,866723,3746,1261,5587,64124,1940.92,259657.3
17450,188845149,677704,2292,842,4140,69993,1143.32,194550.79
16446,1688629,6033,22,11,27,80997,4.98,168472.5
14911,3196374868,11416155,46220,18930,68148,80515,26185.72,143825.06


In [38]:
per95 = np.percentile(cust_orden.amount_spent, 95) # Defino los percentiles 95 y 75
per75 = np.percentile(cust_orden.amount_spent,75)

In [52]:
vip = cust_orden[cust_orden.amount_spent >= per95] # df[condicion df]
vip.head()

Unnamed: 0_level_0,invoiceno,year,month,day,hour,quantity,unitprice,amount_spent,VIP_OR_PREFERRED
customerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
14646,1163267611,4182810,14191,6552,24488,197491,5176.09,280206.02,VIP
18102,243297801,866723,3746,1261,5587,64124,1940.92,259657.3,VIP
17450,188845149,677704,2292,842,4140,69993,1143.32,194550.79,VIP
16446,1688629,6033,22,11,27,80997,4.98,168472.5,VIP
14911,3196374868,11416155,46220,18930,68148,80515,26185.72,143825.06,VIP


In [53]:
vip.shape

(217, 9)

In [54]:
preferred = cust_orden[(cust_orden.amount_spent >= per75) & (cust_orden.amount_spent < per95)] # df[condicion df]
preferred.head()

Unnamed: 0_level_0,invoiceno,year,month,day,hour,quantity,unitprice,amount_spent,VIP_OR_PREFERRED
customerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
13050,223872574,810364,2992,1278,4940,3748,1204.52,5836.86,Preferred
12720,197785584,711849,2760,1106,4153,4672,956.36,5781.73,Preferred
15218,92641005,333826,1009,302,1823,3329,513.44,5756.89,Preferred
17686,159766169,575146,1824,905,3433,2478,1103.64,5739.46,Preferred
13178,147365548,532883,1858,872,3321,3570,542.34,5725.47,Preferred


In [55]:
preferred.shape

(868, 9)

In [56]:
def sibuencliente(x,corte75,corte95):
    if (x>=corte75) & (x<corte95):
        return 'Preferred'
    elif x>=corte95:
        return 'VIP'
    else:
        return '-'


In [57]:
cust_orden['VIP_OR_PREFERRED'] = cust_orden.amount_spent.apply(lambda x: sibuencliente(x,per75,per95))
cust_orden.head()

Unnamed: 0_level_0,invoiceno,year,month,day,hour,quantity,unitprice,amount_spent,VIP_OR_PREFERRED
customerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
14646,1163267611,4182810,14191,6552,24488,197491,5176.09,280206.02,VIP
18102,243297801,866723,3746,1261,5587,64124,1940.92,259657.3,VIP
17450,188845149,677704,2292,842,4140,69993,1143.32,194550.79,VIP
16446,1688629,6033,22,11,27,80997,4.98,168472.5,VIP
14911,3196374868,11416155,46220,18930,68148,80515,26185.72,143825.06,VIP


In [59]:
cust_orden.sample(10)

Unnamed: 0_level_0,invoiceno,year,month,day,hour,quantity,unitprice,amount_spent,VIP_OR_PREFERRED
customerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
12587,2321060,8044,48,20,48,51,46.24,144.0,-
17388,8414581,30165,102,96,177,1068,70.47,1259.56,-
13774,12480468,44242,198,22,264,192,72.4,345.0,-
16014,80323550,291595,737,611,2350,265,406.35,662.38,-
14447,39313655,140770,489,277,787,571,180.36,1163.23,-
14291,133083051,476607,1729,754,2653,2749,614.51,3883.25,Preferred
17844,2822140,10055,40,20,55,52,4.99,51.56,-
15527,91633654,327775,1397,476,1940,992,541.01,2429.78,Preferred
17048,23475070,86473,124,149,571,485,159.38,925.35,-
14076,25856876,92506,368,92,690,81,92.42,122.47,-


Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [4]:
# your code here

In [65]:
paises = orders.groupby(['country', 'customerid']).sum()
paises.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,invoiceno,stockcode,year,month,day,hour,description,quantity,invoicedate,unitprice,amount_spent
country,customerid,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Australia,12386,5381968,2256722915229262295321906224952255522557214222...,20102,98,32,96,20 dolly pegs retrospotassorted bottle top ma...,354,2010-12-08 09:53:002010-12-08 09:53:002010-12-...,23.91,401.9
Australia,12388,55733658,84970L71459224292226247590B47590A2266922148229...,201100,592,381,1230,single heart zinc t-light holderhanging jam ja...,1462,2011-01-17 11:12:002011-01-17 11:12:002011-01-...,277.77,2780.66
Australia,12393,35452768,215812261984997B207272072622383212492237822175...,128704,315,213,732,skulls design cotton tote bagset of 6 soldie...,816,2011-01-11 09:47:002011-01-11 09:47:002011-01-...,145.9,1582.6
Australia,12415,398543981,2207822079220802207722505225162251722518225192...,1439876,4254,2169,8061,ribbon reel lace design ribbon reel hearts des...,77670,2011-01-06 11:12:002011-01-06 11:12:002011-01-...,2097.08,124914.53
Australia,12422,11563488,207282071321937219362193220717225032071285099C...,42231,85,47,189,lunch bag cars bluejumbo bag owlsstrawberry ...,195,2011-01-19 09:13:002011-01-19 09:13:002011-01-...,51.12,386.2


In [66]:
paises['VIP_OR_PREFERRED'] = paises.amount_spent.apply(lambda x: sibuencliente(x,per75,per95))
paises.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,invoiceno,stockcode,year,month,day,hour,description,quantity,invoicedate,unitprice,amount_spent,VIP_OR_PREFERRED
country,customerid,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Australia,12386,5381968,2256722915229262295321906224952255522557214222...,20102,98,32,96,20 dolly pegs retrospotassorted bottle top ma...,354,2010-12-08 09:53:002010-12-08 09:53:002010-12-...,23.91,401.9,-
Australia,12388,55733658,84970L71459224292226247590B47590A2266922148229...,201100,592,381,1230,single heart zinc t-light holderhanging jam ja...,1462,2011-01-17 11:12:002011-01-17 11:12:002011-01-...,277.77,2780.66,Preferred
Australia,12393,35452768,215812261984997B207272072622383212492237822175...,128704,315,213,732,skulls design cotton tote bagset of 6 soldie...,816,2011-01-11 09:47:002011-01-11 09:47:002011-01-...,145.9,1582.6,-
Australia,12415,398543981,2207822079220802207722505225162251722518225192...,1439876,4254,2169,8061,ribbon reel lace design ribbon reel hearts des...,77670,2011-01-06 11:12:002011-01-06 11:12:002011-01-...,2097.08,124914.53,VIP
Australia,12422,11563488,207282071321937219362193220717225032071285099C...,42231,85,47,189,lunch bag cars bluejumbo bag owlsstrawberry ...,195,2011-01-19 09:13:002011-01-19 09:13:002011-01-...,51.12,386.2,-


In [72]:
paisesindex = paises.reset_index()
paisesindex.head()

Unnamed: 0,country,customerid,invoiceno,stockcode,year,month,day,hour,description,quantity,invoicedate,unitprice,amount_spent,VIP_OR_PREFERRED
0,Australia,12386,5381968,2256722915229262295321906224952255522557214222...,20102,98,32,96,20 dolly pegs retrospotassorted bottle top ma...,354,2010-12-08 09:53:002010-12-08 09:53:002010-12-...,23.91,401.9,-
1,Australia,12388,55733658,84970L71459224292226247590B47590A2266922148229...,201100,592,381,1230,single heart zinc t-light holderhanging jam ja...,1462,2011-01-17 11:12:002011-01-17 11:12:002011-01-...,277.77,2780.66,Preferred
2,Australia,12393,35452768,215812261984997B207272072622383212492237822175...,128704,315,213,732,skulls design cotton tote bagset of 6 soldie...,816,2011-01-11 09:47:002011-01-11 09:47:002011-01-...,145.9,1582.6,-
3,Australia,12415,398543981,2207822079220802207722505225162251722518225192...,1439876,4254,2169,8061,ribbon reel lace design ribbon reel hearts des...,77670,2011-01-06 11:12:002011-01-06 11:12:002011-01-...,2097.08,124914.53,VIP
4,Australia,12422,11563488,207282071321937219362193220717225032071285099C...,42231,85,47,189,lunch bag cars bluejumbo bag owlsstrawberry ...,195,2011-01-19 09:13:002011-01-19 09:13:002011-01-...,51.12,386.2,-


In [73]:
paisesindex2= paisesindex[paisesindex.VIP_OR_PREFERRED == 'VIP']

In [76]:
paisesindex2.groupby('country').customerid.count().sort_values(ascending=False)

country
United Kingdom     177
Germany             10
France               9
Switzerland          3
Spain                2
Portugal             2
Japan                2
EIRE                 2
Finland              1
Channel Islands      1
Netherlands          1
Norway               1
Singapore            1
Denmark              1
Sweden               1
Cyprus               1
Australia            1
Name: customerid, dtype: int64

In [77]:
paisesindexpreferred = paisesindex[paisesindex.VIP_OR_PREFERRED == 'Preferred']
paisesindexpreferred.groupby('country').customerid.count().sort_values(ascending=False)

country
United Kingdom     755
Germany             29
France              20
Belgium             11
Switzerland          6
Norway               6
Spain                5
Portugal             5
Italy                5
Finland              4
Australia            3
Channel Islands      3
Israel               2
Japan                2
Denmark              2
Cyprus               2
Greece               1
Austria              1
EIRE                 1
Lebanon              1
Malta                1
Poland               1
Sweden               1
Canada               1
Iceland              1
Name: customerid, dtype: int64

## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [5]:
# your code here