### Customer Profiling

This activity is meant to give you practice exploring data including the use of visualizations with `matplotlib`, `seaborn`, and `plotly`.  The dataset contains demographic information on the customers, information on customer purchases, engagement of customers with promotions, and information on where customer purchases happened.  A complete data dictionary can be found below.  

Your task is to explore the data and use visualizations to inform answers to specific questions using the data.  The questions and resulting visualization should be posted in the group discussion related to this activity.  Some example problems/questions to explore could be:

-----

- Does income differentiate customers who purchase wine? 
- What customers are more likely to participate in the last promotional campaign?
- Are customers with children more likely to purchase products online?
- Do married people purchase more wine?
- What kinds of purchases led to customer complaints?

-----

#### Data Dictionary

Attributes


```
ID: Customer's unique identifier
Year_Birth: Customer's birth year
Education: Customer's education level
Marital_Status: Customer's marital status
Income: Customer's yearly household income
Kidhome: Number of children in customer's household
Teenhome: Number of teenagers in customer's household
Dt_Customer: Date of customer's enrollment with the company
Recency: Number of days since customer's last purchase
Complain: 1 if customer complained in the last 2 years, 0 otherwise


MntWines: Amount spent on wine in last 2 years
MntFruits: Amount spent on fruits in last 2 years
MntMeatProducts: Amount spent on meat in last 2 years
MntFishProducts: Amount spent on fish in last 2 years
MntSweetProducts: Amount spent on sweets in last 2 years
MntGoldProds: Amount spent on gold in last 2 years
Promotion


AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response: 1 if customer accepted the offer in the last campaign, 0 otherwise


NumWebPurchases: Number of purchases made through the company’s web site
NumCatalogPurchases: Number of purchases made using a catalogue
NumStorePurchases: Number of purchases made directly in stores
NumWebVisitsMonth: Number of visits to company’s web site in the last month
```

In [24]:
import numpy as np
import pandas as pd

import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.figure_factory import create_table
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [25]:
df = pd.read_csv('data/marketing_campaign.csv', sep = '\t')

In [26]:
df.head(2)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0


In [27]:
#df.groupby('Income')[['MntWines']].sort_values(by=)

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

In [29]:
df_w = df.groupby('Education')[['MntWines','Income']].sum()
df_w['Avg_Income'] = df.groupby('Education')[['Income']].mean().round(1)
df_w = df_w.sort_values(by='Avg_Income', ascending=False)
create_table(df_w, index=True, index_title='Education')

### Does income differentiate customers who purchase wine? 

In [30]:
# bimodal histogram and left skewed histogram
px.scatter(data_frame=df, x = 'Income', y = 'MntWines',  labels={'MntWines':'Wines'}, title='Customers who purchase wine' , color = 'Education', marginal_y = "histogram", marginal_x = "histogram")

In [31]:
px.line(data_frame=df, x = 'Income', y = 'MntWines',  color = 'Education')

In [32]:
# Does income differentiate customers who purchase wine
px.scatter(data_frame=df, x = 'Income', y = 'MntWines', labels={'MntWines':'Wines'}, color = 'Education', log_x=True,  trendline_options=dict(log_x=True), trendline="ols",  title="Log-transformed fit on linear axes")

In [33]:
px.scatter(data_frame=df.query('Education == "Basic"'), x = 'Income', y = 'MntWines', color = 'Education',  log_x=True, trendline="ols")

Post your questions with an accompanying visualization in canvas.  You should generate at least three different questions and resulting visualizations.  Include complete sentence explanations of your interpretations of the visualizations.

In [34]:
px.scatter_3d(df, x='MntWines', y= 'Income', z='Education', color='Education', log_x=True)

### Are customers with children more likely to purchase products online? 

In [35]:
df1 = df.query('Kidhome in [1,2]').dropna()[['Kidhome','Income','NumWebPurchases']]
px.scatter(df1, x = 'Income', y='NumWebPurchases', color='Kidhome', trendline='ols', log_x=True, labels={'NumWebPurchases':'Purchases'}, title='Web Purchases of customers with children')

In [36]:
df3 = df.query('Kidhome in [1,2]').dropna()[['Kidhome','Income','NumStorePurchases']]
px.scatter(df3, x = 'Income', y='NumStorePurchases',labels={'NumStorePurchases':'Purchases'}, color='Kidhome', log_x=True, trendline='ols', title='Store Purchases of customers with children')

In [37]:
df2 = df.query('Kidhome in [1,2]').dropna()[['Kidhome','Income','NumCatalogPurchases']]
px.scatter(df2, x = 'Income', y='NumCatalogPurchases', labels={'NumCatalogPurchases':'Purchases'}, color='Kidhome', trendline='ols', log_x=True,  title='Catalog Purchases of customers with children')

In [38]:
df_t = df.groupby('Kidhome')[['NumWebPurchases','NumCatalogPurchases','NumStorePurchases']].sum()
df_t['Avg. Income'] = df.groupby('Kidhome')[['Income']].mean().round(1)
create_table(df_t, index=True, index_title='Kidhome')

In [39]:
df = pd.read_csv('data/marketing_campaign.csv', sep = '\t')
df_l = df.query('Kidhome in [1,2]')[['Kidhome','NumWebPurchases','NumCatalogPurchases','NumStorePurchases', 'Income']]
df_r = df.groupby('Kidhome')[['Income']].agg('mean')
df_r=df_r.rename(columns={'Income':'AvgIncome'})
#df_l
df_pur = pd.merge(df_l, df_r, left_on='Kidhome', right_index=True, how='left')
df_pur

Unnamed: 0,Kidhome,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,Income,AvgIncome
1,1,1,1,2,46344.0,39138.076663
3,1,2,0,4,26646.0,39138.076663
4,1,5,3,6,58293.0,39138.076663
7,1,4,0,4,33454.0,39138.076663
8,1,3,0,2,30351.0,39138.076663
...,...,...,...,...,...,...
2230,1,3,1,2,11012.0,39138.076663
2233,1,3,1,3,666666.0,39138.076663
2234,1,1,0,2,34421.0,39138.076663
2236,2,8,2,5,64014.0,39149.500000


In [40]:
px.scatter(x=df_pur['NumWebPurchases'],y=df_pur['Income'],trendline='ols', log_y=True, labels={'y':'Income','x':'Web Purchases'}, title='Purchases of customers with children')

In [41]:
px.scatter(x=df_pur['NumCatalogPurchases'],y=df_pur['Income'], trendline='ols', log_y=True, labels={'y':'Income','x':'Catalog Purchases'}, title='Purchases of customers with children')

In [42]:
px.scatter(x=df_pur['NumStorePurchases'],y=df_pur['Income'], trendline='ols', log_y=True, labels={'y':'Income','x':'Store Purchases'}, title='Purchases of customers with children')

### Do married people purchase more wine? 

In [43]:
df

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,10870,1967,Graduation,Married,61223.0,0,1,13-06-2013,46,709,...,5,0,0,0,0,0,0,3,11,0
2236,4001,1946,PhD,Together,64014.0,2,1,10-06-2014,56,406,...,7,0,0,0,1,0,0,3,11,0
2237,7270,1981,Graduation,Divorced,56981.0,0,0,25-01-2014,91,908,...,6,0,1,0,0,0,0,3,11,0
2238,8235,1956,Master,Together,69245.0,0,1,24-01-2014,8,428,...,3,0,0,0,0,0,0,3,11,0


In [44]:
df_m = df.groupby('Marital_Status')[['MntWines','Income',]].sum()
df_m['Avg. Income'] = df.groupby('Marital_Status')[['Income']].mean().round(1)
create_table(df_m, index=True, index_title='Marital_Status')

In [45]:
#df_m = df_m.reset_index()
px.scatter(df.query('Marital_Status in ["Married","Together"]'), x = 'Income', y='MntWines', labels={'MntWines':'Purchases'}, color='Marital_Status', trendline='ols', log_x=True,  title='Wines Purchases of customers')

In [46]:
import plotly.express as px

df = px.data.gapminder()
df_2007 = df.query("year==2007")

fig = px.scatter(df_2007,
                 x="gdpPercap", y="lifeExp", size="pop", color="continent",
                 log_x=True, size_max=60,
                 template="plotly_dark", title="Gapminder 2007: '%s' theme" % "plotly_dark")
fig.show()