# **Section 01:  Exploratory Data Analysis**

* Are there any null values or outliers? How will you wrangle/handle them?
* Are there any variables that warrant transformations?
* Are there any useful variables that you can engineer with the given data?
* Do you notice any patterns or anomalies in the data? Can you plot them?

In [None]:
#Load libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

#Load dataset
df = pd.read_csv('../input/marketing-data/marketing_data.csv')

#Get to know the dataset and display all columns
pd.set_option('display.max_columns', None)
df.head()

In [None]:
#Determine number of rows and columns
df.shape

In [None]:
#Basic metrics
df.describe()

In [None]:
#Check features, datatypes and null values
df.info()

Observations:
* Feature "Income" with space at the beginning
* Feature "Income" with null values
* Feature "Income" is of datatype "object"
* Feature "Income" with missing values

Next steps: 
* Remove white spaces
* Convert "Income" to datatype float
* Imputation of null values for feature "Income"


In [None]:
#Remove white spaces:
df.columns = df.columns.str.replace(' ', '')
df.info()

In [None]:
#Delete $
df['Income'] = df['Income'].str.replace('$', '')

#Delete ","
df['Income'] = df['Income'].str.replace(',', '')

#Convert string to float
df['Income'] = df['Income'].astype('float')

In [None]:
df.info()

Observation
* Null values in feature "Income"

In [None]:
#Boxplot of "Income" 
df['Income'].plot(kind='box')

Observation:

* There are outliers in feature "Income", some customers have a very high income -> most likely natural outliers, will leave them as they are

In [None]:
#Impute null values with median to minimize the effect of outliers
df['Income'] = df['Income'].fillna(df['Income'].median())

In [None]:
df.info()

In [None]:
# Plot numerical variables

df_plot = df.drop(['ID', 'Education', 'Marital_Status', 'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response', 'Complain'], axis=1)

df_plot.plot(subplots=True, layout=(4,4), kind='box', figsize=(12,10))

Observations:
* Outliers can be found in many columns, probably due to individual buying behaviour
* Birth_Date before 1900 not plausible

Next step:
* Convert "Year_Birth" to "Age"
* Impute values for ages >120

In [None]:
#Convert birthdate to age
import datetime
now = datetime.datetime.now()
df['Age'] = now.year - df['Year_Birth']

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
#Check correlations with "Age"-feature
df_corr = df.corr(method='kendall').unstack().sort_values(kind="quicksort", ascending=False).reset_index()
df_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True)
df_corr[df_corr['Feature 1'] == 'Age']

Observation:
* No strong correlation of "Age" with other feature -> will impute null values with median

In [None]:
df['Age'].median

In [None]:
df['Age'] = np.where(df['Age'] > 120, 51, df['Age'])
df.describe()

Feature engineering: 

* Minors = Kidhome + Teenhome
* Total amount spent = Amount spent for wine + fruits + meat + fish + sweet + gold
* Number of all purchases = Purchases in store + catalog + web + deals
* Remote purchases = Purchases in catalog + web
* Marketing responsiveness = AcceptedCmp1 + AcceptedCmp1 + AcceptedCmp1 +AcceptedCmp1 + AcceptedCmp1 + Response

Transformation:

* Feature "Dt_customer" converted to "Customer_since"
* Delete "Year_birth", but keep "Age"


In [None]:
#Minors in household 
df['Minors'] = df['Kidhome'] + df['Teenhome']

#Convert Dt_Customer to Customer since
df['Customer_since'] = pd.DatetimeIndex(df['Dt_Customer']).year
df = df.drop(['Dt_Customer'], axis=1)

#Total Amount Spent
df['TotalMnt'] = df['MntWines']+df['MntFruits']+df['MntMeatProducts']+df['MntFishProducts']+df['MntSweetProducts']+df["MntGoldProds"]

#Amount spent on luxury items
df['LuxMnt'] = df['MntWines']+df["MntGoldProds"]

#Number of total purchases
df['NumPurchases'] = df['NumDealsPurchases']+df['NumWebPurchases']+df['NumCatalogPurchases']+df['NumStorePurchases']

#Number of remote purchases
df['RemPurchases']=df['NumWebPurchases']+df['NumCatalogPurchases'] 

#MarketingResonsiveness
df['Responsiveness'] = df['AcceptedCmp1']+df['AcceptedCmp2']+df['AcceptedCmp3']+df['AcceptedCmp4']+df['AcceptedCmp5']+df['Response']

#Drop 'Year_Birth'
df = df.drop(['Year_Birth'], axis=1)

df.head()

In [None]:
df.columns

**Explore dataset with correlation matrix**

In [None]:
df_corr = df.drop(columns=['ID', 'Kidhome', 'Teenhome']).select_dtypes(include=np.number)

# Compute the correlation matrix
corr = df_corr.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(19, 19))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(20, 230, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmin =-1, vmax=1, annot=True, fmt='.2f' ,center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})

**Explore effects of income**

In [None]:
df_corr = df.corr().unstack().sort_values(kind="quicksort", ascending=False).reset_index()
df_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True)
df_corr[df_corr['Feature 1'] == 'Income']

In [None]:
sns.boxplot(x='Education', y='Income', data=df)

In [None]:
sns.boxplot(x='Minors', y='Income', data=df)

Observations

Income is strongly positively correlated with:
* total amount spent
* number of purchases
* amount spent for wines
* amount spent for meat
* total amount spent for luxury items
* number of catalog purchases 
* number of store purchases
* higher education (above basic)

Income is negatively correlated with:
* monthtly website visits
* presence of minors in the household

Income is a key determinant of demand and depending on the good and the relationship between income and demand can be both, direct and inverse. In the case of luxury goods, income and demand are directly related. In case of inferior goods (e.g. basic food), income and demand are inversely related. 
This relationship has been described by Ernst Engel, a German statistician, in the 19th century. Engel’s Law states that households with lower income spend a larger proportion of their income on food compared to households with a higher income level. Nevertheless, the absolute dollar expenditures spent on food are still increasing for higher income households. 
The observations in this dataset that shows a trend towards purchases of luxury products for higher incomes are in line with these concepts.


**Effect of minors**

In [None]:
df_corr = df.corr().unstack().sort_values(kind="quicksort", ascending=False).reset_index()
df_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True)
df_corr[df_corr['Feature 1'] == 'Minors']

In [None]:
import seaborn as sns
sns.kdeplot(df['NumPurchases'], data = df, hue = 'Minors', fill=True)

In [None]:
import seaborn as sns
sns.kdeplot(df['RemPurchases'], data = df, hue = 'Minors', fill=True)

In [None]:
import seaborn as sns
sns.kdeplot(df['NumWebVisitsMonth'], data = df, hue = 'Minors', fill=True)

Observations

The presence of minors in the household is positively correlated with:

* number of deals purchased
* number of monthly website visits

The presence of minors in the household is negatively correlated with:

* total amount spent
* amount spent on fish
* amount spent on meat
* amount spent on sweets
* amount spent on fruits
* amount spent on luxury items
* number of catalog purchases
* number of remote purchases
* number of store purchases
* Income

Presence of minor is negatively correlated with income and, in line with Engel's law, with the total amount spent. With more minors in the household, fewer purchases are done remotely, despite a higher number of website visits.

**Customer loyalty, recency, satisfaction and campaign responsiveness**

In [None]:
df_corr = df.corr().unstack().sort_values(kind="quicksort", ascending=False).reset_index()
df_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True)
df_corr[df_corr['Feature 1'] == 'Responsiveness']

Obervations:

Responsiveness is positively correlated with:

* amount spent on wine
* amount spent on luxury items
* total amount spent
* number of catalog purchases
* numer of remote purchases
* amount spent on meat

Responsiveness is negativly correlated with:

* Income

We do not have any details on the type of marketing activities and what products/channels they promoted. The data suggest that the marketing activities had a stronger effect on purchases of luxury goods and remote purchases.

In [None]:
df_corr = df.corr().unstack().sort_values(kind="quicksort", ascending=False).reset_index()
df_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True)
df_corr[df_corr['Feature 1'] == 'Recency']

In [None]:
sns.boxplot(x='Customer_since', y='TotalMnt', data=df)

In [None]:
sns.boxplot(x='Responsiveness', y='Recency', data=df)

In [None]:
sns.boxplot(x='Complain', y='TotalMnt', data=df)

In [None]:
sns.boxplot(x='Responsiveness', y='NumPurchases', data=df)

In [None]:
sns.boxplot(x='Responsiveness', y='TotalMnt', data=df)

Observations:

* Total amount spent is negatively correlated with date of enrollment with the customer -> customer are loyal.
* Customers that responded to campaign 2 and 3 with the shortest time since last purchase
* Customer that responded to campaign 4(5) made the (2nd)highest number of purchases and spent the (2nd)most total amount
* Customer with a complaint spent less
* No strong correlations for feature "Recency"


In [None]:
sns.catplot(x='Responsiveness', kind='count', palette='Blues', data=df)

The majority of customers (more than 3/4) did not respond to any of the marketing campaigns.

In [None]:
df_corr = df.corr().unstack().sort_values(kind="quicksort", ascending=False).reset_index()
df_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True)
df_corr[df_corr['Feature 1'] == 'NumWebVisitsMonth']

Monthly website visits are positively correlated with:

* Number of deals purchases
* Presence of minors

Monthly website visits are negatively correlated with:

* Income
* Amount spent on meat
* Number of catalog purchases
* Number of store purchases
* Total amount spent
* Amount spent on fish
* Amount spent on sweets
* Amount spent on fruit
* Amount spent on luxury products
* Amount spent on wine



# *** Section 02: Statistical Analysis**



Section 02: Statistical Analysis
Please run statistical tests in the form of regressions to answer these questions & propose data-driven action recommendations to your CMO. Make sure to interpret your results with non-statistical jargon so your CMO can understand your findings.

What factors are significantly related to the number of store purchases?
Does US fare significantly better than the Rest of the World in terms of total purchases?
Your supervisor insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases. Justify or refute this statement using an appropriate statistical test.
Fish has Omega 3 fatty acids which are good for the brain. Accordingly, do "Married PhD candidates" have a significant relation with amount spent on fish? What other factors are significantly related to amount spent on fish? (Hint: use your knowledge of interaction variables/effects)
Is there a significant relationship between geographical regional and success of a campaign?

In [None]:
df.columns

In [None]:
df_reg = df.drop(['ID'], axis=1)
df_reg.head()

In [None]:
def create_dummies(df,column_name):
    dummies = pd.get_dummies(df[column_name],prefix=column_name)
    df = pd.concat([df,dummies],axis=1)
    return df

df_reg = create_dummies(df_reg,"Education")
df_reg.head()

In [None]:
df_reg = create_dummies(df_reg,"Marital_Status")
df_reg = create_dummies(df_reg,"Country")
df_reg.head()

In [None]:
df_reg.columns

In [None]:
df_reg_dropped = df_reg.drop(['Education', 'Marital_Status', 'Country', 'NumStorePurchases'], axis=1)
df_reg_dropped.head()

In [None]:
X = df_reg_dropped
y = df_reg['NumStorePurchases']

In [None]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(df_reg_dropped, y, test_size=0.2,random_state=42)

from sklearn.linear_model import LinearRegression 
from sklearn.metrics import mean_absolute_error

reg = LinearRegression(normalize=True) 
reg.fit(train_X,train_y)
y_pred = reg.predict(test_X)
mean_absolute_error(test_y, y_pred)

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(reg, random_state=0).fit(test_X, test_y)
eli5.show_weights(perm, feature_names = test_X.columns.tolist(), top=10)

Most important features for number of store purchases are remote purchases, number of deals purchased, number of web purchases and number of catalog purchases.

In [None]:
plt.figure()
df.groupby('Country')['NumPurchases'].sum().plot(kind='bar')
plt.title('Number of purchases per country')
plt.ylabel('Number of purchases')

The US do not fare better than the Rest of the World in terms of total purchases. Spain, SouthAfrica, Canada, Australia and India have a higher number of purchases compared to the US.

In [None]:
data = df[['MntGoldProds', 'NumStorePurchases']]
x = df['MntGoldProds']
y = df['NumStorePurchases']
plt.scatter(x, y)

z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")
plt.show()

In [None]:
from scipy import stats

tau, p_value = stats.kendalltau(df['MntGoldProds'], df['NumStorePurchases'])

p_value

People who spent an above average amount on gold have indeed more in store purchases. This correlation is statistically significant, however, this does not prove causation that people who spent money on gold are more conservative and prefer buying in stores.

In [None]:
plt.figure(figsize=(25,5))
sns.boxplot(x=df['Education'], y=df['MntFishProducts'], hue=df['Marital_Status'])

Observation: 

Married PhD candidates do not spend more on fish products.

In [None]:
X = df_reg_dropped2 = df_reg.drop(['Education', 'Marital_Status', 'Country', 'MntFishProducts'], axis=1)
y = df_reg_dropped['MntFishProducts']

In [None]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(df_reg_dropped2, y, test_size=0.2,random_state=42)

from sklearn.linear_model import LinearRegression 
from sklearn.metrics import mean_absolute_error

reg = LinearRegression(normalize=True) 
reg.fit(train_X,train_y)
y_pred = reg.predict(test_X)
mean_absolute_error(test_y, y_pred)

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(reg, random_state=0).fit(test_X, test_y)
eli5.show_weights(perm, feature_names = test_X.columns.tolist(), top=10)

Main factors related to the amount spent on fish are amounts spent on meat and wines.

# **Section 03: Data Visualization**

Please plot and visualize the answers to the below questions.

* Which marketing campaign is most successful?
* What does the average customer look like for this company?
* Which products are performing best?
* Which channels are underperforming?

In [None]:
#determine acceptance rate per campaign
campaign_acceptance_rate = pd.DataFrame(df[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']].mean()*100, columns=['Percent']).reset_index()
                     
                               
                               
# plot
fig_dims = (10, 10)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot(x='index', y='Percent', data=campaign_acceptance_rate.sort_values('Percent'))
plt.xlabel('Campaign')
plt.ylabel('% Accepted')


The most recent campaign had by far the highest acceptance rate.

In [None]:
#average customer 

#age
age = round(df['Age'].mean())
print(age)

#income
income = round(df['Income'].mean())
print(income)

#customer since
customer_since = round(df['Customer_since'].mean())
print(customer_since)

#TotalAmountSpent
TotalAmountspent = round(df['TotalMnt'].mean())
print(TotalAmountspent)

#Responsiveness
Responsiveness = df['Responsiveness'].mean()
print(Responsiveness)

#Number of Minors in Household
Minors = df['Minors'].value_counts()
print(Minors)

#Education
Education = df['Education'].value_counts()
print(Education)

#Marital_Status
Marital_Status = df['Marital_Status'].value_counts()
print(Marital_Status)

#Recency
Recency = round(df['Recency'].mean())
print(Recency)

The average customer:

* is 52 years old
* is married
* has one minor in the household
* has graduated
* earns around 52k USD
* spent 606 USD in total
* responded to 0.4 marketing campaigns
* is a customer since 2013
* made the last purchase 49 days ago




In [None]:
#TotalAmountSpent
TotalAmountspent_sum = round(df['TotalMnt'].sum())
print("Total Revenues" + " " + str(TotalAmountspent_sum))

Wines_sum = round(df['MntWines'].sum())
print("Wine Revenues" + " " + str(Wines_sum))

Fruits_sum = round(df['MntFruits'].sum())
print("Sweets Fruits" + " " + str(Fruits_sum))

Meat_sum = round(df['MntMeatProducts'].sum())
print("Meat Revenues" + " " + str(Meat_sum))

Fish_sum = round(df['MntFishProducts'].sum())
print("Fish Revenues" + " " + str(Fish_sum))

Sweet_sum = round(df['MntSweetProducts'].sum())
print("Sweets Revenues" + " " + str(Sweet_sum))

Gold_sum = round(df['MntGoldProds'].sum())
print("Gold Revenues" + " " + str(Gold_sum))


* Wines are the best performing products
* Meat products are the 2nd best performing
* Gold products are the 3rd best performing


In [None]:
NumCatalogPurchases_sum = round(df['NumCatalogPurchases'].sum())

NumWebPurchases_sum = round(df['NumWebPurchases'].sum())

NumStorePurchases_sum = round(df['NumStorePurchases'].sum())

NumDealsPurchases_sum = round(df['NumDealsPurchases'].sum())

piechart_channel = np.array([NumCatalogPurchases_sum, NumWebPurchases_sum, NumStorePurchases_sum, NumDealsPurchases_sum])
mylabels = ["Catalog", "Web", "Store", "Deals"]

plt.pie(piechart_channel, labels = mylabels, autopct='%1.1f%%')
plt.title("Channel performance")
plt.show() 

Store is the most successful channel, followed by Web. Catalog and Deals are the weakest channels, though we do not have any information whether the deals supported any of the other channels, e.g. special discount on web purchases.