# Data analysis of Marketing campaign (toy model)
#### Exploratory Data Analysis with python

The dataset used was picked up from [Kaggle](https://www.kaggle.com/jackdaoud/marketing-data). Is an online community of data scientists and machine learning practitioners, is pretty awesome so please take a look at its datasets!

Let's start with this. This dataset was uploaded by Jack Daoud.

#### Description of the dataset
This dataset was provided to students for the final project of Msc. in Business Analytics in order to test their statistical analysis skills.

### Section 01
##### _Exploratory Data analysis_
In this section we are going to answer some questions like: 
- Are there any null values or outliers? How can you handle them?
- Are there any variables that warrant transformations?
- Are there any useful variables that you can engineer with the given data?
- Do you notice any patterns of anormalies in the data? Can you plot them?

As you can see, this section involves some cleaning, exploration, and data visualization, the basics steps for data analysis.

The first thing we have to do is to load the packages we're going to use with the following command.

In [34]:
import pandas as pd;
import numpy as np;
import matplotlib.pyplot as plt
import seaborn as sns;
from datetime import datetime

Now we have to read the csv file and and view the data.
- Dataset info:

In [4]:
df = pd.read_csv('C:/Users/simon/Documents/Projects/Marketing/marketing_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ID                   2240 non-null   int64 
 1   Year_Birth           2240 non-null   int64 
 2   Education            2240 non-null   object
 3   Marital_Status       2240 non-null   object
 4    Income              2216 non-null   object
 5   Kidhome              2240 non-null   int64 
 6   Teenhome             2240 non-null   int64 
 7   Dt_Customer          2240 non-null   object
 8   Recency              2240 non-null   int64 
 9   MntWines             2240 non-null   int64 
 10  MntFruits            2240 non-null   int64 
 11  MntMeatProducts      2240 non-null   int64 
 12  MntFishProducts      2240 non-null   int64 
 13  MntSweetProducts     2240 non-null   int64 
 14  MntGoldProds         2240 non-null   int64 
 15  NumDealsPurchases    2240 non-null   int64 
 16  NumWeb

Looks like the _'Income'_ column have a blank space, we have to correct that. Also we have to change the _'Income'_ column from an object Dtype to a float Dtype, but first let's take a look at the head of the dataset


In [21]:
df.columns = df.columns.str.replace(' ', '')
df.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Response,Complain,Country
0,1826,1970,Graduation,Divorced,"$84,835.00",0,0,6/16/14,0,189,...,6,1,0,0,0,0,0,1,0,SP
1,1,1961,Graduation,Single,"$57,091.00",0,0,6/15/14,0,464,...,7,5,0,0,0,0,1,1,0,CA
2,10476,1958,Graduation,Married,"$67,267.00",0,1,5/13/14,0,134,...,5,2,0,0,0,0,0,0,0,US
3,1386,1967,Graduation,Together,"$32,474.00",1,1,5/11/14,0,10,...,2,7,0,0,0,0,0,0,0,AUS
4,5371,1989,Graduation,Single,"$21,474.00",1,0,4/8/14,0,6,...,2,7,1,0,0,0,0,1,0,SP


To transform the data type of the _'Income'_ column we have to delete the '$' and ',' symbols.

In [33]:
df.Income = df.Income.str.replace('$', '')
df.Income = df.Income.str.replace(',', '')
df.Income.astype(float);


Lastly we ar going to transform the _'Dt_Customer'_ column from an object data type to a datatime and look for nan values in the dataset.

In [40]:
df.Dt_Customer = pd.to_datetime(df.Dt_Customer)
pd.isna(df).sum()

ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Response                0
Complain                0
Country                 0
dtype: int64

In this case we have 24 empty values in the _'Income'_ column, since is just a little amount of values missing, we can just drop them and keep the rest for further analysis. Also we ar going to drop the _'ID'_ column because it doesn't give us relevant information.

In [51]:
df_clean = df[pd.notna(df.Income)]
df_clean.drop('ID', axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


With the cleaned data we can start our exploration throught the data with the `decribe()` function, this fucntion give us the basic statistics of the dataframe such as:
- Minimum value
- Maximum value 
- Mean
- Standard deviation

In [52]:
df_clean.describe()

Unnamed: 0,Year_Birth,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Response,Complain
count,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,...,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0
mean,1968.820397,0.441787,0.505415,49.012635,305.091606,26.356047,166.995939,37.637635,27.028881,43.965253,...,2.671029,5.800993,5.319043,0.073556,0.074007,0.073105,0.064079,0.013538,0.150271,0.009477
std,11.985554,0.536896,0.544181,28.948352,337.32792,39.793917,224.283273,54.752082,41.072046,51.815414,...,2.926734,3.250785,2.425359,0.261106,0.261842,0.260367,0.24495,0.115588,0.357417,0.096907
min,1893.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1959.0,0.0,0.0,24.0,24.0,2.0,16.0,3.0,1.0,9.0,...,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1970.0,0.0,0.0,49.0,174.5,8.0,68.0,12.0,8.0,24.5,...,2.0,5.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1977.0,1.0,1.0,74.0,505.0,33.0,232.25,50.0,33.0,56.0,...,4.0,8.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1996.0,2.0,2.0,99.0,1493.0,199.0,1725.0,259.0,262.0,321.0,...,28.0,13.0,20.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


We can sum the columns of kidhome and teenhome to know the total numbers of dependents of each customer, also we calculate the total amount spent and the total purchases.

In [54]:
df_clean['Mnt'] = df_clean[[col for col in df_clean.columns if 'Mnt' in col]].sum(axis = 1) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['Mnt'] = df_clean[[col for col in df_clean.columns if 'Mnt' in col]].sum(axis = 1)


In [55]:
df

0       1190
1        577
2        251
3         11
4         91
        ... 
2235     689
2236      55
2237     309
2238    1383
2239    1078
Name: Mnt, Length: 2216, dtype: int64