### 1. Purpose of this notebook

The purpose of this notebook is to do an inital exploratory and cleaning of the data (especially clear the data). We need to ensure that the data is correct and can be used.

In general, the data quality is not bad. Below, some things we found:

- We found two columns that were not in the pdf: 'Z_Revenue','Response'. 
- We found two customer with the marital status strange: "Absurd". We guess that this status is wrong.
- The data of the Z_Revenue was wrong. All the rows containing the same valor: 11. We know that the sucess of the campaign was 15%, not 100%.  So, we create a new column.
- We found three customer with a strange birth year (over one hundred and twenty years old).
- We create a new column containg the age of each customer.
- We change the type of some columns.



### 2. Read data

#### 2.1 Import Python packages

In [116]:
import pandas as pd
import datetime

import src.data_clean as dc
from src.paths import DATA

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### 2.2 Sample data

In [168]:
df = pd.read_csv(DATA / 'ml_project1_data.csv')
df.sample(5)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
411,10882,1976,Graduation,Married,53858.0,0,1,2012-11-09,50,407,...,4,0,0,0,0,0,0,3,11,0
2233,9432,1977,Graduation,Together,666666.0,1,0,2013-06-02,23,9,...,6,0,0,0,0,0,0,3,11,0
1705,8799,1984,PhD,Married,38175.0,1,0,2013-09-23,6,70,...,7,0,0,0,0,0,0,3,11,0
88,8504,1973,Graduation,Married,79593.0,0,0,2014-05-12,70,350,...,2,0,0,1,0,0,0,3,11,0
775,6825,1953,Graduation,Together,41452.0,1,1,2013-03-06,86,13,...,7,0,0,0,0,0,0,3,11,0


#### 2.3 Dataframe data

In [169]:
#Total rows and columns
df.shape

(2240, 29)

In [170]:
#Column names
df.columns

Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response'],
      dtype='object')

In [171]:
df.dtypes

ID                       int64
Year_Birth               int64
Education               object
Marital_Status          object
Income                 float64
Kidhome                  int64
Teenhome                 int64
Dt_Customer             object
Recency                  int64
MntWines                 int64
MntFruits                int64
MntMeatProducts          int64
MntFishProducts          int64
MntSweetProducts         int64
MntGoldProds             int64
NumDealsPurchases        int64
NumWebPurchases          int64
NumCatalogPurchases      int64
NumStorePurchases        int64
NumWebVisitsMonth        int64
AcceptedCmp3             int64
AcceptedCmp4             int64
AcceptedCmp5             int64
AcceptedCmp1             int64
AcceptedCmp2             int64
Complain                 int64
Z_CostContact            int64
Z_Revenue                int64
Response                 int64
dtype: object

### 2. Cleaning data

#### We see some informations about the columns:
- check for null data
- check for dirty data
- check for duplicate data
- check if the data type is correct

#### ID

In [172]:
dc.look_quality_of_the_column_data(df, 'ID')

Column name: ID
Data type: int64
Null data: 0
Duplicate data: False
Values:
0, 1, 9, 13, 17, 20, 22, 24, 25, 35, 48, 49, 55, 67, 73, 75, 78, 87, 89, 92, 113, 115, 123, 125, 143, 146, 153, 158, 164, 175, 176, 178, 182, 193, 194, 195, 199, 202, 203, 213, 217, 221, 231, 232, 234, 236, 238, 241, 246, 247, 252, 254, 255, 257, 263, 269, 271, 273, 274, 286, 291, 295, 304, 309, 310, 313, 322, 326, 332, 339, 340, 347, 359, 361, 367, 368, 375, 378, 380, 387, 405, 425, 433, 437, 448, 450, 451, 453, 454, 455, 456, 460, 466, 477, 486, 492, 498, 500, 503, 520, 521, 523, 524, 528, 531, 535, 538, 544, 550, 564, 569, 574, 577, 590, 591, 606, 607, 610, 615, 618, 624, 626, 635, 640, 641, 642, 663, 675, 679, 692, 697, 701, 702, 709, 713, 716, 736, 737, 738, 749, 760, 762, 771, 773, 793, 796, 798, 800, 803, 807, 810, 819, 821, 830, 832, 833, 837, 839, 843, 849, 850, 851, 868, 873, 879, 880, 891, 895, 898, 902, 905, 907, 916, 922, 933, 938, 940, 942, 944, 945, 946, 948, 954, 955, 961, 965, 966, 967, 968, 97

The correct format is string.  In the rest, everything is OK!

In [173]:
df = (df.assign(ID = df['ID'].astype(str))
     )

#### Year_Birth

In [174]:
dc.look_quality_of_the_column_data(df, 'Year_Birth')

Column name: Year_Birth
Data type: int64
Null data: 0
Duplicate data: True
Values:
1893, 1899, 1900, 1940, 1941, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996


There is something strange. There are 3 constumers with birth year less than or equal to 1900. We can't consider these 3 constumers when analyzing the year of birth.

#### Education

In [175]:
dc.look_quality_of_the_column_data(df, 'Education')

Column name: Education
Data type: object
Null data: 0
Duplicate data: True
Values:
2n Cycle, Basic, Graduation, Master, PhD


Everything is OK!

#### Marital_Status

In [176]:
dc.look_quality_of_the_column_data(df, 'Marital_Status')

Column name: Marital_Status
Data type: object
Null data: 0
Duplicate data: True
Values:
Absurd, Alone, Divorced, Married, Single, Together, Widow, YOLO


There is a marital status something strange: Absurd. There are two cases:

In [177]:
df['Marital_Status'].groupby(df['Marital_Status']).count()

Marital_Status
Absurd        2
Alone         3
Divorced    232
Married     864
Single      480
Together    580
Widow        77
YOLO          2
Name: Marital_Status, dtype: int64

#### Income

In [178]:
dc.look_quality_of_the_column_data(df, 'Income')

Column name: Income
Data type: float64
Null data: 24
Duplicate data: True
Values:
1730.0, 2447.0, 3502.0, 4023.0, 4428.0, 4861.0, 5305.0, 5648.0, 6560.0, 6835.0, 7144.0, 7500.0, 8028.0, 8820.0, 8940.0, 9255.0, 9548.0, 9722.0, 10245.0, 10404.0, 10979.0, 11012.0, 11448.0, 12393.0, 12571.0, 13084.0, 13260.0, 13533.0, 13624.0, 13672.0, 13724.0, 14045.0, 14188.0, 14421.0, 14515.0, 14661.0, 14796.0, 14849.0, 14906.0, 14918.0, 15033.0, 15038.0, 15056.0, 15072.0, 15253.0, 15287.0, 15315.0, 15345.0, 15716.0, 15759.0, 15862.0, 16005.0, 16014.0, 16185.0, 16248.0, 16269.0, 16529.0, 16531.0, 16581.0, 16626.0, 16653.0, 16813.0, 16860.0, 16927.0, 17003.0, 17117.0, 17144.0, 17148.0, 17256.0, 17323.0, 17345.0, 17459.0, 17487.0, 17649.0, 17688.0, 18100.0, 18169.0, 18222.0, 18227.0, 18351.0, 18358.0, 18393.0, 18492.0, 18589.0, 18690.0, 18701.0, 18746.0, 18793.0, 18890.0, 18929.0, 18978.0, 18988.0, 19107.0, 19329.0, 19346.0, 19414.0, 19419.0, 19444.0, 19485.0, 19510.0, 19514.0, 19656.0, 19740.0, 19789.0, 

There are null values in this column.

#### Kidhome

In [179]:
dc.look_quality_of_the_column_data(df, 'Kidhome')

Column name: Kidhome
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1, 2


Everything is OK!

#### Teenhome

In [180]:
dc.look_quality_of_the_column_data(df, 'Teenhome')

Column name: Teenhome
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1, 2


Everything is OK!

#### Dt_Customer

In [181]:
dc.look_quality_of_the_column_data(df, 'Dt_Customer')

Column name: Dt_Customer
Data type: object
Null data: 0
Duplicate data: True
Values:
2012-07-30, 2012-07-31, 2012-08-01, 2012-08-02, 2012-08-03, 2012-08-04, 2012-08-05, 2012-08-06, 2012-08-07, 2012-08-08, 2012-08-09, 2012-08-10, 2012-08-11, 2012-08-12, 2012-08-13, 2012-08-14, 2012-08-15, 2012-08-16, 2012-08-17, 2012-08-18, 2012-08-19, 2012-08-20, 2012-08-21, 2012-08-22, 2012-08-23, 2012-08-24, 2012-08-25, 2012-08-26, 2012-08-27, 2012-08-28, 2012-08-29, 2012-08-30, 2012-08-31, 2012-09-01, 2012-09-02, 2012-09-03, 2012-09-04, 2012-09-05, 2012-09-06, 2012-09-07, 2012-09-08, 2012-09-09, 2012-09-10, 2012-09-11, 2012-09-12, 2012-09-14, 2012-09-15, 2012-09-17, 2012-09-18, 2012-09-19, 2012-09-20, 2012-09-21, 2012-09-22, 2012-09-23, 2012-09-24, 2012-09-25, 2012-09-26, 2012-09-27, 2012-09-28, 2012-09-29, 2012-09-30, 2012-10-01, 2012-10-02, 2012-10-04, 2012-10-05, 2012-10-06, 2012-10-07, 2012-10-09, 2012-10-10, 2012-10-11, 2012-10-12, 2012-10-13, 2012-10-14, 2012-10-15, 2012-10-16, 2012-10-17, 201

The type of this column could be datetime.

In [182]:
df = (df.assign(
         Dt_Customer = pd.to_datetime(df['Dt_Customer']))
)

Check if there are customers with the year of enrollment is less than birth year.

In [183]:
df[df.Year_Birth >= df.Dt_Customer.dt.year]

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response


There is not any cases.

#### Recency

In [184]:
dc.look_quality_of_the_column_data(df, 'Recency')

Column name: Recency
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99


Everything is OK!
There are some customers without purchase for more than 90 days.

#### MntWines

In [185]:
dc.look_quality_of_the_column_data(df, 'MntWines')

Column name: MntWines
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 115, 116, 117, 120, 121, 122, 123, 124, 125, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 138, 139, 140, 141, 143, 144, 145, 146, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 191, 192, 194, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 215, 216, 217, 218, 219, 220, 221

Everything is OK!

#### MntFruits

In [186]:
dc.look_quality_of_the_column_data(df, 'MntFruits')

Column name: MntFruits
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 111, 112, 114, 115, 117, 120, 122, 123, 124, 126, 127, 129, 130, 131, 132, 133, 134, 137, 138, 140, 142, 143, 144, 147, 148, 149, 151, 152, 153, 154, 155, 159, 160, 161, 162, 163, 164, 166, 168, 169, 172, 174, 178, 181, 183, 184, 185, 189, 190, 193, 194, 197, 199


Everything is OK!

#### MntMeatProducts

In [187]:
dc.look_quality_of_the_column_data(df, 'MntMeatProducts')

Column name: MntMeatProducts
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 147, 149, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 188, 189, 192, 193, 194, 195, 196, 197, 199, 201, 202, 203, 204, 205, 206, 207, 208, 209, 211, 212, 213, 214, 21

Everything is OK!

#### MntFishProducts

In [188]:
dc.look_quality_of_the_column_data(df, 'MntFishProducts')

Column name: MntFishProducts
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 15, 16, 17, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 54, 55, 56, 58, 59, 60, 61, 62, 63, 64, 65, 67, 68, 69, 71, 72, 73, 75, 76, 77, 78, 80, 81, 82, 84, 85, 86, 89, 90, 91, 93, 94, 95, 97, 98, 99, 101, 102, 103, 104, 106, 108, 110, 111, 112, 114, 115, 116, 119, 120, 121, 123, 124, 125, 127, 128, 129, 130, 132, 133, 134, 136, 137, 138, 140, 141, 142, 145, 146, 147, 149, 150, 151, 153, 156, 158, 159, 160, 162, 164, 166, 167, 168, 169, 171, 172, 173, 175, 177, 179, 180, 181, 182, 184, 185, 186, 188, 189, 192, 193, 194, 197, 198, 199, 201, 202, 205, 207, 208, 210, 212, 216, 218, 219, 220, 223, 224, 225, 227, 229, 231, 232, 234, 237, 240, 242, 246, 247, 250, 253, 254, 258, 259


Everything is OK!

#### MntSweetProducts

In [189]:
dc.look_quality_of_the_column_data(df, 'MntSweetProducts')

Column name: MntSweetProducts
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 118, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 132, 133, 134, 136, 137, 138, 139, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 156, 157, 160, 161, 162, 163, 165, 166, 167, 169, 172, 173, 174, 175, 176, 178, 179, 182, 185, 187, 188, 189, 191, 192, 194, 195, 196, 197, 198, 262, 263


Everything is OK!

#### MntGoldProds

In [190]:
dc.look_quality_of_the_column_data(df, 'MntGoldProds')

Column name: MntGoldProds
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 106, 107, 108, 109, 110, 111, 112, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 157, 158, 159, 160, 161, 162, 163, 165, 166, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 180, 181, 182, 183, 185, 187, 190, 191, 192, 195, 196, 197, 198, 199, 200, 203, 204, 205, 207, 210, 215, 216, 218, 219, 223, 224, 227, 229, 231, 232, 233, 241, 242, 245,

Everything is OK!

#### NumDealsPurchases

In [191]:
dc.look_quality_of_the_column_data(df, 'NumDealsPurchases')

Column name: NumDealsPurchases
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15


Everything is OK!

#### NumWebPurchases

In [192]:
dc.look_quality_of_the_column_data(df, 'NumWebPurchases')

Column name: NumWebPurchases
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 23, 25, 27


Everything is OK!

#### NumCatalogPurchases

In [193]:
dc.look_quality_of_the_column_data(df, 'NumCatalogPurchases')

Column name: NumCatalogPurchases
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 22, 28


Everything is OK!

#### NumStorePurchases

In [194]:
dc.look_quality_of_the_column_data(df, 'NumStorePurchases')

Column name: NumStorePurchases
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13


Everything is OK!

#### NumWebVisitsMonth

In [195]:
dc.look_quality_of_the_column_data(df, 'NumWebVisitsMonth')

Column name: NumWebVisitsMonth
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 17, 19, 20


Everything is OK!

#### AcceptedCmp3

In [196]:
dc.look_quality_of_the_column_data(df, 'AcceptedCmp3')

Column name: AcceptedCmp3
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1


Everything is OK!

#### AcceptedCmp4

In [197]:
dc.look_quality_of_the_column_data(df, 'AcceptedCmp4')

Column name: AcceptedCmp4
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1


Everything is OK!

#### AcceptedCmp5

In [198]:
dc.look_quality_of_the_column_data(df, 'AcceptedCmp5')

Column name: AcceptedCmp5
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1


Everything is OK!

#### AcceptedCmp1

In [199]:
dc.look_quality_of_the_column_data(df, 'AcceptedCmp1')

Column name: AcceptedCmp1
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1


Everything is OK!

#### AcceptedCmp2

In [200]:
dc.look_quality_of_the_column_data(df, 'AcceptedCmp2')

Column name: AcceptedCmp2
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1


Everything is OK!

#### Complain

In [201]:
dc.look_quality_of_the_column_data(df, 'Complain')

Column name: Complain
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1


The correct format is bool.  In the rest, everything is OK!

#### Z_CostContact

In [202]:
dc.look_quality_of_the_column_data(df, 'Z_CostContact')

Column name: Z_CostContact
Data type: int64
Null data: 0
Duplicate data: True
Values:
3


Apparently, the cost of contacting a costumer is $ 3.

In [203]:
df['Z_CostContact'].sum()

6720

The sum of this column matches what was reported in the pdf.

#### Education

In [204]:
dc.look_quality_of_the_column_data(df, 'Response')

Column name: Response
Data type: int64
Null data: 0
Duplicate data: True
Values:
0, 1


Everything is OK!

#### Z_Revenue

In [205]:
dc.look_quality_of_the_column_data(df, 'Z_Revenue')

Column name: Z_Revenue
Data type: int64
Null data: 0
Duplicate data: True
Values:
11


This column containing only value: 11. this is not possible because not all customers who bought accepted the campaign.

In [206]:
df['Z_Revenue'].sum()

24640

This above value not matches what was reported in the pdf. We need create a new column.

In [207]:
df = (df.assign(Z_Revenue_correct = df.apply(lambda x: x['Z_Revenue'] if x['Response'] == 1 else 0, axis=1))
     )
df['Z_Revenue_correct'].sum()

3674

Now, this above value matches what was reported in the pdf.

### The necessary adjustments have been made. Now let's look at a simple distribution of data for the quantitative columns.

In [208]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year_Birth,2240.0,1968.805804,11.984069,1893.0,1959.0,1970.0,1977.0,1996.0
Income,2216.0,52247.251354,25173.076661,1730.0,35303.0,51381.5,68522.0,666666.0
Kidhome,2240.0,0.444196,0.538398,0.0,0.0,0.0,1.0,2.0
Teenhome,2240.0,0.50625,0.544538,0.0,0.0,0.0,1.0,2.0
Recency,2240.0,49.109375,28.962453,0.0,24.0,49.0,74.0,99.0
MntWines,2240.0,303.935714,336.597393,0.0,23.75,173.5,504.25,1493.0
MntFruits,2240.0,26.302232,39.773434,0.0,1.0,8.0,33.0,199.0
MntMeatProducts,2240.0,166.95,225.715373,0.0,16.0,67.0,232.0,1725.0
MntFishProducts,2240.0,37.525446,54.628979,0.0,3.0,12.0,50.0,259.0
MntSweetProducts,2240.0,27.062946,41.280498,0.0,1.0,8.0,33.0,263.0


In [209]:
df.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response,Z_Revenue_correct
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,...,0,0,0,0,0,0,3,11,1,11
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,...,0,0,0,0,0,0,3,11,0,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,...,0,0,0,0,0,0,3,11,0,0
3,6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,...,0,0,0,0,0,0,3,11,0,0
4,5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,...,0,0,0,0,0,0,3,11,0,0


After the check the data quality of this dataframe, We'll do the explotory data analysis (EDA) in the next notebook.

### 3. Conclusion

- We found two columns that were not in the pdf: 'Z_Revenue','Response'. 
- We found two customer with the marital status strange: "Absurd". We guess that this status is wrong.
- The data of the Z_Revenue was wrong. All the rows containing the same valor: 11. We know that the sucess of the pilot campaign was 15%, not 100%.  So, we create a new column.
- We found three customer with a strange birth year (over one hundred and twenty years old).
- We create a new column containg the age of each customer.
- We change the type of some columns.

In [210]:
df.to_csv(DATA / 'ml_project1_data_cleaned.csv', index=False)