# 1. Open dataset using pandas dataframe

In [4]:
import pandas as pd
pd.set_option('display.max_columns',60 ) #set maximum displayed number of columns 
df = pd.read_csv(r'C:\users\vdavis\vTargetMail_ICA04a.csv',header='infer' , thousands=',', encoding='utf-8') 
df.tail(5)

Unnamed: 0,CustomerKey,GeographyKey,Title,FirstName,MiddleName,LastName,NameStyle,BirthDate,MaritalStatus,Suffix,Gender,EmailAddress,YearlyIncome,TotalChildren,NumberChildrenAtHome,EnglishEducation,EnglishOccupation,HouseOwnerFlag,NumberCarsOwned,AddressLine1,AddressLine2,Phone,DateFirstPurchase,CommuteDistance,Region,Age,BikeBuyer
18479,29479,209,,Tommy,L,Tang,0,1958-07-04,M,,M,tommy2@adventure-works.com,30000.0,1,0,Graduate Degree,Clerical,1,0,"111, rue Maillard",,1 (11) 500 555-0136,2007-03-08,0-1 Miles,Europe,59,1
18480,29480,248,,Nina,W,Raji,0,1960-11-10,S,,F,nina21@adventure-works.com,30000.0,3,0,Graduate Degree,Clerical,1,0,9 Katherine Drive,,1 (11) 500 555-0146,2008-01-18,0-1 Miles,Europe,56,1
18481,29481,120,,Ivan,,Suri,0,1960-01-05,S,,M,ivan0@adventure-works.com,30000.0,3,0,Graduate Degree,Clerical,0,0,Knaackstr 4,,1 (11) 500 555-0144,2006-02-13,0-1 Miles,Europe,57,1
18482,29482,179,,Clayton,,Zhang,0,1959-03-05,M,,M,clayton0@adventure-works.com,30000.0,3,0,Bachelors,Clerical,1,0,"1080, quai de Grenelle",,1 (11) 500 555-0137,2007-03-22,0-1 Miles,Europe,58,1
18483,29483,217,,Jésus,L,Navarro,0,1959-12-08,M,,M,jésus9@adventure-works.com,30000.0,0,0,Bachelors,Clerical,1,0,"244, rue de la Centenaire",,1 (11) 500 555-0141,2007-03-13,0-1 Miles,Europe,57,1


In [5]:
df.shape

(18484, 27)

# 2. Covert datatype and data values
It can be seen that many columns that contains "Not Applicable", "NaN", "Not Available" values.
Pandas can only recognize NaN- not a number, None, NaT as missing values. Thus, we need to convert "Not Available" in numerical columns so that Pandas can recognize them.

In [6]:
import numpy as np
# Replace all occurrences of Not Available with numpy "not a number=NaN"
df = df.replace({'NaN': np.nan})

# Iterate through the columns
for col in list(df.columns):
    # Select columns that should be numeric
    if ('Age' in col or 'Total' in col or 'Key' in col or 'Income' in col or 'Number' in col):
        # Convert the data type to float
        df[col] = df[col].astype(float)

In [7]:
df.shape

(18484, 27)

# 3. Data binning

Data binning (also called Discrete binning or bucketing) is a data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall in a given small interval, a bin, are replaced by a value representative of that interval, often the central value. It is a form of quantization.

Statistical data binning is a way to group a number of more or less continuous values into a smaller number of "bins". For example, if you have data about a group of people, you might want to arrange their ages into a smaller number of age intervals (for example, grouping into young, middle-aged, or senior groups). (<i>https://en.wikipedia.org/wiki/Data_binning</i>)

Pandas offers a cut function. Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. 
(<i>https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html</i>). 

If missing values exists in the column that you want to bin, you will encounter the following error message:
</br>
<i>TypeError: '<' not supported between instances of 'int' and 'NoneType'</i> then you need to deal with those missing values before applying the function.

Below: create two new BIN two columns for each of two "original" columns (eg. A and B)
For each original column, one new BIN column will be "smart" and one will be "regular"
- The "smart" bin column's bin-boundaries are choosen intelligently (by you: eg. 10-20 choices), and
- The "regular" bin column's bin-boundaries cover the space in regular intervals (eg. 10-20 choices) 

For each of above: 
the bin-list (green) shows the "boundary-points" between bins, and 
the group-names (red) are your chosen data-values to be generated for each row:
(these group-name data-values are character, but you can choose numeric values
          inside the quotes: a later python-cell will calculate correlations (numeric only))<br>
If you choose n different data-values for your group-names, you must choose n+1 boudaries in bin-list

In [8]:
binsA_SMART =        [0, 40,  50,   55,    60,    70,  120]
group_namesA_SMART = ['40', '50', '55', '60', '70', '120']
df['AgeSmartBin'] = pd.cut(df['Age'], binsA_SMART, labels = group_namesA_SMART)

binsA_REGULAR =      [0, 50,  65,   80,    95,    110,  120]
group_namesA_REGULAR = ['50', '65', '80', '95', '110', '120']
df['AgeRegularBin'] = pd.cut(df['Age'], binsA_REGULAR, labels = group_namesA_REGULAR)

df.tail(8)

Unnamed: 0,CustomerKey,GeographyKey,Title,FirstName,MiddleName,LastName,NameStyle,BirthDate,MaritalStatus,Suffix,Gender,EmailAddress,YearlyIncome,TotalChildren,NumberChildrenAtHome,EnglishEducation,EnglishOccupation,HouseOwnerFlag,NumberCarsOwned,AddressLine1,AddressLine2,Phone,DateFirstPurchase,CommuteDistance,Region,Age,BikeBuyer,AgeSmartBin,AgeRegularBin
18476,29476.0,147.0,,Elizabeth,,Bradley,0,1959-07-03,M,,F,elizabeth30@adventure-works.com,20000.0,2.0,0.0,Partial College,Manual,0,1.0,Nonnendamm 2,,1 (11) 500 555-0177,2006-01-21,0-1 Miles,Europe,58.0,1,60,65
18477,29477.0,253.0,,Neil,N,Ruiz,0,1959-07-06,M,,M,neil3@adventure-works.com,20000.0,2.0,0.0,Partial College,Manual,0,1.0,P.O. Box 9178,,1 (11) 500 555-0114,2007-12-20,1-2 Miles,Europe,58.0,1,60,65
18478,29478.0,269.0,,Darren,D,Carlson,0,1959-05-25,S,,M,darren41@adventure-works.com,30000.0,3.0,0.0,Graduate Degree,Clerical,1,0.0,5240 Premier Pl.,,1 (11) 500 555-0132,2007-12-28,0-1 Miles,Europe,58.0,1,60,65
18479,29479.0,209.0,,Tommy,L,Tang,0,1958-07-04,M,,M,tommy2@adventure-works.com,30000.0,1.0,0.0,Graduate Degree,Clerical,1,0.0,"111, rue Maillard",,1 (11) 500 555-0136,2007-03-08,0-1 Miles,Europe,59.0,1,60,65
18480,29480.0,248.0,,Nina,W,Raji,0,1960-11-10,S,,F,nina21@adventure-works.com,30000.0,3.0,0.0,Graduate Degree,Clerical,1,0.0,9 Katherine Drive,,1 (11) 500 555-0146,2008-01-18,0-1 Miles,Europe,56.0,1,60,65
18481,29481.0,120.0,,Ivan,,Suri,0,1960-01-05,S,,M,ivan0@adventure-works.com,30000.0,3.0,0.0,Graduate Degree,Clerical,0,0.0,Knaackstr 4,,1 (11) 500 555-0144,2006-02-13,0-1 Miles,Europe,57.0,1,60,65
18482,29482.0,179.0,,Clayton,,Zhang,0,1959-03-05,M,,M,clayton0@adventure-works.com,30000.0,3.0,0.0,Bachelors,Clerical,1,0.0,"1080, quai de Grenelle",,1 (11) 500 555-0137,2007-03-22,0-1 Miles,Europe,58.0,1,60,65
18483,29483.0,217.0,,Jésus,L,Navarro,0,1959-12-08,M,,M,jésus9@adventure-works.com,30000.0,0.0,0.0,Bachelors,Clerical,1,0.0,"244, rue de la Centenaire",,1 (11) 500 555-0141,2007-03-13,0-1 Miles,Europe,57.0,1,60,65


#### b) IF the above does not work / is not well-designed, you can "Manually" bin, as shown in example below:

In [None]:
# df['Age group by hand'] = ['unknown' if x == None else 'young' if x<=45 else 'middle-aged' if x<=65 else 'senior'  for x in data["Age"]]
# df.tail(50)

After the BIN columns have been created, repeat the operation of replacing NaN and converting new numeric columns to datatype "float"

In [9]:
import numpy as np
# Replace all occurrences of Not Available with numpy "not a number=NaN"
df = df.replace({'NaN': np.nan})

# Iterate through the columns
for col in list(df.columns):
    # Select columns that should be numeric
    if ('Age' in col or 'Total' in col or 'Key' in col or 'Income' in col or 'Number' in col):
        # Convert the data type to float
        df[col] = df[col].astype(float)

In [10]:
df.tail(10)

Unnamed: 0,CustomerKey,GeographyKey,Title,FirstName,MiddleName,LastName,NameStyle,BirthDate,MaritalStatus,Suffix,Gender,EmailAddress,YearlyIncome,TotalChildren,NumberChildrenAtHome,EnglishEducation,EnglishOccupation,HouseOwnerFlag,NumberCarsOwned,AddressLine1,AddressLine2,Phone,DateFirstPurchase,CommuteDistance,Region,Age,BikeBuyer,AgeSmartBin,AgeRegularBin
18474,29474.0,174.0,,Jaime,B,Raje,0,1959-10-04,M,,M,jaime36@adventure-works.com,20000.0,2.0,0.0,Partial College,Manual,0,1.0,Potsdamer Straße 646,,1 (11) 500 555-0174,2005-12-20,0-1 Miles,Europe,57.0,1,60.0,65.0
18475,29475.0,147.0,,Jared,A,Ward,0,1959-09-23,S,,M,jared6@adventure-works.com,20000.0,2.0,0.0,Partial College,Manual,0,1.0,Erftplatz 876,,1 (11) 500 555-0135,2005-12-30,2-5 Miles,Europe,58.0,1,60.0,65.0
18476,29476.0,147.0,,Elizabeth,,Bradley,0,1959-07-03,M,,F,elizabeth30@adventure-works.com,20000.0,2.0,0.0,Partial College,Manual,0,1.0,Nonnendamm 2,,1 (11) 500 555-0177,2006-01-21,0-1 Miles,Europe,58.0,1,60.0,65.0
18477,29477.0,253.0,,Neil,N,Ruiz,0,1959-07-06,M,,M,neil3@adventure-works.com,20000.0,2.0,0.0,Partial College,Manual,0,1.0,P.O. Box 9178,,1 (11) 500 555-0114,2007-12-20,1-2 Miles,Europe,58.0,1,60.0,65.0
18478,29478.0,269.0,,Darren,D,Carlson,0,1959-05-25,S,,M,darren41@adventure-works.com,30000.0,3.0,0.0,Graduate Degree,Clerical,1,0.0,5240 Premier Pl.,,1 (11) 500 555-0132,2007-12-28,0-1 Miles,Europe,58.0,1,60.0,65.0
18479,29479.0,209.0,,Tommy,L,Tang,0,1958-07-04,M,,M,tommy2@adventure-works.com,30000.0,1.0,0.0,Graduate Degree,Clerical,1,0.0,"111, rue Maillard",,1 (11) 500 555-0136,2007-03-08,0-1 Miles,Europe,59.0,1,60.0,65.0
18480,29480.0,248.0,,Nina,W,Raji,0,1960-11-10,S,,F,nina21@adventure-works.com,30000.0,3.0,0.0,Graduate Degree,Clerical,1,0.0,9 Katherine Drive,,1 (11) 500 555-0146,2008-01-18,0-1 Miles,Europe,56.0,1,60.0,65.0
18481,29481.0,120.0,,Ivan,,Suri,0,1960-01-05,S,,M,ivan0@adventure-works.com,30000.0,3.0,0.0,Graduate Degree,Clerical,0,0.0,Knaackstr 4,,1 (11) 500 555-0144,2006-02-13,0-1 Miles,Europe,57.0,1,60.0,65.0
18482,29482.0,179.0,,Clayton,,Zhang,0,1959-03-05,M,,M,clayton0@adventure-works.com,30000.0,3.0,0.0,Bachelors,Clerical,1,0.0,"1080, quai de Grenelle",,1 (11) 500 555-0137,2007-03-22,0-1 Miles,Europe,58.0,1,60.0,65.0
18483,29483.0,217.0,,Jésus,L,Navarro,0,1959-12-08,M,,M,jésus9@adventure-works.com,30000.0,0.0,0.0,Bachelors,Clerical,1,0.0,"244, rue de la Centenaire",,1 (11) 500 555-0141,2007-03-13,0-1 Miles,Europe,57.0,1,60.0,65.0


In [11]:
df.shape

(18484, 29)

# 4. Remove columns with high % of missing values

Pecentage missing values of columns are calculated and displayed in a descending order

In [12]:
# Function to calculate missing values by column
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

missing_values_table(df)

Your selected dataframe has 29 columns.
There are 4 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
Suffix,18481,100.0
Title,18383,99.5
AddressLine2,18172,98.3
MiddleName,7830,42.4


Delete columns that has more than 60% of missing values

In [13]:
# Define function
def dropColumnsWithHighPercentageOfMissingValues(df, k):
    missing_df = missing_values_table(df);
    missing_columns = list(missing_df[missing_df['% of Total Values'] > k].index)
    print('We will remove %d columns.' % len(missing_columns))
    for col in missing_columns:
        print('drop column:',col)
        df= df.drop(col, axis=1)
    return df
#call the function
df = dropColumnsWithHighPercentageOfMissingValues(df, 60)
#display result
df.tail(10)

Your selected dataframe has 29 columns.
There are 4 columns that have missing values.
We will remove 3 columns.
drop column: Suffix
drop column: Title
drop column: AddressLine2


Unnamed: 0,CustomerKey,GeographyKey,FirstName,MiddleName,LastName,NameStyle,BirthDate,MaritalStatus,Gender,EmailAddress,YearlyIncome,TotalChildren,NumberChildrenAtHome,EnglishEducation,EnglishOccupation,HouseOwnerFlag,NumberCarsOwned,AddressLine1,Phone,DateFirstPurchase,CommuteDistance,Region,Age,BikeBuyer,AgeSmartBin,AgeRegularBin
18474,29474.0,174.0,Jaime,B,Raje,0,1959-10-04,M,M,jaime36@adventure-works.com,20000.0,2.0,0.0,Partial College,Manual,0,1.0,Potsdamer Straße 646,1 (11) 500 555-0174,2005-12-20,0-1 Miles,Europe,57.0,1,60.0,65.0
18475,29475.0,147.0,Jared,A,Ward,0,1959-09-23,S,M,jared6@adventure-works.com,20000.0,2.0,0.0,Partial College,Manual,0,1.0,Erftplatz 876,1 (11) 500 555-0135,2005-12-30,2-5 Miles,Europe,58.0,1,60.0,65.0
18476,29476.0,147.0,Elizabeth,,Bradley,0,1959-07-03,M,F,elizabeth30@adventure-works.com,20000.0,2.0,0.0,Partial College,Manual,0,1.0,Nonnendamm 2,1 (11) 500 555-0177,2006-01-21,0-1 Miles,Europe,58.0,1,60.0,65.0
18477,29477.0,253.0,Neil,N,Ruiz,0,1959-07-06,M,M,neil3@adventure-works.com,20000.0,2.0,0.0,Partial College,Manual,0,1.0,P.O. Box 9178,1 (11) 500 555-0114,2007-12-20,1-2 Miles,Europe,58.0,1,60.0,65.0
18478,29478.0,269.0,Darren,D,Carlson,0,1959-05-25,S,M,darren41@adventure-works.com,30000.0,3.0,0.0,Graduate Degree,Clerical,1,0.0,5240 Premier Pl.,1 (11) 500 555-0132,2007-12-28,0-1 Miles,Europe,58.0,1,60.0,65.0
18479,29479.0,209.0,Tommy,L,Tang,0,1958-07-04,M,M,tommy2@adventure-works.com,30000.0,1.0,0.0,Graduate Degree,Clerical,1,0.0,"111, rue Maillard",1 (11) 500 555-0136,2007-03-08,0-1 Miles,Europe,59.0,1,60.0,65.0
18480,29480.0,248.0,Nina,W,Raji,0,1960-11-10,S,F,nina21@adventure-works.com,30000.0,3.0,0.0,Graduate Degree,Clerical,1,0.0,9 Katherine Drive,1 (11) 500 555-0146,2008-01-18,0-1 Miles,Europe,56.0,1,60.0,65.0
18481,29481.0,120.0,Ivan,,Suri,0,1960-01-05,S,M,ivan0@adventure-works.com,30000.0,3.0,0.0,Graduate Degree,Clerical,0,0.0,Knaackstr 4,1 (11) 500 555-0144,2006-02-13,0-1 Miles,Europe,57.0,1,60.0,65.0
18482,29482.0,179.0,Clayton,,Zhang,0,1959-03-05,M,M,clayton0@adventure-works.com,30000.0,3.0,0.0,Bachelors,Clerical,1,0.0,"1080, quai de Grenelle",1 (11) 500 555-0137,2007-03-22,0-1 Miles,Europe,58.0,1,60.0,65.0
18483,29483.0,217.0,Jésus,L,Navarro,0,1959-12-08,M,M,jésus9@adventure-works.com,30000.0,0.0,0.0,Bachelors,Clerical,1,0.0,"244, rue de la Centenaire",1 (11) 500 555-0141,2007-03-13,0-1 Miles,Europe,57.0,1,60.0,65.0


In [14]:
df.shape

(18484, 26)

# 5. Remove rows with a missing value in the Target Column

We assume that "TargetColumn" is the name of our target column. We count the number of missing values in the target column by using print(data['TargetColumn'].isnull()) <br>
BEFORE RUNNING CELLS BELOW: Substitute the name of your Target Column into TargetColumn...

In [15]:
print(df["BikeBuyer"].isnull().sum())

0


In [11]:
df = df[pd.notnull(df['BikeBuyer'])]
df.info()

In [16]:
df.shape

(18484, 26)

we removed XXXXXX rows and now the dataset consists of YYYYYY rows

# 6. Replace extreme-outlier values with mean column value
- On the low end, an extreme outlier is below  $\text{First Quartile} -1.5 * \text{Interquartile Range}$
- On the high end, an extreme outlier is above $\text{Third Quartile} + 1.5 * \text{Interquartile Range}$
<br>where Interquartile Range = Third Quartile - First Quartile

## 6.1. Select column containing extreme outliers

In [17]:

def containingExtremeOutliers(df, colName):
    n,m = df.shape
    result = False
    # Calculate first and third quartile
    first_quartile = df[colName].describe()['25%']
    third_quartile = df[colName].describe()['75%']

    # Interquartile range
    iqr = third_quartile - first_quartile
    outlierDF = df[(df[colName] < (first_quartile - 1.5 * iqr)) |
            (df[colName] > (third_quartile + 1.5 * iqr))]
    n,m = outlierDF.shape
    if n == 0:
        return False
    else:
        return True

columnsWithOutliers = []
for c in list(df.columns):
    if df[c].dtype != 'object' and containingExtremeOutliers(df,c):
        columnsWithOutliers.append(c)

print('Columns containing extreme outliers:')
columnsWithOutliers

Columns containing extreme outliers:


['YearlyIncome', 'NumberCarsOwned', 'Age', 'AgeSmartBin', 'AgeRegularBin']

## 6.2. Display Box plot of any one (chosen) column
##        -Do EITHER 6.2a or 6.2b below (whichever one works)!!---------------

## 6.2a. Use plotly's "offline.iplot" to display (horizontal) Box plot----------
Note: For these Plotly Box-plots to be visible: you MUST use CHROME browser (Not Internet Explorer/Edge) <br>
Note: If needed, before running this plotly cell, click Start/Anaconda3/AnacondaPrompt and type pip install plotly .<br>
This should install the plotly library (from the internet) before you import and run it below...

In [None]:
import plotly as py
from plotly.graph_objs import *
py.offline.init_notebook_mode(connected = True)

def displayBoxPlot(df,col):
    trace = Box(x=df[col].dropna())
    data = [trace]
    layout = {'title':'box plots',
                'xaxis':{'title':'values'},
                'yaxis':{'title':col}
             }
    fig = Figure(data = data, layout = layout)
    py.offline.iplot(fig)


displayBoxPlot(df,'PutChosenColumnNameHere')

## 6.2b. Use pyplot's "boxplot" to display (vertical) Box plot----------

In [14]:
# #packages for visualizing data
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz

#setting up environment
%matplotlib inline
pd.set_option('display.max_columns',500 )
import warnings
warnings.filterwarnings('ignore')   #turn off warning

# The following cell enables you do a boxplots of any chosen column
plt.boxplot(df['PutChosenColumnNameHere'])
plt.show()

## 6.3. Replacing extreme outliers by mean value of the column


In [18]:
# define a function that remove rows that contain extreme outlier in column given by colName
def replaceExtremeOutliers(df_in, col_name):
    m = df_in[col_name].describe()['min']
    
    # Calculate first and third quartile
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    
    df_in[col_name] = [m if (x < fence_low  or x > fence_high) else x for x in df_in[col_name] ]

    return df_in

#repeatedly remove rows containing extreme outliers in numeric columns
for c in list(df.columns):
    if df[c].dtype != 'object':
        df = replaceExtremeOutliers(df,c)
        
# display info of the dataframe after 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18484 entries, 0 to 18483
Data columns (total 26 columns):
CustomerKey             18484 non-null float64
GeographyKey            18484 non-null float64
FirstName               18484 non-null object
MiddleName              10654 non-null object
LastName                18484 non-null object
NameStyle               18484 non-null int64
BirthDate               18484 non-null object
MaritalStatus           18484 non-null object
Gender                  18484 non-null object
EmailAddress            18484 non-null object
YearlyIncome            18484 non-null float64
TotalChildren           18484 non-null float64
NumberChildrenAtHome    18484 non-null float64
EnglishEducation        18484 non-null object
EnglishOccupation       18484 non-null object
HouseOwnerFlag          18484 non-null int64
NumberCarsOwned         18484 non-null float64
AddressLine1            18484 non-null object
Phone                   18484 non-null object
DateFirstPurc

In [19]:
df.shape

(18484, 26)

# 7. Replace missing numeric/categ values w mean/most frequent values
<b>In General: for a numerical variable:</b><br>
Ignore these observations<br>
Replace with general average<br>
Replace with similar type of averages/mode/median<br>
Build model to predict missing values<br>
<b>Here, we will fill missing values in each column by its mean value.</b>
<br>
<br>
<b>In General: for a categorical variable:</b><br>
Ignore observation<br>
Replace by most frequent value<br>
Replace using an algorithm like KNN using the neighbours.<br>
Predict the observation using a multiclass predictor.<br>
<b>Here, we will replace all missing values by their most frequent values</b>

In [20]:
for c in list(df.columns):
    if df[c].dtype == 'object':
        df[c].fillna(df[c].mode()[0], inplace=True)
    else:
        df[c].fillna(int(df[c].mean()), inplace=True)
        
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18484 entries, 0 to 18483
Data columns (total 26 columns):
CustomerKey             18484 non-null float64
GeographyKey            18484 non-null float64
FirstName               18484 non-null object
MiddleName              18484 non-null object
LastName                18484 non-null object
NameStyle               18484 non-null int64
BirthDate               18484 non-null object
MaritalStatus           18484 non-null object
Gender                  18484 non-null object
EmailAddress            18484 non-null object
YearlyIncome            18484 non-null float64
TotalChildren           18484 non-null float64
NumberChildrenAtHome    18484 non-null float64
EnglishEducation        18484 non-null object
EnglishOccupation       18484 non-null object
HouseOwnerFlag          18484 non-null int64
NumberCarsOwned         18484 non-null float64
AddressLine1            18484 non-null object
Phone                   18484 non-null object
DateFirstPurc

In [21]:
df.shape

(18484, 26)

# 8. Remove weakly correlated columns

## 8.1. Correlations between feature columns and the target
<br>In order to quantify correlations between the features (variables) and the target, 
we can calculate the Pearson correlation coefficient. 
This is a measure of the strength and direction of a linear relationship between two variables: 
    a value of -1 means the two variables are perfectly negatively linearly correlated and 
    a value of +1 means the two variables are perfectly positively linearly correlated.
We can then use these values for selecting the features to employ in our model.

In [22]:
print(df['Age'].corr(df['BikeBuyer']))

-0.08535297993998012


In [23]:
# Find all correlations and sort 
correlations_data = df.corr()['BikeBuyer'].sort_values()

# Print the most negative correlations
print(correlations_data.head(20), '\n')

# Print the most positive correlations
print(correlations_data.tail(20))

NumberCarsOwned        -0.141520
TotalChildren          -0.127152
AgeRegularBin          -0.098177
NumberChildrenAtHome   -0.086707
GeographyKey           -0.086619
Age                    -0.085353
CustomerKey             0.005804
HouseOwnerFlag          0.007494
AgeSmartBin             0.019158
YearlyIncome            0.027336
BikeBuyer               1.000000
NameStyle                    NaN
Name: BikeBuyer, dtype: float64 

NumberCarsOwned        -0.141520
TotalChildren          -0.127152
AgeRegularBin          -0.098177
NumberChildrenAtHome   -0.086707
GeographyKey           -0.086619
Age                    -0.085353
CustomerKey             0.005804
HouseOwnerFlag          0.007494
AgeSmartBin             0.019158
YearlyIncome            0.027336
BikeBuyer               1.000000
NameStyle                    NaN
Name: BikeBuyer, dtype: float64


## 8.2. Group column based on correlation values to the target
In the procedure below, we identify/group columns according to their correlation score:<br>
You must first choose a threshold (eg .12 (ie, 12%)) and type it below where you see 0.12.<br>
Also, replace TargetColumn below with the name of the earlier-chosen Target Column.<br>
Then, three groups will be created:<br>
Strongly-positively correlated items will have a score above the threshold (eg. >.12 to 1)<br>
Strongly-negatively correlated items will have a score below the negative-threshold (eg. <-.12 to -1)<br>
Weakly correlated items have score between negative-threshold and threshold (eg. between -.12 and .12 (including 0))<br>
We intend to keep the strongly positive and strongly negative items, and drop the weakly correlated.

In [24]:
def classifyColumnBasedOnCorrWithTarget(df,target,threshold):
    negCorrCol = []
    posCorrCol = []
    negCorr = []
    posCorr = []
    lessCorrCols = []
    for col in list(df):
        if(df[col].dtype == np.float64 or df[col].dtype == np.int64) and (col!=target):
            corr = df[col].corr(df[target])
            if corr > threshold:
                posCorr.append(df[col].corr(df[target]))
                posCorrCol.append(col)
            else:
                if corr < -threshold:
                    negCorr.append(df[col].corr(df[target]))
                    negCorrCol.append(col)
                else:
                    lessCorrCols.append(col)
    posCorrCols = [x for _,x in sorted(zip(posCorr,posCorrCol),reverse=True)]
    negCorrCols = [x for _,x in sorted(zip(negCorr,negCorrCol))]
    return negCorrCols, lessCorrCols, posCorrCols

negCorrCols, lessCorrCols, posCorrCols = classifyColumnBasedOnCorrWithTarget(df,"BikeBuyer",0.12)
print('weakly correlated columns:')
lessCorrCols

weakly correlated columns:


['CustomerKey',
 'GeographyKey',
 'NameStyle',
 'YearlyIncome',
 'NumberChildrenAtHome',
 'HouseOwnerFlag',
 'Age',
 'AgeSmartBin',
 'AgeRegularBin']

## 8.3. Drop columns that has weakly correlated with the target

In [25]:
for col in lessCorrCols:
    df=df.drop(col,1)
    
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18484 entries, 0 to 18483
Data columns (total 17 columns):
FirstName            18484 non-null object
MiddleName           18484 non-null object
LastName             18484 non-null object
BirthDate            18484 non-null object
MaritalStatus        18484 non-null object
Gender               18484 non-null object
EmailAddress         18484 non-null object
TotalChildren        18484 non-null float64
EnglishEducation     18484 non-null object
EnglishOccupation    18484 non-null object
NumberCarsOwned      18484 non-null float64
AddressLine1         18484 non-null object
Phone                18484 non-null object
DateFirstPurchase    18484 non-null object
CommuteDistance      18484 non-null object
Region               18484 non-null object
BikeBuyer            18484 non-null int64
dtypes: float64(2), int64(1), object(14)
memory usage: 2.4+ MB


In [26]:
df.shape

(18484, 17)

# 9. Drop any columns with too high a Percentage of Unique Values or too high a number of Distinct Values

# 9.1. Define Function that will count number of: Distinct values, Unique values and Percentage of Unique values in total

In [27]:
def countDistinctUniqueValues(df, colName):
#     counts the number of distinct values with only one instance 
    numberOfInstances = df[colName].count()
    countInstancesForEachDistinctValue = df[colName].value_counts()
    countDistinct = df[colName].nunique()       # counts the number of distinct values of your column (SELECT DISTINCT ...)
    countUnique = 0                             # counts the number of distinct values with only one instance
    for y in countInstancesForEachDistinctValue:
        if y == 1:
            countUnique+=1
    percentage = round(countUnique/ numberOfInstances*100,2) # percentage of instances having a value for this attribute that no other instances have in the data.
    return countDistinct,countUnique,percentage

# 9.2. Define Function that drops any object-typ (nominal?) columns with too high unique% or too high distinct_count

In [28]:
def deleteColumnWithHighPercentageUniqueValues(df, uniq_perc_threshold, dist_count_threshold):
    for colName in df.columns:
        distinct, unique, percentage  = countDistinctUniqueValues(df,colName)
        print(colName+' has datatype of '+ str(df[colName].dtype)+' with:d,u,p(u)='+str(distinct)+','+str(unique)+','+str(percentage)+'!')
        if (str(df[colName].dtype)=='object') and (percentage > uniq_perc_threshold or distinct > dist_count_threshold):
            print(colName+'-------will be dropped------------------------------')
#   to only list these columns without dropping them, comment-out the line below and then run this cell...
            df = df.drop(colName,axis=1)
    return df

# 9.3. Execute Function (defined above) that drops columns

In [29]:
# BEFORE running this cell: 
# replace 60 with desired unique% threshold and 1000 with desired distinct_count threshold
df = deleteColumnWithHighPercentageUniqueValues(df,60,1000)

FirstName has datatype of object with:d,u,p(u)=670,93,0.5!
MiddleName has datatype of object with:d,u,p(u)=44,9,0.05!
LastName has datatype of object with:d,u,p(u)=375,126,0.68!
BirthDate has datatype of object with:d,u,p(u)=8252,2901,15.69!
BirthDate-------will be dropped------------------------------
MaritalStatus has datatype of object with:d,u,p(u)=2,0,0.0!
Gender has datatype of object with:d,u,p(u)=2,0,0.0!
EmailAddress has datatype of object with:d,u,p(u)=18484,18484,100.0!
EmailAddress-------will be dropped------------------------------
TotalChildren has datatype of float64 with:d,u,p(u)=6,0,0.0!
EnglishEducation has datatype of object with:d,u,p(u)=5,0,0.0!
EnglishOccupation has datatype of object with:d,u,p(u)=5,0,0.0!
NumberCarsOwned has datatype of float64 with:d,u,p(u)=4,0,0.0!
AddressLine1 has datatype of object with:d,u,p(u)=12802,7606,41.15!
AddressLine1-------will be dropped------------------------------
Phone has datatype of object with:d,u,p(u)=8890,8175,44.23!
Phone

In [30]:
df.shape

(18484, 12)

# 10. If needed, convert the target column to a two-valued column (eg. 1/0)
Before running, replace TargetColumn below with the name of the earlier-chosen Target Column.<br>

In [21]:
# df["TargetColumn"] = [0 if x < 50 else 1 for x in df["TargetColumn"]]

# 11. Export the dataframe into a "Cleaned" csv file

In [22]:
# df.to_csv(r'C:\users\vdavis\cleanedvTargetMail_ICA04a.csv.csv',index=False, encoding='utf-8')

## After the above is done, do this before/as you feed csv into weka:
In order to use J48 Decision Tree model in WEKA: BEFORE reading csv into WEKA, you must:<br>
A) Open csv in Excel and create new target column in new, last position: name it "score". <br>
B) Assume old-target-column is J: then, in row2 of new target column, type a formula =IF(J2=1,"yes","no")<br>       -copy this formula into each other row of this new target column.<br>
                -confirm that for each row, its formula has its row# in the place of the old "2" <br>
C) Copy new-target-column over itself, using copy-special/values to replace formulas with values<br>
D) Confirm that new yes/no matches old 1/0 correctly.<br>
      If so, Remove old target column, as it is now unnecessary.  Save and keep this csv file.<br><br>
In order to use J48 Decision Tree model in WEKA: AFTER reading csv, you must remove any unacceptable columns:<br>
E) In WEKA: select each column and look for error-msg "Too many values" / "Cannot display" on right-panel<br>
F) If column has such err-msg, select it (on check-box) and hit "REMOVE" button<br>
G) After all unnacceptable columns have been removed, all graphics in right-panel should show (in color?)<br>
H) After all unnacceptable columns have been removed, Classify tab should show J48 model as selectable.