In [1]:
# Importing library
import pandas as pd

In [2]:
# Reading dataset
data = pd.read_csv('https://raw.githubusercontent.com/analyticsindiamagazine/MocksDatasets/main/Customer_Missing.csv')
data.head()

Unnamed: 0,S.No.,Cust_ID,Cust_City,Cust_Age,Cust_Fam_Memb,Cust_Month_Income,Cust_Purchase
0,1,100089,Bangalore,51.0,3.0,170000.0,28333.0
1,2,100018,Bangalore,39.0,2.0,195000.0,
2,3,100070,Hyderabad,,3.0,103333.0,17222.0
3,4,100090,Bangalore,34.0,4.0,85000.0,10625.0
4,5,100016,Bangalore,22.0,6.0,,3055.0


In [3]:
data.shape

(20, 7)

# **Missing value treatment**

In [4]:
# Data information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   S.No.              20 non-null     int64  
 1   Cust_ID            20 non-null     int64  
 2   Cust_City          18 non-null     object 
 3   Cust_Age           18 non-null     float64
 4   Cust_Fam_Memb      19 non-null     float64
 5   Cust_Month_Income  18 non-null     float64
 6   Cust_Purchase      18 non-null     float64
dtypes: float64(4), int64(2), object(1)
memory usage: 1.2+ KB


In [5]:
# Missing values in each column
data.isnull().sum()

S.No.                0
Cust_ID              0
Cust_City            2
Cust_Age             2
Cust_Fam_Memb        1
Cust_Month_Income    2
Cust_Purchase        2
dtype: int64

Out of 7 columns, 5 columns have some null values. To move further with these columns we have two options either dropping all missing values or imputing the missing values. Dropping is advisable when data 2% to 5% missing values but in our case, we have nearly 55% missing values. So it is better to impute the values.

Before imputation of values let’s create a copy of the existing dataset so that we can utilise the original representation later for some other methods of imputation.

In [6]:
# Creating a copy of df
data2 = data.copy()

In [7]:
# Missing values of a feature
data['Cust_Age'].isnull().sum()

2

In [8]:
data['Cust_Age']

0     51.0
1     39.0
2      NaN
3     34.0
4     22.0
5     35.0
6     30.0
7      NaN
8     30.0
9     25.0
10    43.0
11    41.0
12    42.0
13    51.0
14    24.0
15    44.0
16    26.0
17    39.0
18    33.0
19    39.0
Name: Cust_Age, dtype: float64

Now first will impute the missing value from column cust_age. It is continuous features the appropriate value to impute the missing value by mean value. 

Let’s calculate the mean value first.

In [9]:
# Mean of the feature values
age_mean = data['Cust_Age'].mean()

In [10]:
# Filling missing values with mean
data['Cust_Age'] = data['Cust_Age'].fillna(value = age_mean)
data['Cust_Age']

0     51.0
1     39.0
2     36.0
3     34.0
4     22.0
5     35.0
6     30.0
7     36.0
8     30.0
9     25.0
10    43.0
11    41.0
12    42.0
13    51.0
14    24.0
15    44.0
16    26.0
17    39.0
18    33.0
19    39.0
Name: Cust_Age, dtype: float64

Now we have successfully imputed the missing values for the Cust_age, next we impute the missing value for column Cust_city. Since it is a categorical column the measure to impute the value is the mode. Mode represents the most frequent value in a particular column.  

Below is the representation of the feature before imputing with mode value. 

In [11]:
data['Cust_City']

0     Bangalore
1     Bangalore
2     Hyderabad
3     Bangalore
4     Bangalore
5     Hyderabad
6     Bangalore
7           NaN
8     Bangalore
9          Pune
10    Bangalore
11    Hyderabad
12    Bangalore
13         Pune
14    Hyderabad
15    Bangalore
16    Bangalore
17          NaN
18    Bangalore
19    Bangalore
Name: Cust_City, dtype: object

Now imputing with mode value. 

In [12]:
city_mode = data['Cust_City'].mode()
city_mode


0    Bangalore
dtype: object

In [13]:
# Filling missing values with mode
data['Cust_City'] = data['Cust_City'].fillna(value = city_mode[0])
data['Cust_City']

0     Bangalore
1     Bangalore
2     Hyderabad
3     Bangalore
4     Bangalore
5     Hyderabad
6     Bangalore
7     Bangalore
8     Bangalore
9          Pune
10    Bangalore
11    Hyderabad
12    Bangalore
13         Pune
14    Hyderabad
15    Bangalore
16    Bangalore
17    Bangalore
18    Bangalore
19    Bangalore
Name: Cust_City, dtype: object

Above we have successfully imputed the categorical missing value by mode. 

So far we have imputed missing values for two columns, still, we have left 3 features to be imputed out of which one is a categorical feature and two are continuous features. Let’s impute those all in a single shot.   

Now let’s check the missing values present inside the dataset. 

In [14]:
data.isnull().sum() 

S.No.                0
Cust_ID              0
Cust_City            0
Cust_Age             0
Cust_Fam_Memb        1
Cust_Month_Income    2
Cust_Purchase        2
dtype: int64

In [15]:
# Filling all mising values
data['Cust_Fam_Memb'] = data['Cust_Fam_Memb'].fillna(value = data['Cust_Fam_Memb'].mode()[0])
data['Cust_Month_Income'] = data['Cust_Month_Income'].fillna(value = data['Cust_Month_Income'].mean())
data['Cust_Purchase'] = data['Cust_Purchase'].fillna(value = data['Cust_Purchase'].mean())

Let's check the sum of all the missing values across the columns.

In [16]:
data.isnull().sum()

S.No.                0
Cust_ID              0
Cust_City            0
Cust_Age             0
Cust_Fam_Memb        0
Cust_Month_Income    0
Cust_Purchase        0
dtype: int64

Another method to impute the missing value is by using SimpleImputer class provided by the sklearn library. Now to impute the missing values we will use earlier copied data.  

For demonstration, we impute missing values from the column Cust_month_income. The before imputation distribution looks like the below.

In [17]:
data2['Cust_Month_Income']

0     170000.0
1     195000.0
2     103333.0
3      85000.0
4          NaN
5     116666.0
6      75000.0
7      46666.0
8     150000.0
9      41666.0
10     86000.0
11    136666.0
12     70000.0
13    170000.0
14     48000.0
15     88000.0
16    130000.0
17     97500.0
18         NaN
19    130000.0
Name: Cust_Month_Income, dtype: float64

Below we will impute SimpleImputer class from the sklearn library:

In [18]:
# Missing values imputation with SKLearn Imputer
import numpy as np
from sklearn.impute import SimpleImputer

While initializing the SimpleImputer class we need to specify the strategy which measures imputation and also need to indicate the form of the missing values present inside the data. After initializing the class, by using the fit_trasnform method we can impute the missing values. 

In [19]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # strategy="most_frequent" in case of categorical values
data2['Cust_Month_Income'] = imputer.fit_transform(data2['Cust_Month_Income'].values.reshape(-1,1))[:,0]

In [20]:
data2['Cust_Month_Income']

0     170000.000000
1     195000.000000
2     103333.000000
3      85000.000000
4     107749.833333
5     116666.000000
6      75000.000000
7      46666.000000
8     150000.000000
9      41666.000000
10     86000.000000
11    136666.000000
12     70000.000000
13    170000.000000
14     48000.000000
15     88000.000000
16    130000.000000
17     97500.000000
18    107749.833333
19    130000.000000
Name: Cust_Month_Income, dtype: float64

# **Defining features**

So far now we have imputed all the missing values, now it turns to define the input and output features from the cleaned dataset. 

Before defining input features let’s take look at the cleaned dataset.   

In [21]:
data.head()

Unnamed: 0,S.No.,Cust_ID,Cust_City,Cust_Age,Cust_Fam_Memb,Cust_Month_Income,Cust_Purchase
0,1,100089,Bangalore,51.0,3.0,170000.0,28333.0
1,2,100018,Bangalore,39.0,2.0,195000.0,15551.222222
2,3,100070,Hyderabad,36.0,3.0,103333.0,17222.0
3,4,100090,Bangalore,34.0,4.0,85000.0,10625.0
4,5,100016,Bangalore,22.0,6.0,107749.833333,3055.0


Our input features will be the columns starting from Cust_ID to the Cust_month_income this can be extracted as below.

In [22]:
# Defining input features
input_features = data.iloc[:,1:-1].values

In [23]:
input_features.shape

(20, 5)

As you can see out of 7 columns we have defined 5 columns as input features, let’s take look at these defined input feature patterns.

In [24]:
input_features

array([[100089, 'Bangalore', 51.0, 3.0, 170000.0],
       [100018, 'Bangalore', 39.0, 2.0, 195000.0],
       [100070, 'Hyderabad', 36.0, 3.0, 103333.0],
       [100090, 'Bangalore', 34.0, 4.0, 85000.0],
       [100016, 'Bangalore', 22.0, 6.0, 107749.83333333333],
       [100033, 'Hyderabad', 35.0, 3.0, 116666.0],
       [100035, 'Bangalore', 30.0, 4.0, 75000.0],
       [100067, 'Bangalore', 36.0, 6.0, 46666.0],
       [100058, 'Bangalore', 30.0, 2.0, 150000.0],
       [100051, 'Pune', 25.0, 6.0, 41666.0],
       [100024, 'Bangalore', 43.0, 5.0, 86000.0],
       [100091, 'Hyderabad', 41.0, 3.0, 136666.0],
       [100070, 'Bangalore', 42.0, 6.0, 70000.0],
       [100044, 'Pune', 51.0, 3.0, 170000.0],
       [100094, 'Hyderabad', 24.0, 5.0, 48000.0],
       [100044, 'Bangalore', 44.0, 3.0, 88000.0],
       [100064, 'Bangalore', 26.0, 2.0, 130000.0],
       [100096, 'Bangalore', 39.0, 4.0, 97500.0],
       [100053, 'Bangalore', 33.0, 4.0, 107749.83333333333],
       [100066, 'Bangalore', 3

In [25]:
# Defining output feature
ouput_feature = data.iloc[:,-1].values

In [26]:
# Shape of the output feature
ouput_feature.shape

(20,)

# **Creating training and testing set**

Creating the training and testing dataset is considered to be the last step of the data preprocessing. In this step, we separate a part of the data into a training set which will be exclusively only used to train the dataset and by using the test set we evaluate the generalization capabilities of the model.  

This process of splitting the data into training and testing is usually done train_test_split method given by the sklearn library.  Below we will import that class from the sklearn library. 

In [27]:
# Creating training and test patterns
from sklearn.model_selection import train_test_split

In [28]:
X_train, X_test, y_train, y_test = train_test_split(input_features, ouput_feature, train_size=0.8)

In [29]:
# Shape of training and test patterns
print('Size of input training patterns: ', X_train.shape)
print('Size of output training patterns: ', y_train.shape)
print('Size of input test patterns: ', X_test.shape)
print('Size of output test patterns: ', y_test.shape)

Size of input training patterns:  (16, 5)
Size of output training patterns:  (16,)
Size of input test patterns:  (4, 5)
Size of output test patterns:  (4,)
