# Clustering Ghanaian households based on their expenditure patterns

### 1.3 Feature Creation (Data Cleansing and Feature engineering)

This task transforms input columns of various relations into formats that are compatible with machine learning algorithms. We will consider some of the following data cleaning techniques including reomiving correlated variables, normalisation, discretisation, PCA, dummy encoding, replacing missing values and removing duplicates 

Import data

In [5]:
import pandas as pd

In [6]:
df = pd.read_csv('final_capstone_project_data.csv')

In [7]:
df.head(4)

Unnamed: 0.1,Unnamed: 0,hid,region,rururb,TOTFOOD,TOTALCH,TOTCLTH,TOTHOUS,TOTFURN,TOTHLTH,TOTTRSP,TOTCMNQ,TOTRCRE,TOTEDUC,TOTMISC,TOTAL_EXP
0,0,70001/02,Western,Urban,9437.439453,0.0,1978.0,2569.800049,1128.790039,182.5,2248.399902,138.899994,55.25,1481.5,1081.459961,20302.039398
1,1,70001/05,Western,Urban,6990.47998,0.0,906.0,10808.799805,1693.709961,21.9,192.5,370.399994,50.0,5079.0,260.200012,26372.989752
2,2,70001/06,Western,Urban,3079.566895,0.0,442.0,1240.099976,379.599976,29.200001,0.0,146.0,21.900002,827.0,932.660034,7098.026882
3,3,70001/07,Western,Urban,6542.259766,0.0,2435.0,1200.800049,580.959961,29.200001,2579.5,567.700012,12.5,1198.0,898.919983,16044.839771


#### PREPROCESSING

Since we are clustering our data, we do not set a target column. All attributes are regular attributes. Also we do not discretize any of the numerical columns. No attribute is generated and we do not remove any columns in our data set. Furthermore, all the attributes are numerical and hencewe do not unify value types such as 'nominal to text', 'text to nominal', 'numerical to real', etc. 

REPLACING MISSING VALUES

There are no missing values in the current data set

SAMPLING

Since there less than 10000 samples in our dataset, we do not sample down. Instead, we use all samples.

#### NORMALIZATION

Given the range of the values in the columns (starting from 0 and reaching thousands), we will normalize the dataset for all columns. 

first we drop the total expenditure column as we donot need it for the clustering analysis. We also need to drop Unnamed: 0, hid, region, rururb.

In [8]:
df.drop(['Unnamed: 0','hid','region','rururb','TOTAL_EXP'], axis = 1, inplace = True)

In [9]:
df.head(2)

Unnamed: 0,TOTFOOD,TOTALCH,TOTCLTH,TOTHOUS,TOTFURN,TOTHLTH,TOTTRSP,TOTCMNQ,TOTRCRE,TOTEDUC,TOTMISC
0,9437.439453,0.0,1978.0,2569.800049,1128.790039,182.5,2248.399902,138.899994,55.25,1481.5,1081.459961
1,6990.47998,0.0,906.0,10808.799805,1693.709961,21.9,192.5,370.399994,50.0,5079.0,260.200012


In [11]:
 from sklearn.preprocessing import StandardScaler

In [12]:
df_normalized = StandardScaler().fit_transform(df)

In [None]:
#df.values

In [14]:
df_normalized

array([[ 0.86976757, -0.28206333,  0.81332647, ..., -0.21668055,
        -0.11521547,  0.97090572],
       [ 0.33314593, -0.28206333, -0.09551523, ..., -0.22431094,
         0.8063206 , -0.07716584],
       [-0.5245228 , -0.28206333, -0.48889447, ..., -0.2651517 ,
        -0.28287228,  0.78101095],
       ...,
       [-1.09421672, -0.28206333, -0.7788421 , ..., -0.29698133,
        -0.4767855 , -0.40922664],
       [-0.88609968, -0.28206333, -0.68982309, ..., -0.29698133,
        -0.46397749, -0.40922664],
       [-1.08461131, -0.28206333, -0.63047709, ..., -0.29698133,
        -0.4819087 , -0.40922664]])

Lets put the normalized data into a dataframe

In [16]:
X = df_normalized

In [17]:
df_norm_dataframe = pd.DataFrame({'totfood':X[:,0],'totalch':X[:,1], 'totclth':X[:,2],'tothous':X[:,3],'totfurn':X[:,4], 'tothlth':X[:,5],
                       'tottrsp':X[:,6],'totcmnq':X[:,7], 'totrcre':X[:,8],'toteduc':X[:,9],'totmisc':X[:,10]})

In [18]:
df_norm_dataframe.head()

Unnamed: 0,totfood,totalch,totclth,tothous,totfurn,tothlth,tottrsp,totcmnq,totrcre,toteduc,totmisc
0,0.869768,-0.282063,0.813326,0.597423,0.93085,0.270093,0.381865,-0.27459,-0.216681,-0.115215,0.970906
1,0.333146,-0.282063,-0.095515,3.93793,1.783318,-0.204459,-0.270768,0.137049,-0.224311,0.806321,-0.077166
2,-0.524523,-0.282063,-0.488894,0.058295,-0.199684,-0.182889,-0.331876,-0.261965,-0.265152,-0.282872,0.781011
3,0.234851,-0.282063,1.200771,0.042361,0.10417,-0.182889,0.486971,0.487875,-0.278814,-0.187837,0.737953
4,2.104382,-0.282063,0.356362,0.286199,1.399336,6.28828,-0.007448,0.594741,-0.195824,0.020165,0.877056


MULTIPLY DATA

Here, we will copy the normalized data to another dataframe to be used for finding best feature sets for improving the cluster model if applicable

In [19]:
df_norm_dataframe_copy = df_norm_dataframe.copy()

SAVING THE DATA

In [21]:
df_norm_dataframe.to_csv('df_norm_dataframe')

In [22]:
df_norm_dataframe_copy.to_csv('df_norm_dataframe_copy')