# Pre-Processing Data

In this script, I'm working on pre-processing my facial recognition image data. To begin, I load my previously saved DataFrame from a pickle file and check its content and structure.

I then move onto the pre-processing phase where I first remove any missing values from the DataFrame. Once I've confirmed that no null values remain, I proceed to separate the DataFrame into independent features (X) and dependent labels (y).

I modify the labels, converting 'female' and 'male' to a binary representation where 'female' is denoted by 1 and 'male' by 0. This process, known as binary classification mapping, transforms the categorical data into a numerical form, which is easier for machine learning algorithms to process.

Next, I normalize the independent features, using min-max scaling, to ensure all values fall between 0 and 1. This step is crucial as it helps my future machine learning models learn and make predictions more effectively.

After confirming the normalization and binary mapping are successfully applied, I save these processed data (Xnorm, y_norm) using numpy's np.savez function for future use in a machine learning model.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cv2
from PIL import Image
%matplotlib inline

In [2]:
import pickle

# load pickle data
df = pickle.load(open(r'W:\MayCooperStation\New Documents\Data Science and ML\FacialRecognition\data\dataframe_images_100_100.pickle','rb'))

In [3]:
df.head()

Unnamed: 0,gender,0,1,2,3,4,5,6,7,8,...,9990,9991,9992,9993,9994,9995,9996,9997,9998,9999
0,W:\MayCooperStation\New Documents\Data Science...,188,180,184,188,173,179,192,178,216,...,109,111,115,116,122,116,119,120,117,111
1,W:\MayCooperStation\New Documents\Data Science...,32,24,32,27,29,29,29,32,36,...,58,47,27,33,22,28,22,36,62,17
2,W:\MayCooperStation\New Documents\Data Science...,22,30,39,36,30,61,11,17,10,...,156,171,177,186,176,185,186,190,177,177
3,W:\MayCooperStation\New Documents\Data Science...,35,35,35,35,35,35,35,35,35,...,75,82,90,92,86,70,89,84,84,74
4,W:\MayCooperStation\New Documents\Data Science...,86,86,71,54,45,49,33,20,18,...,35,34,32,32,30,32,34,34,33,30


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5460 entries, 0 to 6057
Columns: 10001 entries, gender to 9999
dtypes: object(1), uint8(10000)
memory usage: 52.2+ MB


# Data Preprocessing
- Remove missing
- Data Normalization (min max scalling)

In [5]:
# removing missing values
df.dropna(axis=0,inplace=True)

In [6]:
df.isnull().sum()

gender    0
0         0
1         0
2         0
3         0
         ..
9995      0
9996      0
9997      0
9998      0
9999      0
Length: 10001, dtype: int64

In [7]:
# split the into two parts
X = df.iloc[:,1:].values # inpendent features
y = df.iloc[:,0].values # dependent

In [8]:
# Extract gender from the path
y = np.array([i.split('\\')[-1] for i in y])

In [9]:
print(np.unique(y)) # Should print 'female' and 'male'

['female' 'male']


In [10]:
# Binary classification mapping 
# female = 1, male = 0
y_norm = np.where(y=='female',1,0)

In [11]:
print(np.unique(y_norm)) # Should print '0' and '1'

[0 1]


In [12]:
X.shape

(5460, 10000)

# Min Max Scalling 
## Xnorm = $\frac {x - minValue} { maxValue - minValue}$

The formula calculates the normalized value Xnorm by subtracting the minimum value from x to determine the distance from the minimum. Then, it divides this distance by the range, which is the difference between the maximum and minimum values. This ensures that the normalized value falls within the range of 0 to 1.

In [13]:
X.min() , X.max()

(0, 255)

In [14]:
Xnorm = X / X.max()

In [15]:
#all values are in the range between 0 and 1 
Xnorm

array([[0.7372549 , 0.70588235, 0.72156863, ..., 0.47058824, 0.45882353,
        0.43529412],
       [0.1254902 , 0.09411765, 0.1254902 , ..., 0.14117647, 0.24313725,
        0.06666667],
       [0.08627451, 0.11764706, 0.15294118, ..., 0.74509804, 0.69411765,
        0.69411765],
       ...,
       [0.09803922, 0.09803922, 0.10196078, ..., 0.11764706, 0.12156863,
        0.13333333],
       [0.08235294, 0.10588235, 0.12156863, ..., 0.07843137, 0.08627451,
        0.09803922],
       [0.01568627, 0.01176471, 0.00784314, ..., 0.35294118, 0.35294118,
        0.36470588]])

In [17]:
y

array(['female', 'female', 'female', ..., 'male', 'male', 'male'],
      dtype='<U6')

In [18]:
Xnorm.shape

(5460, 10000)

In [23]:
# Binary classification mapping 
# female = 1, male = 0
y_norm = np.where(y=='female',1,0)

print(np.unique(y_norm))  # This should print [0, 1] if both classes are present


[0 1]


In [24]:
# Print unique values in `y` before applying np.where
print(np.unique(y))

# Now, apply np.where and print unique values in `y_norm`
y_norm = np.where(y=='female',1,0)
print(np.unique(y_norm))

['female' 'male']
[0 1]


In [25]:
print(np.unique(y))  # This should print ['female', 'male'] if both classes are present


['female' 'male']


In [26]:
# save x and y in numpy zip
np.savez(r'W:\MayCooperStation\New Documents\Data Science and ML\FacialRecognition\data\data_10000_norm.npz',Xnorm,y_norm)