## Feature engineering, selection and Normalization!

The goal of feature engineering is to create new features or transform existing ones to improve the performance of machine learning models by providing better inputs to the algorithms. 
- It involves selecting, extracting, and transforming relevant features from raw data to make them more informative and useful for modeling.

### 1.0 Importing libraries used!

In [68]:
#Import Library
import warnings
warnings.filterwarnings('ignore')
import os


#DataFrame Library
import pandas as pd
import numpy as np

#Visualization Library
import matplotlib.pyplot as plt
import seaborn as sns

# To save the model
import joblib

#Modeling Libraries.
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

Load the dataset from my local machine.

In [69]:
DATA_PATH = 'D:/Projects/CLIENTS/Toto Afya/'
df = pd.read_excel(os.path.join(DATA_PATH, 'TOTO AFYA-MBEYA 2.xlsx'))

### 2.0 Creating New Features

### 2.1 Children Age Grouping!

Grouping Childrens Age into: 

- 1: Toddlers aged 0 - 2
- 2: Preschoolers: 3-5 years old
- 3: School-age children: 6-12 years old
- 4: Adolescents: 13-17 years old.

In [70]:
# Create age groups using pd.cut()
bins = [0,4, 6, 13, 17]  # Age group boundaries
labels = ['Toddlers 0-4','Preschoolers: 4-5', 'School-age children: 6-12', 'Adolescents: 13-17 years old']  # Group labels
df['age_group'] = pd.cut(df['Age'], bins=bins, labels=labels)

# Check the distribution of age groups in the dataset
print(df['age_group'].value_counts())

School-age children: 6-12       253
Preschoolers: 4-5               181
Toddlers 0-4                     85
Adolescents: 13-17 years old     55
Name: age_group, dtype: int64


### 2.2 Regions Grouping!

Grouping Children into Urban , Semi-Urban and Rural areas.

In [71]:
# Create a dictionary to map regions to area types
region_to_area = {
    'Arusha': 'Semi-Urban',
    'Dodoma': 'Semi-Urban',
    'Geita': 'Rural',
    'Iringa': 'Rural',
    'Kagera': 'Rural',
    'Kigoma': 'Rural',
    'Kilimanjaro': 'Semi-Urban',
    'Mara': 'Rural',
    'Mbeya': 'Semi-Urban',
    'Morogoro': 'Rural',
    'Mwanza': 'Semi-Urban',
    'Njombe': 'Rural',
    'Pwani': 'Rural',
    'Rukwa': 'Rural',
    'Shinyanga': 'Rural',
    'Singida': 'Rural',
    'Songwe': 'Rural',
    'Tabora': 'Rural',
    'Tanga': 'Rural',
    'Unguja': 'Rural',
    'Kinondoni': 'Urban',
    'Temeke': 'Urban',
    'Ilala': 'Urban'
}

# Create a new column "Area_Type" based on the region names
df['Area_Type'] = df['Region'].map(region_to_area)
print(df['Area_Type'].value_counts())

Urban         358
Semi-Urban    111
Rural         105
Name: Area_Type, dtype: int64


### 2.3 Grouping Number of visits in 2021

Grouping number of visits to the Normal, Regular and frequent patients!

In [72]:
# create a new column 'Visits Group'
bins = [0, 1, 3, 17] # define the bin edges
labels = ['Normal Patient', 'Regular Patient', 'Frequent Patient'] # define the labels for each group
df['Visits Group21'] = pd.cut(df['Visits Jul 21'], bins=bins, labels=labels, include_lowest=True)
print(df['Visits Group21'].value_counts())

Normal Patient      442
Regular Patient     119
Frequent Patient     13
Name: Visits Group21, dtype: int64


### 2.4 Grouping Number of Visits in 2022

In [73]:
# create a new column 'Visits Group'
bins = [0, 1, 3, 17] # define the bin edges
labels = ['Normal Patient', 'Regular Patient', 'Frequent Patient'] # define the labels for each group
df['Visits Group22'] = pd.cut(df['Visits Jul 22'], bins=bins, labels=labels, include_lowest=True)
print(df['Visits Group22'].value_counts())

Normal Patient      403
Regular Patient     150
Frequent Patient     21
Name: Visits Group22, dtype: int64


### 3.0 Analysis of Created Dataframe!

Separating Categorical & Numerical Values.

In [74]:
cats = ['Gender', 'Category', 'Ownership', 'Region','age_group','Visits Group21','Visits Group22', 'Area_Type']
nums = ['Age', 'Visits Jul 21', 'Amount Paid Jul 21', 'age_group','Area_Type','Visits Jul 22','Amount Paid Jul 22']

In [75]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 574 entries, 0 to 573
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   S/n                 574 non-null    int64   
 1   Gender              574 non-null    object  
 2   Age                 574 non-null    int64   
 3   Category            574 non-null    object  
 4   Ownership           574 non-null    object  
 5   Region              574 non-null    object  
 6   Visits Jul 21       574 non-null    int64   
 7   Amount Paid Jul 21  574 non-null    int64   
 8   Visits Jul 22       574 non-null    int64   
 9   Amount Paid Jul 22  574 non-null    int64   
 10  age_group           574 non-null    category
 11  Area_Type           574 non-null    object  
 12  Visits Group21      574 non-null    category
 13  Visits Group22      574 non-null    category
dtypes: category(3), int64(6), object(5)
memory usage: 51.6+ KB


Visulization of the first five rows of the Dataframe.

In [83]:
df.head()

Unnamed: 0,S/n,Gender,Age,Category,Ownership,Region,Visits Jul 21,Amount Paid Jul 21,Visits Jul 22,Amount Paid Jul 22,age_group,Area_Type,Visits Group21,Visits Group22
0,1,Female,4,Specialized Clinic (Polyclinic),Private,Temeke,1,28000,1,11600,Toddlers 0-4,Urban,Normal Patient,Normal Patient
1,2,Male,4,Specialized Clinic (Polyclinic),Private,Ilala,1,31950,2,46960,Toddlers 0-4,Urban,Normal Patient,Regular Patient
2,3,Male,4,Health Centre,Faith Based,Morogoro,1,5350,1,3900,Toddlers 0-4,Rural,Normal Patient,Normal Patient
3,4,Male,4,Dispensary,Private,Unguja,1,10100,3,18460,Toddlers 0-4,Rural,Normal Patient,Regular Patient
4,5,Male,9,Zonal Referral Hospital,Faith Based,Kilimanjaro,2,61000,3,49700,School-age children: 6-12,Semi-Urban,Regular Patient,Regular Patient


Make a copy and save a new dataset!

In [84]:
dfori = df.copy()
dfori.to_csv('New Cleaned Datafile.csv', index=False)

## 4. Data Normalization!

To this stage we do data normalization!

#### 4.1 Feature Transformation

- Ordinal Encoding
- One hot Encoding
- Feature Encoding

Make a data copy first!

In [78]:
dfori = df.copy()

In [79]:

# One Hot Encoding
Gender_ori = pd.get_dummies(dfori['Gender'], prefix = 'Gender')
region_ori = pd.get_dummies(dfori['Region'], prefix = 'Region')
status_ori = pd.get_dummies(dfori['Category'], prefix = 'Category')
owner_ori = pd.get_dummies(dfori['Ownership'], prefix = 'Ownership')
areatype_ori = pd.get_dummies(dfori['Area_Type'], prefix = 'Area_Type')
agegroup_ori = pd.get_dummies(dfori['age_group'], prefix = 'age_group')
visitFreq21_ori = pd.get_dummies(dfori['Visits Group21'], prefix = 'Visits Group21')
visitFreq22_ori = pd.get_dummies(dfori['Visits Group22'], prefix = 'Visits Group22')



# Concat Feature Encoding
dfori = pd.concat([dfori, Gender_ori], axis=1)
dfori = pd.concat([dfori, region_ori], axis=1)
dfori = pd.concat([dfori, status_ori], axis=1)
dfori = pd.concat([dfori, owner_ori], axis=1)
dfori = pd.concat([dfori, areatype_ori], axis=1)
dfori = pd.concat([dfori, agegroup_ori], axis=1)
dfori = pd.concat([dfori, visitFreq21_ori], axis=1)
dfori = pd.concat([dfori, visitFreq22_ori], axis=1)


# Droping the encoded features.
dfori = dfori.drop(columns = ['S/n','Gender', 'Region', 'Category','Ownership','Area_Type','age_group','Visits Group21','Visits Group22'])

4.2 Normalization (Feature Scaling)

In [80]:
# Grouping Features for Normalization
norm_ori = dfori.drop(columns = ['Amount Paid Jul 22','Amount Paid Jul 21','Visits Jul 22','Visits Jul 21','Age']).columns
print(norm_ori)

# Normalization Features
for i in range(len(norm_ori)):
    dfori[norm_ori[i]] = MinMaxScaler().fit_transform(dfori[norm_ori[i]].values.reshape(len(dfori), 1))

Index(['Gender_Female', 'Gender_Male', 'Region_Arusha', 'Region_Dodoma',
       'Region_Geita', 'Region_Ilala', 'Region_Iringa', 'Region_Kagera',
       'Region_Kigoma', 'Region_Kilimanjaro', 'Region_Kinondoni',
       'Region_Mara', 'Region_Mbeya', 'Region_Morogoro', 'Region_Mwanza',
       'Region_Njombe', 'Region_Pwani', 'Region_Rukwa', 'Region_Shinyanga',
       'Region_Singida', 'Region_Songwe', 'Region_Tabora', 'Region_Tanga',
       'Region_Temeke', 'Region_Unguja', 'Category_Dispensary',
       'Category_District Hospital', 'Category_Health Centre',
       'Category_National Referral Hospital', 'Category_Pharmacy',
       'Category_Regional Referral Hospital',
       'Category_Specialized Clinic (Polyclinic)',
       'Category_Specialized Clinics', 'Category_Zonal Referral Hospital',
       'Ownership_Faith Based', 'Ownership_Government', 'Ownership_Private',
       'Area_Type_Rural', 'Area_Type_Semi-Urban', 'Area_Type_Urban',
       'age_group_Toddlers 0-4', 'age_group_Prescho

2.3 Saving the normalized file!

In [81]:
dfori2 = dfori.copy()
dfori2.to_csv('Normalized Dataset.csv', index=False)

In [82]:
dfori2.head()

Unnamed: 0,Age,Visits Jul 21,Amount Paid Jul 21,Visits Jul 22,Amount Paid Jul 22,Gender_Female,Gender_Male,Region_Arusha,Region_Dodoma,Region_Geita,...,age_group_Toddlers 0-4,age_group_Preschoolers: 4-5,age_group_School-age children: 6-12,age_group_Adolescents: 13-17 years old,Visits Group21_Normal Patient,Visits Group21_Regular Patient,Visits Group21_Frequent Patient,Visits Group22_Normal Patient,Visits Group22_Regular Patient,Visits Group22_Frequent Patient
0,4,1,28000,1,11600,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,4,1,31950,2,46960,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,4,1,5350,1,3900,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,4,1,10100,3,18460,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,9,2,61000,3,49700,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
