## Feature Engineering


#### Step 1: Import Libraries and Load Data
First, we need to import the necessary libraries and load the dataset.

In [38]:
import pandas as pd
import numpy as np

# Load a sample dataset
df = pd.DataFrame({
    'age': [25, 32, 47, 51, 62],
    'salary': [50000, 60000, 120000, 90000, 150000],
    'gender': ['male', 'female', 'female', 'male', 'female'],
    'purchase_history': ['yes', 'no', 'yes', 'no', 'yes'],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
})

print("Original DataFrame:")
print(df)


Original DataFrame:
   age  salary  gender purchase_history         city
0   25   50000    male              yes     New York
1   32   60000  female               no  Los Angeles
2   47  120000  female              yes      Chicago
3   51   90000    male               no      Houston
4   62  150000  female              yes      Phoenix


In [39]:
df1 = df

#### Step 2: Handle Missing Values
Before creating new features, it's essential to handle missing values. We can fill or drop missing values as needed.

In [40]:
# Fill missing values with mean for numerical columns
df['age'].fillna(df['age'].mean(), inplace=True)
df['salary'].fillna(df['salary'].mean(), inplace=True)

# Fill missing values with mode for categorical columns
df['gender'].fillna(df['gender'].mode()[0], inplace=True)
df['purchase_history'].fillna(df['purchase_history'].mode()[0], inplace=True)
df['city'].fillna(df['city'].mode()[0], inplace=True)


#### Step 3: Create New Features
##### 3.1 Create Interaction Features
Interaction features are created by combining two or more features. For example, we can create a new feature by multiplying age and salary.

In [41]:
# Create interaction feature
df['age_salary'] = df['age'] * df['salary']


##### 3.2 Create Polynomial Features
Polynomial features can be useful for capturing non-linear relationships in the data.

In [42]:
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['age', 'salary']])
poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(['age', 'salary']))

df = pd.concat([df, poly_df], axis=1)


In [43]:
df

Unnamed: 0,age,salary,gender,purchase_history,city,age_salary,age.1,salary.1,age^2,age salary,salary^2
0,25,50000,male,yes,New York,1250000,25.0,50000.0,625.0,1250000.0,2500000000.0
1,32,60000,female,no,Los Angeles,1920000,32.0,60000.0,1024.0,1920000.0,3600000000.0
2,47,120000,female,yes,Chicago,5640000,47.0,120000.0,2209.0,5640000.0,14400000000.0
3,51,90000,male,no,Houston,4590000,51.0,90000.0,2601.0,4590000.0,8100000000.0
4,62,150000,female,yes,Phoenix,9300000,62.0,150000.0,3844.0,9300000.0,22500000000.0


##### 3.3 Create Binned Features
Binning can convert continuous variables into categorical variables.

In [44]:
df1['age']

0    25
1    32
2    47
3    51
4    62
Name: age, dtype: int64

In [46]:
# Ensure the min and max age values fall within the bin edges
df['age_bin'] = pd.cut(df1['age'], bins=[20, 30, 40, 50, 60, 70], labels=['20-30', '30-40', '40-50', '50-60', '60-70'], include_lowest=True)

# If there are still issues, consider extending the bin edges
# df['age_bin'] = pd.cut(df['age'], bins=[0, 30, 40, 50, 60, 100], labels=['0-30', '30-40', '40-50', '50-60', '60-100'], include_lowest=True)


##### 3.4 Encode Categorical Features
Categorical features need to be encoded into numerical values. We can use one-hot encoding for nominal categories

In [47]:
# One-hot encode categorical features
df = pd.get_dummies(df, columns=['gender', 'purchase_history', 'city', 'age_bin'], drop_first=True)


#### Step 4: Scale Features
Scaling is essential for algorithms that rely on distance metrics. We can use StandardScaler from scikit-learn.

In [48]:
from sklearn.preprocessing import StandardScaler

# Scale numerical features
scaler = StandardScaler()
df[['age', 'salary', 'age_salary']] = scaler.fit_transform(df[['age', 'salary', 'age_salary']])


#### Step 5: Feature Selection
Feature selection can be done using statistical methods or algorithms that provide feature importance scores.

In [49]:
from sklearn.ensemble import RandomForestClassifier

# Assuming 'purchase' is the target variable
y = [1, 0, 1, 0, 1]  # Example target variable
X = df.drop(columns=['age_salary'])  # Features

# Fit RandomForest to determine feature importance
model = RandomForestClassifier()
model.fit(X, y)
feature_importance = pd.Series(model.feature_importances_, index=X.columns)
print("\nFeature Importance:")
print(feature_importance.sort_values(ascending=False))



Feature Importance:
purchase_history_yes    0.228666
salary^2                0.094645
salary                  0.094072
salary                  0.086340
age                     0.085195
age                     0.066294
age salary              0.064290
age^2                   0.052835
city_Los Angeles        0.048396
age_bin_30-40           0.041237
age_bin_50-60           0.039376
city_New York           0.024485
gender_male             0.021048
city_Houston            0.020046
age_bin_40-50           0.018900
age_bin_60-70           0.008448
city_Phoenix            0.005727
dtype: float64


## Encode categorical variables (e.g., one-hot encoding, label encoding)

#### Step 1: Import Libraries and Load Data
Let's start by importing the necessary libraries and creating a sample dataset.

In [50]:
import pandas as pd

# Sample dataset
data = {
    'gender': ['male', 'female', 'male', 'female', 'male'],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'age_group': ['young', 'middle-aged', 'young', 'senior', 'middle-aged']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)


Original DataFrame:
   gender         city    age_group
0    male     New York        young
1  female  Los Angeles  middle-aged
2    male      Chicago        young
3  female      Houston       senior
4    male      Phoenix  middle-aged


#### Step 2: One-Hot Encoding with pandas
We'll use pd.get_dummies() from pandas to perform one-hot encoding.

In [51]:
# Perform one-hot encoding on categorical columns
df_encoded = pd.get_dummies(df, columns=['gender', 'city', 'age_group'])

print("\nOne-Hot Encoded DataFrame:")
print(df_encoded)



One-Hot Encoded DataFrame:
   gender_female  gender_male  city_Chicago  city_Houston  city_Los Angeles  \
0              0            1             0             0                 0   
1              1            0             0             0                 1   
2              0            1             1             0                 0   
3              1            0             0             1                 0   
4              0            1             0             0                 0   

   city_New York  city_Phoenix  age_group_middle-aged  age_group_senior  \
0              1             0                      0                 0   
1              0             0                      1                 0   
2              0             0                      0                 0   
3              0             0                      0                 1   
4              0             1                      1                 0   

   age_group_young  
0                1  
1   

#### Step 3: Label Encoding with scikit-learn
For label encoding, we use LabelEncoder from scikit-learn.

In [52]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Encode categorical columns
df['gender_encoded'] = label_encoder.fit_transform(df['gender'])
df['city_encoded'] = label_encoder.fit_transform(df['city'])
df['age_group_encoded'] = label_encoder.fit_transform(df['age_group'])

print("\nLabel Encoded DataFrame:")
print(df)



Label Encoded DataFrame:
   gender         city    age_group  gender_encoded  city_encoded  \
0    male     New York        young               1             3   
1  female  Los Angeles  middle-aged               0             2   
2    male      Chicago        young               1             0   
3  female      Houston       senior               0             1   
4    male      Phoenix  middle-aged               1             4   

   age_group_encoded  
0                  2  
1                  0  
2                  2  
3                  1  
4                  0  


One-Hot Encoding: Creates binary columns for each category, suitable for nominal categorical data.
Label Encoding: Converts categories into integers, suitable for ordinal categorical data.