
** Feature engineering** is indeed a crucial step in the machine learning workflow, as it can have a significant impact on the performance of predictive models. 

Before diving into specific techniques, it's important to note that the goal of feature engineering is to create features that are informative, relevant, and *non*-redundant. Informative features provide useful information to the model, relevant features are related to the target variable, and non-redundant features don't overlap in terms of the information they provide.

Now let's explore some common feature engineering techniques:

1. **Scaling and normalization**: This technique is used to standardize the range of features. It's important because many machine learning algorithms are sensitive to the scale of features. Common scaling and normalization techniques include min-max scaling, z-score normalization, and log transformation.

2. **One-hot encoding**: One-hot encoding is used to convert categorical features into binary features that can be used by machine learning algorithms. In this technique, each category is represented by a binary feature, with a value of 1 indicating the presence of the category and 0 indicating its absence.

3. **Feature selection**: This technique involves selecting a subset of the most important features for the model. It's important because it can reduce the dimensionality of the feature space, which can lead to faster and more accurate models. Common feature selection techniques include correlation-based feature selection, recursive feature elimination, and principal component analysis.

4. **Feature extraction**: Feature extraction involves creating new features from existing ones. This technique is useful when the existing features are not informative enough or when there are too many features. Common feature extraction techniques include principal component analysis, linear discriminant analysis, and non-negative matrix factorization.

5. **Text preprocessing**: When working with text data, it's important to preprocess the data to extract relevant information. This may involve techniques such as tokenization, stemming, lemmatization, and stop-word removal.

6. **Time-series feature engineering**: When working with time-series data, it's important to create features that capture the temporal patterns in the data. This may involve creating lagged features, rolling statistics, and trend indicators.

7. **Image feature engineering**: When working with image data, it's important to create features that capture the visual patterns in the data. This may involve techniques such as edge detection, texture analysis, and feature extraction using deep learning models.

These are just a few examples of the many feature engineering techniques available. It's important to choose the appropriate techniques based on the characteristics of the data and the specific requirements of the predictive model.

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

d={
    'A': [1,2,None,4 ],
   'B':[5,6,7,None],
   'C':[None,8,9,10]

}

df=pd.DataFrame(d)
print(df)

     A    B     C
0  1.0  5.0   NaN
1  2.0  6.0   8.0
2  NaN  7.0   9.0
3  4.0  NaN  10.0


In [None]:
#drop rows with missing values
df_without_missing=df.dropna()
print(df_without_missing)

     A    B    C
1  2.0  6.0  8.0


In [None]:
#Imputation missing  values can be replaced with estimated values
#mean meadian mode impilation

df_imputed=df.fillna(df.mean())
print(df_imputed)

          A    B     C
0  1.000000  5.0   9.0
1  2.000000  6.0   8.0
2  2.333333  7.0   9.0
3  4.000000  6.0  10.0


In [None]:
#drop columns with missing value
df_C_without_missing=df.dropna(axis=1)
print(df_C_without_missing)

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]


In [None]:
import pandas as pd
from sklearn.impute import SimpleImputer

data = {'Color': ['Red', 'Blue', None, 'Red', 'Blue','Green','Red'],
 'Size': ['Small', 'Medium', 'Large', None, 'Small','Large','Small']}
df = pd.DataFrame(data)
print(df)


   Color    Size
0    Red   Small
1   Blue  Medium
2   None   Large
3    Red    None
4   Blue   Small
5  Green   Large
6    Red   Small


In [None]:
indicator_variables=pd.get_dummies(df)
print(indicator_variables)

   Color_Blue  Color_Green  Color_Red  Size_Large  Size_Medium  Size_Small
0           0            0          1           0            0           1
1           1            0          0           0            1           0
2           0            0          0           1            0           0
3           0            0          1           0            0           0
4           1            0          0           0            0           1
5           0            1          0           1            0           0
6           0            0          1           0            0           1


In [None]:
import pandas as pd
from sklearn.impute import SimpleImputer

data = {'Color': ['Red', 'Blue', None, 'Red', 'Blue','green'],
 'Size': ['Small', 'Medium', 'Large', None, 'Small','Large']}
df = pd.DataFrame(data)
indicator_variables=pd.get_dummies(df)
impute = SimpleImputer(strategy="most_frequent")
imputed_values=impute.fit_transform(indicator_variables)
print(imputed_values)

[[0 1 0 0 0 1]
 [1 0 0 0 1 0]
 [0 0 0 1 0 0]
 [0 1 0 0 0 0]
 [1 0 0 0 0 1]
 [0 0 1 1 0 0]]


In [None]:
import pandas as pd
from sklearn.impute import SimpleImputer

data = {'Color': ['Red', 'Blue', None, 'Red', 'Blue','green'],
 'Size': ['Small', 'Medium', 'Large', None, 'Small','Large']}
df = pd.DataFrame(data)
indicator_variables=pd.get_dummies(df)
impute = SimpleImputer(strategy="most_frequent")
imputed_values=impute.fit_transform(indicator_variables)
imputed_df=pd.DataFrame(imputed_values, columns=indicator_variables.columns)
print(imputed_df)

   Color_Blue  Color_Red  Color_green  Size_Large  Size_Medium  Size_Small
0           0          1            0           0            0           1
1           1          0            0           0            1           0
2           0          0            0           1            0           0
3           0          1            0           0            0           0
4           1          0            0           0            0           1
5           0          0            1           1            0           0


**2.2 Dealing with Categorical Variables**




**one hot encoding**
In this approach, each category of a categorical variable is transformed into a binary feature. For example, if a variable "color" has categories "red," "blue," and
"green," it would be encoded as three separate binary features: "color_red," "color_blue," and "color_green."


In [2]:
import pandas as pd
data={'color':['red','blue','green','red','blue'],
      'size':['S','M','L','S','L']}

df=pd.DataFrame(data)

one_hot_encoded=pd.get_dummies(df)
print(one_hot_encoded)

   color_blue  color_green  color_red  size_L  size_M  size_S
0           0            0          1       0       0       1
1           1            0          0       0       1       0
2           0            1          0       1       0       0
3           0            0          1       0       0       1
4           1            0          0       1       0       0


**Label encoding** assigns a unique numerical label to each category of a categorical variable. However, this method introduces an implicit ordering among
categories, which may mislead the model. It is typically suitable for ordinal variables where the order matters.


In [4]:
from sklearn.preprocessing import LabelEncoder
colors=['Red','Blue','Green','Red','Green']
label_encorder=LabelEncoder()
encordered_colors=label_encorder.fit_transform(colors)
print(encordered_colors)

[2 0 1 2 1]


**Target Encoding:**
Target encoding replaces each category with the mean target value of the corresponding category. This technique can be useful when there is a correlation
between the categorical variable and the target variable.

In below code, the KFold class from sklearn.model_selection is imported . It creates a kf object that will perform k-fold cross-validation. The n_splits parameter is set to 3, meaning the data will be split into 3 folds. The shuffle parameter is set to True to randomize the sample indices before splitting.

The code then uses a for loop to iterate over the train and test indices generated by kf.split(data). You can use these indices to select the corresponding rows from your data DataFrame and perform your model training and evaluation within the loop.

In [9]:
import pandas as pd
from sklearn.model_selection import KFold
data=pd.DataFrame({
     'category': ['A', 'B', 'A', 'C', 'B', 'C'],
     'target': [1, 0, 1, 0, 1, 0]

})

data['category_target_encoded'] = 0
# Perform target encoding using K-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_index, val_index in kf.split(data):
 train_data = data.iloc[train_index]
 val_data = data.iloc[val_index]
 
 # Calculate the mean target value for each category in the training set
 category_mean = train_data.groupby('category')['target'].mean()
 
 # Map the mean target values to the corresponding categories in the validation set
 val_data['category_target_encoded'] = val_data['category'].map(category_mean)
 
 # Update the target-encoded values in the main dataset
 data.iloc[val_index] = val_data
print(data)


  category  target  category_target_encoded
0        A       1                        1
1        B       0                        1
2        A       1                        1
3        C       0                        0
4        B       1                        0
5        C       0                        0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_data['category_target_encoded'] = val_data['category'].map(category_mean)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_data['category_target_encoded'] = val_data['category'].map(category_mean)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_data['category_target_encoded'] = val_data['c

**Frequency Encoding:**
Frequency encoding replaces each category with its frequency in the dataset. It can capture the information about the distribution of categories, especially
when certain categories occur more frequently than others

In [10]:
import pandas as pd
# Sample dataset with a categorical feature 'category'
data = pd.DataFrame({
 'category': ['A', 'B', 'A', 'C', 'B', 'C']
})
# Calculate the frequency of each category in the dataset
category_counts = data['category'].value_counts()
# Create a new column to store the frequency-encoded values
data['category_frequency_encoded'] = data['category'].map(category_counts) / len(data)
print(data)

  category  category_frequency_encoded
0        A                    0.333333
1        B                    0.333333
2        A                    0.333333
3        C                    0.333333
4        B                    0.333333
5        C                    0.333333
