 Feature engineering is indeed a crucial step in the machine learning workflow, as it can have a significant impact on the performance of predictive models. 

Before diving into specific techniques, it's important to note that the goal of feature engineering is to create features that are informative, relevant, and *non*-redundant. Informative features provide useful information to the model, relevant features are related to the target variable, and non-redundant features don't overlap in terms of the information they provide.

Now let's explore some common feature engineering techniques:

1. **Scaling and normalization**: This technique is used to standardize the range of features. It's important because many machine learning algorithms are sensitive to the scale of features. Common scaling and normalization techniques include min-max scaling, z-score normalization, and log transformation.

2. **One-hot encoding**: One-hot encoding is used to convert categorical features into binary features that can be used by machine learning algorithms. In this technique, each category is represented by a binary feature, with a value of 1 indicating the presence of the category and 0 indicating its absence.

3. **Feature selection**: This technique involves selecting a subset of the most important features for the model. It's important because it can reduce the dimensionality of the feature space, which can lead to faster and more accurate models. Common feature selection techniques include correlation-based feature selection, recursive feature elimination, and principal component analysis.

4. **Feature extraction**: Feature extraction involves creating new features from existing ones. This technique is useful when the existing features are not informative enough or when there are too many features. Common feature extraction techniques include principal component analysis, linear discriminant analysis, and non-negative matrix factorization.

5. **Text preprocessing**: When working with text data, it's important to preprocess the data to extract relevant information. This may involve techniques such as tokenization, stemming, lemmatization, and stop-word removal.

6. **Time-series feature engineering**: When working with time-series data, it's important to create features that capture the temporal patterns in the data. This may involve creating lagged features, rolling statistics, and trend indicators.

7. **Image feature engineering**: When working with image data, it's important to create features that capture the visual patterns in the data. This may involve techniques such as edge detection, texture analysis, and feature extraction using deep learning models.

These are just a few examples of the many feature engineering techniques available. It's important to choose the appropriate techniques based on the characteristics of the data and the specific requirements of the predictive model.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

d={
    'A': [1,2,None,4 ],
   'B':[5,6,7,None],
   'C':[None,8,9,10]

}

df=pd.DataFrame(d)
print(df)

     A    B     C
0  1.0  5.0   NaN
1  2.0  6.0   8.0
2  NaN  7.0   9.0
3  4.0  NaN  10.0


In [3]:
#drop rows with missing values
df_without_missing=df.dropna()
print(df_without_missing)

     A    B    C
1  2.0  6.0  8.0


In [6]:
#Imputation missing  values can be replaced with estimated values
#mean meadian mode impilation

df_imputed=df.fillna(df.mean())
print(df_imputed)

          A    B     C
0  1.000000  5.0   9.0
1  2.000000  6.0   8.0
2  2.333333  7.0   9.0
3  4.000000  6.0  10.0


In [4]:
#drop columns with missing value
df_C_without_missing=df.dropna(axis=1)
print(df_C_without_missing)

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]


In [18]:
import pandas as pd
from sklearn.impute import SimpleImputer

data = {'Color': ['Red', 'Blue', None, 'Red', 'Blue','Green','Red'],
 'Size': ['Small', 'Medium', 'Large', None, 'Small','Large','Small']}
df = pd.DataFrame(data)
print(df)


   Color    Size
0    Red   Small
1   Blue  Medium
2   None   Large
3    Red    None
4   Blue   Small
5  Green   Large
6    Red   Small


In [19]:
indicator_variables=pd.get_dummies(df)
print(indicator_variables)

   Color_Blue  Color_Green  Color_Red  Size_Large  Size_Medium  Size_Small
0           0            0          1           0            0           1
1           1            0          0           0            1           0
2           0            0          0           1            0           0
3           0            0          1           0            0           0
4           1            0          0           0            0           1
5           0            1          0           1            0           0
6           0            0          1           0            0           1


In [20]:
import pandas as pd
from sklearn.impute import SimpleImputer

data = {'Color': ['Red', 'Blue', None, 'Red', 'Blue','green'],
 'Size': ['Small', 'Medium', 'Large', None, 'Small','Large']}
df = pd.DataFrame(data)
indicator_variables=pd.get_dummies(df)
impute = SimpleImputer(strategy="most_frequent")
imputed_values=impute.fit_transform(indicator_variables)
print(imputed_values)

[[0 1 0 0 0 1]
 [1 0 0 0 1 0]
 [0 0 0 1 0 0]
 [0 1 0 0 0 0]
 [1 0 0 0 0 1]
 [0 0 1 1 0 0]]


In [21]:
import pandas as pd
from sklearn.impute import SimpleImputer

data = {'Color': ['Red', 'Blue', None, 'Red', 'Blue','green'],
 'Size': ['Small', 'Medium', 'Large', None, 'Small','Large']}
df = pd.DataFrame(data)
indicator_variables=pd.get_dummies(df)
impute = SimpleImputer(strategy="most_frequent")
imputed_values=impute.fit_transform(indicator_variables)
imputed_df=pd.DataFrame(imputed_values, columns=indicator_variables.columns)
print(imputed_df)

   Color_Blue  Color_Red  Color_green  Size_Large  Size_Medium  Size_Small
0           0          1            0           0            0           1
1           1          0            0           0            1           0
2           0          0            0           1            0           0
3           0          1            0           0            0           0
4           1          0            0           0            0           1
5           0          0            1           1            0           0
