# Steps in Feature Engineering
### **Feature engineering can vary depending on the specific problem but the general steps are:**

- Data Cleaning: Identify and correct errors or inconsistencies in the dataset to ensure data quality and reliability.
- Data Transformation: Transform raw data into a format suitable for modeling including scaling, normalization and encoding.
- Feature Extraction: Create new features by combining or deriving information from existing ones to provide more meaningful input to the model.
- Feature Selection: Choose the most relevant features for the model using techniques like correlation analysis, mutual information and stepwise regression.
- Feature Iteration: Continuously refine features based on model performance by adding, removing or modifying features for improvement.

## Common Techniques in Feature Engineering

###  1. One-Hot Encoding: One-Hot Encoding converts categorical variables into binary indicators, allowing them to be used by machine learning models.

In [3]:
import pandas as pd

data = {'Color': ['Red', 'Blue', 'Green', 'Blue']}
df = pd.DataFrame(data)

df_encoded = pd.get_dummies(df, columns=['Color'], prefix='Color')

df_encoded

Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,False,False,True
1,True,False,False
2,False,True,False
3,True,False,False


### 2. Binning: Binning transforms continuous variables into discrete bins, making them categorical for easier analysis.

In [5]:
import pandas as pd

data = {'Age': [23, 45, 18, 34, 67, 50, 21]}
df = pd.DataFrame(data)

bins = [0, 20, 40, 60, 100]
labels = ['0-20', '21-40', '41-60', '61+']

df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

df

Unnamed: 0,Age,Age_Group
0,23,21-40
1,45,41-60
2,18,0-20
3,34,21-40
4,67,61+
5,50,41-60
6,21,21-40


### 3. Text Data Preprocessing: Involves removing stop-words, stemming and vectorizing text data to prepare it for machine learning models.

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

texts = ["This is a sample sentence.", "Text data preprocessing is important."]

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
vectorizer = CountVectorizer()


def preprocess_text(text):
    words = text.split()
    words = [stemmer.stem(word)
             for word in words if word.lower() not in stop_words]
    return " ".join(words)


cleaned_texts = [preprocess_text(text) for text in texts]

X = vectorizer.fit_transform(cleaned_texts)

print("Cleaned Texts:", cleaned_texts)
print("Vectorized Text:", X.toarray())

Cleaned Texts: ['sampl sentence.', 'text data preprocess important.']
Vectorized Text: [[0 0 0 1 1 0]
 [1 1 1 0 0 1]]


### 4. Feature Splitting: Divides a single feature into multiple sub-features, uncovering valuable insights and improving model performance.

In [8]:
import pandas as pd

data = {'Full_Address': [
    '123 Elm St, Springfield, 12345', '456 Oak Rd, Shelbyville, 67890']}
df = pd.DataFrame(data)

df[['Street', 'City', 'Zipcode']] = df['Full_Address'].str.extract(
    r'([0-9]+\s[\w\s]+),\s([\w\s]+),\s(\d+)')

df

Unnamed: 0,Full_Address,Street,City,Zipcode
0,"123 Elm St, Springfield, 12345",123 Elm St,Springfield,12345
1,"456 Oak Rd, Shelbyville, 67890",456 Oak Rd,Shelbyville,67890


## The End !!