## Feature Engineering 

- It refers to the process of selecting, modifying, or creating new features (variables) from the raw data to improve the performance of machine learning models.
- It involves transforming the data into a more suitable format, making it easier for models to learn patterns and make accurate predictions.
- It is a critical step in the data preprocessing pipeline and plays a key role in the success of machine learning projects.

## Transforming Variables
- Transforming variables is a crucial aspect of feature engineering that involves modifying the scale, distribution, or nature of variables to meet certain assumptions or to make them more suitable for analysis or modeling.

- Techniques for transforming variables:

    - Log transformation
    - Square root transformation
    - Box-cox transformation
    

In [None]:
import pandas as pd
import numpy as np
df= pd.read_csv("HousePrices.csv")

# Example: Creating a new feature 'total_rooms' by adding bedrooms and bathrooms
df['total_rooms'] = df['bedrooms'] + df['bathrooms']


#Log Transformation
#Log transformation is useful for handling skewed data or reducing the impact of outliers. It applies the natural logarithm to the variable values and makes highly skewed distributions less skewed. 

# Logarithmic transformation of the 'price' column
df['log_price'] = np.log(df['price'])



# Square Root Transformation
# Square root transformation, like log transformation, stabilizes variance and addresses skewed distributions. 
# Square root transforming the 'price' variable
df['SquareRoot_price'] = np.sqrt(df['price'])

# Displaying the DataFrame with the new feature
print("DataFrame with square root transformed 'price':")
print(df[['price', 'SquareRoot_price']])


#Box-Cox TransformationThe box-cox transformation is a family of power transformations that includes log and square root transformations.
# It can handle a broader range of data distributions.
# Ensuring positive data is crucial for the Box-Cox transformation because it involves taking the logarithm, which is undefined for zero or negative values. 
# Adding a constant helps avoid mathematical errors and ensures the transformation can be applied effectively.

from scipy.stats import boxcox
# Applying Box-Cox transformation to 'sales' variable
df['BoxCox_Price'], _ = boxcox(df['sqft_living'])

# Displaying the DataFrame with the Box-Cox transformed 'sales' variable
print("DataFrame with box-cox transformed price:")
print(df[['sqft_living', 'BoxCox_Price']])


## Feature Scaling
- Feature scaling is a technique used in machine learning and data preprocessing to standardize or normalize the range of independent variables or features of a dataset.
- Min-max scaling transforms data to a specific range, typically between 0 and 1, preserving the relative differences between values. This normalization technique is ideal for datasets with known bounds, ensuring that all values are rescaled proportionally to fit within the specified range.
- Standard scaling is preferable for normally distributed data to maintain mean-centeredness and consistent standard deviations.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Scaling numeric features using min-max scaling
scaler = MinMaxScaler()
df[['sqft_living', 'sqft_lot']] = scaler.fit_transform(df[['sqft_living', 'sqft_lot']])
print(df)

## Label Encoding
- Label encoding is a technique used to convert categorical labels into a numeric format, making it suitable for machine learning algorithms that require numerical input.
- In Label encoding, each category is assigned an integer value
- this is useful when dealing with ordinal categorical data, where the order of categories matters. 

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample DataFrame
data = {'size': ['small', 'medium', 'large', 'medium', 'small']}
df1 = pd.DataFrame(data)

# Before label encoding
print("Original DataFrame:")
print(df1)

# Apply label encoding
label_encoder = LabelEncoder()
df1['size_encoded'] = label_encoder.fit_transform(df1['size'])

# After label encoding
print("\nDataFrame after label encoding:")
print(df1)



# Demonstrating label encoding using csv file
from sklearn.preprocessing import LabelEncoder

# Label encoding for the 'city' column
label_encoder = LabelEncoder()
df['city_encoded'] = label_encoder.fit_transform(df['city'])
print(df)

## One-Hot Encoding

- One-hot encoding is a technique to represent categorical variables as binary vectors.
- It is particularly useful when dealing with nominal categorical data, where there is no inherent order among categories.
- In one-hot encoding, each category is transformed into a binary column, and only one column in each set of binary columns is hot(or 1) to indicate the presence of that category.
-  It increases dataset dimensionality, facilitating categorical data representation. 

In [None]:
import pandas as pd

# Sample DataFrame
data = {'color': ['red', 'blue', 'green', 'red', 'green']}
df2 = pd.DataFrame(data)

# Before one-hot encoding
print("Original DataFrame:")
print(df2)

# Apply one-hot encoding
df2_encoded = pd.get_dummies(df2, columns=['color'], prefix='color')

# After one-hot encoding
print("\nDataFrame after one-hot encoding:")
print(df2_encoded)


# Demonstrating one-hot encoding using csv file
# One-Hot Encoding for the 'view' column
df_encode = pd.get_dummies(df, columns=['price'], prefix='price')

# After one-hot encoding
print("\nDataFrame after one-hot encoding:")
print(df_encode)

## Hashing
- It is a technique to convert input data (of variable length) into a fixed-length string of characters, a hash code