  **Encoding Categorical Features.**

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder , OrdinalEncoder

# Load the Iris dataset
data = load_iris()
df = pd.DataFrame(data.data , columns=data.feature_names)
df['species'] = data.target # this will add the species (target) as a categorical feature

# Map species target (0,1,2) to actual names for demonstration
df['species'] = df['species'].map({0:'setosa',1:'versicolor',2:'virginica'})

# Hypothetical Ordinal Feature
# Let's add an example ordinal feature 'petal_size' to demonstrate ordinal encoding
# Assume 'petal_size' has levels:'small','medium','large'
df['petal_size'] = pd.cut(df['petal length (cm)'],bins=[0,2,4,10],labels=['small','medium','large'])

# Display the initial dataset with the added features
print('Initial DataFrame with Nominal and Ordinal Features:\n',df.tail())

#------------------------
# Nominal Encoding
#------------------------

# Encoding 'species' using OneHotEncoder for nominal data
onehot_encoder = OneHotEncoder(sparse_output=False)
species_encoded = onehot_encoder.fit_transform(df[['species']])

# Convert the encoded columns to DataFrame with feature names
species_encoded_df = pd.DataFrame(species_encoded ,columns=onehot_encoder.get_feature_names_out(['species']))
df = pd.concat([df,species_encoded_df],axis=1)
print(df.tail())

#------------------------
# Ordinal Encoding
#------------------------

# Encoding 'petal_size' using OrdinalEncoder for ordinal data
ordinal_encoder = OrdinalEncoder(categories=[['small','medium','large']])
df['petal_size_encoded'] = ordinal_encoder.fit_transform(df[['petal_size']])

# Drop original categorical columns if desired
print(df.tail())
df=df.drop(columns=['species','petal_size'])

# Display the transformed DataFrame
print('\nTransformed DataFrame with Encoded Nominal and Ordinal Features:\n',df.head())

Initial DataFrame with Nominal and Ordinal Features:
      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0                5.2               2.0   
148                6.2               3.4                5.4               2.3   
149                5.9               3.0                5.1               1.8   

       species petal_size  
145  virginica      large  
146  virginica      large  
147  virginica      large  
148  virginica      large  
149  virginica      large  
     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0        

The table now contains:

Original numerical features (sepal and petal lengths and widths).
One-Hot columns representing flower species.
Numeric encoding for petal size.

Purpose of the code:

Convert textual data into numerical form suitable for machine learning algorithms.
Use OneHotEncoder to handle nominal data.
Use OrdinalEncoder to handle ordinal data.

_______________________________________________________________________________________________________________



Scaling Techniques

In [1]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import MinMaxScaler,StandardScaler,RobustScaler

# Load the california Housing dataset (replace the deprecated Boston dataset)
data = fetch_california_housing()
df = pd.DataFrame(data.data,columns=data.feature_names)

# Display initial feature ranges
print('Initial feature ranges:\n',df.describe())

# Min-Max Scaling
min_max_scaler = MinMaxScaler()
df_min_max_scaled = pd.DataFrame(min_max_scaler.fit_transform(df),columns=df.columns)

print('Feature ranges after Min-Max Scaling:\n',df_min_max_scaled.describe())

# Standardization (Z-Score Scaling)
z_score_scaler = StandardScaler()
df_z_score_scaled = pd.DataFrame(z_score_scaler.fit_transform(df),columns=df.columns)

print('\nFeature ranges after Z-Score Scaling:\n',df_z_score_scaled.describe())

# Robust Scaling
robust_scaler = RobustScaler()
df_robust_scaled = pd.DataFrame(robust_scaler.fit_transform(df),columns=df.columns)

print('\nFeature ranges after Robust Scaling:\n',df_robust_scaled.describe())

Initial feature ranges:
              MedInc      HouseAge      AveRooms  ...      AveOccup      Latitude     Longitude
count  20640.000000  20640.000000  20640.000000  ...  20640.000000  20640.000000  20640.000000
mean       3.870671     28.639486      5.429000  ...      3.070655     35.631861   -119.569704
std        1.899822     12.585558      2.474173  ...     10.386050      2.135952      2.003532
min        0.499900      1.000000      0.846154  ...      0.692308     32.540000   -124.350000
25%        2.563400     18.000000      4.440716  ...      2.429741     33.930000   -121.800000
50%        3.534800     29.000000      5.229129  ...      2.818116     34.260000   -118.490000
75%        4.743250     37.000000      6.052381  ...      3.282261     37.710000   -118.010000
max       15.000100     52.000000    141.909091  ...   1243.333333     41.950000   -114.310000

[8 rows x 8 columns]
Feature ranges after Min-Max Scaling:
              MedInc      HouseAge      AveRooms  ...      A