<a href="https://colab.research.google.com/github/Jeshwanth2/AI-ML-Internship-Task-4/blob/main/FeatureEncodingScaling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

df = pd.read_csv('/content/adult.csv')

print("First 5 rows of the DataFrame:")
print(df.head())

print("\nDataFrame Info:")
df.info()

First 5 rows of the DataFrame:
   age  workclass  fnlwgt     education  educational-num      marital-status  \
0   25    Private  226802          11th                7       Never-married   
1   38    Private   89814       HS-grad                9  Married-civ-spouse   
2   28  Local-gov  336951    Assoc-acdm               12  Married-civ-spouse   
3   44    Private  160323  Some-college               10  Married-civ-spouse   
4   18          ?  103497  Some-college               10       Never-married   

          occupation relationship   race  gender  capital-gain  capital-loss  \
0  Machine-op-inspct    Own-child  Black    Male             0             0   
1    Farming-fishing      Husband  White    Male             0             0   
2    Protective-serv      Husband  White    Male             0             0   
3  Machine-op-inspct      Husband  Black    Male          7688             0   
4                  ?    Own-child  White  Female             0             0   

   hour

In [2]:
numerical_cols = []
categorical_cols = []

for col in df.columns:
    if df[col].dtype == 'int64' or df[col].dtype == 'float64':
        numerical_cols.append(col)
    elif df[col].dtype == 'object':
        categorical_cols.append(col)

print("Numerical Columns:")
print(numerical_cols)
print("\nCategorical Columns:")
print(categorical_cols)

Numerical Columns:
['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week']

Categorical Columns:
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'native-country', 'income']


In [3]:
import numpy as np

df_encoded = df.copy()

for col in categorical_cols:
    df_encoded[col] = df_encoded[col].replace('?', np.nan)
    df_encoded[col] = df_encoded[col].fillna(df_encoded[col].mode()[0])

print("Missing values '?' replaced with NaN and then filled with mode for categorical columns.")

Missing values '?' replaced with NaN and then filled with mode for categorical columns.


In [4]:
education_mapping = {
    'Preschool': 0,
    '1st-4th': 1,
    '5th-6th': 2,
    '7th-8th': 3,
    '9th': 4,
    '10th': 5,
    '11th': 6,
    '12th': 7,
    'HS-grad': 8,
    'Some-college': 9,
    'Assoc-voc': 10,
    'Assoc-acdm': 11,
    'Bachelors': 12,
    'Masters': 13,
    'Prof-school': 14,
    'Doctorate': 15
}

income_mapping = {
    '<=50K': 0,
    '>50K': 1
}

df_encoded['education'] = df_encoded['education'].map(education_mapping)
df_encoded['income'] = df_encoded['income'].map(income_mapping)

print("Education and income columns have been label encoded.")

Education and income columns have been label encoded.


In [5]:
nominal_cols = [col for col in categorical_cols if col not in ['education', 'income']]
df_one_hot_encoded = pd.get_dummies(df_encoded[nominal_cols], drop_first=True)

print("Nominal columns identified and one-hot encoded.")

Nominal columns identified and one-hot encoded.


In [6]:
df_encoded = df_encoded.drop(columns=nominal_cols)

print("Original nominal columns dropped from df_encoded.")

Original nominal columns dropped from df_encoded.


In [7]:
df_encoded = pd.concat([df_encoded, df_one_hot_encoded], axis=1)

print("df_encoded and df_one_hot_encoded concatenated.")

df_encoded and df_one_hot_encoded concatenated.


In [8]:
print("\nFirst 5 rows of the transformed DataFrame:")
print(df_encoded.head())

print("\nDataFrame Info after encoding:")
df_encoded.info()


First 5 rows of the transformed DataFrame:
   age  fnlwgt  education  educational-num  capital-gain  capital-loss  \
0   25  226802          6                7             0             0   
1   38   89814          8                9             0             0   
2   28  336951         11               12             0             0   
3   44  160323          9               10          7688             0   
4   18  103497          9               10             0             0   

   hours-per-week  income  workclass_Local-gov  workclass_Never-worked  ...  \
0              40       0                False                   False  ...   
1              50       0                False                   False  ...   
2              40       1                 True                   False  ...   
3              40       1                False                   False  ...   
4              30       0                False                   False  ...   

   native-country_Portugal  native-c

In [9]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_encoded[numerical_cols] = scaler.fit_transform(df_encoded[numerical_cols])

print("Numerical columns have been Min-Max scaled.")

print("\nFirst 5 rows of the transformed DataFrame after scaling numerical features:")
print(df_encoded.head())

print("\nDataFrame Info after scaling:")
df_encoded.info()

Numerical columns have been Min-Max scaled.

First 5 rows of the transformed DataFrame after scaling numerical features:
        age    fnlwgt  education  educational-num  capital-gain  capital-loss  \
0  0.109589  0.145129          6         0.400000      0.000000           0.0   
1  0.287671  0.052451          8         0.533333      0.000000           0.0   
2  0.150685  0.219649         11         0.733333      0.000000           0.0   
3  0.369863  0.100153          9         0.600000      0.076881           0.0   
4  0.013699  0.061708          9         0.600000      0.000000           0.0   

   hours-per-week  income  workclass_Local-gov  workclass_Never-worked  ...  \
0        0.397959       0                False                   False  ...   
1        0.500000       0                False                   False  ...   
2        0.397959       1                 True                   False  ...   
3        0.397959       1                False                   False  ... 

In [10]:
print("Descriptive statistics for numerical features BEFORE scaling (from original df):")
print(df[numerical_cols].describe())

print("\nDescriptive statistics for numerical features AFTER scaling (from df_encoded):")
print(df_encoded[numerical_cols].describe())

Descriptive statistics for numerical features BEFORE scaling (from original df):
                age        fnlwgt  educational-num  capital-gain  \
count  48842.000000  4.884200e+04     48842.000000  48842.000000   
mean      38.643585  1.896641e+05        10.078089   1079.067626   
std       13.710510  1.056040e+05         2.570973   7452.019058   
min       17.000000  1.228500e+04         1.000000      0.000000   
25%       28.000000  1.175505e+05         9.000000      0.000000   
50%       37.000000  1.781445e+05        10.000000      0.000000   
75%       48.000000  2.376420e+05        12.000000      0.000000   
max       90.000000  1.490400e+06        16.000000  99999.000000   

       capital-loss  hours-per-week  
count  48842.000000    48842.000000  
mean      87.502314       40.422382  
std      403.004552       12.391444  
min        0.000000        1.000000  
25%        0.000000       40.000000  
50%        0.000000       40.000000  
75%        0.000000       45.000000  
ma

## Explain Scaling Impact


Scaling numerical features is a critical preprocessing step in machine learning for several reasons, particularly for algorithms sensitive to the magnitude and range of input features:

1.  **Algorithms Sensitive to Feature Scales**: Many machine learning algorithms perform poorly or converge slowly when the input features have very different scales. Examples include:
    *   **Gradient Descent-based Algorithms (e.g., Logistic Regression, Neural Networks)**: These algorithms update weights based on the gradient of the cost function. If features have different scales, the gradients will also be on different scales, leading to an elongated cost function landscape. This can cause the optimization algorithm to zigzag towards the minimum, resulting in slower convergence or even oscillations around the minimum.
    *   **Distance-based Algorithms (e.g., K-Nearest Neighbors, Support Vector Machines, K-Means Clustering)**: These algorithms calculate distances between data points. If one feature has a much larger range than others, its variations will dominate the distance calculation, effectively overshadowing the contributions of other features. Scaling ensures that all features contribute proportionally to the distance metrics.
    *   **Principal Component Analysis (PCA)**: PCA aims to find directions of maximum variance. If features are not scaled, features with larger variances (often due to larger scales) will disproportionately influence the principal components, leading to a biased representation of the data.

2.  **Faster Convergence and Preventing Dominance**: Scaling brings all numerical features into a comparable range (e.g., [0, 1] with Min-Max scaling or mean 0 and standard deviation 1 with Standardization). This has two main benefits:
    *   **Faster Convergence**: With scaled features, the cost function becomes more spherical, allowing gradient descent algorithms to find the minimum more directly and efficiently, leading to faster training times.
    *   **Preventing Feature Dominance**: When features are on vastly different scales, features with larger values can inadvertently dominate the learning process. Scaling ensures that no single feature's magnitude unfairly influences the model's performance, promoting a more balanced contribution from all features.

3.  **Interpretability and Numerical Stability**: While not always the primary reason, scaling can also aid in:
    *   **Interpretability**: For some models, having features on a similar scale can make the interpretation of feature importances or model coefficients more straightforward.
    *   **Numerical Stability**: Extremely large or small feature values can sometimes lead to numerical precision issues during computations, especially with floating-point arithmetic. Scaling helps mitigate these issues, leading to more numerically stable algorithms.

In [11]:
df_encoded.to_csv('adult_processed.csv', index=False)

print("Processed dataset saved to adult_processed.csv")

Processed dataset saved to adult_processed.csv


## Summary:

### Data Analysis Key Findings
*   The original `adult.csv` dataset contained 48,842 entries and 15 columns, with 6 numerical (`int64`) and 9 categorical (`object`) columns.
*   Initial inspection revealed '?' characters in categorical columns like `workclass` and `occupation`, which were treated as missing values.
*   **Feature Identification**:
    *   Numerical columns identified: `age`, `fnlwgt`, `educational-num`, `capital-gain`, `capital-loss`, `hours-per-week`.
    *   Categorical columns identified: `workclass`, `education`, `marital-status`, `occupation`, `relationship`, `race`, `gender`, `native-country`, `income`.
*   **Categorical Feature Encoding**:
    *   Missing '?' values in categorical columns were replaced with `NaN` and then imputed using the mode of their respective columns.
    *   The `education` column was Label Encoded into numerical values from 0 (Preschool) to 15 (Doctorate).
    *   The `income` column was Label Encoded into 0 (`<=50K`) and 1 (`>50K`).
    *   Other nominal categorical features were One-Hot Encoded, resulting in an expansion of columns.
*   **Numerical Feature Scaling**:
    *   All identified numerical features were scaled using `MinMaxScaler`, transforming their values to lie within the range [0, 1]. For example, before scaling, `age` ranged from 17 to 90, `fnlwgt` from 12,285 to 1,490,400, and `capital-gain` from 0 to 99,999. After scaling, all minimum values became 0.0 and maximum values became 1.0.
*   The final transformed dataset, `df_encoded`, contains 48,842 entries and 84 columns, with all features being numerical (`float64`, `int64`, or `bool`) and no remaining missing values.
*   The processed dataset was saved as `adult_processed.csv`.

### Insights or Next Steps
*   Scaling numerical features is critical for gradient descent-based and distance-based algorithms, ensuring faster convergence and preventing features with larger scales from dominating the model's learning process.
*   The `adult_processed.csv` dataset is now clean, fully numerical, and scaled, making it suitable for direct input into various machine learning models for training and evaluation.
