# EDA and Feature Engineering and Selection Checklist

## Loading and basic details

1. Load the Dataset into DataFrame.
    - If there are more than 1 data sources (CSVs, Tables, etc) for the same dataset, look for foreign keys to merge into 1 DataFrame.

2. Take a look at the present columns using - df.columns

3. For details like datatypes of each column use - df.info

4. To get statistical details of numerical columns use - df.describe()

## Data Analysis & Basic EDA

1. Find out Missing Values using,
    - df.isnull()

    - missing values percentage - df.isnull().mean()

    - sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap="viridis")


2. Duplicate values can be indentified using,

    - df.duplicated().sum()


3. Select columns based on their Datatypes

    - df.select_dtypes(...)


4. The Distribution of datapoints can be classified using

    - df.feature_name.value_counts()


5. This can be plotted in Pie Chart. Pie Charts are useful for finding the distribution of each feature's values.

    - values = df.feature_1.value_counts().values

    - index = df.feature_2.value_counts().index

    - plt.pie(values, index, autopct='%1.2f%%')


6. To Filter Datapoints, Group features like in SQL using,

    - df.groupby([feat1, feat2, feat3]).size().reset_index()


7. To get the Top 10 values of a Column

    - df.groupby(['Feature_name']).size().reset_index().rename(columns={0:'Count'}).sort_values(by="Count", ascending=False)[:10]


8. To compare 2 features and distribution with another feature, draw a Bar Plot

    - sns.barplot(x="feat1", y="feat2", data=df, hue='feat_3', palette=[...])


9. For analysing each feature's count & distribution use a Count Plot.

    - sns.countplot(x="feat1", data=df, hue='feat2', palette=[...])


10. Write Observations along the way

## Feature Engineering

1. Type-convert **Numerical features** to int/float using .astype(int)

    - **Binning** (for Tree models)
        - splitting continuous values into Categories


2. Encoding **Categorical features**

    - For **Ordinal Features** that have a meaningful order use **Ordinal Encoding** (1, 2, 3...) like,
        - Ratings - Poor, Fair, Good, Very Good
        - Sizes - Small, Medium, Large, etc
        - Use df.map({...}) to encode

    - For **Nominal Features** that dont have any meaningful order use **Nominal Encoding** (one-hot, frequency, target, hashing, embeddings) methods like,
        - Color - Red, Blue, Green
        - Names - India, USA, UK
        - Gender - Male, Female
        - According to the number of Categories,
            - less than < 20 categories   - **One-Hot Encoding**, use pd.get_dummies(df)
            - from 20–100 categories - **Frequency/Count Encoding** or **Target Encoding**
            - 100+ categories   - (especially with text/IDs)	**Target Encoding, Hashing trick, or Embeddings**


3. **Date/Time features**

    - Split into,
        - year (one_hot_enc)
        - month (1,2,3 / one_hot_enc)
        - day_of_week (one_hot_enc)
        - hour
        - season (summer, winter)
        - is_weekend


4. **Aggregating statistics** using one Feature

    - **groupby("user_id").agg(["mean", "count"])**
    
    - Gives mean and count for other Features of each unique user_id.
    
    - This adds new features describing user behavior.


5. **Text Features**

    - Sentiment score - positive/negative

    - TF-IDF / Count Vector


6. **Derived features** from Other Models

    - **Clustering (Kmeans)** - assign cluster label as a feature.

    - **PCA** - reduce dimensions into components.

    - Example:
        - Customer dataset → **KMeans clusters:**
        - Customer A = cluster 1 (budget buyer)
        - Customer B = cluster 2 (premium buyer)

    - Now cluster_id is a feature in the main model.


7. For categories with **High Cardinality**

    - For Categories with many unique values, example : P12345, P54321, P11111 ... (millions of unique)

        - **Hashing Trick**
            - map categories into fixed-size buckets 

            - Hash into 1000 buckets: hash(P12345) % 1000 = 347

        - **Embeddings**
            - learn dense vector representations

            - Train embeddings: P12345 → [0.2, -0.5, 0.7, ...]


8. Handling Missing values

    - (simple way)Fill it with the **Mode** of the Feature - df['feature'] = df["feature"].fillna(df["feature"].mode()[0])

    - Or use other **Imputation** techniques of missing values using statistical methods like - **mean, median, interpolation, or ML.**

## Imputation and Encoding

## Feature Selection

1. Drop Constant Features

    - drop the ones that are not "Variant" enough
    - use VarianceThreshold with a desired threshold value and drop the constant_column

2. From Correlation Matrix

    - Correlation: “Does one feature increase or decrease in a linear relationship with another feature?”

    - Compute the Pearson Correlation Matrix using the training set only.

    - Identify highly correlated feature pairs (absolute correlation coefficient |r| > "X")

    - From each highly correlated pair, choose one feature to drop (to reduce multicollinearity).

    - Drop the selected features from both the training and testing sets to ensure consistency.

3. From **Mutual Information** for Classification

    - Mutual Information: “Can this feature help predict or explain the target — in any way, linear or not?”

    - It measures the dependency of the feature on the Target variable.

    - Find MI values for each feature corresponding to the Target variable.

    - Load those values into Series and sort it in Descending order.

    - Select the Top "X" values.

4. From **Mutual Information** for Regression

    - Mutual Information: “Can this feature help predict or explain the target — in any way, linear or not?”

    - It measures the dependency of the feature on the Target variable.

    - Find MI values for each feature corresponding to the Target variable.

    - Load those values into Series and sort it in Descending order.

    - Select the Top "X" % out of the total features.

## Imbalance Handling