# EDA and Feature Engineering Checklist

## Loading and basic details

1. Load the Dataset into DataFrame.
    - If there are more than 1 data sources (CSVs, Tables, etc) look for foreign keys to merge into 1 DataFrame.

2. Take a look at the present columns using df.columns

3. For details like datatypes of each column use df.info

4. To get statistical details of numerical columns use df.describe()

## Data Analysis

1. Find out Missing Values using,
    - df.isnull()
    - sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap="viridis")

2. The Distribution of datapoints can be classified using
    - df.feature_name.value_counts()

3. This can be plotted in Pie Chart. Piecharts are useful for finding the distribution of each feature's values.
    - plt.pie(values, index)

4. To Filter Datapoints, Group features like in SQL using,
    - df.groupby([f1, f2, f3]).size().reset_index()

5. To get the Top 10 values of a Column
    - df.groupby(['Feature_name']).size().reset_index().rename(columns={0:'Count'}).sort_values(by="Count", ascending=False)[:10]

6. To compare 2 features and distribution with another feature, draw a Bar Plot
    - sns.barplot(x="feat1", y="feat2", data=df, hue='feat_3', palette=[...])

7. For analysing each feature's count & distribution use a Count Plot.
    - sns.countplot(x="feat1", data=df, hue='feat2', palette=[...])

8. Write Observations along the way

## Feature Engineering

1. Type convert all features to int/float using .astype(int)


2. Encoding Categorical features

    - For Ordinal Features that have a meaningful order use Ordinal Encoding like,
        - Ratings - Poor, Fair, Good, Very Good
        - Sizes - Small, Medium, Large, etc
        - Use df.map({...}) to encode

    - For Nominal Features that dont have any meaningful order use Nominal Encoding (One Hot Encoding) like,
        - Color - Red, Blue, Green
        - Names - India, USA, UK
        - Gender - Male, Female
        - Use pd.get_dummies(df)
        - For the number of Categories,
            - < 20 categories	One-Hot Encoding
            - 20–100 categories	Frequency or Target
            - 100+ categories (especially with text/IDs)	Target, Hashing, or Embeddings


3. Handling Missing values

    - Fill it with the MODE of the Feature - df['feature'] = df["feature"].fillna(df["feature"].mode()[0])
    - Or use techniques like SMOTE