![Data Exploration and Preparation banner](./images/2_data_exploration_and_preparation.png)

# 2. Data Exploration and Preparation

Perform exploratory data analysis (EDA) to gain insights into the dataset, including calculating summary statistics, visualizing data, identifying missing values and outliers. Then, prepare the data by handling missing values, removing outliers, encoding categorical variables, and scaling/normalizing numerical features.

## 2.1 Data Exploration

![Data Exploration and Preparation EDA](./images/2_data_exploration_and_preparation_eda.png)

- **Understand the context and domain**: Gather information about the context and domain from which the data was collected. This can provide valuable insights into the meaning and relevance of the features, as well as potential data quality issues or biases. Example:
    - NYC Taxi Dataset - Understanding that trip_distance of 0 might indicate canceled rides, not errors, requires domain knowledge

- **Check for data quality issues**: In addition to identifying missing values and outliers, look for other data quality issues such as duplicates, inconsistent formatting, or invalid values. Examples:
    - Adult Census Income Dataset - Contains duplicates and inconsistent country name formats that need cleaning
    - Airbnb Listings - Price fields stored as strings with "$" symbols need conversion to numeric

- **Explore feature distributions**: Visualize the distributions of individual features using histograms, box plots, or violin plots. This can help identify skewed distributions, multimodality, or other patterns that may require specific data transformations. Examples:
    - House Prices Dataset - Sale prices show right-skewed distribution requiring log transformation. Log transformation compresses the large values more than small ones: \$100k becomes `log(100000) = 5.0`; \$1M becomes `log(1000000) = 6.0`; \$10M becomes `log(10000000) = 7.0`. So the difference between \$100k and \$1M (10x) becomes just 1.0 in log space, making the distribution more normal/bell-shaped, which many ML algorithms assume.
    - Iris Dataset - Petal length shows bimodal distribution revealing distinct species groups.

<br>

- **Analyze feature correlations**: Calculate and visualize the correlations between features, either using scatter plot matrices or correlation matrices. This can help identify redundant features or potential multicollinearity issues.

  Examples:
    - Cars Dataset - Engine size and horsepower show strong positive correlation (0.87), suggesting redundancy
    - Real Estate - Square footage and number of rooms highly correlated, may only need one for modeling

<br>

- **Investigate feature-target relationships**: For supervised learning problems, explore the relationships between the features and the target variable. This can be done using scatter plots, box plots, or other visualization techniques, depending on the data types. Examples:
    - Diamond Prices - Clear exponential relationship between carat weight and price
    - Wine Quality - pH levels show non-linear relationship with quality ratings

- **Identify potential interactions**: Look for potential interactions between features, as these may reveal important patterns or relationships that could improve model performance. Example:
    - Bike Sharing Dataset - Temperature and humidity interact differently on weekdays vs weekends for rental counts. On Weekdays (Commuters): High Temp + High Humidity = Still high rentals (People MUST commute to work regardless); Low Temp + High Humidity = Moderate rentals (Still need to get to work). On Weekends (Leisure riders): High Temp + High Humidity = LOW rentals (Too uncomfortable for leisure riding); High Temp + Low Humidity = VERY HIGH rentals (Perfect weather for fun rides)

## 2.2 Data Preparation/Cleaning

- **Handle categorical data**: In addition to encoding categorical variables, consider techniques like label encoding (e.g. 0-small, 1-medium, 2-tall), one-hot encoding (e.g. is_tall or is_small), or target encoding, depending on the nature of the categorical features and the machine learning algorithm being used.

  Examples:
    - Titanic Dataset - Passenger class (1st, 2nd, 3rd) uses ordinal encoding; Embarked port (S, C, Q) uses one-hot encoding
    - Adult Income - Education level (Bachelors, Masters, PhD) benefits from ordinal encoding preserving education hierarchy

![Data Exploration and Preparation handle categorical data](./images/2_data_exploration_and_preparation_handle_categorical_data.png)

- **Address imbalanced data**: If the target variable is imbalanced (one class is significantly more prevalent than others), consider techniques like oversampling, undersampling, or using class weights to mitigate the impact of imbalanced data on model performance. SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic fraud examples by interpolating between existing ones.

  Examples:
    - Medical Diagnosis (Rare Disease) - 1:1000 ratio needs synthetic minority oversampling
    - Credit Card Fraud - Only 0.17% fraudulent transactions requiring SMOTE oversampling or undersampling. Real fraud example A - [amount=$5000, time=2am, location=Nigeria]; Real fraud example B - [amount=$3000, time=3am, location=Russia]; SMOTE creates synthetic example - [amount=$4000, time=2:30am, location=synthetic].

![Data Exploration and Preparation address imbalanced data](./images/2_data_exploration_and_preparation_address_imbalanced_data.png)

- **Feature engineering**: Explore the creation of new features from existing ones, such as interactions, ratios, or derived features, which may capture important patterns or relationships in the data.

  Examples:
    - House Prices - Creating "TotalSF" by combining basement, 1st floor, and 2nd floor square footage
    - E-commerce Dataset - Creating "days_since_last_purchase" from timestamp data improves churn prediction

![Data Exploration and Preparation feature engineering](./images/2_data_exploration_and_preparation_feature_engineering.png)

- **Feature selection**: Identify and remove irrelevant or redundant features that may not contribute significantly to the model's performance. This can be done using techniques like correlation analysis, recursive feature elimination, or embedded feature selection methods.

  Example:
    - Gene Expression Data - Thousands of gene features reduced to top 50 using recursive feature elimination for cancer classification

![Data Exploration and Preparation feature selection](./images/2_data_exploration_and_preparation_feature_selection.png)

- **Data transformation**: Apply appropriate transformations to the data, such as log transformations, box-cox transformations, or other non-linear transformations, to address skewed distributions (e.g. individual income) or improve the linearity of relationships between features and the target variable. Applying these transformations can normalize skewed data distributions, enhance model performance, reduce outlier influence, and stabilize variance, ultimately leading to more accurate and robust predictive model.

  Examples:
    - Income Prediction - Log transform applied to income distribution (heavily right-skewed). The massive gap between $30k and $10M (333x difference) becomes just 5.8 in log space.
    - Population Density - Square root transformation normalizes city population densities. Instead of Dhaka (44,000 people/km²) being 4,400x denser than rural (10 people/km²) - dominating everything, it's just 65x in √-space, allowing the model to learn patterns across ALL density levels, not just extremes

![Data Exploration and Preparation data transformation](./images/2_data_exploration_and_preparation_data_transformation.png)

- **Handling date and time data**: If the dataset includes date or time features, consider extracting additional features like day of the week, month, or hour, which may capture important patterns or seasonality.

  Examples:
    - Retail Sales - Extracting month, day_of_week, is_weekend, is_holiday from transaction dates
    - Energy Consumption - Hour_of_day, season, and is_business_day features from timestamps predict usage patterns

![Data Exploration and Preparation handling time](./images/2_data_exploration_and_preparation_handling_time.png)

- **Data integration and standardization**: If working with multiple data sources, consider techniques for integrating and merging the data, while handling issues like inconsistent formats, missing values, or duplicate records.

  Examples:
    - Healthcare Records - Merging patient data from multiple hospitals with different ID systems and date formats
    - Financial Data - Combining stock prices (daily) with earnings reports (quarterly) requires careful time alignment

![Data Exploration and Preparation data standardization](./images/2_data_exploration_and_preparation_data_standardization.png)

- **Data versioning and reproducibility**: Maintain a clear record of the data transformations and preprocessing steps applied, using version control or data pipelines to ensure reproducibility and transparency in the modeling process.

  Example:
    - Kaggle Competitions - Winners share exact preprocessing pipelines enabling result reproduction and learning

This is what data scientists spend most of their time on.

![Data science perception vs reality](./images/2_data_exploration_and_preparation_perception_reality.jpeg)

A common saying in the industry: _"Garbage in, garbage out"_.