# Exploratory Data Analysis (EDA)

**First**, *Formulate Questions* - Define objectives and questions to be answered from the data.

- If we don't know what we are looking for, we are unlikely to find something interesting.
- Questions guide the analysis and help focus on relevant aspects of the data.

### Four objectives of EDA

1. Discover Patterns
2. Spot Anomalies
3. Frame Hypothesis
4. Check Assumptions

### Key steps in EDA

1. *Data Loading* - Gather relevant data from sources:
    1. Files (csv, excel, txt, json, xml, pdf, etc.)
    1. Surveys (Google Forms, SurveyMonkey)
    1. Web Scraping (BeautifulSoup, Scrapy, Selenium)
    1. APIs (Twitter, Facebook, Google Maps, etc.)
    1. Databases (SQL, NoSQL)
1. *Data Cleaning* - Prepare data for analysis by:
    1. Handle missing values
        1. Remove rows with missing values
        1. Impute missing values
            - Mean, Median, Mode
            - Forward fill, Backward fill
            - Interpolation
        1. Drop columns
    1. Remove duplicates
    1. Correct errors (invalid values)
    1. Standardize feature names
    1. Remove irrelevant and redundant features
    1. Standardize data:
        1. Convert data types (e.g., `str -> datetime`)
        1. inconsistencies (e.g., `Female` and `F` both present in `gender` column)
        1. One format for dates, phone numbers, etc.
1. *Feature Engineering* - based on domain knowledge and questions to be answered:
    1. Handle multicollinearity (e.g., PCA)
    1. Handle imbalanced data (e.g., SMOTE)
    1. Create new variables
        1. `age` from `date_of_birth`
        1. `BMI` from `weight` and `height`
        1. `season` from `date`
    1. *Data Enrichment* - Add new data from external sources
        1. Geocoding (convert address to latitude and longitude using **Google Maps API**)
        1. Sentiment analysis (analyze text data using **OpenAI API**)
    1. Discretize continuous variables
        1. `age_group` from `age`
        1. `income_group` from `income`
        1. `weight_category` from `weight`
    1. Handle date-time variables
        1. Extract year, month, day, day of week, etc.
        1. Time since last purchase
        1. Time since first purchase
    1. Aggregate data
        1. `total_sales` from `sales` table
        1. `orders_count` from `orders` table
        1. `maximum_amount` from `transactions` table
1. *Transformation*
    1. Scale the data:
        1. Normalize
            - Log Transformation
        1. Standardize
            - Z-score
            - Min-max Scaling
            - Robust Scaling
    1. Handle Outliers
        - Z-score method
        - IQR method
        - 95th percentile
        - 99th percentile
        - Domain knowledge
    1. Encode categorical variables (e.g., one-hot encoding)
1. *Data Exploration*
    1. Uni-variate Analysis - Understand the distribution of each variable.
        1. *Descriptive Statistics*
            - Measure of central tendency (mean, median, mode)
            - Dispersion (range, variance, standard deviation)
            - Shape (skewness, kurtosis).
        1. *Visualizations*
            - Histogram
            - Boxplot
            - Density plot
            - Violin plot
            - Bar plot
            - Pie chart
            - Frequency table
            - Word cloud
    1. Bi-variate Analysis - Understand the relationship between two variables.
        1. *Descriptive Statistics*
            - Covariance
            - Correlation
        1. *Visualizations*
            - Scatter plot
            - Line plot
            - Heatmap
            - Pairplot
            - Boxplot
            - Violin plot
            - Bar plot
            - Stacked bar plot
            - Grouped bar plot
1. *Modeling* (optional) - Use statistical and machine learning models to:
    1. Quantify relationships (regression)
    1. Make inference about population from sample
    1. Hypothesis testing
1. *Communicate Insights* - Present findings to stakeholders using:
    1. *Reports*: Jupyter Notebook
    1. *Dashboards*: Tableau, Power BI, Google Data Studio

In summary, EDA involves an iterative process of assessing, preprocessing, understanding and communicating insights from data to discover meaningful patterns as a preliminary stage before **model building**. The overall goal is gaining initial knowledge about the dataset.