# Pandas

- Overview: https://pandas.pydata.org/docs/getting_started/overview.html
- Getting Started: https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html

### Key steps in Exploratory Data Analysis (EDA)

1. *Formulate Questions* - Define objectives and questions to be answered from the data.
2. *Data Collection* - Gather relevant data from sources:
    1. APIs
    1. Databases
    1. Web Scraping
    1. Files
    1. Surveys
3. *Data Cleaning* - Prepare data for analysis by:
    1. Handle missing values
    1. duplicates
    1. outliers
        - z-score
        - IQR
        - domain knowledge
    1. errors (invalid values)
    1. inconsistencies (e.g., Female and F both present in gender column)
    1. data types (e.g., convert string to datetime)
    1. formatting (e.g., date format)
    1. column names (e.g., rename columns)
4. *Feature Engineering*
    1. Create new variables
        1. age from date of birth
        1. BMI from weight and height
        1. season from date
    1. Scale variables (e.g., standardization)
        1. Min-Max scaling
        1. Z-score scaling
        1. Robust scaling
        1. Normalization
        1. Log transformation
    1. Encode categorical variables (e.g., one-hot encoding)
    1. Handle multicollinearity (e.g., PCA)
    1. Handle imbalanced data (e.g., SMOTE)
5. *Data Exploration*
    1. Uni-variate Analysis - Understand the distribution of each variable.
        1. *Descriptive Statistics*
            - Measure of central tendency (mean, median, mode)
            - Dispersion (range, variance, standard deviation)
            - Shape (skewness, kurtosis).
        1. *Visualizations*
            - Histogram
            - Boxplot
            - Density plot
            - Violin plot
            - Bar plot
            - Pie chart
            - Frequency table
            - Word cloud
    1. Bi-variate Analysis - Understand the relationship between two variables.
        1. *Descriptive Statistics*
            - Covariance
            - Correlation
        1. *Visualizations*
            - Scatter plot
            - Line plot
            - Heatmap
            - Pairplot
            - Boxplot
            - Violin plot
            - Bar plot
            - Stacked bar plot
            - Grouped bar plot
6. *Statistical Analysis*
    1. Quantify relationships (regression)
    1. Make inference about population from sample
    1. Hypothesis testing
7. *Communicate Insights* - Present findings to stakeholders using:
    1. *Reports*: Jupyter Notebook
    1. *Dashboards*: Tableau, Power BI, Google Data Studio

In summary, EDA involves an iterative process of assessing, preprocessing, understanding and communicating insights from data to discover meaningful patterns as a preliminary stage before **model building**. The overall goal is gaining initial knowledge about the dataset.