# Week 5: EDA - Descriptive Statistics

### Objectives
- Learn to compute key statistical summaries and understand data distributions.
- Understand how to calculate and interpret correlations in data.

### Topics
- **Mean, Median, Mode**: Measures of central tendency to summarize data.
- **Standard Deviation and Variance**: Measures of data spread.
- **Correlations**: Identifying relationships between variables.

### Content

1. **Introduction to Descriptive Statistics**
   - Descriptive statistics provide a way to summarize and understand the basic features of a dataset. They include measures of central tendency (like mean, median, and mode) and measures of variability (like standard deviation and variance).
   - **Importance of Descriptive Statistics**:
     - Summarizes data for a quick overview of main characteristics.
     - Helps identify data trends, patterns, and potential anomalies.

2. **Measures of Central Tendency**
   - **Mean (Average)**:
     - The mean is the sum of all values divided by the number of values. It gives an overall indication of the data’s central point.
     - **Example**:
       ```python
       mean_value = df['column'].mean()
       print(mean_value)
       ```
   - **Median**:
     - The median is the middle value of the dataset when arranged in ascending order. It’s less affected by outliers than the mean.
     - **Example**:
       ```python
       median_value = df['column'].median()
       print(median_value)
       ```
   - **Mode**:
     - The mode is the most frequently occurring value in the dataset. It’s useful for categorical data or data with repeated values.
     - **Example**:
       ```python
       mode_value = df['column'].mode()
       print(mode_value)
       ```

3. **Measures of Variability**
   - **Standard Deviation**:
     - Standard deviation (SD) quantifies the amount of variation or spread in a set of values. A higher SD indicates greater data dispersion.
     - **Example**:
       ```python
       std_dev = df['column'].std()
       print(std_dev)
       ```
   - **Variance**:
     - Variance measures the average degree to which each point differs from the mean. It is the square of the standard deviation.
     - **Example**:
       ```python
       variance = df['column'].var()
       print(variance)
       ```

4. **Summarizing Descriptive Statistics with `describe()`**
   - The `describe()` method provides a summary of all numerical columns, including count, mean, standard deviation, min, max, and quartiles (25%, 50%, 75%).
   - **Example**:
     ```python
     summary_stats = df.describe()
     print(summary_stats)
     ```
   - **Interpreting Results**:
     - Use these statistics to understand the general shape and spread of your data, as well as to detect any unusually high or low values.

5. **Correlations**
   - **Understanding Correlation**:
     - Correlation measures the relationship between two variables, often indicating how one variable may change in response to another. Values range from -1 to 1:
       - **1**: Perfect positive correlation.
       - **0**: No correlation.
       - **-1**: Perfect negative correlation.
     - **Positive Correlation**: As one variable increases, the other tends to increase.
     - **Negative Correlation**: As one variable increases, the other tends to decrease.
   - **Calculating Correlation with `corr()`**:
     - The `corr()` method computes the correlation matrix for all numerical columns in a DataFrame.
     - **Example**:
       ```python
       correlation_matrix = df.corr()
       print(correlation_matrix)
       ```
   - **Using Heatmaps for Visualization**:
     - Visualize correlations with a heatmap, which provides a quick visual representation of relationships between variables.
     - **Example**:
       ```python
       import seaborn as sns
       import matplotlib.pyplot as plt

       sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
       plt.show()
       ```
   - **Interpreting Correlations**:
     - Correlation analysis helps identify potential predictors and relationships. For instance, a high positive correlation between two variables may suggest a cause-and-effect relationship or similar underlying factors.

6. **Practical Workflow for Descriptive Statistics**
   - A typical workflow for performing descriptive statistics includes:
     1. Calculating measures of central tendency (mean, median, mode).
     2. Assessing data spread using standard deviation and variance.
     3. Summarizing statistics using `describe()`.
     4. Analyzing relationships through correlation coefficients and visualizations.

### Exercises
- **Exercise 1**: Compute the mean, median, and mode for a dataset column and interpret each value in context.
- **Exercise 2**: Calculate the standard deviation and variance for a numeric column and describe what the results suggest about data spread.
- **Exercise 3**: Generate a correlation matrix for the dataset, visualize it using a heatmap, and interpret key relationships between variables.
