# Experiment 5: Exploratory Data Analysis (EDA)

## What is EDA?

**Exploratory Data Analysis (EDA)** is the process of examining a dataset to:
- Understand its **structure and content**
- Identify **patterns, trends, or relationships**
- Detect **missing values, outliers, or errors**
- Summarize the main characteristics of the data

EDA is the **first step** before applying any machine learning model, because it helps us decide:
- Which preprocessing techniques are needed
- Which features are useful or irrelevant
- How the data should be split, scaled, or transformed

---

## Steps in EDA

1. **Dataset Overview**
   - Rows, columns, data types
   - Missing values check
   - Summary statistics

2. **Univariate Analysis (single variable)**
   - Distribution of numerical columns (mean, median, std, quantiles)
   - Frequency of categorical columns (value counts)

3. **Bivariate Analysis (two variables)**
   - Relationship between numerical variables (correlation)
   - Compare numerical values across categories (groupby)

4. **Multivariate Analysis**
   - Studying more than two variables together
   - Identifying interaction effects between features

5. **Outlier & Anomaly Detection**
   - Identifying values that do not follow the expected pattern
   - Checking quartiles and IQR (Interquartile Range)

---

## EDA Without vs With Visualization

- **Without Visualization (this experiment):**
  - Focus on numerical/statistical summaries using Pandas functions
  - No graphs or plots

- **With Visualization (Experiment 10):**
  - Use charts (histogram, scatterplot, boxplot, heatmap) to visually understand data

---

➡️ In this experiment, we will perform **EDA using Pandas functions only**, step by step, on the dataset imported in Experiment 4.

# Section A – Data Overview

In this section, we will perform the **basic exploration** of our dataset using Pandas.  
This step helps us understand the **size, structure, and summary** of the dataset.  

We will cover the following methods:  
1. `.head()` → View first 5 rows  
2. `.tail()` → View last 5 rows  
3. `.shape` → Number of rows and columns  
4. `.info()` → Data types, non-null counts, memory usage  
5. `.describe()` → Summary statistics of numerical columns  
6. `.dtypes` → Data types of each column  
7. `.isnull().sum()` → Missing values per column

In [1]:
import pandas as pd

# Import Titanic dataset from seaborn's GitHub repo
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url)

bold = "\033[1m"
reset = "\033[0m"

print(f"{bold}Dataset Imported Successfully:{reset}")
print(df.head(), end="\n\n")

[1mDataset Imported Successfully:[0m
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  



## 1. First 5 Rows – `.head()`

The `.head()` method shows the **first 5 rows** of the dataset.  
This helps us quickly inspect the structure and verify the data was loaded correctly.

In [3]:
print(f"{bold}First 5 Rows of Dataset:{reset}")
print(df.head(), end="\n\n")

[1mFirst 5 Rows of Dataset:[0m
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  



## 2. Last Five Rows
The `.tail()` method displays the last 5 rows of the dataset.  
This helps us check the ending portion of the data and verify ordering.

In [6]:
print(f"{bold}Last 5 Rows of Dataset:{reset}")
print(df.tail(), end="\n\n")

[1mLast 5 Rows of Dataset:[0m
     survived  pclass     sex   age  sibsp  parch   fare embarked   class  \
886         0       2    male  27.0      0      0  13.00        S  Second   
887         1       1  female  19.0      0      0  30.00        S   First   
888         0       3  female   NaN      1      2  23.45        S   Third   
889         1       1    male  26.0      0      0  30.00        C   First   
890         0       3    male  32.0      0      0   7.75        Q   Third   

       who  adult_male deck  embark_town alive  alone  
886    man        True  NaN  Southampton    no   True  
887  woman       False    B  Southampton   yes   True  
888  woman       False  NaN  Southampton    no  False  
889    man        True    C    Cherbourg   yes   True  
890    man        True  NaN   Queenstown    no   True  



## 3. Shape of the Dataset
The `.shape` attribute gives the number of rows and columns in the dataset.  
It is displayed as a tuple: `(rows, columns)`.

In [7]:
print(f"{bold}Shape of Dataset (rows, columns):{reset}")
print(df.shape, end="\n\n")
print(f"{bold}Size of Dataset (rows*columns):{reset}")
print(df.size, end="\n\n")

[1mShape of Dataset (rows, columns):[0m
(891, 15)

[1mSize of Dataset (rows*columns):[0m
13365



## 4. Dataset Information
The `.info()` method provides:
- Column names
- Data types
- Non-null counts
- Memory usage

In [None]:
print(f"{bold}Dataset Info:{reset}")
print(end='\n\n')
df.info()

[1mDataset Info:[0m


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB


## 5. Summary Statistics
The `.describe()` method provides summary statistics for numerical columns:
- Count, mean, std (standard deviation)
- Min, 25%, 50%, 75%, max

In [None]:
print(f"{bold}Summary Statistics of Numerical Columns:{reset}")
print(df.describe(), end="\n\n")

[1mSummary Statistics of Numerical Columns:[0m
         survived      pclass         age       sibsp       parch        fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%      0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%      0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%      1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200



## 6. Data Types of Columns
The `.dtypes` attribute tells us the type of data in each column.  
Examples: `int64`, `float64`, `object` (strings), `category`, `bool`.

In [None]:
print(f"{bold}Data Types of Columns:{reset}")
print(df.dtypes, end="\n\n")

[1mData Types of Columns:[0m
survived         int64
pclass           int64
sex             object
age            float64
sibsp            int64
parch            int64
fare           float64
embarked        object
class           object
who             object
adult_male        bool
deck            object
embark_town     object
alive           object
alone             bool
dtype: object



## 7. Missing Values
The `.isnull().sum()` method counts the number of missing values in each column.  
This helps us detect data quality issues.

In [12]:
print(f"{bold}Missing Values per Column:{reset}")
print(df.isnull().sum(), end="\n\n")

[1mMissing Values per Column:[0m
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64



# Section B: Univariate Analysis

Univariate analysis means analyzing **one variable at a time**.  
We will explore both numerical and categorical columns.

Steps:
1. Mean of a numerical column  
2. Median of a numerical column  
3. Mode of a numerical column  
4. Standard deviation  
5. Variance  
6. Quantiles (25%, 50%, 75%)  
7. Value counts of a categorical column  
8. Unique values of a categorical column

In [13]:
bold = "\033[1m"
reset = "\033[0m"

# Just checking dataset columns for reference
print(f"{bold}Columns in Titanic Dataset:{reset}")
print(df.columns, end="\n\n")

[1mColumns in Titanic Dataset:[0m
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')



## 1. Mean
The `.mean()` function calculates the average of a numerical column.  
We will calculate the mean age of passengers.

In [14]:
print(f"{bold}Mean Age of Passengers:{reset}")
print(df["age"].mean(), end="\n\n")

[1mMean Age of Passengers:[0m
29.69911764705882



## 2. Median
The `.median()` function returns the middle value when data is sorted.  
We will calculate the median passenger age.

In [None]:
print(f"{bold}Median Age of Passengers:{reset}")
print(df["age"].median(), end="\n\n")

[1mMedian Age of Passengers:[0m
28.0



## 3. Mode
The `.mode()` function returns the most frequent value(s).  
We will find the most common age in the dataset.

In [15]:
print(f"{bold}Mode of Passenger Age:{reset}")
print(df["age"].mode(), end="\n\n")

[1mMode of Passenger Age:[0m
0    24.0
Name: age, dtype: float64



## 4. Standard Deviation
The `.std()` function measures how much values deviate from the mean.  
A higher std means more variation in the data.

In [17]:
print(f"{bold}Standard Deviation of Age:{reset}")
print(df["age"].std(), end="\n\n")

[1mStandard Deviation of Age:[0m
14.526497332334044



## 5. Variance
The `.var()` function measures the spread of the data.  
It is the square of the standard deviation.

In [18]:
print(f"{bold}Variance of Age:{reset}")
print(df["age"].var(), end="\n\n")

[1mVariance of Age:[0m
211.0191247463081



## 6. Quantiles
The `.quantile()` function gives cutoff values at given percentages.  
We will calculate the 25%, 50%, and 75% quantiles of Age.

In [None]:
print(f"{bold}Age Quantiles (25%, 50%, 75%):{reset}")
print(df["age"].quantile([0.25, 0.5, 0.75]), end="\n\n")

[1mAge Quantiles (25%, 50%, 75%):[0m
0.25    20.125
0.50    28.000
0.75    38.000
Name: age, dtype: float64



## 7 & 8. Value Counts and Unique Values
- `.value_counts()` → frequency of each category  
- `.unique()` → unique categories in a column  

We will apply these on the **class** column.

In [20]:
print(f"{bold}Value Counts of Class Column:{reset}")
print(df["class"].value_counts(), end="\n\n")

print(f"{bold}Unique Values in Class Column:{reset}")
print(df["class"].unique(), end="\n\n")

[1mValue Counts of Class Column:[0m
class
Third     491
First     216
Second    184
Name: count, dtype: int64

[1mUnique Values in Class Column:[0m
['Third' 'First' 'Second']



# Section C: Bivariate Analysis

Bivariate analysis means exploring the relationship between **two variables**.  
In this section, we will cover:

1. Correlation between numerical variables  
2. Correlation with a specific column (survived)  
3. Groupby with mean  
4. Groupby with multiple aggregations  
5. Cross-tabulation between categorical variables

## 1. Correlation Between Numerical Variables
The `.corr()` function computes pairwise correlation between numerical columns.  
Correlation values range from **-1 to +1**:
- `+1` → strong positive relationship  
- `-1` → strong negative relationship  
- `0` → no linear relationship

In [None]:
bold = "\033[1m"
reset = "\033[0m"

print(f"{bold}Correlation Matrix of Numerical Columns:{reset}")
print(df.corr(numeric_only=True), end="\n\n")

[1mCorrelation Matrix of Numerical Columns:[0m
            survived    pclass       age     sibsp     parch      fare  \
survived    1.000000 -0.338481 -0.077221 -0.035322  0.081629  0.257307   
pclass     -0.338481  1.000000 -0.369226  0.083081  0.018443 -0.549500   
age        -0.077221 -0.369226  1.000000 -0.308247 -0.189119  0.096067   
sibsp      -0.035322  0.083081 -0.308247  1.000000  0.414838  0.159651   
parch       0.081629  0.018443 -0.189119  0.414838  1.000000  0.216225   
fare        0.257307 -0.549500  0.096067  0.159651  0.216225  1.000000   
adult_male -0.557080  0.094035  0.280328 -0.253586 -0.349943 -0.182024   
alone      -0.203367  0.135207  0.198270 -0.584471 -0.583398 -0.271832   

            adult_male     alone  
survived     -0.557080 -0.203367  
pclass        0.094035  0.135207  
age           0.280328  0.198270  
sibsp        -0.253586 -0.584471  
parch        -0.349943 -0.583398  
fare         -0.182024 -0.271832  
adult_male    1.000000  0.404744  
alon

## 2. Correlation with a Specific Column
We can also check correlation of all numerical columns with one target-like variable.  
Here, we will see which features are most related to **survival**.

In [None]:
print(f"{bold}Correlation of Numerical Columns with Survived:{reset}")
print(df.corr(numeric_only=True)["survived"].sort_values(ascending=False), end="\n\n")

[1mCorrelation of Numerical Columns with Survived:[0m
survived      1.000000
fare          0.257307
parch         0.081629
sibsp        -0.035322
age          -0.077221
alone        -0.203367
pclass       -0.338481
adult_male   -0.557080
Name: survived, dtype: float64



## 3. Groupby with Mean
The `.groupby()` method allows grouping the dataset by categories.  
We will find the **average fare paid by each passenger class**.

In [30]:
print(f"{bold}Average Fare by Passenger Class:{reset}")
print(df.groupby("class")["fare"].mean(), end="\n\n")

[1mAverage Fare by Passenger Class:[0m
class
First     84.154687
Second    20.662183
Third     13.675550
Name: fare, dtype: float64



## 4. Groupby with Multiple Aggregations
We can apply multiple functions on grouped data using `.agg()`.  
Here, we calculate **mean, min, and max of Age and Fare grouped by Sex**.

In [None]:
print(f"{bold}Aggregated Stats of Age and Fare by Sex:{reset}")
print(df.groupby("sex")[["age", "fare"]].agg(["mean", "min", "max"]), end="\n\n")

[1mAggregated Stats of Age and Fare by Sex:[0m
              age                   fare                
             mean   min   max       mean   min       max
sex                                                     
female  27.915709  0.75  63.0  44.479818  6.75  512.3292
male    30.726645  0.42  80.0  25.523893  0.00  512.3292



## 5. Cross-tabulation (Frequency Table)
`pd.crosstab()` creates a frequency table for two categorical variables.  
We will check the relationship between **Sex and Survival**.

In [40]:
print(f"{bold}Cross-tabulation of Sex vs Survived:{reset}")
print(pd.crosstab(df["sex"], df["survived"]), end="\n\n")

[1mCross-tabulation of Sex vs Survived:[0m
survived    0    1
sex               
female     81  233
male      468  109



# Section D: Multivariate Analysis & Outlier Detection

In this section, we will:
1. Use `.describe(include="all")` to summarize both numerical and categorical columns  
2. Revisit correlation matrix as a multivariate overview  
3. Detect outliers using the Interquartile Range (IQR) method  
4. Identify rows with outliers in Age and Fare

## 1. Describe with include="all"
The `.describe(include="all")` method gives summary statistics for both:
- Numerical columns (count, mean, std, min, quartiles, max)  
- Categorical columns (count, unique, top, frequency)

In [41]:
bold = "\033[1m"
reset = "\033[0m"

print(f"{bold}Full Summary of Dataset (Numerical + Categorical):{reset}")
print(df.describe(include="all"), end="\n\n")

[1mFull Summary of Dataset (Numerical + Categorical):[0m
          survived      pclass   sex         age       sibsp       parch  \
count   891.000000  891.000000   891  714.000000  891.000000  891.000000   
unique         NaN         NaN     2         NaN         NaN         NaN   
top            NaN         NaN  male         NaN         NaN         NaN   
freq           NaN         NaN   577         NaN         NaN         NaN   
mean      0.383838    2.308642   NaN   29.699118    0.523008    0.381594   
std       0.486592    0.836071   NaN   14.526497    1.102743    0.806057   
min       0.000000    1.000000   NaN    0.420000    0.000000    0.000000   
25%       0.000000    2.000000   NaN   20.125000    0.000000    0.000000   
50%       0.000000    3.000000   NaN   28.000000    0.000000    0.000000   
75%       1.000000    3.000000   NaN   38.000000    1.000000    0.000000   
max       1.000000    3.000000   NaN   80.000000    8.000000    6.000000   

              fare embarked 

## 2. Multivariate Correlation Matrix
The `.corr()` method can also be seen as a **multivariate tool**,  
since it compares all numerical variables pairwise.

In [42]:
print(f"{bold}Multivariate Correlation Matrix:{reset}")
print(df.corr(numeric_only=True), end="\n\n")

[1mMultivariate Correlation Matrix:[0m
            survived    pclass       age     sibsp     parch      fare  \
survived    1.000000 -0.338481 -0.077221 -0.035322  0.081629  0.257307   
pclass     -0.338481  1.000000 -0.369226  0.083081  0.018443 -0.549500   
age        -0.077221 -0.369226  1.000000 -0.308247 -0.189119  0.096067   
sibsp      -0.035322  0.083081 -0.308247  1.000000  0.414838  0.159651   
parch       0.081629  0.018443 -0.189119  0.414838  1.000000  0.216225   
fare        0.257307 -0.549500  0.096067  0.159651  0.216225  1.000000   
adult_male -0.557080  0.094035  0.280328 -0.253586 -0.349943 -0.182024   
alone      -0.203367  0.135207  0.198270 -0.584471 -0.583398 -0.271832   

            adult_male     alone  
survived     -0.557080 -0.203367  
pclass        0.094035  0.135207  
age           0.280328  0.198270  
sibsp        -0.253586 -0.584471  
parch        -0.349943 -0.583398  
fare         -0.182024 -0.271832  
adult_male    1.000000  0.404744  
alone       

## 3. Detecting Outliers with IQR
Steps:
1. Find Q1 (25%) and Q3 (75%)  
2. Compute IQR = Q3 − Q1  
3. Define outlier boundaries:  
   - Lower bound = Q1 − 1.5 × IQR  
   - Upper bound = Q3 + 1.5 × IQR  
4. Values outside these bounds are potential outliers

In [45]:
Q1 = df["fare"].quantile(0.25)
Q3 = df["fare"].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"{bold}Fare Outlier Detection Bounds:{reset}")
print(f"Lower Bound: {lower_bound}, Upper Bound: {upper_bound}", end="\n\n")

[1mFare Outlier Detection Bounds:[0m
Lower Bound: -6.6875, Upper Bound: 64.8125



## 4. Identify Rows with Outliers
We will filter rows in the **Fare** column that lie outside  
the IQR bounds and treat them as potential outliers.

In [None]:
outliers_fare = df[(df["fare"] < lower_bound) | (df["fare"] > upper_bound)]

print(f"{bold}Outlier Rows in Fare:{reset}")
print(outliers_fare[["sex", "age", "fare", "class"]], end="\n\n")

[1mOutlier Rows in Fare:[0m
        sex   age      fare  class
1    female  38.0   71.2833  First
27     male  19.0  263.0000  First
31   female   NaN  146.5208  First
34     male  28.0   82.1708  First
52   female  49.0   76.7292  First
..      ...   ...       ...    ...
846    male   NaN   69.5500  Third
849  female   NaN   89.1042  First
856  female  45.0  164.8667  First
863  female   NaN   69.5500  Third
879  female  56.0   83.1583  First

[116 rows x 4 columns]



# Section E: Mini Case Study

Now let’s apply what we learned in a practical way.  
We will answer some analytical questions using the Titanic dataset:

1. Which passenger class paid the highest average fare?  
2. Which sex had the highest survival rate?  
3. Which passenger class had the youngest average age?  
4. Compare mean fare of survivors vs non-survivors.

## 1. Highest Average Fare by Passenger Class
We group by `class` and calculate the mean of `fare`.

In [46]:
bold = "\033[1m"
reset = "\033[0m"

avg_fare_by_class = df.groupby("class")["fare"].mean().sort_values(ascending=False)

print(f"{bold}Average Fare by Passenger Class:{reset}")
print(avg_fare_by_class, end="\n\n")

print(f"{bold}Passenger class with highest average fare:{reset} {avg_fare_by_class.index[0]}")

[1mAverage Fare by Passenger Class:[0m
class
First     84.154687
Second    20.662183
Third     13.675550
Name: fare, dtype: float64

[1mPassenger class with highest average fare:[0m First


## 2. Highest Survival Rate by Sex
We group by `sex` and calculate the mean of `survived`  
(since `1 = survived`, the mean gives the survival rate).

In [None]:
survival_rate_by_sex = df.groupby("sex")["survived"].mean().sort_values(ascending=False)

print(f"{bold}Survival Rate by Sex:{reset}")
print(survival_rate_by_sex, end="\n\n")

print(f"{bold}Sex with highest survival rate:{reset} {survival_rate_by_sex.index[0]}")

[1mSurvival Rate by Sex:[0m
sex
female    0.742038
male      0.188908
Name: survived, dtype: float64

[1mSex with highest survival rate:[0m female


## 3. Youngest Average Age by Passenger Class
We group by `class` and calculate the mean of `age`.  
The class with the lowest mean age is the youngest.

In [None]:
avg_age_by_class = df.groupby("class")["age"].mean().sort_values()

print(f"{bold}Average Age by Passenger Class:{reset}")
print(avg_age_by_class, end="\n\n")

print(f"{bold}Passenger class with youngest average age:{reset} {avg_age_by_class.index[0]}")

[1mAverage Age by Passenger Class:[0m
class
Third     25.140620
Second    29.877630
First     38.233441
Name: age, dtype: float64

[1mPassenger class with youngest average age:[0m Third


## 4. Mean Fare of Survivors vs Non-Survivors
We group by `survived` and compare the mean fare.  
This helps us check if fare amount was related to survival.

In [None]:
fare_by_survival = df.groupby("survived")["fare"].mean()

print(f"{bold}Mean Fare by Survival Status:{reset}")
print(fare_by_survival, end="\n\n")

print(f"{bold}Interpretation:{reset} Survivors generally paid higher fares than non-survivors.")

[1mMean Fare by Survival Status:[0m
survived
0    22.117887
1    48.395408
Name: fare, dtype: float64

[1mInterpretation:[0m Survivors generally paid higher fares than non-survivors.


# Section F: Wrap-up & Key Takeaways

✅ In this experiment, we performed **Exploratory Data Analysis (EDA)** on the Titanic dataset.  

### What we covered:
1. **Section A – Data Overview**  
   - Head, tail, shape, info, describe, dtypes, missing values  

2. **Section B – Univariate Analysis**  
   - Mean, median, mode, std, variance, quantiles, value_counts, unique  

3. **Section C – Bivariate Analysis**  
   - Correlation, groupby mean, multiple aggregations, cross-tab  

4. **Section D – Multivariate & Outlier Detection**  
   - Full describe, correlation matrix, IQR method for outliers  

5. **Section E – Mini Case Study**  
   - Solved real questions with groupby and aggregation  

---

### Key Takeaways:
- **EDA is the foundation** for any data science project.  
- It helps detect **data quality issues** (missing values, outliers).  
- It reveals **patterns and relationships** that guide preprocessing & modeling.  
- Visualization (already covered in Experiment 10) complements these statistical methods.  

➡️ With this, we are ready to move from **EDA** to **Data Preprocessing & Visualization** in the upcoming experiments.