## Visualization settings
### `sns.set(style="whitegrid", palette="pastel")`

This configures Seaborn's default aesthetics:

-   `style="whitegrid"`: Adds a clean white background with gridlines—great for readability, especially in bar plots and box plots.
    
-   `palette="pastel"`: Uses soft, light colors for plot elements. It’s visually gentle and works well for presentations or dashboards.
    

### 📐 `plt.rcParams["figure.figsize"] = (8, 5)`

This sets the default size of your plots using Matplotlib:

-   `figure.figsize = (8, 5)`: Every plot will be 8 inches wide and 5 inches tall unless you override it later. It helps maintain consistency across your visualizations.


### 🧮 `print("Shape of dataset:", df.shape)`

-   This prints the **dimensions** of your DataFrame.
    
-   `df.shape` returns a tuple: `(number_of_rows, number_of_columns)`
    
-   Example output:
    
    Code
    
    ```
    Shape of dataset: (7043, 21)
    
    ```
    
    This tells you there are 7,043 rows and 21 columns in your dataset.
    

### 👀 `df.head()`

-   This displays the **first 5 rows** of your DataFrame by default.
    
-   It’s a quick way to preview your data and check if it loaded correctly.
    
-   You’ll see column names and sample values, which helps you:
    
    -   Spot data types (numeric, categorical, etc.)
        
    -   Identify missing or inconsistent values
        
    -   Understand the structure of your features


### 🧾 `df.info()` — DataFrame Summary at a Glance

This method gives you a concise summary of your DataFrame, including:

|     Feature    |                                Description                                |
|:--------------:|:-------------------------------------------------------------------------:|
| Index Range    | Shows the number of rows and the index type (e.g., RangeIndex: 0 to 7042) |
| Column Names   | Lists all column names                                                    |
| Non-Null Count | Tells you how many non-missing values are in each column                  |
| Data Types     | Shows the data type of each column (e.g., int64, float64, object)         |
| Memory Usage   | Displays how much memory the DataFrame consumes                           |

### 🧠 Why It’s Useful 

As someone who works with ML pipelines and secure data flows:

-   You can **quickly spot missing values** that need imputation.
    
-   You’ll know which columns are **categorical (**`object`**) vs numeric**, which is crucial for preprocessing.
    
-   It helps you **optimize memory usage**—especially important in embedded or resource-constrained environments.


### 📊 `df.describe(include="all")`

This generates summary statistics for **all columns**, regardless of data type:

-   For **numerical columns**: it shows count, mean, std, min, max, and percentiles.
    
-   For **categorical columns**: it shows count, unique values, top (most frequent), and frequency.
    
-   For **datetime columns**: it shows count, earliest, latest, etc.
    

Using `include="all"` ensures that even non-numeric columns are included in the summary.

### 🔄 `.T`

This is the **transpose** method. It flips the rows and columns:

-   Normally, `describe()` returns statistics as rows and columns as columns.
    
-   Transposing makes each **column** in your original DataFrame become a **row** in the output, with its stats laid out horizontally.
    

This is especially useful when you have many columns—it makes the summary easier to scan visually.

### 🧠 Why You’d Use This

Since you're working with ML workflows and data preprocessing, this combo gives you a **quick overview of all features**, helping you spot:

-   Missing values (`NaN`)
    
-   Skewed distributions
    
-   Dominant categories
    
-   Outliers


### 🧼 Column Name Cleaning Breakdown



```
df.columns = df.columns
    .str.strip()       # Removes leading/trailing whitespace
    .str.lower()       # Converts all column names to lowercase
    .str.replace(' ', '_')  # Replaces spaces with underscores

```

### ✅ Why This Is Useful

-   **Consistency**: Makes column names predictable and easier to reference in code.
    
-   **Avoids Errors**: Prevents issues when accessing columns with spaces or mixed casing.
    
-   **Cleaner Code**: You can now do things like `df.total_charges` instead of `df['Total Charges']`.