# Insurance Dataset Analysis
This notebook provides a step-by-step analysis of the insurance dataset, with code and explanations for each step.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the insurance dataset
df = pd.read_csv('../datasets/insurance.csv')
print('Rows, Cols:', df.shape)
df.head()
df.info()
print('\nMissing values per column:\n', df.isna().sum())

## Data Loading and Inspection
This cell loads the insurance dataset and provides a quick overview, including shape, missing values, and basic info.

In [None]:
# Clean text columns if present
def clean_text(x):
    if pd.isna(x):
        return x
    s = str(x).strip()
    s = " ".join(s.split())
    return s

for col in ["sex", "region"]:
    if col in df.columns:
        df[col] = df[col].apply(clean_text)
        df[col] = df[col].str.title()

## Cleaning Text Columns
This cell cleans and standardizes text columns such as 'sex' and 'region' for consistency in analysis.

In [None]:
# Remove duplicates
before = df.shape[0]
df = df.drop_duplicates()
after = df.shape[0]
print(f"Removed {before - after} duplicate rows")

## Removing Duplicates
This cell removes duplicate rows from the insurance dataset to ensure data quality.

In [None]:
# Drop irrelevant/noisy column if present
if 'random_notes' in df.columns:
    df = df.drop(columns=['random_notes'])

## Dropping Irrelevant Columns
This cell removes columns that are not useful for analysis, such as 'random_notes'.