# U.S. Medical Insurance Costs

## 📋 Task 1: Look over your dataset

**Description:** Open insurance.csv and examine the file structure. Note how information is organized and consider what aspects you want to investigate.

### Key Questions to Consider:
- How is the data organized?
- What patterns do you notice?
- What relationships might exist between variables?

### Steps to Follow:
1. Open the `insurance.csv` file in a text editor or spreadsheet application
2. Examine the column headers and data types
3. Look for patterns in the data
4. Note any interesting observations

### What to Look For:
- **Column names**: What information does each column represent?
- **Data types**: Are the values numbers, text, or categories?
- **Sample data**: What do the first few rows look like?
- **Data size**: How many rows and columns are there?
- **Missing values**: Are there any empty cells or missing data?

### Your Turn:
Before writing any code, take a moment to manually examine the `insurance.csv` file. What do you notice about the structure and content?


## 🎯 Task 2: Scoping Your Project

We'll define the key questions and an ordered plan for analysis so that the following sections (Smoking, Region, Variables, Functions) flow naturally.


## 📥 Task 3: Import and explore the dataset

**Comprehensive Data Exploration:** Load `insurance.csv` with pandas and perform thorough analysis including:
- Dataset structure (shape, columns, data types)
- Summary statistics for all numeric columns
- Categorical variable distributions (regions, smokers, sex)
- Missing value analysis
- Manual calculations of averages, min/max values
- Data quality assessment

This establishes a complete baseline understanding for all subsequent analyses.


In [60]:
import pandas as pd

df = pd.read_csv("insurance.csv")

average_age = df["age"].sum() / len(df)
average_bmi = df["bmi"].sum() / len(df)
average_children = df["children"].sum() / len(df)
average_charges = df["charges"].sum() / len(df)
num_columns = len(df.columns)
num_rows = len(df)
regions = df["region"].value_counts()

max_age = df["age"].max()
min_age = df["age"].min()

max_charges = df["charges"].max()
min_charges = df["charges"].min()

max_bmi = df["bmi"].max()
min_bmi = df["bmi"].min()
rows_with_nulls = df.isnull().sum()

categorical_col = df.select_dtypes(include=['object']).columns

smokers = len(df[df["smoker"] == "yes"])
no_smokers = len(df[df["smoker"] == "no"])



print(df.info(), "\n")
print(df.describe(),"\n")
print("Average age:",average_age)
print(f"Max age is: {max_age} and the minimun age is: {min_age}\n")
print(f"The max BMI is {max_bmi}, and the minimum BMI is {min_bmi}")
print("Average BMI:",average_bmi,"\n")
print(f"The maximum charges is {max_charges}, and the minimum is {min_charges}\n")
print("Average Children:",average_children)
print("Average Charges:", round(average_charges, 2), "\n")
print("Number of columns:", num_columns)
print("Number of rows:", num_rows, "\n")
print("Rows / columns shape:", df.shape, "\n")
print("Data types", df.dtypes, "\n")
print("Regions:", regions, "\n")
print("Rows with null values:")
print(rows_with_nulls)
print("Columns that are categorical:")
print(categorical_col)
print(f"Number of smokers is {smokers} and no smokers are {no_smokers}")







df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
None 

               age          bmi     children       charges
count  1338.000000  1338.000000  1338.000000   1338.000000
mean     39.207025    30.663397     1.094918  13270.422265
std      14.049960     6.098187     1.205493  12110.011237
min      18.000000    15.960000     0.000000   1121.873900
25%      27.000000    26.296250     0.000000   4740.287150
50%      39.000000    30.400000     1.000000   9382.033000
75%      51.000000    34.693750     2.000000  16639.912515
m

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## 🔍 Analysis 1: Smoking and charges

Compare mean charges for smokers vs non‑smokers using both manual selection and `groupby` to validate results.


In [76]:
# How does smoking affect insurance cost?
# Create a column with charges for smokers

from numpy import NaN


def is_smoker(row):
    if row["smoker"] == "yes":
        return row["charges"]
    else:
        return NaN

def no_smoker(row):
    if row["smoker"] == "no":
        return row["charges"]
    else:
        return NaN

df["charges_smokers"] = df.apply(is_smoker, axis=1)

df["charge_nosmokers"] = df.apply(no_smoker, axis=1)

smokers_average_charge = df["charges_smokers"].sum() / len(df[df["smoker"] == "yes"])
no_smokers_average_charge = df["charge_nosmokers"].sum() / len(df[df["smoker"] == "no"])

print(f"The avearge charges for smokers is {smokers_average_charge}")
print(f"The average charges for no smokers is {no_smokers_average_charge}")

if smokers_average_charge < no_smokers_average_charge:
    print(f"The result is suprising as the average charge for smokers is {round(smokers_average_charge, 2)} and for no smokers is {round(no_smokers_average_charge, 2)}\n")
else: 
    print("Result make sense as the avergae cost for smokers is higher that for non smokers\n")

print("======= Another way to claculate the mean ==========")
print("Using .mean() and groupby()\n")
smoker_charges_new = df.groupby("smoker")["charges"].mean()
print(f"The mean charges for smokers and no smokers is:\n{smoker_charges_new}")

The avearge charges for smokers is 32050.23183153285
The average charges for no smokers is 8434.268297856202
Result make sense as the avergae cost for smokers is higher that for non smokers

Using .mean() and groupby()

The mean charges for smokers and no smokers is:
smoker
no      8434.268298
yes    32050.231832
Name: charges, dtype: float64


## 🌍 Analysis 2: Regional differences

Compute average charges per `region`, then identify the highest and lowest regions using `idxmax()` and `idxmin()`.


In [105]:
# Does region affect the cost of the insurance?

average_per_region = df.groupby("region")["charges"].mean()
print(average_per_region)

# Access the mean for the northeast region
northeast_mean = average_per_region["northeast"]
northwest_mean = average_per_region["northwest"]
southeast_mean = average_per_region["southeast"]
southwest_mean = average_per_region["southwest"]

regions_means = [northeast_mean, northwest_mean, southeast_mean, southwest_mean]

print(f"\nThe mean charges for the northeast region is: {northeast_mean}\n")

max_average_region = average_per_region.max()
max_region_name = average_per_region.idxmax()
print(f"The region with the highest average charges is {max_region_name} with ${round(max_average_region, 2)}")


min_average_region = average_per_region.min()
min_region_average_name = average_per_region.idxmin()  
print(f"The region with the lowest average charges is {min_region_average_name} with ${round(min_average_region, 2)}")


region
northeast    13406.384516
northwest    12417.575374
southeast    14735.411438
southwest    12346.937377
Name: charges, dtype: float64

The mean charges for the northeast region is: 13406.384516385804

The region with the highest average charges is southeast with $14735.41
The region with the lowest average charges is southwest with $12346.94


## 💾 Task 4: Save dataset features to variables

Extract columns into Python lists and organize them into dictionaries for convenient access. Verify lengths and sample values, then demonstrate indexed access to a single record.


In [125]:
ages = df["age"].tolist()

charges = df["charges"].tolist()

smokers = df["smoker"].tolist()

children = df["children"].tolist()

sex = df["sex"].tolist()

bmi = df["bmi"].tolist()

region = df["region"].tolist()

# Length of the variables
print(f"The lenght of ages is {len(ages)}")
print(f"The lenght of smokers is {len(smokers)}")
print(f"The lenght of charges is {len(charges)}")


# First values
print(f"The first values for ages are {ages[:5]}")
print(f"The first values for sex are {sex[:5]}\n")

personal_info = {
    "age": ages,
    "sex": sex,
    "children": children,
    "region": region
}

health_info = {
    "smoker": smokers,
    "bmi": bmi
}

# Print info for the 3rd person (index 2)
print(f"Personal info for 3rd person:")
print(f"Age: {personal_info['age'][2]}")
print(f"Sex: {personal_info['sex'][2]}")
print(f"Children: {personal_info['children'][2]}")
print(f"Region: {personal_info['region'][2]}")

print(f"\nHealth info for 3rd person:")
print(f"Smoker: {health_info['smoker'][2]}")
print(f"BMI: {health_info['bmi'][2]}")

The lenght of ages is 1338
The lenght of smokers is 1338
The lenght of charges is 1338
The first values for ages are [19, 18, 28, 33, 32]
The first values for sex are ['female', 'male', 'male', 'male', 'male']

Personal info for 3rd person:
Age: 28
Sex: male
Children: 3
Region: southeast

Health info for 3rd person:
Smoker: no
BMI: 33.0


## 🔧 Task 5: Build analysis functions

Implement reusable helpers for grouped statistics (mean, min, max, std) and conditional counts; then test them on the dataset.


In [174]:
# First function to calculate average per group
def calculate_average_by_group(data, group_column, value_column):
    average_group = data.groupby(group_column)[value_column].mean()
    return average_group

def calculate_total(data, group_column, type1, value_column, type2):
    sum_group = len(data[(data[group_column] == type1) & (data[value_column] == type2)])
    return sum_group


print("The average BMI per sex:",calculate_average_by_group(df, "sex", "bmi"), "\n")
print("The average charges per sex:",calculate_average_by_group(df, "sex", "charges"), "\n")
print("The average age per sex:",calculate_average_by_group(df, "sex", "age"), "\n")

print("The total number of female smokers:", calculate_total(df, "sex", "female", "smoker", "yes"), "\n")

# Even simpler
def count_by_two_conditions(data, column1, value1, column2, value2):
    """Count rows that meet two conditions"""
    count = len(data[(data[column1] == value1) & (data[column2] == value2)])
    return count

# Usage examples:
print("Female smokers:", count_by_two_conditions(df, "sex", "female", "smoker", "yes"))
print("Male non-smokers:", count_by_two_conditions(df, "sex", "male", "smoker", "no"), "\n")

def find_max_by_group(data, column_value, group_value):
    max_group_value = data.groupby(column_value)[group_value].max()
    return max_group_value

print("Maximum age per sex:", find_max_by_group(df, "sex", "age"), "\n")
print("Maximum BMI per sex:", find_max_by_group(df, "sex", "bmi"), "\n")

def find_min_by_group(data, column_name, group_value):
    min_group = data.groupby(column_name)[group_value].min()
    return min_group

print("The minimum value for age in both sexs is:", find_min_by_group(df, "sex", "age"), "\n")

# Find standard deviation
def standard_deviation(data, column_value, group_value):
    standard_group = data.groupby(column_value)[group_value].std()
    return standard_group

print("The standard deviation for BMI by sex:", standard_deviation(df, "sex", "bmi"), "\n")
print("The standard deviation for charges by sex:", standard_deviation(df, "sex", "charges"))




The average BMI per sex: sex
female    30.377749
male      30.943129
Name: bmi, dtype: float64 

The average charges per sex: sex
female    12569.578844
male      13956.751178
Name: charges, dtype: float64 

The average age per sex: sex
female    39.503021
male      38.917160
Name: age, dtype: float64 

The total number of female smokers: 115 

Female smokers: 115
Male non-smokers: 517 

Maximum age per sex: sex
female    64
male      64
Name: age, dtype: int64 

Maximum BMI per sex: sex
female    48.07
male      53.13
Name: bmi, dtype: float64 

The minimum value for age in both sexs is: sex
female    18
male      18
Name: age, dtype: int64 

The standard deviation for BMI by sex: sex
female    6.046023
male      6.140435
Name: bmi, dtype: float64 

The standard deviation for charges by sex: sex
female    11128.703801
male      12971.025915
Name: charges, dtype: float64
