# U.S. Medical Insurance Costs

In this notebook a dataset containing insurance data will be analyzed. This dataset is contained in the file `insurance.csv`, and thus we'll begin by importing the **csv** library so we can work with this file:


In [1]:
import csv

## Dataset description

The dataset file contains seven fields:
- `age`, the age of the individual;
- `sex`, the sex of the individual;
- `bmi`, the [body mass index](https://en.wikipedia.org/wiki/Body_mass_index) of the patient;
- `children`, the number of children of the individual;
- `smoker`, whether the individual smokes or not;
- `region`, a very broad description of the patient's location;
- `charges`, how much that individual pays for insurance.

We can define lists for each of these fields:

In [2]:
ages = []
sexes = []
bmis = []
children = []
is_smoker = []
regions = []
charges = []

With our lists defined, we can read the dataset file and populate them:

In [7]:
with open('insurance.csv', newline='') as insurance_dataset:
    insurance_dict = csv.DictReader(insurance_dataset)
    for item in insurance_dict:
        ages.append(item["age"])
        sexes.append(item["sex"])
        bmis.append(item["bmi"])
        children.append(item["children"])
        is_smoker.append(item["smoker"])
        regions.append(item["region"])
        charges.append(item["charges"])

With this method each list will have the same number of items and we can use the indexes to refer to each individual.
We can see if this holds by checking the length of each list:

In [9]:
# Un/comment each line to show/hide its output

print(len(ages))
print(len(sexes))
print(len(bmis))
print(len(children))
print(len(is_smoker))
print(len(regions))
print(len(charges))

1338
1338
1338
1338
1338
1338
1338


All arrays have 1338 items each - perfect!
With the data imported, we can go ahead and analyze it.

## Analysis

### Sex Distribution

First, we'll check the sex distribution of our dataset. We can do this by counting how many `male`s and `female`s are present, as well as their percentages:

In [14]:
males = sexes.count('male')
females = sexes.count('female')
no_of_people = len(sexes)

print(f"Number of males: {males} ({round(males/no_of_people * 100, 2)}%)")
print(f"Number of females: {females} ({round(females/no_of_people * 100, 2)}%)")

Number of males: 676 (50.52%)
Number of females: 662 (49.48%)


The dataset is balanced in terms of males and females, with only 6 more males present.

### Ages

We'll be getting a feel for the ages of the people in the dataset. For that we'll be getting the youngest and oldest age present, as well as the average age: