# U.S. Medical chargess

### Dataset overview

The dataset provides medical insurance information for individual clients from the United States.

---

### Goals
- Solidify my understanding of data analysis concepts by exploring the real-world dataset.
- Formulate exploratory questions that uncover meaningful patterns in the dataset - for example, compare the costs of insurance between smokers and non-smokers. At this stage, it is not important to evaluate if my questions are strong, but to generate ideas freely as I will revisit them in the future when my analytical skills are stronger
- Begin to practice how to interpret findings even if the conclusions are tentative. The aim is to build the habit of turning observations into meaningful insights.

---

### Basic exploratory questions:

- **How many *records* are there in the dataset?**
> This is the most basic information to find out in the dataset. This establish an understanding of how large the dataset is and the result can be useful for downstream percentage-based calculations. 
- **What are the unique values or categories for categorical variables such as `region`, `sex`, `smoker`? and even numerical varibles with limited values like `children`**?
> This step helps define the structure of each categorical variables. Knowing all the possible values each variable can take is essential for downstream tasks such as analyzing the *distribution within a variable* or category-based *aggregations* (e.g., group by `region` or the smoker `status`).
> Although `children` is a numerical variable, its small number of distinct values (e.g., 0 - 9) makes it suitable for grouping and aggregation like a categorial variable.
- **What are the distributions within such categories?** (e.g., how many people in the dataset are a smoker and how many are not)
> Examining the distribution allows me to understand which group dominates the dataset(e.g., whether most clients are male or female, or which region is most represented).
> These distributions also support downstream tasks such as percentage-based calculations and subgroup comparisons and aggregations.
- **What are the mean (average) values for numerical variables such as `age`, `children`, `charges`, and `bmi`?**  
> Calculating the mean helps establish the central tendency of each variable in the dataset.  
> It is important to interpret the mean as a reference point — not as the most likely value, but as a mathematical center. To better understand the shape and spread of the data, I plan to explore additional metrics such as *median*, *mode*, and *standard deviation* as I continue learning.

---

### Some Ideas for further exploration:
- What are the proportions of smokers and non-smokers in the population?
- What is the average BMI of a typical smoker compared to a non-smoker?
- What is the avereage charges of a typical smoker compared to a non-smoker? 
- Which region has the highest average BMI?
- Which region has the highest avereage insurnace costs?
- Which region has the most smokers?
- What is the proportion of the female clients who smoke against the female clients who do not smoke?
- What is the proportion of the male clients who smoke against the male clients who do not smoke?
- What is the proportion of the clients who do not have any children compared to the ones who have at least one children?
- What are the average chargess of these two groups?
- What is the average age of a typical female client and a male client?

### Basic exploration

Import csv library

In [279]:
import csv

Basic information of the dataset

- How many lines are there in the dataset?
- What are the headers of the dataset?
- How is the data organized?

In [280]:
row_count = 0
rows_get = 10
horizontal_line_length = 100
mode = 'table'

header = []
preview_rows = []

with open('insurance.csv') as insurance_csv:
    if mode == 'dict':
        insurance_reader = csv.DictReader(insurance_csv)
        header = insurance_reader.fieldnames
    else:
        insurance_reader = csv.reader(insurance_csv)
        # skip the header row
        header = next(insurance_reader)
        print(header)
    for i, row in enumerate(insurance_reader):
        if i < rows_get:
            preview_rows.append(row)
        row_count += 1
        

# print a horizontal line to separate the output
for row in preview_rows:
    print(row)
print("-" * horizontal_line_length)
print("There are " + str(row_count) + " rows in the dataset.")

['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']
['19', 'female', '27.9', '0', 'yes', 'southwest', '16884.924']
['18', 'male', '33.77', '1', 'no', 'southeast', '1725.5523']
['28', 'male', '33', '3', 'no', 'southeast', '4449.462']
['33', 'male', '22.705', '0', 'no', 'northwest', '21984.47061']
['32', 'male', '28.88', '0', 'no', 'northwest', '3866.8552']
['31', 'female', '25.74', '0', 'no', 'southeast', '3756.6216']
['46', 'female', '33.44', '1', 'no', 'southeast', '8240.5896']
['37', 'female', '27.74', '3', 'no', 'northwest', '7281.5056']
['37', 'male', '29.83', '2', 'no', 'northeast', '6406.4107']
['60', 'female', '25.84', '0', 'no', 'northwest', '28923.13692']
----------------------------------------------------------------------------------------------------
There are 1338 rows in the dataset.


There are 7 variables in the dataset:

- `age`: a *discrete numerical variable* indicating the clients' age
- `sex`: a *categorical variable* with two possible values - 'male' and 'female'
- `bmi`: a *continuous numerical variable* short for "body mass index", represented as a float
- `children`: a *discrete numerical variable* showing the number of children for each client represented by an integer
- `smoker`: a *categorical binary variable* that indicates whether a client smokes ('yes') or not ('no')
- `charges`: a *continuous numerical variable* (represented as a float), denoting the charges for each client (in USD)
- `region`: a *categorical nominal variable* representing the client's geographical region (e.g., northwest, southeast), rather than a specific city or district


Load values into Python using a dictionary

In [281]:
# Create lists to store the values of different variables
insurance_dict = {
    'age': [],
    'sex': [],
    'bmi': [],
    'children': [],
    'smoker': [],
    'charges': [],
    'region': []
}

Open the file and load the data into the dictornary created in the step above.  
- Since all of the loaded data is in string form values of certain variables need to be converted accordingly.
- Check if there is any missing values in any variables

In [282]:
# load the data from the file into the according variables
with open('insurance.csv', newline='') as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    # append the values of each row to the respecting lists
    # convert the values if needed
    for row in insurance_reader:
        insurance_dict['age'].append(int(row['age']))
        insurance_dict['sex'].append(row['sex'])
        insurance_dict['bmi'].append(float(row['bmi']))
        insurance_dict['children'].append(int(row['children']))
        insurance_dict['smoker'].append(row['smoker'])
        insurance_dict['region'].append(row['region'])
        insurance_dict['charges'].append(float(row['charges']))  

# check if there is any missing  values in any columns
missing_columns = [col for col, values in insurance_dict.items() if len(values) < row_count]
if missing_columns:
    print("Missing values are found in: ", ", ".join(missing_columns))
else:
    print("There are no missing values")

There are no missing values


#### Figure out the unique values for the following variables: `smoker`, `region`,  `sex`, `age`, and `children`

In [283]:
# Return the dictionary of a variable's unique values (key) and their distribution (value)
def get_unique_dict_with_count(my_list):
    result = {}
    for element in my_list:
        # Make sure element exists in the dictionary
        result.setdefault(element, {})
        # Use get to avoid KeyError for the right hand side
        result[element]['count'] = result[element].get('count', 0) + 1
    return result
# Get the dictionary for the selected fields
unique_smoker_dict = get_unique_dict_with_count(insurance_dict['smoker'])
unique_region_dict = get_unique_dict_with_count(insurance_dict['region'])
unique_sex_dict = get_unique_dict_with_count(insurance_dict['sex'])
unique_age_dict = get_unique_dict_with_count(insurance_dict['age'])
unique_children_dict = get_unique_dict_with_count(insurance_dict['children'])


def display_value(variable, my_dict, type_ = 'unique'):
    if type_ == 'distribution':
        print("Distribution within {variable}:".format(variable=variable))
        for key, value in my_dict.items():
            print("  \'{key}\' has {value} instances.".format(key=key, value=value['count']))
    elif type_ == 'total_row':
        num_of_rows = 0
        for value in my_dict.values():
            num_of_rows += value
        print("{variable} has {num_of_rows} rows in total.".format(variable=variable, num_of_rows = num_of_rows))
    # default value is to display the unique values
    else:
        # convert the list of unique values into string
        unique_values = str(sorted(list(my_dict.keys())))[1:-1].replace('\'', '')
        print("The unique values for {variable} are: {values}".format(variable = variable, values = unique_values))

print("Display unique values of variables:")
print('-' * horizontal_line_length)
display_value('smoker', unique_smoker_dict)
display_value('region', unique_region_dict)
display_value('sex', unique_sex_dict)
display_value('age', unique_age_dict)
display_value('children', unique_children_dict)

Display unique values of variables:
----------------------------------------------------------------------------------------------------
The unique values for smoker are: no, yes
The unique values for region are: northeast, northwest, southeast, southwest
The unique values for sex are: female, male
The unique values for age are: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64
The unique values for children are: 0, 1, 2, 3, 4, 5


#### Comment on the unique values:
- For categorial values such as `smoker`, `region`, or `sex`, the outcome is as expected in the overview, and there does not seem to be any abnormal values due to mispelling.
- For numerical values, `children` has a small cardinality (6 ranging from 0 to 5), which might be helpful for grouping and aggregation while `age` has 47 unique values (this value is aquired by providing the argument 'length' to the `display_value` function), and might not be that helpful for grouping like `children`, so it also might not be helpful to calculate the distribution within `children`

#### Figure out the distribution for the following values: `smoker`, `region`, `sex`, `children`

In [284]:
print("Display the distribution for certain variables: ")
print('-' * horizontal_line_length)
display_value('smoker', unique_smoker_dict, 'distribution')
display_value('region', unique_region_dict, 'distribution')
display_value('sex', unique_sex_dict, 'distribution')
display_value('children', unique_children_dict, 'distribution')


Display the distribution for certain variables: 
----------------------------------------------------------------------------------------------------
Distribution within smoker:
  'yes' has 274 instances.
  'no' has 1064 instances.
Distribution within region:
  'southwest' has 325 instances.
  'southeast' has 364 instances.
  'northwest' has 325 instances.
  'northeast' has 324 instances.
Distribution within sex:
  'female' has 662 instances.
  'male' has 676 instances.
Distribution within children:
  '0' has 574 instances.
  '1' has 324 instances.
  '3' has 157 instances.
  '2' has 240 instances.
  '5' has 18 instances.
  '4' has 25 instances.


#### Comment on the distribution:
- `smoker`: there are way more non-smoker clients than smoker clients (1064 against 274).
- `region`: the distribution is fairly even, so all the dataset is representative of 4 regions.
- `sex`: the gap between the number of male and female clients is not big, indicating that neither male or female is underepresented and overrepresented here.
- `children`: one outstanding feature of the distribution among the number of `children` is that the number of clients who have no children also make up half of our population.
- Overall, the distribution within most of the variables inspected is relatively balanced. In order to get a better picture, I can use other concepts to interpret these numbers such as proportion and percentage over the whole population.

#### Calculate the average values for the following vairables: `age`, `children`, `charges`, and `bmi`

In [285]:
def calculate_average(variable, my_list):
    total = 0
    for item in my_list:
        total += item
    print("The average for {variable} is {average}".format(variable = variable, average = round(total / len(my_list), ndigits=2)))

calculate_average('age', insurance_dict['age'])
calculate_average('children', insurance_dict['children'])
calculate_average('charges', insurance_dict['charges'])
calculate_average('bmi', insurance_dict['bmi'])

The average for age is 39.21
The average for children is 1.09
The average for charges is 13270.42
The average for bmi is 30.66


### Extra exploration:

#### Proportion between smokers and non-smokers

In [286]:
print("There are almost " + str(round(unique_smoker_dict['yes']['count'] / unique_smoker_dict['no']['count'], ndigits=1)) + " smokers for every one non-smoker client.")
print("There are almost " + str(round(unique_smoker_dict['no']['count'] / unique_smoker_dict['yes']['count'], ndigits=1)) + " non-smoker client for every one smoker client.")

There are almost 0.3 smokers for every one non-smoker client.
There are almost 3.9 non-smoker client for every one smoker client.


*The number of smoker-client dominates the whole dataset. Specifically, for every one smoker client, there are almost 4 smoker clients, indicating an imbalance in representation for this variable*

### Some Ideas for further exploration:
- What are the proportions of smokers and non-smokers in the population?
- What is the average BMI of a typical smoker compared to a non-smoker?
- What is the avereage charges of a typical smoker compared to a non-smoker? 
- Which region has the highest average BMI?
- Which region has the highest avereage insurnace costs?
- Which region has the most smokers?
- What is the proportion of the female clients who smoke against the female clients who do not smoke?
- What is the proportion of the male clients who smoke against the male clients who do not smoke?
- What is the proportion of the clients who do not have any children compared to the ones who have at least one children?
- What are the average charges of these two groups?
- What is the average age of a typical female client and a male client?

#### Average `bmi` and `charges` of a typical smoker compared to a non-smoker

In [287]:
for i in range(row_count):
    smoker = insurance_dict['smoker'][i] # either 'yes' or 'no'
    unique_smoker_dict[smoker]['total_bmi'] = unique_smoker_dict[smoker].get('total_bmi', 0) + insurance_dict['bmi'][i]
    unique_smoker_dict[smoker]['total_charges'] = unique_smoker_dict[smoker].get('total_charges', 0) + insurance_dict['charges'][i]

print("The average BMI of non-smoker clients: ", unique_smoker_dict['no']['total_bmi'] / unique_smoker_dict['no']['count'])
print("The average BMI of smoker clients", unique_smoker_dict['yes']['total_bmi'] / unique_smoker_dict['yes']['count'])
print('-' * horizontal_line_length)
print("The average charges of a typical non_smoker client: ", unique_smoker_dict['no']['total_charges'] / unique_smoker_dict['no']['count'])
print("The average charges of a typical smoker client: ", unique_smoker_dict['yes']['total_charges'] / unique_smoker_dict['yes']['count'])


The average BMI of non-smoker clients:  30.651795112781922
The average BMI of smoker clients 30.708448905109503
----------------------------------------------------------------------------------------------------
The average charges of a typical non_smoker client:  8434.268297856199
The average charges of a typical smoker client:  32050.23183153285


- The average BMI of a typical smoker is almost the same as the BMI of a typical non-smoker.
- A typical non-smoker client needs to bay almost twice the insurance charge as much as a typical smoker client.

#### Find out the average `bmi` and insurance `charges`, and the total `smoker` for each region

In [288]:
for i in range(row_count):
    region = insurance_dict['region'][i]
    unique_region_dict[region]['total_bmi'] = unique_region_dict[region].get('total_bmi', 0) + insurance_dict['bmi'][i]
    unique_region_dict[region]['total_charges'] = unique_region_dict[region].get('total_charges', 0) + insurance_dict['charges'][i]
    unique_region_dict[region]['total_smokers'] = unique_region_dict[region].get('total_smokers', 0) + (1 if insurance_dict['smoker'][i] == 'yes' else 0)
# Figure out the average bmi among the 4 regions
for key, value in unique_region_dict.items():
    print("The average BMI of the {region} region is {value}".format(region=key, value= value['total_bmi'] / value['count']))
print('-' * horizontal_line_length)
# Figure out the average insurance costs among the 4 regions
for key, value in unique_region_dict.items():
    print("The average insurance costs of the {region} region is {value}.".format(region=key, value = value['total_charges'] / value['count']))
print('-' * horizontal_line_length)
for key, value in unique_region_dict.items():
    print("There is one smoker for almost every {number} people in the {region} region".format(number=round(value['count'] / value['total_smokers'], ndigits=1), region=key))

The average BMI of the southwest region is 30.59661538461538
The average BMI of the southeast region is 33.35598901098903
The average BMI of the northwest region is 29.199784615384626
The average BMI of the northeast region is 29.17350308641976
----------------------------------------------------------------------------------------------------
The average insurance costs of the southwest region is 12346.93737729231.
The average insurance costs of the southeast region is 14735.411437609895.
The average insurance costs of the northwest region is 12417.575373969228.
The average insurance costs of the northeast region is 13406.3845163858.
----------------------------------------------------------------------------------------------------
There is one smoker for almost every 5.6 people in the southwest region
There is one smoker for almost every 4.0 people in the southeast region
There is one smoker for almost every 5.6 people in the northwest region
There is one smoker for almost every 4.8

- There is a notable diffence in the avereage BMI between the northern region (around 29) and the southern region (over 30).
- The average insurance charges range from about $12,300 to 14,700 across regions - with Southeast region having the highest
- Smokers are slightly more common in the eastern regions, especially the Southeast, where 1 in 4 clients is a smoker - compared to 1 in ~ 5.6 in both western regions

#### Figure out the total `smoker`, `bmi`, and insurance `charges` among the 2 `sex`:

In [289]:
for i in range(row_count):
    sex = insurance_dict['sex'][i]
    unique_sex_dict[sex]['total_bmi'] = unique_sex_dict[sex].get('total_bmi', 0) + insurance_dict['bmi'][i]
    unique_sex_dict[sex]['total_charges'] = unique_sex_dict[sex].get('total_charges', 0) + insurance_dict['charges'][i]
    unique_sex_dict[sex]['total_age'] = unique_sex_dict[sex].get('total_age', 0) + insurance_dict['age'][i]
    unique_sex_dict[sex]['total_smokers'] = unique_sex_dict[sex].get('total_smokers', 0) + (1 if insurance_dict['smoker'][i] == 'yes' else 0)

for key, value in unique_sex_dict.items():
    print("The average bmi for {sex}: {value}".format(sex=key, value=round(value['total_bmi'] / value['count'], ndigits=1)))
print('-' * horizontal_line_length)
for key, value in unique_sex_dict.items():
    print("The average insurance charges for {sex}: {charges}".format(sex=key, charges=round(value['total_charges'] / value['count'], ndigits=1)))
print('-' * horizontal_line_length)
for key, value in unique_sex_dict.items():
    print("The average age for {sex}: {age}".format(sex=key, age=round(value['total_age'] / value['count'], ndigits=1)))
print('-' * horizontal_line_length)
for key, value in unique_sex_dict.items():
    print("There is one smoker for almost every {number} in {sex}".format(number=round(value['count'] / value['total_smokers'], ndigits=1), sex=key))

The average bmi for female: 30.4
The average bmi for male: 30.9
----------------------------------------------------------------------------------------------------
The average insurance charges for female: 12569.6
The average insurance charges for male: 13956.8
----------------------------------------------------------------------------------------------------
The average age for female: 39.5
The average age for male: 38.9
----------------------------------------------------------------------------------------------------
There is one smoker for almost every 5.8 in female
There is one smoker for almost every 4.3 in male


- The average bmi for both male and female are almost around 30 with male's value a little higher (30.9)
- On average, male clients pay about $1,400 more in insurance charges than female clients ($13,956 vs. $12,569).
- “The average age difference between male and female clients is minimal — less than one year.
- Smoking is more common among male clients, where there is one smoker for every 4 clients compared to female clients, where there is one smoker for almost every 6 clients.

TODO: go through the whole project again to polish if needed, and make sure the logic flow is smooth