# U.S. Medical Insurance Costs
In this project, a CSV file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within insurance.csv to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

In [None]:
import csv

Let's begin by importing the required libraries. For this analysis, I will primarily use the `csv` library to process the **insurance.csv** data. While other libraries could enhance our analysis, I will focus on using csv for simplicity.

Before diving into the code, let's examine **insurance.csv** to understand its structure. I will analyze these key aspects:

- Column and row structure
- Data completeness
- Data types (numeric and categorical variables)

I will create seven empty lists to store each data column from **insurance.csv**:

In [None]:
ages = []
sexes = []
bmis = []
num_children = []
smoker_statuses = []
regions = []
insurance_charges = []

The dataset contains these fields:

- Age of patient
- Gender
- Body Mass Index (BMI)
- Number of dependent children
- Smoking status
- Geographic region in U.S.
- Annual medical insurance charges

Initial inspection shows no missing values. I will use separate lists to store each column's data efficiently.


Here's a helper function to streamline data loading:

In [None]:
def load_list_data(lst, csv_file, column_name):
    # open csv file
    with open(csv_file) as csv_info:
        # read the data from the csv file
        csv_dict = csv.DictReader(csv_info)
        # loop through the data in each row of the csv
        for row in csv_dict:
            # add the data from each row to a list
            lst.append(row[column_name])
    # return the list
    return lst

This function makes our data loading process more efficient. Instead of writing separate code for each column, we can reuse this function multiple times.

Now we can load each column of data:

In [None]:
load_list_data(ages, 'insurance.csv', 'age')
load_list_data(sexes, 'insurance.csv', 'sex')
load_list_data(bmis, 'insurance.csv', 'bmi')
load_list_data(num_children, 'insurance.csv', 'children')
load_list_data(smoker_statuses, 'insurance.csv', 'smoker')
load_list_data(regions, 'insurance.csv', 'region')
load_list_data(insurance_charges, 'insurance.csv', 'charges')

With the data from **insurance.csv** now properly organized into labeled lists, we can begin our analysis. Let's outline the key questions I want to investigate and determine our analytical approach. Among the various potential areas of study, I will focus on these key analyses:

- Calculate the mean patient age
- Analyze gender distribution in the dataset
- Identify patient distribution by region
- Calculate mean annual medical expenses
- Compile comprehensive patient records into a dictionary

To execute these analyses efficiently, I have developed a `PatientsInfo` class with five specialized methods:

- `analyze_ages()`
- `analyze_sexes()`
- `unique_regions()`
- `average_charges()`
- `create_dictionary()`

The implementation of this class follows below.

In [21]:
class PatientsInfo:
    def __init__(self, patients_ages, patients_sexes, patients_bmis, patients_num_children, 
                 patients_smoker_statuses, patients_regions, patients_charges):
        self.patients_ages = patients_ages
        self.patients_sexes = patients_sexes
        self.patients_bmis = patients_bmis
        self.patients_num_children = patients_num_children
        self.patients_smoker_statuses = patients_smoker_statuses
        self.patients_regions = patients_regions
        self.patients_charges = patients_charges

    def analyze_ages(self):
        total_age = sum(int(age) for age in self.patients_ages)
        average_age = total_age / len(self.patients_ages)
        return f"Average Patient Age: {round(average_age, 2)} years"

    def analyze_sexes(self):
        females = sum(1 for sex in self.patients_sexes if sex == 'female')
        males = sum(1 for sex in self.patients_sexes if sex == 'male')
        print(f"Count for female: {females}")
        print(f"Count for male: {males}")

    def unique_regions(self):
        return list(set(self.patients_regions))

    def average_charges(self):
        total_charges = sum(float(charge) for charge in self.patients_charges)
        average = total_charges / len(self.patients_charges)
        return f"Average Yearly Medical Insurance Charges: {round(average, 2)} dollars."
    
    def create_dictionary(self):
        return {
            "age": [int(age) for age in self.patients_ages],
            "sex": self.patients_sexes,
            "bmi": self.patients_bmis,
            "children": self.patients_num_children,
            "smoker": self.patients_smoker_statuses,
            "regions": self.patients_regions,
            "charges": self.patients_charges
        }

Let's create an instance of the class called patient_info to execute our analysis methods and examine the results.

In [24]:
# Create instance of PatientsInfo class with our data
patient_info = PatientsInfo(ages, sexes, bmis, num_children, smoker_statuses, regions, insurance_charges)

In [26]:
patient_info.analyze_ages()

'Average Patient Age: 39.21 years'

Analysis reveals that the average patient age in **insurance.csv** is approximately 39 years. This demographic information helps validate whether the dataset effectively represents the broader population. When extrapolating findings to other populations, we must ensure the dataset is sufficiently comprehensive and diverse.

Additional statistical analysis examining the range and standard deviation would help confirm whether the age distribution in insurance.csv represents a true random sample.

In [32]:
patient_info.analyze_sexes()

Count for female: 662
Count for male: 676


Examining gender distribution in **insurance.csv** is crucial for assessing population representation. This is particularly vital when developing classification models, where balanced attributes enhance statistical validity.

In real-world scenarios, data often shows imbalances that can complicate statistical analysis. We'll explore these challenges in upcoming portfolio projects.

In [38]:
patient_info.unique_regions()

['southwest', 'northwest', 'southeast', 'northeast']

The dataset encompasses four distinct geographical regions within the United States.

In [47]:
patient_info.average_charges()

'Average Yearly Medical Insurance Charges: 13270.42 dollars.'

Our analysis shows the average annual medical insurance cost per person is $13,270. Future research could explore correlations between patient characteristics and insurance costs, such as examining whether age influences annual expenditure.

In [None]:
patient_info.create_dictionary()

The patient information is efficiently stored in a dictionary format, facilitating additional analysis of the insurance.csv attributes if needed.