<div id='title'>
<h1>Project: Exploratory Data Analysis of U.S. Medical Insurance Costs</h1>
</div>

<div id='table-of-contents'>
<h1>Table of Contents</h1>
<ul>
    <li><a href="title">Title</a></li>
    <li><a href="table-of-contents">Table of Contents</a></li>
    <li><a href="introduction">Introduction</a></li>
    <ul>
        <li><a href="project-description">Project Description</a></li>
        <li><a href="project-objectives">Project Objectives</a></li>
        <li><a href="personal-notes">Personal Notes</a></li>
        <ul>
            <li><a href="challenges">Challenges</a></li>
            <li><a href="learning-goals">Learning Goals</a></li>
        </ul>
        <li><a href="project-setup">Project Setup</a></li>
    </ul>
    <li><a href="initial-data-exploration">Initial Data Exploration</a></li>
    <ul>
        <li><a href="quick-data-overview">Quick Data Overview</a></li>
        <li><a href="data-dimensions">Data Dimensions</a></li>
        <li><a href="handling-null-values">Handling Null Values</a></li>
        <li><a href="handling-duplicate-values">Handling Duplicate Values</a></li>
        <li><a href="data-dimensions-after-cleaning">Data Dimensions After Cleaning</a></li>
        <li><a href="saving-the-cleaned-data">Saving the Cleaned Data</a></li>
    </ul>
    <li><a href="exploratory-data-analysis">Exploratory Data Analysis</a></li>
    <ul>
        <li><a href="data-types">Data Types</a></li>
        <li><a href="univariate-analysis">Univariate Analysis</a></li>
        <ul>
            <li><a href="univariate-descriptive-statistics">Descriptive Statistics</a></li>
            <li><a href="univariate-histograms">Histograms</a></li>
            <li><a href="univariate-bar-charts">Bar Charts</a></li>
            <li><a href="univariate-interpreation">Interpretation of Univariate Analysis</a></li>
            <ul>
                <li><a href="univariate-interpretation-categorical">Numerical Variables</a></li>
                <ul>
                    <li><a href="univariate-numerical-interpretation-age">Age</a></li>
                    <li><a href="univariate-numerical-interpretation-bmi">BMI</a></li>
                    <li><a href="univariate-numerical-interpretation-children">Children</a></li>
                    <li><a href="univariate-numerical-interpretation-charges">Charges</a></li>
                </ul>
                <li><a href="univariate-categorical-interpretation">Categorical Variables</a></li>
                <ul>
                    <li><a href="univariate-categorical-interpretation-sex">Sex</a></li>
                    <li><a href="univariate-categorical-interpretation-smoker">Smoker</a></li>
                    <li><a href="univariate-categorical-interpretation-region">Region</a></li>
                </ul>
            </ul>
        </ul>
        <li><a href="bivariate-analysis">Bivariate Analysis</a></li>
        <ul>
            <li><a href="bivariate-charges-versus-age">Charges versus Age</a></li>
            <li><a href="bivariate-charges-versus-bmi">Charges versus BMI</a></li>
            <li><a href="bivariate-charges-versus-children">Charges versus Children</a></li>
            <li><a href="bivariate-charges-versus-sex">Charges versus Sex</a></li>
            <li><a href="bivariate-charges-versus-smoker">Charges versus Smoker</a></li>
            <li><a href="bivariate-charges-versus-region">Charges versus Region</a></li>
        </ul>
        <li><a href="summary-of-the-exploratory-data-analysis">Summary of the Exploratory Data Analysis</a></li>
    </ul>
    <li><a href="conclusions">Conclusions</a></li>
</ul>

<div id='introduction'>
<h1>Introduction</h1>
</div>

<div id='project-description'>
<h2>Project Description</h2>
<p>(Provided by Codecademy)</p>
<p>For this project, you will be investigating a medical insurance costs dataset in a .csv file using the Python skills that you've developed. This dataset and its parameters will seem familiar if you've done any of the previous Python projects in the data science path.</p>
<p>However, you're now tasked with working with the actual information in the dataset and performing your own independent analysis on real-world data! We will not be providing step-by-step instructions on what to do, but we will provide you with a framework to structure your exploration and analysis. For this project, you will be investigating a medical insurance costs dataset in a .csv file using the Python skills that you've developed. This dataset and its parameters will seem familiar if you've done any of the previous Python projects in the data science path.</p>
</div>

<div id='project-objectives'>
<h2>Project Objectives</h2>
<p>(Provided by Codecademy)</p>
<ul>
    <li>Work locally on your own computer</li>
    <li>Import a dataset into a Jupyter Notebook</li>
    <li>Analyze a dataset by building out analysis</li>
    <li>Share your analysis in a blog post</li>
    <li>Optional: Document and organize your findings</li>
    <li>Optional: Make predictions about a dataset's features based on your findings</li>
</ul>
</div>

<div id='personal-notes'>
<h2>Personal Notes</h2>
</div>

<div id='challenges'>
<h3>Challenges</h3>
<ul>
    <li>This is intended to be a "growing" project. The goal is to improve it with time as I learn more about data science.</li>
    <li>This is my first experience working with Git branches, which presents a learning curve for managing different versions of the project effectively.</li>
</ul>
</div>

<div id='learning-goals'>
<h3>Learning Goals</h3>
<ul>
    <li>Enhance my skills in exploratory data analysis, focusing on the variables and objectives specified by the project description.</li>
    <li>Learn the basics of Git branching to manage different stages and updates of this project effectively. This aligns with the challenge of using Git repositories for the first time and will be an invaluable skill for future projects. This will mostly be done using GitHub.</li>

<div id='project-setup'>
<h2>Project Setup</h2>
<p>The next code cell will be used to import the necessary libraries and load the dataset into a dataframe.</p>
<p>If the output of the next cell outputs "The source file was not found", then the source file was not found. The structure for the project should be as it is presented in github. Modifying this structure might lead to problems with this cell.</p>

In [1]:
# WARNING: This cell *needs* to be run *first* for the rest of the notebook to work.
import pandas as pd
import matplotlib.pyplot as plt
import scipy

# Define the paths to the data files
data_path = '../data'
raw_data_path = data_path + '/raw'
processed_data_path = data_path + '/processed'

# Define the path to the source file
source_file_path = raw_data_path + '/raw.csv'

# Error handling for the source file
try:
    # Load the source file into a dataframe
    main_df = pd.read_csv(source_file_path)
except FileNotFoundError:
    print('The source file was not found')
else:
    print(main_df.head())
    
    # Library versions
    print(f'Versions:')
    print(f'Pandas version: {pd.__version__}')
    print(f'Matplotlib version: {plt.matplotlib.__version__}')  
    print(f'Scipi version: {scipy.__version__}')

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520
Versions:
Pandas version: 2.0.3
Matplotlib version: 3.7.2
Scipi version: 1.11.2


<div id='initial-data-exploration'>
<h1>Initial Data Exploration</h2>
</div>

<div id='quick-data-overview'>
<h2>Quick Data Overview</h2>
<p>Before diving into an analysis of the data, it’s important to get an initial feel for the data. This gives an idea of what the data looks like.</p>
</div>

In [2]:
# Display the first five rows of the dataframe
main_df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [3]:
# Display the last five rows of the dataframe
main_df.tail()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1333,50,male,30.97,3,no,northwest,10600.5483
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335
1336,21,female,25.8,0,no,southwest,2007.945
1337,61,female,29.07,0,yes,northwest,29141.3603


<p>At a first glance, the dataset seems to be fairly clean. There are no obvious issues and the values are all within a reasonable range.</p>

<div id='data-dimensions'>
<h2>Data Dimensions</h2>
<p>Knowing the shape of the dataset helps to understand the size. This is important for understanding the usage of computational resources.</p>
</div>

In [4]:
# Display the shape of the dataframe
print(main_df.shape)

(1338, 7)


<p>The database consists of 1338 rows and seven columns. This is a fairly small dataset, which should be easy to work with.</p>

<div id='handling-null-values'>
<h2>Handling Null Values</h2>
<p>Null values can cause data analysis and models to misbehave. It's important to know if there are any null values in the dataset and how to handle them.</p>
</div>

In [5]:
# Check for null values in the dataframe
print(main_df.isnull().sum())

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64


<p>There are no null values in the dataset. This is good news, as it means that there is no need to handle null values.</p>

<div id='handling-duplicate-values'>
<h2>Handling Duplicate Values</h2>
<p>Duplicate entries mess with the analysis, as they skew the results. It’s important to know if there are any duplicate values in the dataset and how to handle them.</p>
</div>

In [6]:
# Display the duplicate value
print(main_df[main_df.duplicated()])

     age   sex    bmi  children smoker     region    charges
581   19  male  30.59         0     no  northwest  1639.5631


<p>It looks like there is a single duplicate value in the dataset. This duplicate could be useful information. Since there are no names or IDs in the dataset, the only clue present to show that this is a duplicate is the charges value. This value is too specific to be a coincidence. For this reason, the duplicate value will be removed and only the first instance will be kept.</p>

In [7]:
# Remove the duplicate value
main_df.drop_duplicates(inplace=True)

<div id='data-dimensions-after-cleaning'>
<h2>Data Dimensions After Cleaning</h2>
<p>After cleaning the dataset, it's important to check the dimensions again. This will help determine if the cleaning process was successful. In this instance, the number of rows should be reduced by one.</p>
</div>

In [8]:
# Display the shape of the dataframe
print(main_df.shape)

(1337, 7)


<p>The number of rows has been reduced by one, which means that the duplicate value was successfully removed.</p>