# **(Healthcare Insurance Cost Analysis Assessment)**

## Objectives

* Write your notebook objective here, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write down which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [117]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Nine\\OneDrive'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [118]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [119]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Nine'

# Section 1

Section 1 content

In [120]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



---

# Section 2

Section 2 content

In [121]:
# Load raw data
df = pd.read_csv("C:\\Users\\Nine\\OneDrive\\Documents\\VS Code Projects\\archive\\insurance.csv")
df.head()

print(df.head())



   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520


In [122]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [123]:
#Check for missing values
df.isnull().sum()
print (df)

      age     sex     bmi  children smoker     region      charges
0      19  female  27.900         0    yes  southwest  16884.92400
1      18    male  33.770         1     no  southeast   1725.55230
2      28    male  33.000         3     no  southeast   4449.46200
3      33    male  22.705         0     no  northwest  21984.47061
4      32    male  28.880         0     no  northwest   3866.85520
...   ...     ...     ...       ...    ...        ...          ...
1333   50    male  30.970         3     no  northwest  10600.54830
1334   18  female  31.920         0     no  northeast   2205.98080
1335   18  female  36.850         0     no  southeast   1629.83350
1336   21  female  25.800         0     no  southwest   2007.94500
1337   61  female  29.070         0    yes  northwest  29141.36030

[1338 rows x 7 columns]


In [124]:
# Check for duplicates.
print(df.duplicated())

# Duplicates returned as false, To make sure we will normalise the data to make sure there are no additional spaces to make values seem different.


0       False
1       False
2       False
3       False
4       False
        ...  
1333    False
1334    False
1335    False
1336    False
1337    False
Length: 1338, dtype: bool


In [125]:
#here we are making all text lowercase and strip space
for col in df.select_dtypes(include='object'):
    df[col] = df[col].str.strip().str.lower()

In [126]:
print(df.duplicated())

0       False
1       False
2       False
3       False
4       False
        ...  
1333    False
1334    False
1335    False
1336    False
1337    False
Length: 1338, dtype: bool


In [127]:
#here we are round all numeric values to 2 decimal places
for col in df.select_dtypes(include='float'):
    df[col] = df[col].round(2)

In [128]:
print(df.duplicated().sum())

#here is now showing there is 1 duplicate

1


In [129]:
#here we will check where the duplicates are
df[df.duplicated(keep=False)]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
195,19,male,30.59,0,no,northwest,1639.56
581,19,male,30.59,0,no,northwest,1639.56


In [130]:
df[df.duplicated()]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
581,19,male,30.59,0,no,northwest,1639.56


In [131]:
#We will now delete the duplicate

df = df.drop_duplicates()

In [132]:
print(df.duplicated().sum())

#now we can see that there are no more duplicates

0


In [133]:
# Convert categorical columns to category dtype
categorical_cols = ['sex', 'smoker', 'region']
for col in categorical_cols:
    df[col] = df[col].astype('category')

In [134]:
df[['sex','smoker','region']].value_counts()

#This shows that we have no missing or mispelt values in the categorical columns

sex     smoker  region   
female  no      southwest    141
                southeast    139
                northwest    135
male    no      southeast    134
female  no      northeast    132
male    no      northwest    131
                southwest    126
                northeast    125
        yes     southeast     55
                northeast     38
                southwest     37
female  yes     southeast     36
                northeast     29
                northwest     29
male    yes     northwest     29
female  yes     southwest     21
Name: count, dtype: int64

In [135]:
df.describe()


Unnamed: 0,age,bmi,children,charges
count,1337.0,1337.0,1337.0,1337.0
mean,39.222139,30.663628,1.095737,13279.121503
std,14.044333,6.100233,1.205571,12110.359677
min,18.0,15.96,0.0,1121.87
25%,27.0,26.29,0.0,4746.34
50%,39.0,30.4,1.0,9386.16
75%,51.0,34.7,2.0,16657.72
max,64.0,53.13,5.0,63770.43


In [136]:
#The max BMI may be a mistake or incorrect input
df[df['bmi'] == df['bmi'].max()]


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1317,18,male,53.13,0,no,southeast,1163.46


In [137]:
df.sort_values(by='bmi', ascending=False).head(5)[['age','bmi','charges','smoker','region']]

# After checking these values it seems false as the BMI is high for an 18 year old, who doesn't smoke and is paying a low premium
# This may be an error in the data entry process


Unnamed: 0,age,bmi,charges,smoker,region
1317,18,53.13,1163.46,no,southeast
1047,22,52.58,44501.4,yes,southeast
847,23,50.38,2438.06,no,southeast
116,58,49.06,11381.33,no,southeast
286,46,48.07,9432.93,no,northeast


In [138]:
#I will flag the bmi as an outlier
df['is_bmi_outlier'] = df['bmi'] > 53


#Flagged outliers
df['is_bmi_outlier'].value_counts()

is_bmi_outlier
False    1336
True        1
Name: count, dtype: int64

In [139]:
# Create BMI category
def bmi_category(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif bmi < 25:
        return 'Normal'
    elif bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'

df['bmi_category'] = df['bmi'].apply(bmi_category)

In [140]:
#Checking underweight category
df[df['bmi'] < 18.5][['bmi', 'bmi_category']].head()


Unnamed: 0,bmi,bmi_category
28,17.39,Underweight
128,17.76,Underweight
172,15.96,Underweight
198,18.05,Underweight
232,17.8,Underweight


In [141]:
# Create age category
def age_category(age):
    if age < 25:
        return '18-24'
    elif 25 <= age < 35:
        return '25-34'
    elif 35 <= age < 45:
        return '35-44'
    elif 45 <= age < 55:
        return '45-54'
    else:
        return '55+'

df['age_category'] = df['age'].apply(age_category)


In [142]:
# Checking age category
df[df['age'] < 25][['age', 'age_category']].head()

Unnamed: 0,age,age_category
0,19,18-24
1,18,18-24
12,23,18-24
15,19,18-24
17,23,18-24


---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [143]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block after 'try' statement on line 2 (553063055.py, line 5)