## 2. Tranformation 

In [1]:
import numpy as np
import pandas as pd


New cleaned file now in use. Columns in the data will be encoded moving from category data to integers to allow for further analysis of the data set. 

In [3]:


df = pd.read_csv('/Users/nataliewaugh/Documents/DataCode/Healthcare-Insurance-Cost-Analysis/jupyter_notebooks/data_raw/insurance_cleaned.csv')
df.sample(5)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
227,58,female,41.91,0,no,southeast,24227.33724
1326,51,male,30.03,1,no,southeast,9377.9047
1022,18,male,23.32,1,no,southeast,1711.0268
1210,39,male,34.1,2,no,southeast,23563.01618
367,42,female,24.985,2,no,northwest,8017.06115


In [4]:
df['region'].value_counts()

region
southeast    364
southwest    325
northwest    324
northeast    324
Name: count, dtype: int64

Checking the data types in the data to ensure the data can be analysed. 

In [20]:
df.dtypes


age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

The Smoker and Region columns are of type object, which  indicates categorical data in pandas. These need to be encoded (e.g., using label encoding or one-hot encoding) to convert them into a numerical format suitable for analysis and machine learning models.

In [5]:
df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

In [6]:
df['sex_encoded'] = df['sex'].map({'male': 0, 'female': 1})
df['smoker_encoded'] = df['smoker'].map({'no': 0, 'yes': 1})

Sex and smoker columns are encoded.
Sex - Male = 0 and Female = 1
Smokers - Non smokers = 0 and smokers = 1 

In [7]:
region_encoded = pd.get_dummies(df["region"], prefix="region")
df = pd.concat([df, region_encoded], axis=1)

In [8]:
df[['region_northeast', 'region_northwest', 'region_southeast', 'region_southwest']] = \
    df[['region_northeast', 'region_northwest', 'region_southeast', 'region_southwest']].astype(int)

In [31]:
df.dtypes

age                   int64
sex                  object
bmi                 float64
children              int64
smoker               object
region               object
charges             float64
sex_encoded           int64
smoker_encoded        int64
region_northeast      int64
region_northwest      int64
region_southeast      int64
region_southwest      int64
dtype: object

The data transformations are complete. The categorical data has been changed to integers to allow for analysis. 


In [9]:
df = df.drop(['sex', 'smoker', 'region'], axis=1)

Dropping old categorial column, as they are no longer needed. Previous data file contains this information 

In [10]:
df.dtypes

age                   int64
bmi                 float64
children              int64
charges             float64
sex_encoded           int64
smoker_encoded        int64
region_northeast      int64
region_northwest      int64
region_southeast      int64
region_southwest      int64
dtype: object

Final check of the data to ensure no errors 


In [11]:
print(df.isnull().sum())

age                 0
bmi                 0
children            0
charges             0
sex_encoded         0
smoker_encoded      0
region_northeast    0
region_northwest    0
region_southeast    0
region_southwest    0
dtype: int64


In [12]:
print("Missing Values:\n", df.isnull().sum(), "\n")
print("Duplicate Rows:", df.duplicated().sum(), "\n")
print("Data Types:\n", df.dtypes, "\n")
print("Unique Encoded Values:")
print("  sex_encoded:", df['sex_encoded'].unique())
print("  smoker_encoded:", df['smoker_encoded'].unique(), "\n")
print("Region Columns (One-Hot) Sums:")
print(df[['region_northeast', 'region_northwest', 'region_southeast', 'region_southwest']].sum(), "\n")
print("Statistical Summary:\n", df.describe(), "\n")
print("Data Shape:", df.shape)

Missing Values:
 age                 0
bmi                 0
children            0
charges             0
sex_encoded         0
smoker_encoded      0
region_northeast    0
region_northwest    0
region_southeast    0
region_southwest    0
dtype: int64 

Duplicate Rows: 0 

Data Types:
 age                   int64
bmi                 float64
children              int64
charges             float64
sex_encoded           int64
smoker_encoded        int64
region_northeast      int64
region_northwest      int64
region_southeast      int64
region_southwest      int64
dtype: object 

Unique Encoded Values:
  sex_encoded: [1 0]
  smoker_encoded: [1 0] 

Region Columns (One-Hot) Sums:
region_northeast    324
region_northwest    324
region_southeast    364
region_southwest    325
dtype: int64 

Statistical Summary:
                age          bmi     children       charges  sex_encoded  \
count  1337.000000  1337.000000  1337.000000   1337.000000  1337.000000   
mean     39.222139    30.663452    

Final check to ensure there are no errors in the data, saved to a new file.

In [13]:
df.to_csv('/Users/nataliewaugh/Documents/DataCode/Healthcare-Insurance-Cost-Analysis/jupyter_notebooks/data_raw/insurance_transformed.csv', index=False)