# CSI4142 - Data Science

## Assignment 3 - Predictive analysis - Regression and Classification

Shacha Parker 300235525

Callum Frodsham 300199446

This part of the assignment takes a dataset and performs an empirical study with a linear regression task.

#### Execution of the Notebook
 
FILL THIS IN WHEN FINALIZED

### Dataset Information
Dataset name: 

<a href="https://www.kaggle.com/datasets/mirichoi0218/insurance">Medical Cost Personal Datasets</a>
<br>
Provider: Miri Choi on Kaggle

<b>Features</b>

    age: age of primary beneficiary - quantitative - continuous

    sex: insurance contractor gender, female, male - qualitative, ordinal

    bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9 - quantitative - continuous

    children: Number of children covered by health insurance / Number of dependents - quantitative, discrete

    smoker: does the beneficiary smoke - categorical, nominal

    region: the beneficiary's residential area in the US: northeast, southeast, southwest, northwest. - categorical, nominal

    charges: Individual medical costs billed by health insurance - quantitative, continuous


In [None]:
# Imports
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
# CHANGE THIS TO GITHUB RAW
data = pd.read_csv('insurance.csv')

# output settings for debugging
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

## Data Cleaning
The dataset is already clean. This can be seen with the results of the code cell below: no entries are null, and all values are appropriate. No cleaning is needed.

In [None]:
# Display information about the dataframe
print(data.info())

# Describe the dataframe
print(data.describe())

# Check for missing values
missing_values = data.isnull().sum()
print("Missing values in each column:\n", missing_values)

# Check for unique values in categorical columns to identify any unclean data
categorical_columns = data.select_dtypes(include=['object']).columns
for column in categorical_columns:
    print(f"\nUnique values in column '{column}':\n", data[column].unique())

## Categorical Feature Encoding
Through the get_dummies function, we can create the dataset with all numerical features becoming one-hot vectors. The rest of the features remain as they were.

In [None]:
#print(data.head(), "\n")
# One-hot encode the specified features
data_onehot = pd.get_dummies(data, columns=['bmi', 'charges'])
print(data_onehot.head())

## EDA and Outlier detection
a. Outlier detection

In [None]:
# Visualize categorical features
for column in categorical_columns:
    plt.figure(figsize=(10, 5))
    sns.countplot(data=data, x=column)
    plt.title(f'Distribution of {column}')
    plt.show()

It doesn't seem like outliers occur on categorical data.

In [None]:
# Visualize numerical features
numerical_columns = data.select_dtypes(include=['int64', 'float64']).columns
for column in numerical_columns:
    plt.figure(figsize=(10, 5))
    sns.boxplot(data=data, x=column)
    plt.title(f'Box plot of {column}')
    plt.show()

Since feature 'bmi' has only 9 outliers out of the 1000+ entries (0.67%), we can safely remove them without there being a significant statistical impact.

In [None]:
original = data.shape

# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = data['bmi'].quantile(0.25)
Q3 = data['bmi'].quantile(0.75)

IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out the outliers
data = data[(data['bmi'] >= lower_bound) & (data['bmi'] <= upper_bound)]

plt.figure(figsize=(10, 5))
sns.boxplot(data=data, x=data['bmi'])
plt.title(f'Box plot of BMI after outlier removal')
plt.show()

print("Original data shape:", original)
print("Data shape after removing outliers:", data.shape)

For the 'charges' outliers, it would negatively impact the dataset if all rows with 'charges' outliers were removed. Instead, we'll use random sample imputation to 

In [None]:
# Calculate Q1 (25th percentile) and Q3 (75th percentile) for 'charges'
Q1_charges = data['charges'].quantile(0.25)
Q3_charges = data['charges'].quantile(0.75)
IQR_charges = Q3_charges - Q1_charges
lower_bound_charges = Q1_charges - 1.5 * IQR_charges
upper_bound_charges = Q3_charges + 1.5 * IQR_charges

# Create D1 by removing rows where 'charges' is an outlier
D1 = data[(data['charges'] >= lower_bound_charges) & (data['charges'] <= upper_bound_charges)]

# Create D2 by removing 'charges' outliers but keeping the rest of the data
D2 = data.copy()
D2.loc[(D2['charges'] < lower_bound_charges) | (D2['charges'] > upper_bound_charges), 'charges'] = np.nan

# Display the shapes of the original, D1, and D2 datasets
print("Original data shape:", data.shape)
print("D1 data shape (without 'charges' outliers):", D1.shape)
print("D2 data shape (with 'charges' outliers removed):", D2.shape)

missing_values_d2 = D2.isnull().sum()
print("Missing values in each column of D2:\n", missing_values_d2)

# Get the indices of missing values in D2
missing_indices = D2[D2['charges'].isnull()].index

# Randomly sample values from D1['charges'] to fill the missing values in D2
imputed_values = D1['charges'].sample(n=len(missing_indices), replace=True).values

# Impute the missing values in D2
D2.loc[missing_indices, 'charges'] = imputed_values

missing_values_d2 = D2.isnull().sum()
print("Missing values in each column of D2 after RSI:\n", missing_values_d2)

data = D2