# Supervised Learning: Regression

Kaggle Dataset (https://www.kaggle.com/mirichoi0218/insurance)

Main objective of the analysis that specifies whether your model will be focused on prediction or interpretation.
   * Our model will focus on both predicting the charges and being able to address the magnitude of the features, hence interpretabilty.

Brief description of the data set you chose and a summary of its attributes.
   * The data provided from Kaggle, referenced below, was tidy. Thus, no cleaning was needed. 

Brief summary of data exploration and actions taken for data cleaning and feature engineering.
   * As aforementioned the data did not need cleaning, however in terms of feature engineering: 
       * Sex & Smoker columns were convereted to be booleans
       * Children & Region columns were convereted to dummies
       * Age column was seperated into the following age ranges: 18-25, 26-35, 36-50, 51-64
       * The mean BMI was discovered for both male/female for each feature engineered age range as well as by region

Summary of training at least three linear regression models which should be variations that cover using a simple linear regression as a baseline, adding polynomial effects, and using a regularization regression. Preferably, all use the same training and test splits, or the same cross-validation method.

A paragraph explaining which of your regressions you recommend as a final model that best fits your needs in terms of accuracy and explainability.

Summary Key Findings and Insights, which walks your reader through the main drivers of your model and insights from your data derived from your linear regression model.

Suggestions for next steps in analyzing this data, which may include suggesting revisiting this model adding specific data features to achieve a better explanation or a better prediction.


Questions: 
How much insurance will charge based on: 
BMI
children
age
region
smoking

Null/Alternative

1) ANOVA
* Ho: µ BMI_region == µ BMI_region
* Ha: µ BMI_region != µ BMI_region

2) ANOVA
* Ho: µ BMI_age == µ BMI_age
* Ha: µ BMI_age != µ BMI_age

3) ANOVA
* Ho: µ Charges_region == µ Charges_region
* Ha: µ Charges_region != µ Charges_region

4) ANOVA
* Ho: µ Charges_region == µ Charges_region
* Ha: µ Charges_region != µ Charges_region

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

In [2]:
df = pd.read_csv("insurance.csv")
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [4]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [5]:
df['region'].unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

In [6]:
df['children'].unique()

array([0, 1, 3, 2, 5, 4])

In [7]:
num_df = df.drop(['sex','smoker', 'region', 'children'], axis=1)
num_df

Unnamed: 0,age,bmi,charges
0,19,27.900,16884.92400
1,18,33.770,1725.55230
2,28,33.000,4449.46200
3,33,22.705,21984.47061
4,32,28.880,3866.85520
...,...,...,...
1333,50,30.970,10600.54830
1334,18,31.920,2205.98080
1335,18,36.850,1629.83350
1336,21,25.800,2007.94500


In [8]:
cat_df = df.drop(['age','bmi','charges'], axis=1)
cat_df

Unnamed: 0,sex,children,smoker,region
0,female,0,yes,southwest
1,male,1,no,southeast
2,male,3,no,southeast
3,male,0,no,northwest
4,male,0,no,northwest
...,...,...,...,...
1333,male,3,no,northwest
1334,female,0,no,northeast
1335,female,0,no,southeast
1336,female,0,no,southwest


In [9]:
bins = [10, 25, 35, 50, 64]

bins_num_df = pd.cut(num_df['age'], bins, include_lowest = True, labels = ('18-25', '26-35', '36-50', '51-64'))
cat_df['age_group'] = bins_num_df.cat.as_unordered()

In [10]:
cat_df['children'] = cat_df['children'].map(str)

In [11]:
cat_df_dummies = pd.get_dummies(cat_df, drop_first=True)
cat_df_dummies

Unnamed: 0,sex_male,children_1,children_2,children_3,children_4,children_5,smoker_yes,region_northwest,region_southeast,region_southwest,age_group_26-35,age_group_36-50,age_group_51-64
0,0,0,0,0,0,0,1,0,0,1,0,0,0
1,1,1,0,0,0,0,0,0,1,0,0,0,0
2,1,0,0,1,0,0,0,0,1,0,1,0,0
3,1,0,0,0,0,0,0,1,0,0,1,0,0
4,1,0,0,0,0,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1333,1,0,0,1,0,0,0,1,0,0,0,1,0
1334,0,0,0,0,0,0,0,0,0,0,0,0,0
1335,0,0,0,0,0,0,0,0,1,0,0,0,0
1336,0,0,0,0,0,0,0,0,0,1,0,0,0
