In [1]:
import pandas as pd
import seaborn as sns

Machine learning step one: load your data do EDA and make sure all is what you expect. This can include:
- checking your features for NaNs and removing or imputing the data.
- understanding the types of data you are working with and the shape of your data .info(), .describe() and .shape are useful for this
- One hot encode your data if you are planning on working with linear models or other models that would benefit from transforming categorical variables into numeric variables.

- NOTE: Data exploration can be time consuming and it can take quite a while to clean your data and to decide what type of ML algorithm you need to use. It is ok! :-)

# Linear Regression

In [46]:
linear_data = pd.read_csv("./ml-linear-regression-Meg-Guidry/data/insurance.csv")

In [47]:
# Now lets make sure the data looks like we expet it to:
linear_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [48]:
# Now lets check to see what types of data we are working with and if any of the columns have NaN values:
linear_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [49]:
# Check to see if we have any NaN values in our dataframe:
linear_data.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [50]:
# describe is a useful method for viewing which of your features are numeric:
linear_data.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [51]:
# From exploring the data I can see that we do not have any NaN values to worry about. 
# But we do have some categorical features in the data that we need to one hot encode:

Which features do we need to one hot encode? The features taht are type "object". For this dataset this includes:
- sex
- smoker
- region

The target variable for this dataset will be "charges" as this is what we will want to predict.

In [52]:
linear_data.value_counts()

age  sex     bmi     children  smoker  region     charges    
19   male    30.590  0         no      northwest  1639.56310     2
47   male    29.830  3         no      northwest  9620.33070     1
48   female  25.850  3         yes     southeast  24180.93350    1
             22.800  0         no      southwest  8269.04400     1
47   male    47.520  1         no      southeast  8083.91980     1
                                                                ..
31   female  25.740  0         no      southeast  3756.62160     1
             23.600  2         no      southwest  4931.64700     1
             21.755  0         no      northwest  4134.08245     1
30   male    44.220  2         no      southeast  4266.16580     1
64   male    40.480  0         no      southeast  13831.11520    1
Name: count, Length: 1337, dtype: int64

## One hot encoding

In [53]:
# One hot encoding, creating my own version of a one hot encoding function

def one_hot_encode(df_name, list_of_cols):
    test_linear_data = df_name.copy()
    
    for col in list_of_cols:
        dummies = pd.get_dummies(
            data= df_name[col],
            prefix= f"{col} = "
        )
        test_linear_data = pd.concat([test_linear_data, dummies], axis=1)
        test_linear_data.drop([col], axis=1, inplace=True)
    
    return test_linear_data

In [54]:
linear_data_OHE = one_hot_encode(linear_data, ["sex", "smoker", "region"])

In [55]:
linear_data_OHE.head()

Unnamed: 0,age,bmi,children,charges,sex = _female,sex = _male,smoker = _no,smoker = _yes,region = _northeast,region = _northwest,region = _southeast,region = _southwest
0,19,27.9,0,16884.924,True,False,False,True,False,False,False,True
1,18,33.77,1,1725.5523,False,True,True,False,False,False,True,False
2,28,33.0,3,4449.462,False,True,True,False,False,False,True,False
3,33,22.705,0,21984.47061,False,True,True,False,False,True,False,False
4,32,28.88,0,3866.8552,False,True,True,False,False,True,False,False


In [56]:
linear_data_OHE.shape

(1338, 12)

In [57]:
# What happens when I use pd.get_dummies on the entire dataframe?
dummies = pd.get_dummies(data= linear_data)
dummies.head()

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,True,False,False,True,False,False,False,True
1,18,33.77,1,1725.5523,False,True,True,False,False,False,True,False
2,28,33.0,3,4449.462,False,True,True,False,False,False,True,False
3,33,22.705,0,21984.47061,False,True,True,False,False,True,False,False
4,32,28.88,0,3866.8552,False,True,True,False,False,True,False,False


In [58]:
dummies.shape

(1338, 12)

### NOTE ON ONE HOT ENCODING: 

Insead of writing a new function for the one hot encoding step I should just use the existing function pd.get_dummies properly! If I give this function a dataframe it will return a new dataframe which has dummy variables for all the categorical data . it will also drop the original columns for you.



## Linear Regression - creating training and testing the model

In [67]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [60]:
# Now that the data is one hot encoded, lets define the feature matrix and target variable:
X = linear_data_OHE.drop(["charges"], axis=1)
y = linear_data_OHE["charges"]

In [64]:
### Instantiate the model 

In [65]:
lr_model = LinearRegression()

### Create the train test split

In [70]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

### Fit the instantiated model to the training data and make predictions!

In [73]:
lr_model.fit(X_train, y_train)
lrmodel_predictions = lr_model.predict(X_test)

### Analyse how the model performed with metrics:

In [77]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

#### MAE: Mean Absolute Error

In [78]:
MAE = mean_absolute_error(y_test, lrmodel_predictions)
MAE

4181.194473753645

#### MSE: Mean Squared Error

In [79]:
MSE = mean_squared_error(y_test, lrmodel_predictions)
MSE

33596915.851361476

#### K-folds cross validation. 

analysing the how our model performs with the test data is great, but it can be tricky to know just how well it is performing. Use cross-validation to explore the variance in error that exists in your model and to explore whether or not your model might be overfitting (high variance!)