<a href="https://colab.research.google.com/github/Jonny-T87/Dojo-Work/blob/main/Pre_Processing_Exercise_(Practice).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- Jonny Tesfahun
- 06/20/22

How well can the charge be predicted based on the age, sex, bmi, number of children, smoking habit and region of the patient?  

At this point, you are just completing the pre-processing. In a later assignment, you will apply the additional steps to actually address the question.

You will need to:

- Define features (X) and target (y)
- Train test split the data to prepare for machine learning
- Identify each feature as numerical, ordinal, or nominal. (Please provide this answer in a text cell in your Colab notebook)
- Ordinal encode any ordinal features
- One Hot Encode any nominal features 
- Scale any numeric features
- Concatenate all features back into one dataframe.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

In [2]:
df = pd.read_csv('/content/drive/MyDrive/DojoBootCamp/Project Files/insurance.csv')

In [3]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [5]:
#sex column is ordered 
df['sex'].value_counts()

male      676
female    662
Name: sex, dtype: int64

In [7]:
#and we can use ordinal encoding 
df['sex'].replace({'male':0, 'female':1}, inplace=True )
df['sex'].value_counts()

0    676
1    662
Name: sex, dtype: int64

In [8]:
##sex column is ordered also
df['smoker'].value_counts()

no     1064
yes     274
Name: smoker, dtype: int64

In [9]:
#and we can use ordinal encoding 
df['smoker'].replace({'no':0, 'yes':1}, inplace=True )
df['smoker'].value_counts()

0    1064
1     274
Name: smoker, dtype: int64

I will assign "charges" as our target, y.

I will assign the rest of the columns as our features (X). 

In [10]:
X = df.drop(columns=['charges'])
y = df['charges']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [11]:
# will now created the selector for region column
cat_selector = make_column_selector(dtype_include='object')

In [12]:
#Selecting the  Categorical Columns
cat_selector(X_train)

['region']

In [13]:
# preing column for one-hot encode the categorical features
train_smoker_data = X_train[cat_selector(X_train)]
test_smoker_data = X_test[cat_selector(X_test)]
train_smoker_data

Unnamed: 0,region
693,northwest
1297,southeast
634,southwest
1022,southeast
178,southwest
...,...
1095,northeast
1130,southeast
1294,northeast
860,southwest


In [14]:
#instantiate one hot encoder
ohe_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

In [15]:
#fit the OneHotEncoder on the training data
ohe_encoder.fit(train_smoker_data)

OneHotEncoder(handle_unknown='ignore', sparse=False)

In [16]:
#transform both the training and the testing data using one hot encoder, which will show an array
train_ohe = ohe_encoder.transform(train_smoker_data)
test_ohe = ohe_encoder.transform(test_smoker_data)
train_ohe

array([[0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       ...,
       [1., 0., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.]])

In [17]:
##I will use the method 'get_feature_names_out()' to get a list of the new features, 
ohe_column_names = ohe_encoder.get_feature_names_out(train_smoker_data.columns)

In [18]:
#converting to dataframe, extract new column names from encoder
train_ohe = pd.DataFrame(train_ohe, columns=ohe_column_names)
test_ohe = pd.DataFrame(test_ohe, columns=ohe_column_names)
train_ohe

Unnamed: 0,region_northeast,region_northwest,region_southeast,region_southwest
0,0.0,1.0,0.0,0.0
1,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
998,1.0,0.0,0.0,0.0
999,0.0,0.0,1.0,0.0
1000,1.0,0.0,0.0,0.0
1001,0.0,0.0,0.0,1.0


Concatenating the One-hot Encoded Categorical Feature with the Numeric Features

In [19]:
# create a numeric selector
num_selector = make_column_selector(dtype_include='number')

In [20]:
# isolating the numeric columns
train_nums = X_train[num_selector(X_train)].reset_index(drop=True)
test_nums = X_test[num_selector(X_test)].reset_index(drop=True)

In [21]:
# re-combining the train and test sets on axis 1 (columns)
X_train_processed = pd.concat([train_nums, train_ohe], axis=1)
X_test_processed = pd.concat([test_nums, test_ohe], axis=1)
X_train_processed

Unnamed: 0,age,sex,bmi,children,smoker,region_northeast,region_northwest,region_southeast,region_southwest
0,24,0,23.655,0,0,0.0,1.0,0.0,0.0
1,28,1,26.510,2,0,0.0,0.0,1.0,0.0
2,51,0,39.700,1,0,0.0,0.0,0.0,1.0
3,47,0,36.080,1,1,0.0,0.0,1.0,0.0
4,46,1,28.900,2,0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...
998,18,1,31.350,4,0,1.0,0.0,0.0,0.0
999,39,1,23.870,5,0,0.0,0.0,1.0,0.0
1000,58,0,25.175,0,0,1.0,0.0,0.0,0.0
1001,37,1,47.600,2,1,0.0,0.0,0.0,1.0
