#IS 470 QUIZ 2

---

# Clustering on insurance data set

---
You have been given a data file by an insurance company. The goal of this analysis is to generate segmentation based on the insurance data.<br>
<br>
The insurance data set has 1338 observations of 7 variables.
<br>
<br>
VARIABLE DESCRIPTIONS:<br>
age:	      age in years<br>
sex:	      gender<br>
bmi:	      body mass index<br>
children:	how many children do they have?<br>
smoker:	  do they smoke?<br>
region:	  geographic region<br>
expenses:	yearly medical expenses<br>

### 1.Upload and clean data

In [1]:
# Upload data
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving insurance.csv to insurance.csv
User uploaded file "insurance.csv" with length 50264 bytes


In [2]:
# Import libraries
import pandas as pd
from sklearn.cluster import KMeans
from collections import Counter
from sklearn import preprocessing
from matplotlib import pyplot as plt

In [3]:
# Read data
insurance = pd.read_csv("insurance.csv")
insurance

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86
...,...,...,...,...,...,...,...
1333,50,male,31.0,3,no,northwest,10600.55
1334,18,female,31.9,0,no,northeast,2205.98
1335,18,female,36.9,0,no,southeast,1629.83
1336,21,female,25.8,0,no,southwest,2007.95


In [4]:
# Examine variable type
insurance.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
expenses    float64
dtype: object

In [5]:
# Change categorical variables to "category"
insurance['sex'] = insurance['sex'].astype('category')
insurance['smoker'] = insurance['smoker'].astype('category')
insurance['region'] = insurance['region'].astype('category')

In [6]:
# Examine variable type
insurance.dtypes

age            int64
sex         category
bmi          float64
children       int64
smoker      category
region      category
expenses     float64
dtype: object

### 2.Prepare data set for clustering (2 points)

In [7]:
# Create dummy variables
insurance = pd.get_dummies(insurance, columns=['sex','smoker','region'], drop_first=True)
insurance

Unnamed: 0,age,bmi,children,expenses,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.92,0,1,0,0,1
1,18,33.8,1,1725.55,1,0,0,1,0
2,28,33.0,3,4449.46,1,0,0,1,0
3,33,22.7,0,21984.47,1,0,1,0,0
4,32,28.9,0,3866.86,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...
1333,50,31.0,3,10600.55,1,0,1,0,0
1334,18,31.9,0,2205.98,0,0,0,0,0
1335,18,36.9,0,1629.83,0,0,0,1,0
1336,21,25.8,0,2007.95,0,0,0,0,1


In [9]:
# Apply minmax normalization (2 points)
min_max_scaler = preprocessing.MinMaxScaler()
insurance_normalized = pd.DataFrame(min_max_scaler.fit_transform(insurance))
insurance_normalized.columns = insurance.columns
insurance_normalized

Unnamed: 0,age,bmi,children,expenses,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
0,0.021739,0.320755,0.0,0.251611,0.0,1.0,0.0,0.0,1.0
1,0.000000,0.479784,0.2,0.009636,1.0,0.0,0.0,1.0,0.0
2,0.217391,0.458221,0.6,0.053115,1.0,0.0,0.0,1.0,0.0
3,0.326087,0.180593,0.0,0.333010,1.0,0.0,1.0,0.0,0.0
4,0.304348,0.347709,0.0,0.043816,1.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
1333,0.695652,0.404313,0.6,0.151299,1.0,0.0,1.0,0.0,0.0
1334,0.000000,0.428571,0.0,0.017305,0.0,0.0,0.0,0.0,0.0
1335,0.000000,0.563342,0.0,0.008108,0.0,0.0,0.0,1.0,0.0
1336,0.065217,0.264151,0.0,0.014144,0.0,0.0,0.0,0.0,1.0


### 3.Clustering model (4 points)

In [10]:
# Build a clustering model with n_clusters=3 and random_state=0 (1 point)
model1 = KMeans(n_clusters=3, random_state=0)
model1.fit(insurance_normalized)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [11]:
# Generate the cluster labels (1 point)
model1.labels_

array([2, 0, 0, ..., 0, 2, 1], dtype=int32)

In [12]:
# Show cluster size (1 point)
Counter(model1.labels_)

Counter({0: 364, 1: 649, 2: 325})

In [16]:
# Show cluster centroids (1 point)
pd.DataFrame({'cluster 1':insurance[model1.labels_==0].mean(axis=0), 'cluster 2':insurance[model1.labels_==1].mean(axis=0), 'cluster 3':insurance[model1.labels_==2].mean(axis=0)})

Unnamed: 0,cluster 1,cluster 2,cluster 3
age,38.93956,39.232666,39.455385
bmi,33.359341,29.18906,30.596615
children,1.049451,1.097072,1.141538
expenses,14735.411538,12911.218136,12346.937908
sex_male,0.519231,0.49923,0.501538
smoker_yes,0.25,0.192604,0.178462
region_northwest,0.0,0.50077,0.0
region_southeast,1.0,0.0,0.0
region_southwest,0.0,0.0,1.0


***Download the html file and submit to BeachBoard***<br>
<br>
1.   ***Download the IS470_quiz2.ipynb file***
2.   ***Upload the IS470_quiz2.ipynb file***
3.   ***Run the code below to generate a html file***
4.   ***Download the html file and submit to BeachBoard***

In [17]:
!jupyter nbconvert --to html IS470_quiz2.ipynb

[NbConvertApp] Converting notebook IS470_quiz2.ipynb to html
[NbConvertApp] Writing 300907 bytes to IS470_quiz2.html
