# Part 1: Cleaning Data
Based on Data Science Methodology, in this part I will do Business Understanding, Analytic Approach, Data Requirement, Data Collection, Data Understanding, and Data Preparation

## <h2 style="text-align:center">Data Science Methodology</h2>
<p>
<center><img src="https://miro.medium.com/max/1838/1*YPsZO50dIiEKpW9RqzqsTw.jpeg" alt = "Data Science Methodology" width = "400" height = "300"></center>
<ol>
<li><strong>Business Understanding:</strong>
<p> Medical insurance is something important for us to get financial freedom. If we have something that threatens our lives, such as illness, accident, and etc., medical insurance will cover all our medical expenses for whatever the cost and to get this services, we have to pay an medical insurance fee every month. However, the problem faced by company that provide medical insurance. Companies face uncertainty about medical cost. Medical cost increase every year, while the insurance fee of each customer if it is increased to the rising medical cost, of course the company will lose customers. To overcome this problem, companies usually take into account the risk of each customer to get insurance fee that are in accordance with the risk profile of each customer.
<p> From this problem, I want to answer some question:
<ul>
<li>What features are the most influential in determining the insurance fee of each customer?</li>
<li>Can I determine the insurance fee of each customer automatically?</li>
</ul>
</li>


<p>
<li><strong>Analytic Approach:</strong>
<p>To answer that question, Analytic Approach that I use is machine learning model and I can use supervised model that can solve Regression problem, such as linear, ridge, or lasso. Using machine learning model, I can know what feature are most influential and I can determine insurance fee automatically
</li>

<p>
<li><strong>Data Requirements:</strong>
<p>Data that I need is historical data about customers insurance fee with their demographic, such as sex, bmi, smoking, etc.
</li>

<p>
<li><strong>Data Collection:</strong>
<p>Unfortunately, data about insurance fee and their demographic from each company confidential, but I found nice dataset from <strong>Kaggle</strong> about <a href = "https://www.kaggle.com/mirichoi0218/insurance?ref=hackernoon.com">Medical Cost Personal Dataset</a> from <strong>Zack Stedy</strong>, he provides free dataset for anyone who wants to learn machine learning. 
</ol>

## <strong>5. Data Understanding:</strong>
<p>The first thing I do in this stage is understand the features in my data. For the next steps, probably I will back in this stage because Data Science Methology is iterative which allows us to return to certain stages 

In [1]:
data <- read.csv("insurance.csv")

In [2]:
df <- data.frame(data);

In [3]:
head(df)

Unnamed: 0_level_0,age,sex,bmi,children,smoker,region,charges
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<int>,<chr>,<chr>,<dbl>
1,19,female,27.9,0,yes,southwest,16884.924
2,18,male,33.77,1,no,southeast,1725.552
3,28,male,33.0,3,no,southeast,4449.462
4,33,male,22.705,0,no,northwest,21984.471
5,32,male,28.88,0,no,northwest,3866.855
6,31,female,25.74,0,no,southeast,3756.622


In [4]:
str(df)

'data.frame':	1338 obs. of  7 variables:
 $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
 $ sex     : chr  "female" "male" "male" "male" ...
 $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
 $ children: int  0 1 3 0 0 0 1 3 2 0 ...
 $ smoker  : chr  "yes" "no" "no" "no" ...
 $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
 $ charges : num  16885 1726 4449 21984 3867 ...


- From the code above we can find out what features I have (7 features), the numbers of instances (1338 instances), and the type of data on each feature
- Description about my features:
    - age(int) : age of customer
    - sex(chr): gender of customer
    - bmi(num): Body mass index (kg / m ^ 2)
    - children(int): Number of children covered by health insurance
    - smoker(chr): Smoking
    - region(chr): the beneficiary's residential area in the US(northeast, southeast, southwest, northwest)
    - charges(num): Individual medical costs billed by health insurance
- In this stage, I undestand some things about my dataset:
    - My dataset is taken from US
    - BMI is taken to find out whether the customer has obesity or not:
        - Below 18.5: Underweight
        - 18.5 - 24.9: Healthy
        - 25.0 - 29.9: Overweight
        - Above 30: Obese
    - Smoking status is taken because most likely people who like to smoke have a tendency to get cancer

## Check Missing Value:
- In this stage, I will check if there is a missing value or not

In [5]:
sprintf("Number of missing value in this dataset: %d", sum(is.na(df)))

Fortunately, this dataset has no missing value. So, I can skip "Cleaning Missing Value" step

## Check Duplicate Value:
- Beside missing value, I need to check duplicate values

In [6]:
sprintf("Number of duplicated value in this dataset: %d", sum(duplicated(df)))

In [7]:
df[duplicated(df),]

Unnamed: 0_level_0,age,sex,bmi,children,smoker,region,charges
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<int>,<chr>,<chr>,<dbl>
582,19,male,30.59,0,no,northwest,1639.563


There is 1 duplicate value in this data, so I will drop that duplicate

## Check Summary Dataframe:
- There is probability that this dataset have invalid input data. To avoid that, I will check summary dataset

In [8]:
summary(df)

      age            sex                 bmi           children    
 Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
 1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
 Median :39.00   Mode  :character   Median :30.40   Median :1.000  
 Mean   :39.21                      Mean   :30.66   Mean   :1.095  
 3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
 Max.   :64.00                      Max.   :53.13   Max.   :5.000  
    smoker             region             charges     
 Length:1338        Length:1338        Min.   : 1122  
 Class :character   Class :character   1st Qu.: 4740  
 Mode  :character   Mode  :character   Median : 9382  
                                       Mean   :13270  
                                       3rd Qu.:16640  
                                       Max.   :63770  

All summary data looks make senses and it can be assumed that this data does not have invalid input 

## <strong>6. Data Preparation:</strong>
<p>After I checked what do I need to check, I need to clean this dataset. The cleaning process is deleting duplicate values only and I will save the result of the cleaning in new CSV file

In [9]:
dfclean <- unique(df)

In [10]:
head(dfclean)

Unnamed: 0_level_0,age,sex,bmi,children,smoker,region,charges
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<int>,<chr>,<chr>,<dbl>
1,19,female,27.9,0,yes,southwest,16884.924
2,18,male,33.77,1,no,southeast,1725.552
3,28,male,33.0,3,no,southeast,4449.462
4,33,male,22.705,0,no,northwest,21984.471
5,32,male,28.88,0,no,northwest,3866.855
6,31,female,25.74,0,no,southeast,3756.622


In [11]:
write.csv(dfclean, "insuranceDataCleaned.csv", row.names=F)