## Starting Words

This notebook is targeted for people who are new to the data analysis. The notebook explains the process of data modeling in the simplest way, including the detail on how I approach the process of problem solving.

### Data Analysis Process
I follow the following steps fod data analysis:
1. Define the problem
2. Analyze and prepare the data
3. Develop and evaluate the models
4. Improve results
5. Present results

## Define the problem
The insurance company needs a program that will predict the insurance charges (premium) to be levied to a new client. The task is to develop a model that will provide a charge amount whenever certain input regarding the new client is given to the model. The company has data on existing clients on age, sex, bmi, smoking habit, and others. Based on these variables, it is expected to predict the charges to a new client with utmost accuracy. 

Calculating an insurance charge is a complex and time-consuming process. Developing a model to predict the insurance charge will help the company to avoid the burden of manual calculation and reduce the risk of human error. Once the model is developed, the manual tasks of calculation are omitted from the system, which ultimately enhances efficiency, work speed, and reduction in operating resource cost. 

In the given problem, the target variable is numeric in nature and we will be developing regression models (both linear and non-linear). The accuracy of the models will be measured in terms of R square.  

## Analyze and Prepare the data
Analyzing the data includes the descriptive analysis and visualization. We will not be going for EDA (Exploratory Data Analysis), and focus on basic analysis only. Let us begin by importing necessary libraries and data.

In [30]:
### Import the necessary libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Library to encode categorical variable
from sklearn.preprocessing import LabelEncoder

### Linear Regression model libraries
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR


### Import the data as a panda dataframe
df = pd.read_csv('../input/data-visualizatiion/insurance.csv')

In [31]:
### Data summary
df.shape

The dataset contains 1338 entities (rows) and 7 attributes (columns).

In [32]:
df.head(10)

We can observe the first 10 entires of the dataset. Here, age, bmi. children and charges are numeric in nature, and others are categorical. In real case, we will have large number of attributes (more than 50) and it is not possible to seperate the nature of variable by visual observation. Hence, we have a separate function to check the nature of variable. 

In [33]:
df.dtypes

Here, we can check the nature of attributes in detail. The number of children should be categorical also, but the program identified it as numeric due to the presence of number. Hence, we will change the data type to object.

In [34]:
df.children = df.children.astype(object)

One of the most important aspect of data preparation is checking the missing values and NAN in the dataset and making suitable adjustments.

In [35]:
### Check whether there is any missing values
df.isnull().sum()

Here, the number of missing values for each column is shown. Since, all the columns have zero missing values, we do not need any adjustments.
Now, we continue with the descriptive analysis for numeric attributes.

In [36]:
### Descriptive Statistics
df.describe()

The descriptive analysis provides detail on 7 parameters as shown in the table. For age and bmi attributes, the mean value and median values are similar in nature, hence, we can expect less skweness. However, the median of the charges is lower than median signaling the presence of skewness. Moreover, the maximum value of charges is well-above the mean and is the likely cause of the skewness. Hence, we can treat the higher charges as an outlier and remove from data for analysis. For this, let us check the data with charges above 45,000.

In [37]:
df[df['charges'] > 45000]['charges'].count()/df['charges'].count()

About 3% of data have charges above the 45,000. We consider them as an outlier and remove from the dataset.

In [38]:
### Remove Outliers
df = df[df['charges'] <= 45000]

### Data Visualization
We will continue the analysis of data through visualization.

In [39]:
### Numerical variable visualization
sns.pairplot(df)

While analyzing the numeric data, we should look for the correlation between two attributes, with special focus on target varible, that is, charge. Here, charges and age attributes show some kind of linear growth pattern signalling the possibility of correlation. BMI and charges plot has no significant pattern and hence, correlation is less likely.

In [40]:
### Numerical and categorical variable visualization
sns.pairplot(df, hue = 'sex')
sns.pairplot(df, hue = 'region')
sns.pairplot(df, hue = 'children')
sns.pairplot(df, hue = 'smoker')

The graph with smoker attribute has significant patterns. In general, the charge amount is lower for people who do not smoke. It is likely that smoking habit affects the insurance charge. Other categorical variables are likely not to have significant impact on charge.

## Develop and Evaluate the model
Now we move to the next steps where we develop the models and evaluate their performance. Here, we need to understand three concepts: Machine Language Alogrithm, Resampling and Performance Metrics. 

### Machine Learning Alogrithm
A machine learning algorithm is the method by used to conduct the specified task, generally predicting output values from given input data. In this problem solving, we are applying regression method (both linear and non-linear) from supervised machine learning alogrithms and selecting the best performing.

The linear machine learning algorithms are as follow:
1. Linear Regression.
2. Ridge Regression.
3. LASSO Linear Regression.
4. Elastic Net Regression.

The nonlinear machine learning algorithms are:
1. k-Nearest Neighbors.
2. Classification and Regression Trees.
3. Support Vector Machines.

### Resampling
In order to evaluate the performance of a machine learning alogrithm, we use train-test method, where available data is divided into train set and test set in certain ratio (generally 80:20). We build the model based on train dataset and make prediction on test dataset to measure the accuracy by comparing to actual target value in test dataset. 

The problem with this method is that we only have a single estimate, with little idea of the variability or uncertainty in the estimate. We can address this issue by estimating the performance parameter multiple times from our data sample. In other words, we change the sample data in train and test datasets multiple times and measure the accuracy of the model.

The different techniques that we can use for resampling are:
* Train and Test Sets.
* k-fold Cross Validation.
* Leave One Out Cross Validation.
* Repeated Random Test-Train Splits.

### Performance Metrics
Performance metrics are the indicators used to evaluate the performance of machine learning algorithms. The most common metrics for evaluating predictions on regression machine learning problems are:
* Mean Absolute Error.
* Mean Squared Error.
* R square

### Work Model
To better understand the ML alogrithm and Resampling, we will develop a single linear regression model based on train-test dataset and then, develop multiple models based on cross-validation resampling. We will use R square as performance metric.

First, we will convert the object variable into categorical and then apply labelencoder to convert them into numericals.

In [41]:
##Converting objects labels into categorical
df[['sex', 'smoker', 'region', 'children']] = df[['sex', 'smoker', 'region', 'children']].astype('category')
df.dtypes

##Converting category labels into numerical using LabelEncoder
label = LabelEncoder()

label.fit(df.sex.drop_duplicates())
df.sex = label.transform(df.sex)

label.fit(df.smoker.drop_duplicates())
df.smoker = label.transform(df.smoker)

label.fit(df.region.drop_duplicates())
df.region = label.transform(df.region)

label.fit(df.children.drop_duplicates())
df.children = label.transform(df.children)

df.dtypes

In [42]:
### Segregrating dataframe in independent and target variable
x = df.drop(['charges'], axis = 1)  #Independent Variables
y = df['charges']                   #Target Variables

Here, we have all independent variables in the dataframe x and target variable is in y.

In [43]:
### Linear regression with train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20)
model = LinearRegression()
model.fit(X_train, y_train)
r_sq = model.score(x, y)
print(f"coefficient of determination: {r_sq}")

We first split the dataset into train and test dataset in the ration 80:20. We used linear regression as our model and fitted the train dataset. The coefficient of determination or R square is obtained as printed.

### Evaluating multiple models
Now, we will develop different regression models and evaluate their performance. For this, we create a list of models with appropriate ML alogrithm. The result capture average R square for each model based on the cross validation method.

In [44]:
### Prepare a list of models
models = []
models.append(('Linear', LinearRegression()))
models.append(('Ridge', Ridge(alpha = 0.5)))
models.append(('Lasso', Lasso(alpha=0.2, fit_intercept=True, normalize=False, precompute=False, max_iter=1000,
              tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic')))
models.append(('Elastic', ElasticNet()))
models.append(('Kneighbour', KNeighborsRegressor()))
models.append(('CART',  DecisionTreeRegressor()))
models.append(('SVR',  SVR()))

In [45]:
### Evaluate each model in terms of R square
results = []
names = []
scoring = 'accuracy'

for name, model in models:
    kfold = KFold(n_splits=10, random_state=7, shuffle = True)
    cv_results = cross_val_score(model, x, y)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

Hence, Linear, Ridge and Lasso regression method has smiliar accuracy and are the best method for the current dataset. Support Vector Regression (SVR) has the worst performance.

In [46]:
# Visual representation of model comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

## Improve Results
We have further option to improve the result with Ensembles and Alogrithm Tuning. However, the current dataset is not that complex and the result obtained is satisfactory. Hence, futher work of improving the result is not required.

## Present Results
The result can be presented in different forms - sildes, report and other methods. For my work, this notebook is used as a document to present the result.