**Principle of the DM**


* In order to build our forecasting model, we will have to opt for using CRISP-DM methodology.
* Goal is to transform data into knowledge.
* for any project and any data, and to build a knowledge with these data, we must first go through these 4 steps:
    1. Get the data
    2. Prepare the data ( Data exploration, Data cleaning, Data reduction, Data transformation)
    3. Extract pattern
    4. Evaluate pattern





# Medical Cost



### * Business understanding


Exploration of a data set dedicated to the treatment costs of different patients for the prediction of these medical costs in relation to their age, bmi, ...

### * Defining business objectives



* Rendre les données plus accessibles et compréhensibles pour tous.
* Provide rapid analyzes for various factors affecting medical costs.
* use different prediction and classification algorithms
* calculating and comparing evaluation measures




##### importing libraries



In [None]:
#import Python libraries
import pandas as pd #for DataFrames  -- resembles relational DB and SQL
import numpy as np #for mathematical operations -- resembles Matlab

In [None]:
#to import the library which allows to plot and visualize the data in graph form
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import warnings
warnings.filterwarnings('ignore')

In [None]:
# to import Scikit-learn which is a linear model (Regrssion and SVM)
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn import metrics

## 1.Get Data

##### in this part we will get and understand the data

In [None]:
# set the path of the data file
filepath = "../input/costmedical.csv"
df = pd.read_csv(filepath)

In [None]:
df.info()

In [None]:
type(df)

In [None]:
# The number of rows and columns in this data frame
df.shape

In [None]:
# display the first 10 lines of the data
df.head(10)

In [None]:
# list of column names
df.columns

In [None]:
#the data type of each column
df.dtypes

In [None]:
#dans ce case on va voir les différents valeurs de type  Object ,
df.sex.unique()


In [None]:
df.smoker.unique()


In [None]:
df.region.unique()

In [None]:
#on géneral unique() se fait sur les attribut de type object, mais aussi on peut le faire pour les autres types d'attributs mais ,elle est unitile
df.age.unique()

In [None]:
df.describe()

###### Today, we will explore a set of data dedicated to the treatment costs of different patients
###### regression usage, svm

## 1- DATA PREPARATION

### Explore the data


##### - Check the missing value


We want to know how many missing values there are in each attribute (column).
To do this, use the functions isna or isnull

In [None]:
df.corr()

In [None]:
df.corr()['charges'].sort_values()

In [None]:
#are there zero values in our data?
df.isnull().sum()

very good choose when we have not empty values in the data, let's look at our data to understand something about it. Since we are mainly interested in the amount of the costs

##### - Check duplicate values


pour savoir les valeurs dupliquer,on utilise , la fonction duplicated()

In [None]:
df.duplicated().sum()


- oops, we have a duplicate value, it must eliminate (data cleansing phase)
=> the function drop_duplicated () 

In [None]:
df.duplicated().shape

In [None]:
df.drop_duplicates(keep='first').shape

### Visualize distribution of attribute values



This will help us identify bad values and bad attributes ...


In [None]:
#we will now see the function value_counts () which allows to return the numbers of the unique values of an object
# We use the value_counts () method to get the distribution of a particular column in the data block.
# This method actually returns the absolute number of frequencies (default)
df.age.value_counts()

In [None]:
#to better see the data, we use the plots
plt.figure(figsize=(12,5))
plt.title("Distribution of age")
ax = sns.distplot(df["age"])

###### * Note that the most common ages are between 18 and 23.

In [None]:
#we can also display the less frequent values
pd.value_counts(df.age).tail(n=4)

In [None]:
# then only show the 10 most frequent values
pd.value_counts(df.age).head(n=10)

In [None]:
gender_list = [df[df.sex == "female"].sex.value_counts().tolist(), df[df.sex == "male"].sex.value_counts().tolist()]
gender_list = [gender_list[0][0], gender_list[1][0]]
gender_list

In [None]:
labels = ["Female", "Male"]
values = gender_list
trace = go.Pie(labels=labels, values=values,
               hoverinfo='label+percent', textinfo='percent', 
               textfont=dict(size=20),
               )
data = [trace]
layout = go.Layout(title='Rate of Males & Females')
fig = go.Figure(data = data, layout = layout)
iplot(fig)

In [None]:
df.bmi.value_counts()

In [None]:
plt.figure(figsize=(12,5))
plt.title("Distribution of BMI")
ax = sns.distplot(df["bmi"])

In [None]:
df.children.value_counts()

In [None]:
df.children.value_counts().plot(kind='bar')

In [None]:
df.smoker.value_counts()

In [None]:
df.smoker.value_counts().plot(kind='bar')

In [None]:
df.region.value_counts()

In [None]:
df.region.value_counts().plot(kind='bar')

In [None]:
df.charges.value_counts()

In [None]:
plt.figure(figsize=(12,5))
plt.title("Distribution of charges")
ax = sns.distplot(df["charges"],color = 'y')

In [None]:
# We call the describe () method to get simple statistical summaries of ALL the numeric columns of the DataFrame
df.describe()

In [None]:
# even here we can get simple statistical summaries for object type columns
df.describe(include=['object'])

In [None]:
df.describe(include=['int64'])

### Visualize the relationship between attributes
This is useful for detecting * irrelevant * and * redundant * attributes.

In [None]:
#we use the corr method to calculate the linear correlation coefficient of two NUMERIC attributes
df.corr()

In [None]:
# We use the crosstab () method to display the co-occurrence frequency table of two attributes.
pd.crosstab(df.age,df.children)

In [None]:
pd.crosstab(df.age,df.smoker)

- The attributes are well structured, we do not have attributes nor redundant nor relevant
- to clearly see the coorelation between the attributes, it must convert the attribute of the object type, since corr () makes it possible to calculate the linear correlation coefficient of two attributes NUMERIC
- use of librarian lableEncoder

In [None]:
from sklearn.preprocessing import LabelEncoder
#sex
le = LabelEncoder()
le.fit(df.sex.drop_duplicates()) 
df.sex = le.transform(df.sex)
# smoker or not
le.fit(df.smoker.drop_duplicates()) 
df.smoker = le.transform(df.smoker)
#region
le.fit(df.region.drop_duplicates()) 
df.region = le.transform(df.region)

In [None]:
df.corr()['charges'].sort_values()

* According to the correlation chart, we notice that there is a strong coorelation between the loads and the smoking and non-smoking patients

###   1. Medical Costs of Smoker vs Non-Smokers


![![image.png](attachment:image.jpg =20*10)](http://www.lemondedesados.fr/wp-content/uploads/2016/10/stop-620x350.jpg)

In [None]:
charges_sorted = df.copy()
sort_index = (df['charges'].sort_values(ascending=False)).index.values
charges_sorted = df.reindex(sort_index)
charges_sorted.reset_index(inplace=True)
charges_sorted.head()

In [None]:
trace0 = go.Scatter(
    x = charges_sorted.index,
    y = charges_sorted[charges_sorted.smoker == 1].charges,
    name = "Smokers",
    mode='lines',
    marker=dict(
        size=12,
        color = "red", #set color equal to a variable
    )
)

trace1 = go.Scatter(
    x = charges_sorted.index,
    y = charges_sorted[charges_sorted.smoker == 0].charges,
    name = "Non-Smokers",
    mode='lines',
    marker=dict(
        size=12,
        color = "green", #set color equal to a variable
    )
)


data = [trace0,trace1]
layout = go.Layout(title = 'Medical Costs of Smoker vs Non-Smokers',
              xaxis = dict(title = 'Persons'),
              yaxis = dict(title = 'Medical Costs'),)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

* => In smoking patients, the cost of treatment increases almost twice compared to non-smokers

###   2. Bmi compared to average medical costs (low-normal-high )



1. ![![image.png](attachment:image.jpg )](https://www.healthhub.sg/sites/assets/Assets/Programs/bmi/image01.gif)

In [None]:
dict_bmi= {'low' : df[df.bmi < 18.5].charges.mean(),
               'normal' : df[(df.bmi > 18.5) & (df.bmi < 24.9)].charges.mean(),
               'high' : df[df.bmi > 24.9].charges.mean(),
             }
df_bmi = pd.DataFrame.from_dict(dict_bmi, orient='index')
df_bmi.reset_index(inplace=True)
df_bmi.columns = ['bmi', 'mean_value']
df_bmi

In [None]:
my_color = ['rgb(220,250,39)','rgb(102,189,99)','rgb(115,48,39)']
trace=go.Bar(
            x=df_bmi.bmi,
            y=df_bmi.mean_value,
            text="Mean Medical Costs",
            marker=dict(
                color=my_color,
                line=dict(
                color=my_color,
                width=1.5),
            ),
            opacity=0.7)

data = [trace]
layout = go.Layout(title = 'Body mass index',
              xaxis = dict(title = 'BMI'),
              yaxis = dict(title = 'mean charges'))
fig = go.Figure(data = data, layout = layout)
iplot(fig)

In [None]:
plt.figure(figsize=(12,5))
plt.title("Distribution de charges pour les patients à un BMI= low")
ax = sns.distplot(df[(df.bmi < 18.5)]
                  ['charges'], color = 'm')

In [None]:
plt.figure(figsize=(12,5))
plt.title("Distribution of charges for patients with BMI = normal")
ax = sns.distplot(df[(df.bmi > 18.5) & (df.bmi < 24.9)]['charges'], color = 'b')

In [None]:
plt.figure(figsize=(12,5))
plt.title("Distribution of charges for patients with BMI=high")
ax = sns.distplot(df[df.bmi > 24.9]['charges'], color = 'y')

### 4.Medical Costs Means by Regions


![![image.png](attachment:image.jpg =100*300)](https://www.sare.org/extension/htmlmap/design/standard/images/SARE_USA-292.png)

In [None]:
dict_regions= {'southwest' : df[df.region == 0].charges.mean(),
              'southeast' : df[df.region == 1].charges.mean(),
              'northwest' : df[df.region == 2].charges.mean(),
              'northeast' : df[df.region == 3].charges.mean()
             }
df_regions = pd.DataFrame.from_dict(dict_regions, orient='index')
df_regions.reset_index(inplace=True)
df_regions.columns = ['regions', 'charges']

df_regions

In [None]:
import plotly.graph_objs as go

trace=go.Bar(
            x=df_regions.regions,
            y=df_regions.charges,
            text="Mean Medical Costs",
            opacity=0.8)

data = [trace]
layout = go.Layout(title ='Medical Cost Means by Regions',
              xaxis = dict(title = 'Region'),
              yaxis = dict(title = 'Medical Cost'))
fig = go.Figure(data = data, layout = layout)
iplot(fig)

### 4. Medical Costs Means by Age


![![image.png](attachment:image.jpg =100*300)](https://cdn.psychologytoday.com/sites/default/files/styles/image-article_inline_full/public/field_blog_entry_images/Longevity%20Cartoon_1.jpg?itok=X89Hn_1J)

In [None]:
dict_age= {'youth' : df[(df.age > 18)&(df.age < 30 )].charges.mean(),
               'adult' : df[(df.age > 30)&(df.age < 50 )].charges.mean(),
               'elders' : df[(df.age > 50)&(df.age < 70 )].charges.mean(),
             }
df_age = pd.DataFrame.from_dict(dict_age, orient='index')
df_age.reset_index(inplace=True)
df_age.columns = ['age', 'mean_value']
df_age

In [None]:
my_color = ['rgb(150,150,155)','rgb(107,189,99)','rgb(15,148,139)']
trace=go.Bar(
            x=df_age.age,
            y=df_age.mean_value,
            text="Mean Medical Costs",
            marker=dict(
                color=my_color,
                line=dict(
                color=my_color,
                width=1.5),
            ),
            opacity=0.7)

data = [trace]
layout = go.Layout(title = 'age category index',
              xaxis = dict(title = 'age'),
              yaxis = dict(title = 'mean charges'))
fig = go.Figure(data = data, layout = layout)
iplot(fig)

### 5.Medical Cost by Age and Sex


In [None]:
sns.set_style('ticks')
col_list = ['light lavender','denim']
col_list_palette = sns.xkcd_palette(col_list)
sns.set_palette(col_list_palette)
a = sns.FacetGrid(df, col='sex',hue='sex',height =6,aspect= 0.9)                  
a.map(plt.scatter, 'age','charges')
a.set_axis_labels('Age', 'Medical Costs in Dollars')
plt.suptitle('Medical Costs by Age & Sex', fontsize = 25);plt.tight_layout(rect=[0, 0.03, 1, 0.95])

0: female 
1:male

### 6. Medical Cost by Age and Smoker/Non-Smoker


In [None]:
# Medical Cost by Age and Smoker/Non-Smoker
col_list = ["shit","pistachio"]
col_list_palette = sns.xkcd_palette(col_list)
sns.set_palette(col_list_palette)
a = sns.FacetGrid(df, col='smoker',hue= 'smoker',height =6,aspect= 0.9)
a.map(plt.scatter, 'age','charges')
plt.suptitle('Medical Costs by Age & Smoker', fontsize = 25)
a.set_axis_labels('Age', 'Medical Costs in Dollars')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])

0: smoker 
1: no_smoker

## Pattern Extraction Experiments

### Experiment 1: linear regression method with same training and test data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn import metrics

notice that all attributes are numeric

In [None]:
df.head()

In [None]:
df.keys()

In [None]:
# In this part, we will use the entire data as training and test sets.

x = df.drop(['charges'], axis = 1)
y = df.charges

In [None]:
x_train=x
y_train=y
x_test=x
y_test=y

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# create object of LinearRegression class
model1 = LinearRegression()

In [None]:
# call learning algorithm with training data
model1.fit(x_train, y_train)

In [None]:
# use extracted pattern to make predictions on the test set
ypred1 = model1.predict(x_test)

In [None]:
type(ypred1)

In [None]:
ypred1.shape

In [None]:
# measure accuracy on test test
rmse1 = np.sqrt(metrics.mean_squared_error(y_test, ypred1))
mae1 = metrics.mean_absolute_error(y_test, ypred1)

In [None]:
print("Root mean squared error: %.3f" % rmse1)
print('Mean Absolute Error: %.3f' % mae1)

In [None]:
# Make sure this value is identical to output of previous cell
ypred1[0]

#### Divide data into training and test subsets
use the ``train_test_split`` method to randomly divide the data into 80% training instances and 20% test instances.  The result should be stored in four variables: ``Xtrain``, ``ytrain``, ``Xtest``, ``ytest``


In [None]:
    # read documentation
?train_test_split

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split( x, y, test_size=0.2)

In [None]:
model2 = LinearRegression()

In [None]:
model2.fit(Xtrain, ytrain)

In [None]:
ypred2 = model2.predict(Xtest)

In [None]:
# measure accuracy on test test
rmse2 = np.sqrt(metrics.mean_squared_error(ytest, ypred2))
mae2 = metrics.mean_absolute_error(ytest, ypred2)

In [None]:
print("Root mean squared error: %.3f" % rmse2)
print('Mean Absolute Error: %.3f' % mae2)

In [None]:
print(rmse1,mae1)
print(rmse2,mae2)

In [None]:
pd.DataFrame( {'model 1 coefficient': model1.coef_, 'model 2 coefficient': model2.coef_})

## Evaluate pattern

In [None]:
df = pd.DataFrame({'predicted_price':ypred2, 'true_price':ytest})
df.head(10)

In [None]:


x_train1,x_test1,y_train1,y_test1 = train_test_split(x,y, random_state = 0)
lr = LinearRegression().fit(x_train1,y_train1)

y_train_pred = lr.predict(x_train1)
y_test_pred = lr.predict(x_test1)

print(lr.score(x_test1,y_test1))

In [None]:
df.plot.scatter(x='true_price', y='predicted_price', c= ['#feb3b3', '#c5feb3'],title = 'predicted price vs. true price')


END.