Problem Definition : we are trying to predict the charges by some features like sex, smoker, region, bmi, etc... .
the charges are continuous value and the data is labeled so we are going to use linear regression. 

In [890]:
############################################ Packages Used ######################################              

# to deal with the data set and process it
import pandas as pd

# to split the data set to train and test data set to evaluate the model
from sklearn.model_selection import train_test_split

# to create the linear regression model
from sklearn.linear_model import LinearRegression

# to know how good the model is
from sklearn.metrics import r2_score

import numpy as np 

In [891]:
# read the data set into data frame
df = pd.read_csv('insurance.csv')
print(df)

      age     sex     bmi  children smoker     region      charges
0      19  female  27.900         0    yes  southwest  16884.92400
1      18    male  33.770         1     no  southeast   1725.55230
2      28    male  33.000         3     no  southeast   4449.46200
3      33    male  22.705         0     no  northwest  21984.47061
4      32    male  28.880         0     no  northwest   3866.85520
...   ...     ...     ...       ...    ...        ...          ...
1333   50    male  30.970         3     no  northwest  10600.54830
1334   18  female  31.920         0     no  northeast   2205.98080
1335   18  female  36.850         0     no  southeast   1629.83350
1336   21  female  25.800         0     no  southwest   2007.94500
1337   61  female  29.070         0    yes  northwest  29141.36030

[1338 rows x 7 columns]


In [892]:
# get some information about the data set
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


the data set contains 7 features, three of them are objects [sex, smoker, region]. the data set does not contain any null value

In [893]:
# display the unique values in the features which its data type is object
print(df['region'].unique())
print(df['sex'].unique())
print(df['smoker'].unique())

['southwest' 'southeast' 'northwest' 'northeast']
['female' 'male']
['yes' 'no']


In [894]:
# use one hot encoding to convert string data to numeric data
one_hot_encoded_data = pd.get_dummies(df, columns = ['region', 'sex', 'smoker']) 
print(one_hot_encoded_data)

      age     bmi  children      charges  region_northeast  region_northwest  \
0      19  27.900         0  16884.92400                 0                 0   
1      18  33.770         1   1725.55230                 0                 0   
2      28  33.000         3   4449.46200                 0                 0   
3      33  22.705         0  21984.47061                 0                 1   
4      32  28.880         0   3866.85520                 0                 1   
...   ...     ...       ...          ...               ...               ...   
1333   50  30.970         3  10600.54830                 0                 1   
1334   18  31.920         0   2205.98080                 1                 0   
1335   18  36.850         0   1629.83350                 0                 0   
1336   21  25.800         0   2007.94500                 0                 0   
1337   61  29.070         0  29141.36030                 0                 1   

      region_southeast  region_southwes

In [895]:
# infromation about the data after preprocessing
one_hot_encoded_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               1338 non-null   int64  
 1   bmi               1338 non-null   float64
 2   children          1338 non-null   int64  
 3   charges           1338 non-null   float64
 4   region_northeast  1338 non-null   uint8  
 5   region_northwest  1338 non-null   uint8  
 6   region_southeast  1338 non-null   uint8  
 7   region_southwest  1338 non-null   uint8  
 8   sex_female        1338 non-null   uint8  
 9   sex_male          1338 non-null   uint8  
 10  smoker_no         1338 non-null   uint8  
 11  smoker_yes        1338 non-null   uint8  
dtypes: float64(2), int64(2), uint8(8)
memory usage: 52.4 KB


In [896]:
# extract the independent and dependent variables
X = one_hot_encoded_data.loc[:, one_hot_encoded_data.columns != 'charges']
y = one_hot_encoded_data['charges']

In [897]:
# split the data set into train and test data set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

In [898]:
# create object of linear regression
model = LinearRegression()

# make model by trying to get the most appropriate coefficients
model.fit(X_train, y_train)

# test the model with new data
y_predict = model.predict(X_test)

# know how good the model is
r2 = r2_score(y_test, y_predict)
print(r2)

0.7486737521763733


In [899]:
class linearRegression:
    def __init__(self, lr=0.001, n_iters=1000):
        self.lr = lr
        self.n_iters = n_iters
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for i in range(self.n_iters):
            y_pred = np.dot(X, self.weights) + self.bias

            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
            db = (1/ n_samples) * np.sum(y_pred - y)

            self.weights = self.weights - self.lr * dw
            self.bias = self.bias - self.lr * db

    def predict(self, X):
        y_pred = np.dot(X, self.weights) + self.bias
        return y_pred

In [900]:
model_two = linearRegression(lr=0.0002)
model_two.fit(X_train, y_train)

y_predict_two = model_two.predict(X_test)

r2_two = r2_score(y_test, y_predict_two)
print(r2_two)

0.19707846785516248


I am countering a problem which is that the learning rate 0.01 producing NAN values and i do not know why but learning rate 0.0002 produce very low r^2 value so the model is doning very bad.