## Linear Regression with PyTorch - Machine Learning with Python

### Insurance cost prediction using linear regression


In this assignment we're going to use information like a person's age, sex, BMI, no. of children and smoking habit to predict the price of yearly medical bills. This kind of model is useful for insurance companies to determine the yearly insurance premium for a person. The dataset for this problem is taken from [Kaggle](https://www.kaggle.com/mirichoi0218/insurance).


We will create a model with the following steps:
1. Download and explore the dataset
2. Prepare the dataset for training
3. Create a linear regression model
4. Train the model to fit the data
5. Make predictions using the trained model


This assignment builds upon the concepts from the first 2 lessons. It will help to review these Jupyter notebooks:
- PyTorch basics: https://jovian.ai/aakashns/01-pytorch-basics
- Linear Regression: https://jovian.ai/aakashns/02-linear-regression
- Logistic Regression: https://jovian.ai/aakashns/03-logistic-regression
- Linear regression (minimal): https://jovian.ai/aakashns/housing-linear-minimal
- Logistic regression (minimal): https://jovian.ai/aakashns/mnist-logistic-minimal

As you go through this notebook, you will find a **???** in certain places. Your job is to replace the **???** with appropriate code or values, to ensure that the notebook runs properly end-to-end . In some cases, you'll be required to choose some hyperparameters (learning rate, batch size etc.). Try to experiment with the hypeparameters to get the lowest loss
- Insurance Model using Pytorch: https://www.kaggle.com/code/sanath123/insurance-model-using-pytorch/notebook


In [39]:
# Uncomment and run the appropriate command for your operating system, if required

# Linux / Binder
# !pip install torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
# !pip install torch==1.7.1+cpu torchvision==0.8.2+cpu -f https://download.pytorch.org/whl/torch_stable.html

# Windows
#!pip install torch==1.9.1+cpu torchvision==0.10.1+cpu torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
#!pip install torch==1.7.1+cpu torchvision==0.8.2+cpu -f https://download.pytorch.org/whl/torch_stable.html
### the line below works for Windows - Thank God!
#!pip install torch torchvision torchaudio

# MacOS
# !pip install numpy matplotlib pandas torch torchvision torchaudio



In [None]:
#!pip uninstall torch

In [None]:
#!pip uninstall torchvision

In [None]:
#!pip uninstall torchaudio

In [40]:
import torch
import torchvision
import torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torchvision.datasets.utils import download_url
from torch.utils.data import DataLoader, TensorDataset, random_split

## Step 1: Download and explore the data

Let us begin by downloading the data. We'll use the `download_url` function from PyTorch to get the data as a CSV (comma-separated values) file. 

In [41]:
DATASET_URL = "https://gist.githubusercontent.com/BirajCoder/5f068dfe759c1ea6bdfce9535acdb72d/raw/c84d84e3c80f93be67f6c069cbdc0195ec36acbd/insurance.csv"
DATA_FILENAME = "insurance.csv"
download_url(DATASET_URL, '.')

Using downloaded and verified file: .\insurance.csv


To load the dataset into memory, we'll use the `read_csv` function from the `pandas` library. The data will be loaded as a Pandas dataframe. See this short tutorial to learn more: https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/

In [241]:
dataframe_raw = pd.read_csv(DATA_FILENAME)
dataframe_raw.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


We're going to do a slight customization of the data, so that you every participant receives a slightly different version of the dataset. Fill in your name below as a string (enter at least 5 characters)

In [242]:
your_name = 'Achan' # at least 5 characters

The `customize_dataset` function will customize the dataset slightly using your name as a source of random numbers.

In [243]:
#Full set of data being used instead of .95, and inlcude 'region' variable
def customize_dataset(dataframe_raw, rand_str):
    dataframe = dataframe_raw.copy(deep=True)
    # drop some rows s.t. use 0.95
    dataframe = dataframe.sample(int(1.00*len(dataframe)), random_state=int(ord(rand_str[0])))
    # scale input
    dataframe.bmi = dataframe.bmi * ord(rand_str[1])/100.
    # scale target
    dataframe.charges = dataframe.charges * ord(rand_str[2])/100.
    # drop column
    #if ord(rand_str[3]) % 2 == 1:
        #dataframe = dataframe.drop(['region'], axis=1)
    return dataframe

In [244]:
dataframe = customize_dataset(dataframe_raw, your_name)
dataframe.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
27,55,female,32.44725,2,no,northwest,12759.37754
752,64,male,37.52595,0,no,northwest,14778.957388
1258,55,male,37.33785,3,no,northwest,31266.123772
384,44,male,21.91365,2,no,northeast,8634.637076
406,33,female,24.0669,0,no,southeast,4352.501816


In [245]:
#How many rows does the dataset have?
num_rows = dataframe.shape[0]
print(num_rows)

1338


In [246]:
#How many columns does the dataset have?
num_cols = dataframe.shape[1]
print(num_cols)

7


In [247]:
#1271 rows x 6 columns
# What are the columns of the dataset have?
input_cols = ['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']
print(input_cols)

['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']


In [248]:
#Which of the input columns are non-numeric or categorial variables ?
categorical_cols = ['sex', 'smoker', 'region']
print(categorical_cols)

['sex', 'smoker', 'region']


In [249]:
#What are the column titles of output/target variable(s)?
output_cols=['charges']
output_cols

['charges']

**Q: (Optional) What is the minimum, maximum and average value of the `charges` column? Can you show the distribution of values in a graph?**
Use this data visualization cheatsheet for referece: https://jovian.ai/aakashns/dataviz-cheatsheet

In [250]:
dataframe.describe()
#dataframe[['charges']].describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.356763,1.094918,13801.239156
std,14.04996,6.037205,1.205493,12594.411686
min,18.0,15.8004,0.0,1166.748856
25%,27.0,26.033288,0.0,4929.898636
50%,39.0,30.096,1.0,9757.31432
75%,51.0,34.346812,2.0,17305.509016
max,64.0,52.5987,5.0,66321.24513


## Step 2: Prepare the dataset for training

We need to convert the data from the Pandas dataframe into a PyTorch tensors for training. To do this, the first step is to convert it numpy arrays. If you've filled out `input_cols`, `categorial_cols` and `output_cols` correctly, this following function will perform the conversion to numpy arrays.

In [251]:
def dataframe_to_arrays(dataframe):
    # Make a copy of the original dataframe
    dataframe1 = dataframe.copy(deep=True)
    # Convert non-numeric categorical columns to numbers
    for col in categorical_cols:
        dataframe1[col] = dataframe1[col].astype('category').cat.codes
    # Extract input & outupts as numpy arrays
    inputs_array = dataframe1[input_cols].to_numpy()
    targets_array = dataframe1[output_cols].to_numpy()
    return inputs_array, targets_array

Read through the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) to understand how we're converting categorical variables into numbers.

In [252]:
inputs_array, targets_array = dataframe_to_arrays(dataframe)
inputs_array, targets_array

(array([[5.50000000e+01, 0.00000000e+00, 3.24472500e+01, ...,
         0.00000000e+00, 1.00000000e+00, 1.27593775e+04],
        [6.40000000e+01, 1.00000000e+00, 3.75259500e+01, ...,
         0.00000000e+00, 1.00000000e+00, 1.47789574e+04],
        [5.50000000e+01, 1.00000000e+00, 3.73378500e+01, ...,
         0.00000000e+00, 1.00000000e+00, 3.12661238e+04],
        ...,
        [1.90000000e+01, 1.00000000e+00, 2.74230000e+01, ...,
         1.00000000e+00, 3.00000000e+00, 1.69497598e+04],
        [5.80000000e+01, 0.00000000e+00, 2.68983000e+01, ...,
         0.00000000e+00, 1.00000000e+00, 1.27118142e+04],
        [2.90000000e+01, 0.00000000e+00, 2.76606000e+01, ...,
         1.00000000e+00, 2.00000000e+00, 1.98720908e+04]]),
 array([[12759.37754 ],
        [14778.957388],
        [31266.123772],
        ...,
        [16949.75984 ],
        [12711.814232],
        [19872.090784]]))

In [253]:
#Convert the numpy arrays `inputs_array` and `targets_array` 
#into PyTorch tensors. Make sure that the data type is `torch.float32`.**
inputs = inputs_array
targets = targets_array

In [254]:
inputs.dtype, targets.dtype

(dtype('float64'), dtype('float64'))

Next, we need to create PyTorch datasets & data loaders for training & validation. We'll start by creating a `TensorDataset`.

In [255]:
tensor_x = torch.Tensor(inputs) # transform to torch tensor
tensor_y = torch.Tensor(targets)

dataset = TensorDataset(tensor_x,tensor_y)

In [256]:
#Check if Make sure that the data type is `torch.float32`.**
tensor_x.dtype, tensor_y.dtype

(torch.float32, torch.float32)

Pick a number between `0.1` and `0.2` to determine the fraction of data that will be used for creating the validation set. Then use `random_split` to create training & validation datasets.**

val_percent = 0.2 # between 0.1 and 0.2
val_size = int(num_rows * val_percent)
train_size = num_rows - val_size

# Use the random_split function to split dataset into 2 parts of the desired length

train_ds, val_ds = random_split(dataset, [train_size,val_size])

In [257]:
val_percent = 0.2 # between 0.1 and 0.2
val_size = int(num_rows * val_percent)
train_size = num_rows - val_size

train_ds, val_ds = torch.utils.data.random_split(dataset, [train_size, val_size])
# train_ds, val_ds = ??? # Use the random_split function to split dataset into 2 parts of the desired length

In [258]:
#Pick a batch size for the data loader
batch_size = 32

In [259]:
train_loader = DataLoader(train_ds, batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size)

Let's look at a batch of data to verify everything is working fine so far.

In [260]:
for xb, yb in train_loader:
    print("inputs:", xb)
    print("targets:", yb)
    break

inputs: tensor([[1.9000e+01, 1.0000e+00, 3.0096e+01, 0.0000e+00, 0.0000e+00, 3.0000e+00,
         1.3066e+03],
        [6.0000e+01, 0.0000e+00, 2.5582e+01, 0.0000e+00, 0.0000e+00, 1.0000e+00,
         3.0080e+04],
        [2.2000e+01, 1.0000e+00, 3.1037e+01, 1.0000e+00, 0.0000e+00, 1.0000e+00,
         2.7490e+03],
        [3.7000e+01, 0.0000e+00, 3.8006e+01, 0.0000e+00, 1.0000e+00, 2.0000e+00,
         4.2036e+04],
        [5.1000e+01, 0.0000e+00, 3.7679e+01, 0.0000e+00, 1.0000e+00, 2.0000e+00,
         4.6176e+04],
        [5.0000e+01, 1.0000e+00, 2.5047e+01, 0.0000e+00, 0.0000e+00, 2.0000e+00,
         8.7804e+03],
        [3.1000e+01, 1.0000e+00, 3.0754e+01, 3.0000e+00, 0.0000e+00, 1.0000e+00,
         5.6420e+03],
        [4.5000e+01, 0.0000e+00, 2.5443e+01, 3.0000e+00, 0.0000e+00, 3.0000e+00,
         9.4659e+03],
        [3.1000e+01, 1.0000e+00, 3.9095e+01, 1.0000e+00, 0.0000e+00, 2.0000e+00,
         4.0308e+03],
        [2.4000e+01, 0.0000e+00, 2.7443e+01, 0.0000e+00, 0.0000e+

## Step 3: Create a Linear Regression Model

Our model itself is a fairly straightforward linear regression (we'll build more complex models in the next assignment). 

In [261]:
input_size = len(input_cols)
output_size = len(output_cols)

In [262]:
input_size

7

In [263]:
output_size

1

**Q : Complete the class definition below by filling out the constructor (`__init__`), `forward`, `training_step` and `validation_step` methods.**

Hint: Think carefully about picking a good loss fuction (it's not cross entropy). Maybe try 2-3 of them and see which one works best. See https://pytorch.org/docs/stable/nn.functional.html#loss-functions

In [264]:
class InsuranceModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 =  nn.Linear(input_size,40) 
        self.linear2 =  nn.Linear(40,output_size) 
        
        # fill this (hint: use input_size & output_size defined above)
        
    def forward(self, xb):
        x1 = F.relu(self.linear1(xb))
        x2 = F.relu(self.linear2(x1))
        out = x2                      # fill this
        return out
    
    def training_step(self, batch):
        inputs, targets = batch 
        # Generate predictions
        out = self(inputs)          
        # Calcuate loss
        loss = F.mse_loss(out, targets)                          # fill this
        return loss
    
    def validation_step(self, batch):
        inputs, targets = batch
        # Generate predictions
        out = self(inputs)
        # Calculate loss
        loss = F.mse_loss(out, targets)                           # fill this    
        return {'val_loss': loss.detach()}
        
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
        return {'val_loss': epoch_loss.item()}
    
    def epoch_end(self, epoch, result, num_epochs):
        # Print result every 20th epoch
        if (epoch+1) % 20 == 0 or epoch == num_epochs-1:
            print("Epoch [{}], val_loss: {:.4f}".format(epoch+1, result['val_loss']))

Let us create a model using the `InsuranceModel` class. You may need to come back later and re-run the next cell to reinitialize the model, in case the loss becomes `nan` or `infinity`.

In [265]:
model = InsuranceModel()

Let's check out the weights and biases of the model using `model.parameters`.

In [266]:
list(model.parameters())

[Parameter containing:
 tensor([[ 0.2350, -0.1937, -0.2645, -0.0269,  0.0139, -0.1765, -0.0352],
         [ 0.1605,  0.0977,  0.0167, -0.3663, -0.2933,  0.2976,  0.3662],
         [-0.2715,  0.1109,  0.0376, -0.0430,  0.0293,  0.0323, -0.3228],
         [ 0.3725,  0.1816,  0.2741, -0.3072,  0.2767, -0.3741,  0.2467],
         [ 0.1316, -0.2969, -0.1938, -0.3118,  0.2111,  0.3022,  0.3609],
         [-0.2005,  0.0108,  0.1126,  0.0312,  0.0603, -0.2224,  0.1240],
         [ 0.0186,  0.1288,  0.0685, -0.3030,  0.1368,  0.1665,  0.3240],
         [ 0.2916, -0.0891, -0.1479,  0.2005,  0.1523, -0.3445, -0.2792],
         [ 0.1254,  0.0307,  0.0340,  0.0461, -0.2739, -0.2991, -0.0582],
         [ 0.0286,  0.1906, -0.0699, -0.2248, -0.0842,  0.3518,  0.3644],
         [-0.2709, -0.3195,  0.3215, -0.3466, -0.1025,  0.3241, -0.0589],
         [ 0.1220,  0.0774,  0.1784,  0.3565, -0.1950,  0.1857,  0.0093],
         [-0.2941, -0.0550,  0.1528,  0.0108,  0.1393, -0.0102, -0.2319],
         [ 0.04

**Step 4**: Train the model to fit the data
To train our model, we'll use the same fit function explained in the lecture. That's the benefit of defining a generic training loop - you can use it for any problem

In [267]:
def evaluate(model, val_loader):
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)

def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
    history = []
    optimizer = opt_func(model.parameters(), lr)
    for epoch in range(epochs):
        # Training Phase 
        for batch in train_loader:
            loss = model.training_step(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        # Validation phase
        result = evaluate(model, val_loader)
        model.epoch_end(epoch, result, epochs)
        history.append(result)
    return history

Q: Use the evaluate function to calculate the loss on the validation set before training.

In [268]:
result = evaluate(model, val_loader) # Use the the evaluate function
print(result)

{'val_loss': 297221120.0}


In [269]:
# Extra: Just check the train_loader
result_t = evaluate(model, train_loader) # Use the the evaluate function
print(result_t)

{'val_loss': 336659008.0}


We are now ready to train the model. You may need to run the training loop many times, for different number of epochs and with different learning rates, to get a good result. Also, if your loss becomes too large (or nan), you may have to re-initialize the model by running the cell model = InsuranceModel(). Experiment with this for a while, and try to get to as low a loss as possible.

**Q: Train the model 4-5 times with different learning rates & for different number of epochs.**

Hint: Vary learning rates by orders of 10 (e.g. 1e-2, 1e-3, 1e-4, 1e-5, 1e-6) to figure out what works.

In [270]:
epochs = 10
lr = 0.01
history1 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [10], val_loss: 310975808.0000


In [271]:
epochs = 20
lr = .001
history2 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [20], val_loss: 310975808.0000


In [272]:
epochs = 30
lr = .00001
history3 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [20], val_loss: 310975808.0000
Epoch [30], val_loss: 310975808.0000


In [273]:
epochs = 40
lr = .000001
history4 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [20], val_loss: 310975808.0000
Epoch [40], val_loss: 310975808.0000


In [274]:
epochs = 100
lr = .00001
history5 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [20], val_loss: 310975808.0000
Epoch [40], val_loss: 310975808.0000
Epoch [60], val_loss: 310975808.0000
Epoch [80], val_loss: 310975808.0000
Epoch [100], val_loss: 310975808.0000


Q: What is the final validation loss of your model?

In [275]:
#val_loss = 325285792.0000 for split =.1, and 336885312.0000 for split =.2
val_loss = 323426336.0000 # 336885312.0000 # 428244992.0000

### Step 5: Make predictions using the trained model
Q: Complete the following function definition to make predictions on a single input

In [276]:
def predict_single(input, target, model):
    inputs = input.unsqueeze(0)
    predictions = model.forward(inputs)                # fill this
    prediction = predictions[0].detach()
    print("Input:", input)
    print("Target:", target)
    print("Prediction:", prediction)

In [277]:
input, target = val_ds[0]
predict_single(input, target, model)

Input: tensor([4.2000e+01, 1.0000e+00, 2.4394e+01, 0.0000e+00, 1.0000e+00, 2.0000e+00,
        2.0296e+04])
Target: tensor([20296.1641])
Prediction: tensor([0.])


In [278]:
input, target = val_ds[10]
predict_single(input, target, model)

Input: tensor([4.2000e+01, 0.0000e+00, 2.4735e+01, 2.0000e+00, 0.0000e+00, 1.0000e+00,
        8.3377e+03])
Target: tensor([8337.7432])
Prediction: tensor([0.])


In [279]:
input, target = val_ds[23]
predict_single(input, target, model)

Input: tensor([3.2000e+01, 0.0000e+00, 4.0689e+01, 0.0000e+00, 0.0000e+00, 3.0000e+00,
        4.1494e+03])
Target: tensor([4149.4346])
Prediction: tensor([0.])


In [281]:
input, target = val_ds[100]
predict_single(input, target, model)

Input: tensor([5.5000e+01, 0.0000e+00, 2.9839e+01, 2.0000e+00, 0.0000e+00, 2.0000e+00,
        1.2357e+04])
Target: tensor([12357.2480])
Prediction: tensor([0.])


### (Optional) Step 6: Try another dataset & blog about it
While this last step is optional for the submission of your assignment, we highly recommend that you do it. Try to clean up & replicate this notebook (or this one, or this one ) for a different linear regression or logistic regression problem. This will help solidify your understanding, and give you a chance to differentiate the generic patters in machine learning from problem-specific details.

Here are some sources to find good datasets:

- https://lionbridge.ai/datasets/10-open-datasets-for-linear-regression/
- https://www.kaggle.com/rtatman/datasets-for-regression-analysis
- https://archive.ics.uci.edu/ml/datasets.php?format=&task=reg&att=&area=&numAtt=&numIns=&type=&sort=nameUp&view=table
- https://people.sc.fsu.edu/~jburkardt/datasets/regression/regression.html
- https://archive.ics.uci.edu/ml/datasets/wine+quality
- https://pytorch.org/docs/stable/torchvision/datasets.html

We also recommend that you write a blog about your approach to the problem. Here is a suggested structure for your post (feel free to experiment with it):

- Interesting title & subtitle
- Overview of what the blog covers (which dataset, linear regression or logistic regression, intro to PyTorch)
- Downloading & exploring the data
- Preparing the data for training
- Creating a model using PyTorch
- Training the model to fit the data
- Your thoughts on how to experiment with different hyperparmeters to reduce loss
- Making predictions using the model

Ref. source: https://www.kaggle.com/code/sanath123/insurance-model-using-pytorch/notebook