<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/assignments/assignment_yourname_class8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/academics/programs/index.html)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

**Module 8 Assignment: Feature Engineering**

**Student Name: Your Name**

# Assignment Instructions

This assignment is similar to assignment 5, except that you must use feature engineering to solve it.  I provide you with a dataset that contains dimensions and the quality of items of specific shapes.  With the values of 'height', 'width', 'depth'. 'shape', and 'quality' you should try to predict the cost of these items.  You should be able to match very close to solution file, if you feature engineer correctly.  To get full credit your average cost should not be more than 50 off from the solution.  The autocorrector will let you know if you are in this range.

You can find all of the needed CSV files here:

* [Shapes - Training](https://data.heatonresearch.com/data/t81-558/datasets/shapes-train.csv)
* [Shapes - Submit](https://data.heatonresearch.com/data/t81-558/datasets/shapes-test.csv)

Use the training file to train your neural network and submit results for for the data contained in the test/submit file.

In [1]:
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

Mounted at /content/drive
Note: using Google CoLab


# Assignment Submit Function

You will submit the 10 programming assignments electronically.  The following submit function can be used to do this.  My server will perform a basic check of each assignment and let you know if it sees any basic problems. 

**It is unlikely that should need to modify this function.**

In [27]:
import base64
import os
import numpy as np
import pandas as pd
import requests
import PIL
import PIL.Image
import io

# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - List of pandas dataframes or images.
# key - Your student key that was emailed to you.
# no - The assignment class number, should be 1 through 1.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.  
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.
def submit(data,key,no,source_file=None):
    if source_file is None and '__file__' not in globals(): raise Exception('Must specify a filename when a Jupyter notebook.')
    if source_file is None: source_file = __file__
    suffix = '_class{}'.format(no)
    if suffix not in source_file: raise Exception('{} must be part of the filename.'.format(suffix))
    with open(source_file, "rb") as image_file:
        encoded_python = base64.b64encode(image_file.read()).decode('ascii')
    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb','.py']: raise Exception("Source file is {} must be .py or .ipynb".format(ext))
    payload = []
    for item in data:
        if type(item) is PIL.Image.Image:
            buffered = BytesIO()
            item.save(buffered, format="PNG")
            payload.append({'PNG':base64.b64encode(buffered.getvalue()).decode('ascii')})
        elif type(item) is pd.core.frame.DataFrame:
            payload.append({'CSV':base64.b64encode(item.to_csv(index=False).encode('ascii')).decode("ascii")})
    r= requests.post("https://api.heatonresearch.com/assignment-submit",
        headers={'x-api-key':key}, json={ 'payload': payload,'assignment': no, 'ext':ext, 'py':encoded_python})
    if r.status_code==200:
        print("Success: {}".format(r.text))
    else: print("Failure: {}".format(r.text))

# Assignment #8 Sample Code

The following code provides a starting point for this assignment.

In [22]:
import copy


class EarlyStopping:
    def __init__(self, patience=50, min_delta=0, restore_best_weights=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best_weights = restore_best_weights
        self.best_model = None
        self.best_loss = None
        self.counter = 0
        self.status = ""

    def __call__(self, model, val_loss):
        if self.best_loss is None:
            self.best_loss = val_loss
            self.best_model = copy.deepcopy(model.state_dict())
        elif self.best_loss - val_loss >= self.min_delta:
            self.best_model = copy.deepcopy(model.state_dict())
            self.best_loss = val_loss
            self.counter = 0
            self.status = f"Improvement found, counter reset to {self.counter}"
        else:
            self.counter += 1
            self.status = f"No improvement in the last {self.counter} epochs"
            if self.counter >= self.patience:
                self.status = f"Early stopping triggered after {self.counter} epochs."
                if self.restore_best_weights:
                    model.load_state_dict(self.best_model)
                return True
        return False

In [11]:
from sklearn import metrics
import scipy as sp
import numpy as np
import math
from sklearn import metrics

import torch
import torch.nn.functional as F
import pandas as pd

def perturbation_rank(device, model, x, y, names, regression):
    model.to(device)
    model.eval() # set the model to evaluation mode

    #x = torch.tensor(x).float().to(device)
    #y = torch.tensor(y).float().to(device)
    
    errors = []

    for i in range(x.shape[1]):
        hold = x[:, i].clone()
        x[:, i] = torch.randperm(x.shape[0]).to(device)  # shuffling
        
        with torch.no_grad():
            pred = model(x)

        if regression:
            loss_fn = torch.nn.MSELoss()
            error = loss_fn(y, pred).item()
        else:
            # pred should be probabilities; apply softmax if not done in model's forward method
            if len(pred.shape) == 2 and pred.shape[1] > 1:
                pred = F.softmax(pred, dim=1)
                loss_fn = torch.nn.CrossEntropyLoss()
                error = loss_fn(pred, y.long()).item()
            else:
                loss_fn = nn.MSELoss()
                error = loss_fn(y, pred).item()
            
            
        errors.append(error)
        x[:, i] = hold
        
    max_error = max(errors)
    importance = [e/max_error for e in errors]

    data = {'name':names, 'error':errors, 'importance':importance}
    result = pd.DataFrame(data, columns=['name', 'error', 'importance'])
    result.sort_values(by=['importance'], ascending=[0], inplace=True)
    result.reset_index(inplace=True, drop=True)
    return result

In [23]:
import time

import numpy as np
import pandas as pd
import torch.nn as nn
import torch.nn.functional as F
import tqdm
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from torch.autograd import Variable
from torch.utils.data import DataLoader, TensorDataset

device = 'mps'
# Read the MPG dataset.
df =pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/shapes-train.csv")

# Pandas to Numpy
df_dummies = pd.get_dummies(df['shape']).astype(int)
df = pd.concat([df, df_dummies], axis=1)
result = df['cost']
x_columns = df.columns.drop(['shape', 'cost', 'id'])
x = df[x_columns].values
print(x_columns)
print(result)
y = result.values  # regression

# Split into validation and training sets
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.25, random_state=42
)

# Numpy to Torch Tensor
x_train = torch.tensor(x_train, device=device, dtype=torch.float32)
y_train = torch.tensor(y_train, device=device, dtype=torch.float32)

x_test = torch.tensor(x_test, device=device, dtype=torch.float32)
y_test = torch.tensor(y_test, device=device, dtype=torch.float32)


# Create datasets
BATCH_SIZE = 16

dataset_train = TensorDataset(x_train, y_train)
dataloader_train = DataLoader(dataset_train, batch_size=BATCH_SIZE, shuffle=True)

dataset_test = TensorDataset(x_test, y_test)
dataloader_test = DataLoader(dataset_test, batch_size=BATCH_SIZE, shuffle=True)


# Create model

model = nn.Sequential(
    nn.Linear(x_train.shape[1], 50), 
    nn.ReLU(), 
    nn.Linear(50, 25), 
    nn.ReLU(), 
    nn.Linear(25, 1)
)

model = torch.compile(model, backend="aot_eager").to(device)

# Define the loss function for regression
loss_fn = nn.MSELoss()

# Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

es = EarlyStopping()

epoch = 0
done = False
while epoch < 1000 and not done:
    epoch += 1
    steps = list(enumerate(dataloader_train))
    pbar = tqdm.tqdm(steps)
    model.train()
    for i, (x_batch, y_batch) in pbar:
        y_batch_pred = model(x_batch).flatten()  #
        loss = loss_fn(y_batch_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        loss, current = loss.item(), (i + 1) * len(x_batch)
        if i == len(steps) - 1:
            model.eval()
            pred = model(x_test).flatten()
            vloss = loss_fn(pred, y_test)
            if es(model, vloss):
                done = True
            pbar.set_description(
                f"Epoch: {epoch}, tloss: {loss}, vloss: {vloss:>7f}, EStop:[{es.status}]"
            )
        else:
            pbar.set_description(f"Epoch: {epoch}, tloss {loss:}")

from sklearn import metrics

# Measure RMSE error.  RMSE is common for regression.
pred = model(x_test)
score = torch.sqrt(torch.nn.functional.mse_loss(pred.flatten(), y_test))
print(f"Final score (RMSE): {score}")

Index(['height', 'width', 'depth', 'quality', 'box', 'cylinder', 'ellipsoid'], dtype='object')
0        200.49
1       1175.71
2        131.72
3         15.83
4        340.21
         ...   
9995    3918.36
9996      35.57
9997     325.90
9998     481.23
9999     824.83
Name: cost, Length: 10000, dtype: float64


Epoch: 1, tloss: 8715.2001953125, vloss: 116250.695312, EStop:[]: 100%|██████████| 469/469 [00:04<00:00, 108.89it/s]
Epoch: 2, tloss: 103877.78125, vloss: 62129.804688, EStop:[Improvement found, counter reset to 0]: 100%|██████████| 469/469 [00:03<00:00, 119.22it/s]
Epoch: 3, tloss: 8979.873046875, vloss: 55541.324219, EStop:[Improvement found, counter reset to 0]: 100%|██████████| 469/469 [00:04<00:00, 116.07it/s]
Epoch: 4, tloss: 149801.6875, vloss: 58336.093750, EStop:[No improvement in the last 1 epochs]: 100%|██████████| 469/469 [00:03<00:00, 118.43it/s]
Epoch: 5, tloss: 34558.9609375, vloss: 50142.160156, EStop:[Improvement found, counter reset to 0]: 100%|██████████| 469/469 [00:03<00:00, 117.68it/s]
Epoch: 6, tloss: 89361.4921875, vloss: 39417.894531, EStop:[Improvement found, counter reset to 0]: 100%|██████████| 469/469 [00:03<00:00, 117.91it/s]
Epoch: 7, tloss: 157668.65625, vloss: 39259.000000, EStop:[Improvement found, counter reset to 0]: 100%|██████████| 469/469 [00:03<0

KeyboardInterrupt: 

In [24]:
from IPython.display import display, HTML

names = list(df.columns) # x+y column name
names.remove('id')
names.remove('shape')
names.remove("cost") # remove the target(y)
print(names)
rank = perturbation_rank(device, model, x_test, y_test, names, True)
display(rank)

['height', 'width', 'depth', 'quality', 'box', 'cylinder', 'ellipsoid']


  return F.mse_loss(input, target, reduction=self.reduction)


Unnamed: 0,name,error,importance
0,box,810990500.0,1.0
1,cylinder,453834900.0,0.559606
2,width,43281650.0,0.053369
3,height,38544570.0,0.047528
4,depth,22616700.0,0.027888
5,ellipsoid,20057830.0,0.024733
6,quality,975930.6,0.001203


In [42]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim

# This is your student key that I emailed to you at the beginnning of the semester.
key = "r2nrqz2pX53SGKnwA07UW52mBbNzuLpf8e2ZYIV9"  # This is an example key and will not work.

# You must also identify your source file.  (modify for your local setup)
file='/Users/zixuanni/Desktop/T81-558/assignment_zixuan_class8.ipynb'  # Google CoLab
# file='C:\\Users\\jeffh\\projects\\t81_558_deep_learning\\assignments\\assignment_yourname_class8.ipynb'  # Windows
# file='/Users/jheaton/projects/t81_558_deep_learning/assignments/assignment_yourname_class8.ipynb'  # Mac/Linux

# Begin assignment
df_train = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/shapes-train.csv")
df_submit = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/shapes-test.csv")
id = df_submit['id']
df_dummies = pd.get_dummies(df_submit['shape']).astype(int)
df_submit = pd.concat([df_submit, df_dummies], axis=1)
x_columns = df_submit.columns.drop(['shape', 'id'])
x = df_submit[x_columns].values
x_pred = torch.tensor(x, device=device, dtype=torch.float32)
pred = model(x_pred)
df_submit = pd.DataFrame(pred.cpu().detach().numpy())
df_submit.rename(columns={df_submit.columns[0]: 'cost'}, inplace=True)
df_submit['id'] = id
print(df_submit)
submit(source_file=file,data=[df_submit],key=key,no=8)

            cost     id
0      10.905555  10001
1     336.237671  10002
2      20.813232  10003
3     981.914978  10004
4     266.961121  10005
...          ...    ...
1995  824.682800  11996
1996  278.743805  11997
1997  241.665329  11998
1998   39.184223  11999
1999   11.798323  12000

[2000 rows x 2 columns]
Success: Submitted Assignment 8 for n.zixuan:
You have submitted this assignment 8 times. (this is fine)
Note: The mean difference 18.44340547050001 for column 'cost' is acceptable and is less than the maximum allowed value of '50.0' for this assignment.
