<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/assignments/assignment_yourname_class4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

**Module 4 Assignment: Classification and Regression Neural Network**

**Student Name: Your Name**

# Assignment Instructions

For this assignment, you will use the **crx.csv** dataset.  This dataset is a public dataset that can you can find [here](https://archive.ics.uci.edu/ml/datasets/credit+approval). You should use the CSV file on my data site, at this location: [crx.csv](https://data.heatonresearch.com/data/t81-558/crx.csv) because it includes column headers.  The primary use for this dataset is binary classification. There are 15 attributes, plus a target column that contains only + or -.  Some of the columns have missing values.

You should train a neural network and return the predictions.  You will submit these predictions to the **submit** function.  See [Assignment #1](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class1.ipynb) for details on how to submit an assignment or check that one was submitted.

Complete the following tasks:

* Your task is to replace missing values in columns *a2* and *a14* with values estimated by a neural network (one neural network for *a2* and another for *a14*).
* Your submission file will contain the same headers as the source CSV: *a1*, *a2*, *s3*, *a4*, *a5*, *a6*, *a7*, *a8*, *a9*, *a10*, *a11*, *a12*, *a13*, *a14*, *a15*, and *a16*.
* You should only need to modify *a2* and *a14*.
* Neural networks can be much more powerful at filling missing variables than median and mean.
* Train two neural networks to predict *a2* and *a14*.  
* The *y* (target) for training the two nets will be *a2* and *a14*, depending on which you are trying to fill.
* The x for training the two nets will be 's3','a8','a9','a10','a11','a12','a13','a15'.  These are chosen because it is important not to use any columns with missing values; also, it could cause unwanted bias if we include the ultimate target (*a16*).
* ONLY predict new values for missing values in *a2* and *a14*.
* You will likely get this small warning:  Warning: The mean of column a14 differs from the solution file by 0.20238937709643778. (might not matter if small)



# Google CoLab Instructions

If you are using Google CoLab, it will be necessary to mount your GDrive so that you can send your notebook during the submit process. Running the following code will map your GDrive to ```/content/drive```.

In [None]:
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Early stopping (see module 3.4)
import copy

class EarlyStopping:
    def __init__(self, patience=5, min_delta=0, restore_best_weights=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best_weights = restore_best_weights
        self.best_model = None
        self.best_loss = None
        self.counter = 0
        self.status = ""

    def __call__(self, model, val_loss):
        if self.best_loss is None:
            self.best_loss = val_loss
            self.best_model = copy.deepcopy(model.state_dict())
        elif self.best_loss - val_loss >= self.min_delta:
            self.best_model = copy.deepcopy(model.state_dict())
            self.best_loss = val_loss
            self.counter = 0
            self.status = f"Improvement found, counter reset to {self.counter}"
        else:
            self.counter += 1
            self.status = f"No improvement in the last {self.counter} epochs"
            if self.counter >= self.patience:
                self.status = f"Early stopping triggered after {self.counter} epochs."
                if self.restore_best_weights:
                    model.load_state_dict(self.best_model)
                return True
        return False

# Make use of a GPU or MPS (Apple) if one is available.  (see module 3.2)
import torch

device = (
    "mps"
    if getattr(torch, "has_mps", False)
    else "cuda"
    if torch.cuda.is_available()
    else "cpu"
)
print(f"Using device: {device}")

Mounted at /content/drive
Note: using Google CoLab
Using device: cpu


  if getattr(torch, "has_mps", False)


# Assignment Submit Function

You will submit the ten programming assignments electronically.  The following **submit** function can be used to do this.  My server will perform a basic check of each assignment and let you know if it sees any underlying problems.

**It is unlikely that should need to modify this function.**

In [None]:
import base64
import os
import numpy as np
import pandas as pd
import requests
import PIL
import PIL.Image
import io

# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - List of pandas dataframes or images.
# key - Your student key that was emailed to you.
# course - The course that you are in, currently t81-558 or t81-559.
# no - The assignment class number, should be 1 through 1.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.
def submit(data,key,course,no,source_file=None):
    if source_file is None and '__file__' not in globals(): raise Exception('Must specify a filename when a Jupyter notebook.')
    if source_file is None: source_file = __file__
    suffix = '_class{}'.format(no)
    if suffix not in source_file: raise Exception('{} must be part of the filename.'.format(suffix))
    with open(source_file, "rb") as image_file:
        encoded_python = base64.b64encode(image_file.read()).decode('ascii')
    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb','.py']: raise Exception("Source file is {} must be .py or .ipynb".format(ext))
    payload = []
    for item in data:
        if type(item) is PIL.Image.Image:
            buffered = BytesIO()
            item.save(buffered, format="PNG")
            payload.append({'PNG':base64.b64encode(buffered.getvalue()).decode('ascii')})
        elif type(item) is pd.core.frame.DataFrame:
            payload.append({'CSV':base64.b64encode(item.to_csv(index=False).encode('ascii')).decode("ascii")})
    r= requests.post("https://api.heatonresearch.com/wu/submit",
        headers={'x-api-key':key}, json={ 'payload': payload,'assignment': no, 'course':course, 'ext':ext, 'py':encoded_python})
    if r.status_code==200:
        print("Success: {}".format(r.text))
    else: print("Failure: {}".format(r.text))

# Assignment #4 Sample Code

The following code provides a starting point for this assignment.

In [None]:
import os
import pandas as pd
from scipy.stats import zscore
import pandas as pd
import io
import requests
import numpy as np
from sklearn import metrics

# This is your student key that I emailed to you at the beginnning of the semester.
# key = "H3B554uPhc3f8kirGGBYA7cYuDOamhXM87OY6QH1"  # This is an example key and will not work.
key = ""

# You must also identify your source file.  (modify for your local setup)
# file='/content/drive/My Drive/Colab Notebooks/assignment_solution_class4.ipynb'  # Google CoLab
# file='C:\\Users\\jeffh\\projects\\t81_558_deep_learning\\assignments\\assignment_yourname_class4.ipynb'  # Windows
# file='/Users/jeff/projects/t81_558_deep_learning/assignments/assignment_yourname_class4.ipynb'  # Mac/Linux
file= '/content/drive/My Drive/Colab Notebooks/assignment_ZhijiangLi_class4.ipynb'



# Begin assignment
df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/crx.csv",na_values=['?'])





In [None]:
import pandas as pd
import numpy as np
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import tqdm
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder


class NeuralNetwork(nn.Module):
    def __init__(self, input_size):
        super(NeuralNetwork, self).__init__()
        # Define the layers of the network
        self.fc1 = nn.Linear(input_size, 64)   # First fully connected layer with 64 units
        self.fc2 = nn.Linear(64, 32)           # Second fully connected layer with 32 units
        self.fc3 = nn.Linear(32, 16)           # Third fully connected layer with 16 units
        self.fc4 = nn.Linear(16, 1)            # Output layer for regression (1 output)

    def forward(self, x):
        # Define the forward pass through the layers
        x = torch.relu(self.fc1(x))   # Apply ReLU activation to the first layer
        x = torch.relu(self.fc2(x))   # Apply ReLU activation to the second layer
        x = torch.relu(self.fc3(x))   # Apply ReLU activation to the third layer
        x = self.fc4(x)               # Output layer, no activation (for regression)
        return x

def fill_missing_numeric(df, current, target):
    # le = LabelEncoder()
    # df[target] = le.fit_transform(df[target].astype(str))
    df['a16'] = df['a16'].map({'+': 1, '-': 0})
    df['a9'] = df['a9'].map({'t': 1, 'f': 0})
    df['a10'] = df['a10'].map({'t': 1, 'f': 0})
    df['a12'] = df['a12'].map({'t': 1, 'f': 0})
    df['a13'] = df['a13'].map({'g': 1, 's': 0})

    # Identify numeric columns
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    numeric_cols.remove(current)  # Remove the current column from input features

    # Impute missing values in numeric columns with median values
    for col in numeric_cols:
        df[col].fillna(df[col].median(), inplace=True)

    # Separate the data into sets with and without missing 'current' values
    train_data = df[df[current].notna()]
    missing_data = df[df[current].isna()]

    # Early exit if no missing data is present
    if missing_data.empty:
        return df

    # Create input features and target for training data
    X_train = train_data[numeric_cols + [target]]
    #X_train = train_data[numeric_cols]
    y_train = train_data[current]

    # Scale features using z-score for train data
    X_train_scaled = X_train.apply(zscore).values

    # Convert to PyTorch tensors for train data
    X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
    y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1)

    # Determine the input size from the number of features in X_train
    input_size = X_train.shape[1]

    # Create the model with the correct input size
    model = NeuralNetwork(input_size)
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # DataLoader for training data
    dataset = TensorDataset(X_train_tensor, y_train_tensor)
    loader = DataLoader(dataset, batch_size=64, shuffle=True)

    # Training loop
    epochs = 100
    for epoch in tqdm.tqdm(range(epochs)):
        for X_batch, y_batch in loader:
            model.train()
            optimizer.zero_grad()
            y_pred = model(X_batch)
            loss = criterion(y_pred, y_batch)
            loss.backward()
            optimizer.step()

    # If there are missing values, predict them
    if not missing_data.empty:
        # Create input features for missing data
        X_missing = missing_data[numeric_cols + [target]]
        #X_missing = missing_data[numeric_cols]

        # Scale features using z-score for missing data
        X_missing_scaled = X_missing.apply(zscore).values

        # Convert to PyTorch tensors for missing data
        X_missing_tensor = torch.tensor(X_missing_scaled, dtype=torch.float32)

        # Predict missing values
        model.eval()
        with torch.no_grad():
            predicted_values = model(X_missing_tensor).numpy().flatten()
            df.loc[df[current].isna(), current] = predicted_values

    return df

# string_columns = ['a9', 'a10', 'a12', 'a13']
# for col in string_columns:
#     df[col] = df[col].astype(str)

# feature_cols = ['s3', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a15']

# df = train_predict_model(df, feature_cols, 'a2')

# df = train_predict_model(df, feature_cols, 'a14')

df_submit = df.copy()

df_1 = df.copy()
df_2 = df.copy()

df_r1 = fill_missing_numeric(df_1, 'a2', 'a16')
df_r2 = fill_missing_numeric(df_2, 'a14', 'a16')

df_submit['a2'] = df_r1['a2']
df_submit['a14'] = df_r2['a14']

# string_columns = ['a9', 'a10', 'a12', 'a13']
# for col in string_columns:
#     df_submit[col] = df_submit[col].astype(str)


# df_submit['a16'] = df_submit['a16'].map({1: '+', 0: '-'})

# df_submit['a16'] = df_submit['a16'].astype(str)

# df_submit['a9'] = df_submit['a9'].map({1:'t', 0:'f'})
# df_submit['a10'] = df_submit['a10'].map({1:'t', 0:'f'})
# df_submit['a12'] = df_submit['a12'].map({1:'t', 0:'f'})
# df_submit['a13'] = df_submit['a13'].map({1:'g', 0:'s'})

# df_submit['a9'] = df_submit['a9'].astype(str)
# df_submit['a10'] = df_submit['a10'].astype(str)
# df_submit['a12'] = df_submit['a12'].astype(str)
# df_submit['a13'] = df_submit['a13'].astype(str)

string_columns = ['a9', 'a10', 'a12', 'a13', 'a16']
for col in string_columns:
    df_submit[col] = df_submit[col].astype(str)

#print(df_submit['a16'])

submit(source_file=file,data=[df_submit],key=key,course='t81-558',no=4)

# df_submit.head()



100%|██████████| 100/100 [00:02<00:00, 35.20it/s]
100%|██████████| 100/100 [00:02<00:00, 36.18it/s]


Success: Submitted Assignment 4 (t81-558) for l.zhijiang:
You have submitted this assignment 91 times. (this is fine)
Note: The mean difference 0.008878891495140095 for column 'a2' is acceptable and is less than the maximum allowed value of '1.0' for this assignment.
Note: The mean difference 0.8307530712372397 for column 'a14' is acceptable and is less than the maximum allowed value of '1.0' for this assignment.
