<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/master/assignments/assignment_yourname_t81_558_class2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), School of Engineering and Applied Science, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

**Module 2 Assignment: Creating Columns in Pandas**

**Student Name: Chang(Jason) Ti**

# Assignment Instructions

For this assignment, you will use the **reg-36-data.csv** dataset.  This file contains a dataset that I generated specifically for this class.  You can find the CSV file on my data site, at this location: [reg-36-data.csv](http://data.heatonresearch.com/data/t81-558/datasets/reg-36-data.csv).

For this assignment, load and modify the data set.  You will submit this modified dataset to the **submit** function.  See [Assignment #1](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class1.ipynb) for details on how to submit an assignment or check that one was submitted.

Modify the dataset as follows:

* Add a column named *ratio* that is *max* divided by *number*.  Leave *max* and *number* in the dataframe.
* Replace the *cat2* column with dummy variables. e.g. 'cat2_CA-0', 'cat2_CA-1',
       'cat2_CA-10', 'cat2_CA-11', 'cat2_CA-12', ...
* Replace the *item* column with dummy variables, e.g. 'item_IT-0', 'item_IT-1',
       'item_IT-10', 'item_IT-11', 'item_IT-12', ...
* For field *length* replace missing values with the median of *length*.
* For field *height* replace missing with median and convert to zscore.
* Remove all other columns.
* Your submitted dataframe will have these columns: 'height', 'max', 'number', 'length', 'ratio', 'cat2_CA-0', 'cat2_CA-1',
       'cat2_CA-10', 'cat2_CA-11', 'cat2_CA-12', 'cat2_CA-13', 'cat2_CA-14',
       'cat2_CA-15', 'cat2_CA-16', 'cat2_CA-17', 'cat2_CA-18', 'cat2_CA-19',
       'cat2_CA-1A', 'cat2_CA-1B', 'cat2_CA-1C', 'cat2_CA-1D', 'cat2_CA-1E',
       'cat2_CA-1F', 'cat2_CA-2', 'cat2_CA-20', 'cat2_CA-21', 'cat2_CA-22',
       'cat2_CA-23', 'cat2_CA-24', 'cat2_CA-25', 'cat2_CA-26', 'cat2_CA-27',
       'cat2_CA-3', 'cat2_CA-4', 'cat2_CA-5', 'cat2_CA-6', 'cat2_CA-7',
       'cat2_CA-8', 'cat2_CA-9', 'cat2_CA-A', 'cat2_CA-B', 'cat2_CA-C',
       'cat2_CA-D', 'cat2_CA-E', 'cat2_CA-F', 'item_IT-0', 'item_IT-1',
       'item_IT-10', 'item_IT-11', 'item_IT-12', 'item_IT-13', 'item_IT-14',
       'item_IT-15', 'item_IT-16', 'item_IT-17', 'item_IT-18', 'item_IT-19',
       'item_IT-1A', 'item_IT-1B', 'item_IT-1C', 'item_IT-1D', 'item_IT-1E',
       'item_IT-2', 'item_IT-3', 'item_IT-4', 'item_IT-5', 'item_IT-6',
       'item_IT-7', 'item_IT-8', 'item_IT-9', 'item_IT-A', 'item_IT-B',
       'item_IT-C', 'item_IT-D', 'item_IT-E', 'item_IT-F'.

In [4]:
import pandas as pd
from scipy.stats import zscore

# Load the dataset
file_path = "/content/reg-36-data.csv"  # Ensure the file is uploaded to Colab
df = pd.read_csv(file_path)

# Add the 'ratio' column: max divided by number
df['ratio'] = df['max'] / df['number']

# Replace 'cat2' column with dummy variables
cat2_dummies = pd.get_dummies(df['cat2'], prefix='cat2')

# Replace 'item' column with dummy variables
item_dummies = pd.get_dummies(df['item'], prefix='item')

# Fill missing values in 'length' with the median
df['length'] = df['length'].fillna(df['length'].median())

# Fill missing values in 'height' with the median and apply z-score transformation
df['height'] = df['height'].fillna(df['height'].median())
df['height'] = zscore(df['height'])

# Keep only the required columns
required_columns = ['height', 'max', 'number', 'length', 'ratio']
final_df = df[required_columns].join(cat2_dummies).join(item_dummies)

# Ensure columns match the required output format by reordering them
expected_columns = [
    'height', 'max', 'number', 'length', 'ratio',
    'cat2_CA-0', 'cat2_CA-1', 'cat2_CA-10', 'cat2_CA-11', 'cat2_CA-12',
    'cat2_CA-13', 'cat2_CA-14', 'cat2_CA-15', 'cat2_CA-16', 'cat2_CA-17',
    'cat2_CA-18', 'cat2_CA-19', 'cat2_CA-1A', 'cat2_CA-1B', 'cat2_CA-1C',
    'cat2_CA-1D', 'cat2_CA-1E', 'cat2_CA-1F', 'cat2_CA-2', 'cat2_CA-20',
    'cat2_CA-21', 'cat2_CA-22', 'cat2_CA-23', 'cat2_CA-24', 'cat2_CA-25',
    'cat2_CA-26', 'cat2_CA-27', 'cat2_CA-3', 'cat2_CA-4', 'cat2_CA-5',
    'cat2_CA-6', 'cat2_CA-7', 'cat2_CA-8', 'cat2_CA-9', 'cat2_CA-A',
    'cat2_CA-B', 'cat2_CA-C', 'cat2_CA-D', 'cat2_CA-E', 'cat2_CA-F',
    'item_IT-0', 'item_IT-1', 'item_IT-10', 'item_IT-11', 'item_IT-12',
    'item_IT-13', 'item_IT-14', 'item_IT-15', 'item_IT-16', 'item_IT-17',
    'item_IT-18', 'item_IT-19', 'item_IT-1A', 'item_IT-1B', 'item_IT-1C',
    'item_IT-1D', 'item_IT-1E', 'item_IT-2', 'item_IT-3', 'item_IT-4',
    'item_IT-5', 'item_IT-6', 'item_IT-7', 'item_IT-8', 'item_IT-9',
    'item_IT-A', 'item_IT-B', 'item_IT-C', 'item_IT-D', 'item_IT-E', 'item_IT-F'
]

# Ensure all expected dummy columns exist, fill missing ones with 0
for col in expected_columns:
    if col not in final_df.columns:
        final_df[col] = 0

# Reorder columns
final_df = final_df[expected_columns]

# Display the processed dataframe in Google Colab
display(final_df)

# Save the processed dataset to CSV for submission
output_path = "/content/processed_reg-36-data.csv"
final_df.to_csv(output_path, index=False)

print(f"Processing complete. The processed dataset has been saved as '{output_path}'.")


Unnamed: 0,height,max,number,length,ratio,cat2_CA-0,cat2_CA-1,cat2_CA-10,cat2_CA-11,cat2_CA-12,...,item_IT-6,item_IT-7,item_IT-8,item_IT-9,item_IT-A,item_IT-B,item_IT-C,item_IT-D,item_IT-E,item_IT-F
0,0.496557,44907,16669,12471.1127,2.694043,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,-1.419844,48831,8652,10035.7085,5.643897,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,-0.288551,40760,23103,14442.6566,1.764273,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,1.344844,33597,17680,15121.4937,1.900283,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
4,-0.000328,29848,24136,18093.9147,1.236659,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494,-0.000328,41369,21155,11691.7957,1.955519,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
495,-0.327335,45004,24213,13873.7262,1.858671,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
496,-0.000328,65793,15942,14585.8900,4.127023,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
497,-0.418630,68441,20380,12625.3643,3.358243,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


Processing complete. The processed dataset has been saved as '/content/processed_reg-36-data.csv'.


# Google CoLab Instructions

If you are using Google CoLab, it will be necessary to mount your GDrive so that you can send your notebook during the submit process. Running the following code will map your GDrive to ```/content/drive```.

In [5]:
try:
  from google.colab import drive, userdata
  drive.mount('/content/drive', force_remount=True)
  COLAB = True
  print("Note: using Google CoLab")
except:
  print("Note: not using Google CoLab")
  COLAB = False

# Assignment Submission Key - Was sent you first week of class.
# If you are in both classes, this is the same key.
if COLAB:
  # For Colab, add to your "Secrets" (key icon at the left)
  key = userdata.get('T81_558_KEY')
else:
  # If not colab, enter your key here, or use an environment variable.
  # (this is only an example key, use yours)
  key = "Gx5en9cEVvaZnjhdaushddhuhhO4PsI32sgldAXj"

Mounted at /content/drive
Note: using Google CoLab


# Assignment Submit Function

You will submit the ten programming assignments electronically.  The following **submit** function can be used to do this.  My server will perform a basic check of each assignment and let you know if it sees any underlying problems.

**It is unlikely that should need to modify this function.**

In [None]:
import base64
import os
import numpy as np
import pandas as pd
import requests
import PIL
import PIL.Image
import io
from typing import List, Union

# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - List of pandas dataframes or images.
# key - Your student key that was emailed to you.
# course - The course that you are in, currently t81-558 or t81-559.
# no - The assignment class number, should be 1 through 10.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.

def submit(
    data: List[Union[pd.DataFrame, PIL.Image.Image]],
    key: str,
    course: str,
    no: int,
    source_file: str = None
) -> None:
    if source_file is None and '__file__' not in globals():
        raise Exception("Must specify a filename when in a Jupyter notebook.")
    if source_file is None:
        source_file = __file__

    suffix = f'_class{no}'
    if suffix not in source_file:
        raise Exception(f"{suffix} must be part of the filename.")

    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb', '.py']:
        raise Exception(f"Source file is {ext}; must be .py or .ipynb")

    with open(source_file, "rb") as file:
        encoded_python = base64.b64encode(file.read()).decode('ascii')

    payload = []
    for item in data:
        if isinstance(item, PIL.Image.Image):
            buffered = io.BytesIO()
            item.save(buffered, format="PNG")
            payload.append({'PNG': base64.b64encode(buffered.getvalue()).decode('ascii')})
        elif isinstance(item, pd.DataFrame):
            payload.append({'CSV': base64.b64encode(item.to_csv(index=False).encode('ascii')).decode("ascii")})
        else:
            raise ValueError(f"Unsupported data type: {type(item)}")

    response = requests.post(
        "https://api.heatonresearch.com/wu/submit",
        headers={'x-api-key': key},
        json={
            'payload': payload,
            'assignment': no,
            'course': course,
            'ext': ext,
            'py': encoded_python
        }
    )

    if response.status_code == 200:
        print(f"Success: {response.text}")
    else:
        print(f"Failure: {response.text}")

# Assignment #2 Sample Code

The following code provides a starting point for this assignment.

In [None]:
import os
import pandas as pd
from scipy.stats import zscore

# You must identify your source file.  (modify for your local setup)
file="/content/drive/My Drive/Colab Notebooks/assignment_yourname_t81_558_class2.ipynb"  # Google CoLab
# file='C:\\Users\\jeffh\\projects\\t81_558_deep_learning\\assignments\\assignment_yourname_class2.ipynb'  # Windows
# file='/Users/jheaton/projects/t81_558_deep_learning/assignments/assignment_yourname_class2.ipynb'  # Mac/Linux

# Begin assignment
df = pd.read_csv("/content/reg-36-data.csv")
print(len(df))

df.drop('id',axis=1,inplace=True)
df.drop('convention',axis=1,inplace=True)

# Add a column named 'ratio' (max divided by number)
df['ratio'] = df['max'] / df['number']

# Replace 'cat2' column with dummy variables
cat2_dummies = pd.get_dummies(df['cat2'], prefix='cat2')

# Replace 'item' column with dummy variables
item_dummies = pd.get_dummies(df['item'], prefix='item')

# Fill missing values in 'length' with the median
df['length'] = df['length'].fillna(df['length'].median())

# Fill missing values in 'height' with the median and apply z-score transformation
df['height'] = df['height'].fillna(df['height'].median())
df['height'] = zscore(df['height'])

# Keep only required columns
required_columns = ['height', 'max', 'number', 'length', 'ratio']
final_df = df[required_columns].join(cat2_dummies).join(item_dummies)

# Ensure columns match the required output format by reordering them
expected_columns = [
    'height', 'max', 'number', 'length', 'ratio',
    'cat2_CA-0', 'cat2_CA-1', 'cat2_CA-10', 'cat2_CA-11', 'cat2_CA-12',
    'cat2_CA-13', 'cat2_CA-14', 'cat2_CA-15', 'cat2_CA-16', 'cat2_CA-17',
    'cat2_CA-18', 'cat2_CA-19', 'cat2_CA-1A', 'cat2_CA-1B', 'cat2_CA-1C',
    'cat2_CA-1D', 'cat2_CA-1E', 'cat2_CA-1F', 'cat2_CA-2', 'cat2_CA-20',
    'cat2_CA-21', 'cat2_CA-22', 'cat2_CA-23', 'cat2_CA-24', 'cat2_CA-25',
    'cat2_CA-26', 'cat2_CA-27', 'cat2_CA-3', 'cat2_CA-4', 'cat2_CA-5',
    'cat2_CA-6', 'cat2_CA-7', 'cat2_CA-8', 'cat2_CA-9', 'cat2_CA-A',
    'cat2_CA-B', 'cat2_CA-C', 'cat2_CA-D', 'cat2_CA-E', 'cat2_CA-F',
    'item_IT-0', 'item_IT-1', 'item_IT-10', 'item_IT-11', 'item_IT-12',
    'item_IT-13', 'item_IT-14', 'item_IT-15', 'item_IT-16', 'item_IT-17',
    'item_IT-18', 'item_IT-19', 'item_IT-1A', 'item_IT-1B', 'item_IT-1C',
    'item_IT-1D', 'item_IT-1E', 'item_IT-2', 'item_IT-3', 'item_IT-4',
    'item_IT-5', 'item_IT-6', 'item_IT-7', 'item_IT-8', 'item_IT-9',
    'item_IT-A', 'item_IT-B', 'item_IT-C', 'item_IT-D', 'item_IT-E', 'item_IT-F'
]

# Ensure all expected dummy columns exist, fill missing ones with 0
for col in expected_columns:
    if col not in final_df.columns:
        final_df[col] = 0

# Reorder columns
final_df = final_df[expected_columns]

# Save the processed dataset for submission
output_file = "2.csv"
final_df.to_csv(output_file, index=False)
print(f"Processed dataset saved as '{output_file}'.")

## Submit assignment
df.to_csv('2.csv',index=False)
submit(source_file=file,data=[df],key=key,course='t81-558',no=2)