# Assignment Instructions

For this assignment, you will use the **reg-33-data.csv** dataset.  This file contains a dataset that I generated specifically for this class.  You can find the CSV file on my data site, at this location: [reg-33-data.csv](http://data.heatonresearch.com/data/t81-558/datasets/reg-33-data.csv).

For this assignment, load and modify the data set.  You will submit this modified dataset to the **submit** function.  See [Assignment #1](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class1.ipynb) for details on how to submit an assignment or check that one was submitted.

Modify the dataset as follows:

* Add a column named *ratio* that is *max* divided by *number*.  Leave *max* and *number* in the dataframe.
* Replace the *cat2* column with dummy variables. e.g. 'cat2_CA-0', 'cat2_CA-1',
       'cat2_CA-10', 'cat2_CA-11', 'cat2_CA-12', ...
* Replace the *item* column with dummy variables, e.g. 'item_IT-0', 'item_IT-1',
       'item_IT-10', 'item_IT-11', 'item_IT-12', ...
* For field *length* replace missing values with the median of *length*.
* For field *height* replace missing with median and convert to zscore.
* Remove all other columns.
* Your submitted dataframe will have these columns: 'height', 'max', 'number', 'length', 'ratio', 'cat2_CA-0', 'cat2_CA-1',
       'cat2_CA-10', 'cat2_CA-11', 'cat2_CA-12', 'cat2_CA-13', 'cat2_CA-14',
       'cat2_CA-15', 'cat2_CA-16', 'cat2_CA-17', 'cat2_CA-18', 'cat2_CA-19',
       'cat2_CA-1A', 'cat2_CA-1B', 'cat2_CA-1C', 'cat2_CA-1D', 'cat2_CA-1E',
       'cat2_CA-1F', 'cat2_CA-2', 'cat2_CA-20', 'cat2_CA-21', 'cat2_CA-22',
       'cat2_CA-23', 'cat2_CA-24', 'cat2_CA-25', 'cat2_CA-26', 'cat2_CA-27',
       'cat2_CA-3', 'cat2_CA-4', 'cat2_CA-5', 'cat2_CA-6', 'cat2_CA-7',
       'cat2_CA-8', 'cat2_CA-9', 'cat2_CA-A', 'cat2_CA-B', 'cat2_CA-C',
       'cat2_CA-D', 'cat2_CA-E', 'cat2_CA-F', 'item_IT-0', 'item_IT-1',
       'item_IT-10', 'item_IT-11', 'item_IT-12', 'item_IT-13', 'item_IT-14',
       'item_IT-15', 'item_IT-16', 'item_IT-17', 'item_IT-18', 'item_IT-19',
       'item_IT-1A', 'item_IT-1B', 'item_IT-1C', 'item_IT-1D', 'item_IT-1E',
       'item_IT-2', 'item_IT-3', 'item_IT-4', 'item_IT-5', 'item_IT-6',
       'item_IT-7', 'item_IT-8', 'item_IT-9', 'item_IT-A', 'item_IT-B',
       'item_IT-C', 'item_IT-D', 'item_IT-E', 'item_IT-F'.

In [55]:
import os
import pandas as pd
from scipy.stats import zscore
 
# Begin assignment
df = pd.read_csv("../data/reg-33-data.csv")

df.drop('id',1,inplace=True)

In [56]:
df.head()

Unnamed: 0,convention,height,max,cat2,number,usage,region,length,code,power,item,weight,country,target
0,CO-1A,4284.51,44907,CA-E,16669,US-7,RE-4,12471.1127,CO-B,27351.36,IT-17,13722,CO-1,44098.106769
1,CO-C,806.88,48831,CA-A,8652,US-20,RE-15,10035.7085,CO-E,42323.89,IT-1E,33779,CO-0,95567.294044
2,CO-19,2859.8,40760,CA-16,23103,US-17,RE-1D,14442.6566,CO-5,30660.91,IT-14,26633,CO-23,48583.507153
3,CO-2B,5823.87,33597,CA-9,17680,US-10,RE-1D,15121.4937,CO-B,59456.24,IT-8,14537,CO-11,130572.202064
4,CO-5,,29848,CA-9,24136,US-21,RE-4,18093.9147,CO-4,46998.44,IT-5,21135,CO-1E,85768.81285


#### Add a column named *ratio* that is *max* divided by *number*.  Leave *max* and *number* in the dataframe.

In [57]:
df['ratio'] = df['max'] / df['number']

#### Replace the *cat2* column with dummy variables. e.g. 'cat2_CA-0', 'cat2_CA-1', 'cat2_CA-10', 'cat2_CA-11', 'cat2_CA-12', ...

In [58]:
df['cat2'] = 'cat2_' + df['cat2'] 

right = pd.get_dummies(df['cat2'])

df = df.join(right)

df.drop('cat2', axis=1, inplace=True)

#### Replace the *item* column with dummy variables, e.g. 'item_IT-0', 'item_IT-1', 'item_IT-10', 'item_IT-11', 'item_IT-12', ...

In [59]:
df['item'] = 'item_' + df['item'] 

right = pd.get_dummies(df['item'])

df = df.join(right)

df.drop('item', axis=1, inplace=True)

#### For field *length* replace missing values with the median of *length*.

In [60]:
df['length'].fillna(df['length'].median(), inplace=True)

#### For field *height* replace missing with median and convert to zscore.


It is useful to standardized the values (raw scores) of a normal distribution by converting them into z-scores because:

(a) it allows researchers to calculate the probability of a score occurring within a standard normal distribution;

(b) and enables us to compare two scores that are from different samples (which may have different means and standard deviations).

In [61]:
df['height'].fillna(df['height'].median(), inplace=True)

In [62]:
std = df['height'].std()
mean = df['height'].mean()

In [63]:
df['height'] = (df['height'] - mean) / std

#### Remove all other columns.

In [64]:
df = df.drop(['convention', 'region', 'usage', 'code', 'power', 'country', 'target', 'weight'], axis=1)

In [65]:
df.head()

Unnamed: 0,height,max,number,length,ratio,cat2_CA-0,cat2_CA-1,cat2_CA-10,cat2_CA-11,cat2_CA-12,...,item_IT-6,item_IT-7,item_IT-8,item_IT-9,item_IT-A,item_IT-B,item_IT-C,item_IT-D,item_IT-E,item_IT-F
0,0.453201,44907,16669,12471.1127,2.694043,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,-1.482107,48831,8652,10035.7085,5.643897,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,-0.339653,40760,23103,14442.6566,1.764273,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1.309858,33597,17680,15121.4937,1.900283,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0.006937,29848,24136,18093.9147,1.236659,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
