# Lab Assignment Five: Wide and Deep Network Architectures
In this lab, you will select a prediction task to perform on your dataset, evaluate two different deep learning architectures and tune hyper-parameters for each architecture. If any part of the assignment is not clear, ask the instructor to clarify. 

This report is worth 10% of the final grade. Please upload a report (one per team) with all code used, visualizations, and text in a rendered Jupyter notebook. Any visualizations that cannot be embedded in the notebook, please provide screenshots of the output. The results should be reproducible using your report. Please carefully describe every assumption and every step in your report.

## Dataset Selection

Select a dataset similarly to lab one. That is, the dataset must be table data and must have categorical features. In terms of generalization performance, it is helpful to have a large dataset for building a wide and deep network. It is also helpful to have many different categorical features to create the embeddings and cross-product embeddings. It is fine to perform binary classification, multi-class classification, or regression. You are NOT allowed to use the census (i.e., Adult) dataset that was given as an example in class. 

we have selected this dataset: https://www.kaggle.com/datasets/syedanwarafridi/vehicle-sales-data



## Grading Rubric

### Preparation (4 points total)
- [1 points] Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). You have the option of using tf.dataset for processing, but it is not required. 

- [1 points] Identify groups of features in your data that should be combined into cross-product features. Provide a compelling justification for why these features should be crossed (or why some features should not be crossed). 

- [1 points] Choose and explain what metric(s) you will use to evaluate your algorithm’s performance. You should give a detailed argument for why this (these) metric(s) are appropriate on your data. That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.

- [1 points] Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Shuffle splits? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. Argue why your cross validation method is a realistic mirroring of how an algorithm would be used in practice. Use the method to split your data that you argue for. 


In [14]:
import pandas as pd

df = pd.read_csv('car_prices.csv')
headers = df.columns
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 558837 entries, 0 to 558836
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   year          558837 non-null  int64  
 1   make          548536 non-null  object 
 2   model         548438 non-null  object 
 3   trim          548186 non-null  object 
 4   body          545642 non-null  object 
 5   transmission  493485 non-null  object 
 6   vin           558833 non-null  object 
 7   state         558837 non-null  object 
 8   condition     547017 non-null  float64
 9   odometer      558743 non-null  float64
 10  color         558088 non-null  object 
 11  interior      558088 non-null  object 
 12  seller        558837 non-null  object 
 13  mmr           558799 non-null  float64
 14  sellingprice  558825 non-null  float64
 15  saledate      558825 non-null  object 
dtypes: float64(4), int64(1), object(11)
memory usage: 68.2+ MB
None


In [15]:
df.dropna(inplace=True)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 472325 entries, 0 to 558836
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   year          472325 non-null  int64  
 1   make          472325 non-null  object 
 2   model         472325 non-null  object 
 3   trim          472325 non-null  object 
 4   body          472325 non-null  object 
 5   transmission  472325 non-null  object 
 6   vin           472325 non-null  object 
 7   state         472325 non-null  object 
 8   condition     472325 non-null  float64
 9   odometer      472325 non-null  float64
 10  color         472325 non-null  object 
 11  interior      472325 non-null  object 
 12  seller        472325 non-null  object 
 13  mmr           472325 non-null  float64
 14  sellingprice  472325 non-null  float64
 15  saledate      472325 non-null  object 
dtypes: float64(4), int64(1), object(11)
memory usage: 61.3+ MB
None


In [18]:
df = df[df['year'] >= 2005]
df = df[df['year'] < 2015]
print(df.info())


<class 'pandas.core.frame.DataFrame'>
Index: 417707 entries, 2 to 558836
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   year          417707 non-null  int64  
 1   make          417707 non-null  object 
 2   model         417707 non-null  object 
 3   trim          417707 non-null  object 
 4   body          417707 non-null  object 
 5   transmission  417707 non-null  object 
 6   vin           417707 non-null  object 
 7   state         417707 non-null  object 
 8   condition     417707 non-null  float64
 9   odometer      417707 non-null  float64
 10  color         417707 non-null  object 
 11  interior      417707 non-null  object 
 12  seller        417707 non-null  object 
 13  mmr           417707 non-null  float64
 14  sellingprice  417707 non-null  float64
 15  saledate      417707 non-null  object 
dtypes: float64(4), int64(1), object(11)
memory usage: 54.2+ MB
None


In [19]:
df['year'].value_counts()


year
2013    87467
2012    87380
2014    69712
2011    41384
2008    27011
2007    25378
2010    22616
2006    21631
2009    17959
2005    17169
Name: count, dtype: int64

In [21]:
from sklearn.model_selection import train_test_split

df_train_orig, df_test_orig = train_test_split(df, test_size=0.2, random_state=37)

In [22]:
from copy import deepcopy
df_train = deepcopy(df_train_orig)
df_test = deepcopy(df_test_orig)

In [23]:
import numpy as np


df_train.reset_index()

df_test.reset_index()

df_test.head()

Unnamed: 0,year,make,model,trim,body,transmission,vin,state,condition,odometer,color,interior,seller,mmr,sellingprice,saledate
387952,2008,Jeep,Liberty,Sport,SUV,automatic,1j8gn28k18w195884,oh,43.0,126044.0,red,gray,tc's used cars llc,6450.0,7800.0,Tue Mar 03 2015 01:30:00 GMT-0800 (PST)
369170,2014,Lincoln,MKX,Base,SUV,automatic,2lmdj8jk4ebl06749,tn,49.0,10077.0,—,beige,ford motor credit company,36600.0,34000.0,Thu Mar 05 2015 03:00:00 GMT-0800 (PST)
165759,2007,GMC,Yukon,SLE,SUV,automatic,1gkfk13047r312578,md,43.0,90057.0,blue,beige,lexus of rockville,16700.0,15400.0,Tue Jan 20 2015 01:30:00 GMT-0800 (PST)
113164,2012,Toyota,Prius c,Two,Hatchback,automatic,jtdkdtb34c1509187,pa,29.0,54808.0,black,gray,ken pollock nissan llc,11900.0,9000.0,Fri Jan 16 2015 01:00:00 GMT-0800 (PST)
75538,2011,GMC,Yukon,SLT,SUV,automatic,1gks1ce05br150916,ca,4.0,61831.0,black,black,rvr,23500.0,24500.0,Wed Dec 31 2014 12:30:00 GMT-0800 (PST)


### Modeling (5 points total)
- [2 points] Create at least three combined wide and deep networks to classify your data using Keras (this total of "three" includes the model you will train in the next step of the rubric). Visualize the performance of the network on the training data and validation data in the same plot versus the training iterations.
Note: you can use the "history" return parameter that is part of Keras "fit" function to easily access this data.

- [2 points] Investigate generalization performance by altering the number of layers in the deep branch of the network. Try at least two models (this "two" includes the wide and deep model trained from the previous step). Use the method of cross validation and evaluation metric that you argued for at the beginning of the lab to answer: What model with what number of layers performs superiorly? Use proper statistical methods to compare the performance of different models.

- [1 points] Compare the performance of your best wide and deep network to a standard multi-layer perceptron (MLP). Alternatively, you can compare to a network without the wide branch (i.e., just the deep network). For classification tasks, compare using the receiver operating characteristic and area under the curve. For regression tasks, use Bland-Altman plots and residual variance calculations.  Use proper statistical methods to compare the performance of different models.  


### Exceptional Work (1 points total)
5000 students: You have free reign to provide additional analyses.
One idea (required for 7000 level students): Capture the embedding weights from the deep network and (if needed) perform dimensionality reduction on the output of these embedding layers (only if needed). That is, pass the observations into the network, save the embedded weights (called embeddings), and then perform  dimensionality reduction in order to visualize results. Visualize and explain any clusters in the data.