# Lab Assignment Five: Wide and Deep Network Architectures

Team Members: Stephen Palmier, Bryn McCann, Jaime Garza

# Part 1 - Preparation 

1.1 Define and Prepare Class Variables

In [3]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

# Load the dataset from the specified path
file_path = r"C:\Users\jaime\OneDrive - Southern Methodist University\Courses\CS7324\Labs\Lab 5\AirBNB\Unit_1_Project_Dataset.csv"
airbnb_data = pd.read_csv(file_path)

# Identify numerical and categorical columns
numerical_cols = ['host_since_year', 'latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'guests_included', 'minimum_nights', 'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value']
categorical_cols = ['neighbourhood_cleansed', 'city', 'state', 'country', 'property_type', 'room_type', 'bed_type', 'host_response_time']

# Define the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])

# Prepare features and target
features = airbnb_data.drop(['price', 'host_name', 'host_since_anniversary', 'zipcode'], axis=1)
target = airbnb_data['price']

# Applying preprocessing
X = preprocessor.fit_transform(features)

# Stratified split based on neighbourhood to ensure each train-test set is representative
# Note: if 'neighbourhood_cleansed' contains NaN values, you will need to handle these before stratifying
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.2, random_state=42, stratify=features['neighbourhood_cleansed'])

# a Print of the first few entries of the train data to verify
print(X_train[:5])


  (0, 0)	-0.7916372063774119
  (0, 1)	0.3837438115303823
  (0, 2)	-0.03502058299434622
  (0, 3)	-0.6342676503297628
  (0, 4)	-0.28607889796756397
  (0, 5)	-0.4681847495753422
  (0, 6)	-0.5947328560432157
  (0, 7)	-0.5605763984427392
  (0, 8)	-0.2681582960265635
  (0, 9)	4.638520945486807
  (0, 10)	-0.1781496960135872
  (0, 11)	-0.5479739306284881
  (0, 12)	-0.2991564865199796
  (0, 13)	0.49722548702230135
  (0, 14)	0.46730467049690233
  (0, 15)	0.8324057557723143
  (0, 16)	-0.04576051100999835
  (0, 22)	1.0
  (0, 40)	1.0
  (0, 91)	1.0
  (0, 97)	1.0
  (0, 98)	1.0
  (0, 113)	1.0
  (0, 120)	1.0
  (0, 123)	1.0
  :	:
  (4, 0)	-0.7916372063774119
  (4, 1)	1.3448431221578148
  (4, 2)	0.1322445816520758
  (4, 3)	0.503796375562582
  (4, 4)	-0.28607889796756397
  (4, 5)	1.7887428440927218
  (4, 6)	1.218684456508622
  (4, 7)	0.3127320599904302
  (4, 8)	0.2586748950384158
  (4, 9)	-0.4644870326417867
  (4, 10)	0.7508896112485252
  (4, 11)	0.6781102312892552
  (4, 12)	0.73413980420103
  (4, 13)	0.4

Part 1 explained: 
* 1.1 We standardized numerical features to guarantee that no single property dominated because of its size. We employed one-hot encoding to convert categorical features into machine-readable format. A regression task defines the price of the target variable. Irrelevant characteristics like host_name and zipcode are deleted to help the model focus on more important predictions.

* 1.2: Justification: This code prepares the Airbnb dataset by loading it from a specified file location using the 'pd.read_csv()' function. It then discovers the numerical and categorical columns that are necessary for preprocessing. A 'ColumnTransformer' is used to scale numerical columns, whilst categorical columns are encoded once to ensure interoperability with machine learning approaches. To prepare the features, unnecessary columns like ''price'', ''host_name'', ''host_since_anniversary'', and ''zipcode'' are deleted, and the target variable ''price'' is extracted separately. Following the preparation operations, the dataset is stratified and separated into training and testing sets using the ''neighborhood_cleansed'' feature. Finally, the first five training data entries are presented to confirm that the preparation and splitting methods were followed appropriately. 

* 1.3: In order to evaluate the algorithm's performance on the Airbnb dataset, we intend to use mean absolute error (MAE) or mean squared error (MSE) rather than accuracy. MAE gives an interpretable measure of the average size of errors in anticipated price values, which is critical for comprehending price differences from real prices. MSE, while less interpretable, has sensitivity to big errors, which is useful for finding outliers. Given the business case's emphasis on accurately anticipating listing prices for informed decision-making by hosts and guests, MAE and MSE are well suited to the goal of reducing prediction errors in the Airbnb domain.

* 1.4: The code utilizes the `train_test_split` function to divide the data into training and testing sets. Stratified splitting based on the 'neighbourhood_cleansed' column ensures the representation of neighborhoods in both sets. This approach aligns with practical scenarios where maintaining the distribution of important categorical variables is crucial. The code directly reflects this strategy, guaranteeing a realistic mirroring of how the algorithm would be used in practice for model evaluation and testing.