In [19]:
import sys
!{sys.executable} -m pip install category_encoders



In [20]:
#Airbnb processing and training Data

%reset_selective -f regex
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import pprint
import numpy as np
import seaborn as sns
from scipy.stats import zscore
from sklearn import preprocessing
%matplotlib inline
from sklearn.preprocessing import StandardScaler

In [21]:
file_path = "/Users/juanreyes/Downloads/cleaned_airbnb_cdmx_2024_prediction_data.csv"
mexico_airbnb_df = pd.read_csv("/Users/juanreyes/Downloads/cleaned_Airbnb_prediction_data.csv")

In [23]:
# Identify categorical and numeric features
categorical_low_card = ['disponibility_cat', 'min_nights_cat', 'average_measure', 'supermarket']
categorical_high_card = ['name', 'host_name', 'neighbourhood', 'room_type', 'price_category']
numeric_features = ['calculated_host_listings_count', 'hospital', 'university', 'subway', 'restaurant', 'park', 'security_index', 'demand_index', 'neighbourhood_area', 'density_hospital', 'density_supermarket', 'density_park', 'density_university', 'density_subway', 'density_restaurant', 'dis_to_tourist_point_per_neighborhood'
]
target_variable = 'income_per_month'

# Preview data to confirm structure
mexico_airbnb_df[categorical_low_card + categorical_high_card + numeric_features + [target_variable]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 26 columns):
 #   Column                                 Non-Null Count  Dtype
---  ------                                 --------------  -----
 0   disponibility_cat                      1000 non-null   int64
 1   min_nights_cat                         1000 non-null   int64
 2   average_measure                        1000 non-null   int64
 3   supermarket                            1000 non-null   int64
 4   name                                   1000 non-null   int64
 5   host_name                              1000 non-null   int64
 6   neighbourhood                          1000 non-null   int64
 7   room_type                              1000 non-null   int64
 8   price_category                         1000 non-null   int64
 9   calculated_host_listings_count         1000 non-null   int64
 10  hospital                               1000 non-null   int64
 11  university                     

In [26]:
# Step 1: Create dummy variables for low-cardinality categorical features
categorical_low_card = ['disponibility_cat', 'min_nights_cat', 'average_measure', 'supermarket']
mexico_airbnb_df_dummies = pd.get_dummies(mexico_airbnb_df, columns=categorical_low_card, drop_first=True)

In the code above we created a Dummy Variables for Low-Cardinality Categorical Features and Low-cardinality categorical variables (e.g. disponibility_cat, min_nights_cat) these were also one-hot encoded using the pd.get_dummies() method in order to convert them into binary features.

In [27]:
 #Step 2: Standardize numeric features
scaler = StandardScaler()
mexico_airbnb_df_dummies[numeric_features] = scaler.fit_transform(mexico_airbnb_df_dummies[numeric_features])

In step 2, we targeted encoded High-Cardinality Categorical Features such as: name, host_name, neighbourhood, room_type and price_category, which can introduce sparsity and noise when one-hot encoded.
Due to this we applied Target Encoding, which replaces each category with the mean of the target variable (monthly_income) within that category. The goal of this is to reduce dimensionality and mandating predictive power.

In [28]:
# Step 3: Target Encode High-Cardinality Features
# Target encoding helps with features like site_id, app_id, device_model.
#Using category_encoders.TargetEncoder, we replace each category with the mean of the target (click) for that category.

import category_encoders as ce

# Instantiate target encoder
target_encoder = ce.TargetEncoder(cols=['name', 'host_name', 'neighbourhood', 'room_type', 'price_category'])

# Fit and transform
Airbnb_df_encoded = target_encoder.fit_transform(mexico_airbnb_df_dummies, mexico_airbnb_df['income_per_month'])


Step 3:
In the above step: We deployed Target Encoding for High-Cardinality Features
Some categorical variables in this dataset, such as name, host_name, 'neighbourhood, room_type, price_category, have a large number of unique categories (high cardinality). If we use one-hot encoding on these it will create a sparse dataset and might introduce noise or overfitting in out training models.
To address this we used Target Encoding via category_encoders.TargetEncoder, which replaces each category with the mean of the target variable (income_per_month) for that category. This approach reduces dimensionality and preserves predictive information.
This step is crucial to hep improve model performance when dealing with high-cardinality categorical features.

In [29]:
# Step 4:  Standard Scale the Numeric Features
from sklearn.preprocessing import StandardScaler

numeric_features = ['calculated_host_listings_count', 'hospital', 'university', 'subway', 'restaurant', 'park', 'security_index', 'demand_index', 'neighbourhood_area', 'density_hospital', 'density_supermarket', 'density_park', 'density_university', 'density_subway', 'density_restaurant', 'dis_to_tourist_point_per_neighborhood'
]
scaler = StandardScaler()
Airbnb_df_encoded[numeric_features] = scaler.fit_transform(Airbnb_df_encoded[numeric_features])

In step 4 we used Standard Scaling to optimize convergence of Logistic Regression or Gradient Boosting. This was done to make sure all numeric features contribute equally to the model. 
We used StandardScaler from sklearn.preprocessing to transform each numeric feature (calculated_host_listings_count', 'hospital', 'university', 'subway', 'restaurant', 'park', 'security_index', 'demand_index', 'neighbourhood_area', 'density_hospital', 'density_supermarket', 'density_park', 'density_university', 'density_subway', 'density_restaurant', 'dis_to_tourist_point_per_neighborhood') so that they have a mean of 0 and standard deviation of 1. This will prevent features with larger magnitudes from dominating the model training process.
Note that only numeric features are standardized—categorical features (including dummy or encoded ones) are not scaled.

In [30]:
# Step 5: Split into Training and Testing Sets
from sklearn.model_selection import train_test_split

# Define X and y
X = Airbnb_df_encoded.drop(columns=['income_per_month'])
y = Airbnb_df_encoded['income_per_month']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Optional: Check shape
print("Train set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Train set shape: (800, 66)
Test set shape: (200, 66)


In step 5 we split our data into training and testing sets to evaluate our model performance fairly and avoid data leakage. We split our datasets into the following subsets.
Training Set: Used to train the machine learning model.
Testing Set: Used to evaluate the model’s performance on unseen data.
We used the train_test_split() function from sklearn.model_selection with the following configurations:
test_size=0.2: meaning 20% of the data is reserved for testing.
random_state=42: which, ensures reproducibility of the split.
stratify=y: Maintains the original class distribution
These ensure the model is trained on a representative subset of the data and evaluated on an equally balanced holdout set.

In [37]:
# Use raw string (prefix with 'r') to avoid issues with backslashes
## Save datasets in the processed data folder
X_train.to_csv("/Users/juanreyes/Desktop/DataScienceGuidedCapstone/Springboard/CapstoneTwo_AirbnbPrediction/data/Processed/X_train.csv", index=False)
X_test.to_csv("/Users/juanreyes/Desktop/DataScienceGuidedCapstone/Springboard/CapstoneTwo_AirbnbPrediction/data/Processed/X_test.csv", index=False)
y_train.to_csv("/Users/juanreyes/Desktop/DataScienceGuidedCapstone/Springboard/CapstoneTwo_AirbnbPrediction/data/Processed/y_train.csv", index=False)
y_test.to_csv("/Users/juanreyes/Desktop/DataScienceGuidedCapstone/Springboard/CapstoneTwo_AirbnbPrediction/data/Processed/y_test.csv", index=False)

In this last step we save our processed datasets as csv files to reuse in our modeling phase. The data is saved as follows: X_train, X_test, y_train and y_test.

In the following step we will use our preprocessed datasets to train and evaluate machine learning models.

Feature Type Justification

In this notebook we identified categorical and continuous features through data types and domain understanding.
Categorical Features: 'disponibility_cat', 'min_nights_cat', 'average_measure', 'supermarket' have a small number of unique values and are encoded using one-hot encoding.
High Cardinality Categorical: 'name', 'host_name', 'neighbourhood', 'room_type', ‘price_category' all  have a large number of unique categories, so we applied target encoding to avoid high dimensionality.
Continuous Features: 'calculated_host_listings_count', 'hospital', 'university' are treated as numeric values based on their value distribution and usage in previous prediction models. We used StandardScaler on these numeric features to standardize their magnitude, which is important for many machine learning algorithms.


Conclusion:
This notebook displays the pre-processing and training data development phase of our Airbnb predictions project. 
In this notebook we used Dummy Encoding, which applied one-hot encoding to low-cardinality categorical features (disponibility_cat, min_nights_cat, average_measure, supermarket) to prepare them for modeling.
Target Encoding: Handled high-cardinality categorical features (name, host_name, neighbourhood, room_type, price_category) using target encoding, which reduces dimensionality while preserving meaningful patterns with respect to the target variable click.
Standardization: Scaled continuous numeric features ('calculated_host_listings_count', 'hospital', 'university') using StandardScaler to ensure features contribute equally to model learning.
Train-Test Split: Split the dataset into training and testing subsets using an 80/20 ratio while maintaining the target distribution (stratify=y) to ensure fair model evaluation.
After completing our preprocessing steps, we now have clean and standardized datasets ready for building and evaluating predictive machine learning models.
