# Phase 3: Data Transformation for Machine Learning and Analytics
This notebook focuses on preparing the Airbnb dataset for advanced analytics and machine learning by performing **complex data transformations**, **feature engineering**, and **data normalization/encoding**.

### Goals:
1. **Complex Data Transformation Workflows**: Aggregate, filter, and enrich data to prepare for advanced analytics.
2. **Feature Engineering**: Extract meaningful features (temporal and textual) to improve model performance.
3. **Advanced Data Normalization and Encoding**: Prepare data for machine learning by scaling and encoding features.


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder, OneHotEncoder
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from datetime import datetime
import warnings
from tqdm import tqdm
warnings.filterwarnings('ignore')


In [2]:
file_path = '/Users/asr/Desktop/College/CapstoneProject/CapStoneProject_Group3/dataset/final_data.csv'
data = pd.read_csv(file_path)
print("Data Loaded Successfully!")
data.info()

Data Loaded Successfully!
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23918082 entries, 0 to 23918081
Data columns (total 37 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   listing_id                   int64  
 1   scrape_id                    int64  
 2   name                         object 
 3   host_id                      int64  
 4   host_name                    object 
 5   host_since                   object 
 6   host_location                object 
 7   latitude                     float64
 8   longitude                    float64
 9   neighbourhood_cleansed       object 
 10  city                         object 
 11  state                        object 
 12  country                      object 
 13  property_type                object 
 14  room_type                    object 
 15  accommodates                 int64  
 16  bathrooms                    float64
 17  bedrooms                     float64
 18  beds          

# Problem Statement 6: Complex Data Transformation Workflows

### Objective:
Transform the data to meet business requirements and enable advanced analytics. This involves:
1. **Aggregation**: Summarize data to identify trends and patterns.
2. **Filtering**: Clean the dataset by removing irrelevant or redundant rows and columns.
3. **Enrichment**: Create new derived features to enhance the dataset's value for analysis.

### What We Are Doing and Why:
1. **Aggregation**:
   - Aggregating `price_y`, `number_of_reviews`, and `reviews_per_month` by `neighbourhood_cleansed` provides insights into average prices, total reviews, and activity by neighborhood.
   - This helps identify high-performing neighborhoods for targeted analysis.

2. **Filtering**:
   - Filtering rows where `price_y` is greater than 0 and `availability_365` is greater than 0 ensures we focus on valid and active listings.
   - Adding a `revenue` column (`price_y` × `availability_365`) quantifies each listing's potential earnings.
   - Classifying revenue into "Low," "Medium," and "High" categories enables segmentation analysis.

3. **Enrichment**:
   - **`price_per_bedroom`**: Normalizes price by the number of bedrooms for fairer comparisons.
   - **`review_count_score`**: Combines `number_of_reviews` and `review_scores_rating` to rank listings based on popularity and quality.
   - **`is_recent_review`**: Identifies listings with reviews in the past year for recency-based analysis.

These transformations make the dataset more actionable for analysis and machine learning.


In [3]:
# Aggregation
aggregated_data = data.groupby('neighbourhood_cleansed').agg({
    'price_y': 'mean',
    'number_of_reviews': 'sum',
    'reviews_per_month': 'mean',
    'availability_365': 'mean'
}).reset_index()

aggregated_data.rename(columns={
    'price_y': 'avg_price',
    'number_of_reviews': 'total_reviews',
    'reviews_per_month': 'avg_reviews_per_month',
    'availability_365': 'avg_availability_365'
}, inplace=True)
print("Aggregated Data:")
display(aggregated_data.head())

Aggregated Data:


Unnamed: 0,neighbourhood_cleansed,avg_price,total_reviews,avg_reviews_per_month,avg_availability_365
0,Adams,122.849728,66551910,4.121451,282.171341
1,Alki,145.67658,9330495,2.698397,292.83156
2,Arbor Heights,168.486319,364270,0.942455,278.083333
3,Atlantic,118.705948,33995370,4.42454,243.608844
4,Belltown,154.828553,121058820,3.824602,239.967988


In [4]:
# Filtering
filtered_data = data[(data['price_y'] > 0) & (data['availability_365'] > 0)]
print(f"Filtered Data Shape: {filtered_data.shape}")

Filtered Data Shape: (23638940, 37)


In [5]:
# Add a revenue column (price * availability_365)
filtered_data["revenue"] = filtered_data["price_y"] * filtered_data["availability_365"]

# Add a column to classify properties as high, medium, or low revenue
filtered_data["revenue_category"] = pd.cut(
    filtered_data["revenue"],
    bins=[0, 5000, 20000, np.inf],
    labels=["Low", "Medium", "High"]
)

print(filtered_data[["revenue", "revenue_category"]].head())


       revenue revenue_category
0  29410.00000             High
1  29410.00000             High
2  48597.59954             High
3  48597.59954             High
4  48597.59954             High


In [6]:
# Enrichment
data['price_per_bedroom'] = np.where(data['bedrooms'] > 0, data['price_y'] / data['bedrooms'], 0)
data['review_count_score'] = data['number_of_reviews'] * data['review_scores_rating']
data['is_recent_review'] = pd.to_datetime(data['last_review'], errors='coerce').apply(
    lambda x: x >= datetime.now() - pd.DateOffset(years=1) if pd.notnull(x) else False)

print("Data after Enrichment:")
display(data[['price_per_bedroom', 'review_count_score', 'is_recent_review']].head())

Data after Enrichment:


Unnamed: 0,price_per_bedroom,review_count_score,is_recent_review
0,85.0,19665.0,False
1,85.0,19665.0,False
2,140.45549,19665.0,False
3,140.45549,19665.0,False
4,140.45549,19665.0,False


# Problem Statement 7: Feature Engineering for Machine Learning

### Objective:
Extract meaningful features that enhance machine learning model performance. This involves:
1. **Temporal Features**: Extract time-based features from review and hosting data to capture trends and patterns.
2. **Textual Features**: Use Natural Language Processing (NLP) techniques to extract insights from text data.

### What We Are Doing and Why:
1. **Temporal Features**:
   - Extracting features such as `last_review_year`, `last_review_month`, and `last_review_dayofweek` from `last_review` enables time-based analysis.
   - Temporal features like `days_since_last_review` and `host_duration_days` provide critical context for modeling activity over time.

2. **Textual Features (TF-IDF)**:
   - Using `TfidfVectorizer` to convert the `name` column into numerical features captures textual patterns in the listing names.
   - By precomputing a global vocabulary, we ensure consistency across batches, preventing column mismatches.

These features enhance the dataset’s richness, making it suitable for both predictive and descriptive analytics.


In [7]:
# Temporal Features
data['last_review_date'] = pd.to_datetime(data['last_review'], errors='coerce')
data['last_review_year'] = data['last_review_date'].dt.year
data['last_review_month'] = data['last_review_date'].dt.month
data['last_review_dayofweek'] = data['last_review_date'].dt.dayofweek

print("Temporal Features Extracted:")
display(data[['last_review_year', 'last_review_month', 'last_review_dayofweek']].head())

Temporal Features Extracted:


Unnamed: 0,last_review_year,last_review_month,last_review_dayofweek
0,2016,1,5
1,2016,1,5
2,2016,1,5
3,2016,1,5
4,2016,1,5


In [8]:
# Fill missing values in 'name'
data['name'] = data['name'].fillna('unknown')

# Fit the vectorizer on the entire 'name' column
vectorizer = TfidfVectorizer(max_features=50, stop_words='english')  # Adjust max_features as needed
vectorizer.fit(data['name'].astype(str))  # Pre-fit on all data

# Extract the global vocabulary
global_vocab = vectorizer.get_feature_names_out()
print(f"Global vocabulary created with {len(global_vocab)} features.")

Global vocabulary created with 50 features.


In [26]:
# Define batch size
batch_size = 1000
output_file = '/Users/asr/Desktop/College/CapstoneProject/CapStoneProject_Group3/dataset/final_data_with_tfidf.csv'

# Initialize the output file
pd.DataFrame().to_csv(output_file, index=False)  # Create an empty file

# Process data in batches
for start in tqdm(range(0, len(data), batch_size), desc="Processing and merging batches", unit="batch"):
    # Select the current batch
    batch = data.iloc[start:start + batch_size].copy()
    
    # Fill missing values in 'name'
    batch['name'] = batch['name'].fillna('unknown')

    # Use the pre-fitted vectorizer with the global vocabulary
    tfidf_matrix = vectorizer.transform(batch['name'].astype(str))  # Use .transform, not .fit_transform

    # Convert TF-IDF matrix to DataFrame
    tfidf_batch_df = pd.DataFrame(tfidf_matrix.toarray(), columns=global_vocab)

    # Merge TF-IDF features with the current batch
    merged_batch = pd.concat([batch.reset_index(drop=True), tfidf_batch_df.reset_index(drop=True)], axis=1)

    # Append the merged batch to the output file
    merged_batch.to_csv(output_file, mode='a', header=not bool(start), index=False)

Processing and merging batches: 100%|██████████| 23919/23919 [12:25<00:00, 32.07batch/s]   


In [10]:
# Load a sample from the saved file
result = pd.read_csv("output_file", nrows=5)
print(result.head())

result = pd.read_csv(output_file)
print(f"Total rows: {len(result)}")


   listing_id       scrape_id                          name  host_id  \
0      241032  20160104002432  Stylish Queen Anne Apartment   956883   
1      241032  20160104002432  Stylish Queen Anne Apartment   956883   
2      241032  20160104002432  Stylish Queen Anne Apartment   956883   
3      241032  20160104002432  Stylish Queen Anne Apartment   956883   
4      241032  20160104002432  Stylish Queen Anne Apartment   956883   

  host_name  host_since                       host_location   latitude  \
0     Maija  2011-08-11  Seattle, Washington, United States  47.636289   
1     Maija  2011-08-11  Seattle, Washington, United States  47.636289   
2     Maija  2011-08-11  Seattle, Washington, United States  47.636289   
3     Maija  2011-08-11  Seattle, Washington, United States  47.636289   
4     Maija  2011-08-11  Seattle, Washington, United States  47.636289   

    longitude neighbourhood_cleansed  ... room seattle spacious studio suite  \
0 -122.371025        West Queen Anne  ... 

In [11]:
# Convert dates to datetime format
data["host_since"] = pd.to_datetime(data["host_since"])
data["first_review"] = pd.to_datetime(data["first_review"])
data["last_review"] = pd.to_datetime(data["last_review"])

# Create new temporal features
data["host_duration_days"] = (pd.Timestamp.now() - data["host_since"]).dt.days
data["days_since_last_review"] = (pd.Timestamp.now() - data["last_review"]).dt.days
data["days_between_reviews"] = (data["last_review"] - data["first_review"]).dt.days

print(data[["host_duration_days", "days_since_last_review", "days_between_reviews"]].head())

   host_duration_days  days_since_last_review  days_between_reviews
0                4875                    3270                  1523
1                4875                    3270                  1523
2                4875                    3270                  1523
3                4875                    3270                  1523
4                4875                    3270                  1523


# Problem Statement 8: Advanced Data Normalization and Encoding

### Objective:
Prepare the dataset for machine learning by:
1. **Normalizing numerical features** to ensure scale consistency.
2. **Encoding categorical features** to convert them into machine-readable formats.

### What We Are Doing and Why:
1. **Normalization**:
   - Applying Min-Max Scaling to columns like `price_y`, `review_scores_rating`, and `number_of_reviews` ensures numerical features are on a comparable scale.
   - This prevents features with larger ranges from dominating the model during training.

2. **Encoding**:
   - **One-Hot Encoding**: Converts categorical variables like `room_type` and `property_type` into binary columns. This is essential for models that cannot handle categorical data directly.
   - **Label Encoding**: Encodes binary features like `is_recent_review` into numerical values (0 or 1).

These steps ensure the dataset is ready for machine learning algorithms that require normalized and encoded inputs.


In [12]:
# Normalization
scaler = MinMaxScaler()
numerical_cols = ['price_y', 'review_scores_rating', 'number_of_reviews', 'availability_365']
normalized_data = pd.DataFrame(scaler.fit_transform(data[numerical_cols]), columns=numerical_cols)
print("Normalized Data:")
display(normalized_data.head())

Normalized Data:


Unnamed: 0,price_y,review_scores_rating,number_of_reviews,availability_365
0,0.045732,0.9375,0.436709,0.947945
1,0.045732,0.9375,0.436709,0.947945
2,0.079546,0.9375,0.436709,0.947945
3,0.079546,0.9375,0.436709,0.947945
4,0.079546,0.9375,0.436709,0.947945


In [None]:
# Encoding
# One-Hot Encoding
encoded_data = pd.get_dummies(data, columns=['room_type', 'property_type'], drop_first=True)

# Label Encoding
label_encoder = LabelEncoder()
encoded_data['is_recent_review'] = label_encoder.fit_transform(data['is_recent_review'])

print("Encoded Data:")
display(encoded_data.head())

Encoded Data:


Unnamed: 0,listing_id,scrape_id,name,host_id,host_name,host_since,host_location,latitude,longitude,neighbourhood_cleansed,...,property_type_Chalet,property_type_Condominium,property_type_Dorm,property_type_House,property_type_Loft,property_type_Other,property_type_Tent,property_type_Townhouse,property_type_Treehouse,property_type_Yurt
0,241032,20160104002432,Stylish Queen Anne Apartment,956883,Maija,2011-08-11,"Seattle, Washington, United States",47.636289,-122.371025,West Queen Anne,...,0,0,0,0,0,0,0,0,0,0
1,241032,20160104002432,Stylish Queen Anne Apartment,956883,Maija,2011-08-11,"Seattle, Washington, United States",47.636289,-122.371025,West Queen Anne,...,0,0,0,0,0,0,0,0,0,0
2,241032,20160104002432,Stylish Queen Anne Apartment,956883,Maija,2011-08-11,"Seattle, Washington, United States",47.636289,-122.371025,West Queen Anne,...,0,0,0,0,0,0,0,0,0,0
3,241032,20160104002432,Stylish Queen Anne Apartment,956883,Maija,2011-08-11,"Seattle, Washington, United States",47.636289,-122.371025,West Queen Anne,...,0,0,0,0,0,0,0,0,0,0
4,241032,20160104002432,Stylish Queen Anne Apartment,956883,Maija,2011-08-11,"Seattle, Washington, United States",47.636289,-122.371025,West Queen Anne,...,0,0,0,0,0,0,0,0,0,0


: 