<a href="https://colab.research.google.com/github/0-Parth-D/Dataset-Extraction-and-Exploration/blob/main/Data_Exploration_DM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Checkpoint 1: Dataset Comparison, Selection, and EDA

## PART A: IDENTIFICATION OF CANDIDATE DATASETS

## Dataset 1: Airbnb United States Listings

### 1. Dataset Name and Source

**Name:** Inside Airbnb - United States Multi-City Listings Dataset

**Source:** http://insideairbnb.com/get-the-data/

**Provider:** Inside Airbnb (Independent, non-commercial project)

**Scrape dates:** September 2025 - December 2025

**Data Collection Process:**
1. Scraped listings from multiple U.S. cities via Inside Airbnb
2. Downloaded `listings.csv.gz` (83 columns) + `reviews.csv` (6 columns) per city
3. Concatenated city-level files into unified U.S. dataset
4. Preserved city identifier for multi-city comparisons
5. Selected 12-13 essential columns for analysis

**Actual Column Structure (83 columns):**

```
id, listing_url, scrape_id, last_scraped, source, name, description,
neighborhood_overview, picture_url, host_id, host_url, host_name,
host_since, host_location, host_about, host_response_time,
host_response_rate, host_acceptance_rate, host_is_superhost,
host_thumbnail_url, host_picture_url, host_neighbourhood,
host_listings_count, host_total_listings_count, host_verifications,
host_has_profile_pic, host_identity_verified, neighbourhood,
neighbourhood_cleansed, neighbourhood_group_cleansed, latitude,
longitude, property_type, room_type, accommodates, bathrooms,
bathrooms_text, bedrooms, beds, amenities, price, minimum_nights,
maximum_nights, [... 43 more columns including reviews, availability,
revenue estimates, host tenure metrics]
```

**Reviews Data (6 columns):**
```
listing_id, id, date, reviewer_id, reviewer_name, comments
```


---

### 2. Course Topic Alignment

- **Frequent Itemset Mining:** Apriori on `amenities`; regional patterns
- **Graph Mining:** Host-listing bipartite graph; multi-city host networks
- **Large-Scale ML:** SGD on 100K+ samples; price prediction
- **Clustering:** K-Means for listing archetypes across regions
- **Text Mining:** TF-IDF on descriptions; sentiment on review comments
- **Anomaly Detection:** Price outliers; city-specific anomalies

---

### 3. Potential Beyond-Course Techniques

- **XGBoost:** Gradient boosting for non-linear price prediction
- **Topic Modeling (LDA):** Discover regional themes in text
- **BERT Sentiment:** Transformer-based sentiment on reviews
- **Hierarchical Models:** Mixed-effects with city random effects

---

### 4. Dataset Size and Structure

- **Rows:** 100K+ listings (estimated across multiple cities)
- **Columns:** 83 original, 12-13 selected
- **Cities:** Boston, NYC, Newport, and others
- **File size:** ~100+ MB (combined)

**Selected 13 Columns:**
id, city, name, description, host_name, price, neighbourhood_cleansed,
latitude, longitude, room_type, accommodates, review_scores_rating, amenities

text

---

### 5. Data Types

- **Text:** name, description, host_name, amenities (JSON list)
- **Numerical:** price (string "$150.00", needs cleaning), accommodates, lat/lon, review_scores_rating
- **Categorical:** city, neighbourhood_cleansed, room_type

---

### 6. Target Variable(s)

- **Primary:** `price` (continuous, regression)
- **Secondary:** `review_scores_rating` (continuous or binary)
- **Derived:** `is_professional_host` (binary, from `calculated_host_listings_count`)

---

### 7. Licensing and Usage Constraints

**License:** CC BY 4.0

**Usage:** Academic/non-commercial with attribution

**Citation:**
Inside Airbnb. (2025). United States Airbnb Listings [Dataset].
Retrieved from http://insideairbnb.com/get-the-data/.
Multiple U.S. cities, scraped September-December 2025.

## Dataset 2: GitHub Developer Social Network

### 1. Dataset Name and Source

**Name:** GitHub Social Network Dataset (MUSAE GitHub)

**Source:** Stanford SNAP - https://snap.stanford.edu/data/github-social.html

**Authors:** Rozemberczki, Allen, & Sarkar (2021)

**Alternative:** PyTorch Geometric (`torch_geometric.datasets.GitHub`)

**Files:**
- `musae_github_edges.csv` - Edge list (289,003 edges)
- `musae_github_target.csv` - Node labels (37,700 developers)
- `musae_github_features.json` - Starred repositories (sparse)

---

### 2. Course Topic Alignment

- **Frequent Itemset Mining:** Technology stack patterns from starred repos
- **Graph Mining:** PageRank, community detection, centrality measures
- **Large-Scale ML:** Node classification (Web vs. ML developer)
- **Clustering:** Community detection algorithms
- **Graph Embeddings:** Node2Vec, DeepWalk
- **LSH:** Fast developer similarity search

---

### 3. Potential Beyond-Course Techniques

- **Graph Neural Networks (GCN/GAT):** Message-passing for node classification
- **Link Prediction:** Recommend developers to follow
- **Influence Maximization:** Identify key opinion leaders
- **Heterogeneous GNNs:** Multi-relation graphs (follows, stars, contributes)

---

### 4. Dataset Size and Structure

- **Nodes:** 37,700 developers
- **Edges:** 289,003 mutual follower relationships
- **Density:** 0.0004 (sparse)
- **Average degree:** 15.3 connections
- **File size:** ~20 MB total
- **Connected components:** 1 (fully connected)

---

### 5. Data Types

- **Graph:** Edge pairs (id_1, id_2)
- **Binary:** ml_target (0=Web, 1=ML developer)
- **Lists:** starred_repos (integer list, variable length 10-500)
- **Categorical:** location, employer (very sparse)

---

### 6. Target Variable(s)

- **Primary:** `ml_target` (binary, 50-50 balanced classification)
- **Derived:** Community ID (from clustering), Influence score (PageRank)

---

### 7. Licensing and Usage Constraints

**License:** Stanford SNAP Academic License

**Usage:** Free for academic/educational, must cite paper

**Citation:**
```bibtex
@article{rozemberczki2021multi,
  title={Multi-scale Attributed Node Embedding},
  author={Rozemberczki, Benedek and Allen, Carl and Sarkar, Rik},
  journal={Journal of Complex Networks},
  volume={9}, number={2}, year={2021}
}

## Dataset 3: Spotify Million Playlist Dataset

### 1. Dataset Name and Source

**Name:** Spotify Million Playlist Dataset (MPD Remastered)

**Source:**
- Primary: Spotify Research (https://research.atspotify.com/)
- Challenge: RecSys Challenge 2018 (re-released 2020)
- Platform: AIcrowd (https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge)
- Paper: Language Model-Based Playlist Generation Recommender System (RecSys 2025)
- GitHub: https://github.com/elea-vellard/LLM-Playlist-Recommender

**Description:** Largest public music playlist dataset containing 1 million user-created Spotify playlists with over 2 million unique tracks by 300,000 artists.

---

### 2. Course Topic Alignment

- Frequent Itemset Mining: Track co-occurrence in playlists
- Clustering: Group playlists by genre/theme
- Text Mining: TF-IDF on playlist titles and descriptions
- Graph Mining: Track-playlist bipartite graph
- Large-Scale ML: Recommendation system on 1M playlists
- Recommender Systems: Matrix factorization, collaborative filtering

---

### 3. Potential Beyond-Course Techniques

- Language Models (GPT/BERT): Fine-tune transformer on playlist titles
- Semantic Embeddings: Create track embeddings using playlist context
- Transformer Recommenders: Use attention mechanisms for sequential playlist continuation
- Audio Feature Integration: Combine Spotify API features with collaborative signals

---

### 4. Dataset Size and Structure

**Dataset Statistics:**

- Playlists: 1,000,000
- Unique tracks: 2,262,292
- Unique artists: ~295,000
- Total entries: ~66 million
- File size: ~32 GB (JSON)
- Avg tracks per playlist: 66
- Playlist length range: 5-250 tracks

**Data Structure (JSON):**
```
{
  "playlists": [
    {
      "name": "Country Classics",
      "collaborative": "false",
      "pid": 0,
      "modified_at": 1493424000,
      "num_tracks": 52,
      "tracks": [
                  {
                    "pos": 0,
                    "artist_name": "Johnny Cash",
                    "track_uri": "spotify:track:...",
                    "track_name": "Ring of Fire",
                    "duration_ms": 156000,
                    "album_name": "Ring of Fire: The Best of..."
                  }
                ]
    }
  ]
}
```

---

### 5. Data Types

- Text: name, description, artist_name, track_name, album_name
- Numerical: pid, num_tracks, num_followers, duration_ms, modified_at
- Categorical: track_uri, artist_uri, album_uri
- Boolean: collaborative
- Lists: tracks (array of track objects)

---

### 6. Target Variable(s)

**Primary:** Playlist Continuation (recommend 500 tracks given K=0,1,5,10,25,100 seed tracks)

**Secondary:** Playlist Title Generation (generate title from tracks)

**Derived:** Genre classification, Playlist theme clustering, Track popularity prediction

---

### 7. Licensing and Usage Constraints

**License:** Spotify Research License (Non-commercial)

**Usage:** Academic/educational with registration on AIcrowd

**Citation:**
Chen, C.-W., Lamere, P., Schedl, M., & Zamani, H. (2018).
Recsys challenge 2018: Automatic music playlist continuation.
ACM Conference on Recommender Systems, 527-528.

## (B) Comparative Analysis of Datasets

### 1. Supported Data Mining Tasks

| Dataset | Course Techniques (In-Scope) | Beyond-Course Techniques |
|--------|------------------------------|---------------------------|
| **Airbnb US Listings** | - Frequent itemsets on `amenities` (association rules)<br>- Clustering of listings (price, location, capacity)<br>- Text mining on `description` and review `comments`<br>- Graph mining on host–listing network<br>- Anomaly detection on price and occupancy | - XGBoost/LightGBM for price prediction<br>- Topic modeling (LDA) on descriptions/reviews<br>- BERT-based sentiment on reviews<br>- Hierarchical/mixed-effects models with city random effects |
| **GitHub Social Network** | - Graph mining (PageRank, centrality, communities)<br>- Node classification (Web vs. ML developer)<br>- Graph embeddings (Node2Vec)<br>- Clustering of developers via embeddings<br>- Frequent itemsets on starred repositories | - Graph Neural Networks (GCN/GAT/GraphSAGE)<br>- Link prediction (future follows)<br>- Influence maximization<br>- Heterogeneous GNNs (follows, stars, contributions) |
| **Spotify Million Playlist** | - Frequent itemsets on tracks in playlists<br>- Clustering of playlists/tracks (usage patterns)<br>- Text mining on playlist titles/descriptions<br>- Graph mining on track–playlist bipartite graph<br>- Large-scale recommendation (implicit feedback) | - Transformer-based LLM playlist generation<br>- Sequence models for next-track prediction<br>- Track embeddings from co-occurrence (Word2Vec-style)<br>- Multi-modal models combining audio + collaborative signals |

---

### 2. Data Quality Issues

| Dataset | Main Data Quality Issues | Impact on Analysis |
|--------|--------------------------|--------------------|
| **Airbnb US Listings** | - Missing values in text fields (`neighborhood_overview`, `host_about`), ratings, and host metadata<br>- Inconsistent `price` formatting (strings with currency symbols, commas)<br>- Noisy/free-text descriptions and reviews<br>- Duplicated or near-duplicated hosts across cities<br>- Heterogeneous cities (different markets mixed together) | - Requires careful cleaning and imputation, especially for `price` and `review_scores_rating`<br>- Amenity parsing (JSON-like strings) needed before itemset mining<br>- City heterogeneity must be controlled for in modeling (city dummies or hierarchical models) |
| **GitHub Social Network** | - Sparse graph (low density, many weakly connected nodes) [web:340]<br>- Missing or unreliable node attributes (location, employer often missing or noisy) [web:343]<br>- Starred repositories list length varies widely<br>- Labels (Web vs. ML) derived from job titles, may be noisy [web:340] | - Graph sparsity affects community detection quality and centrality measures<br>- Node classification cannot rely heavily on metadata; must lean on graph structure<br>- Label noise may cap achievable accuracy for supervised tasks |
| **Spotify Million Playlist** | - Very large size (~32 GB JSON) makes full in-memory loading difficult [web:335][web:338]<br>- Playlist titles/descriptions often short, vague, or missing<br>- User behavior bias: popular tracks/artists overrepresented<br>- Track metadata incomplete without Spotify API (no audio features in raw MPD) | - Requires sampling or chunked processing for EDA and baselines<br>- Text mining on titles alone may be weak; need to combine with tracks<br>- Popularity bias must be handled (e.g., reweighting, debiasing) for fair recommendations<br>- Extra pipeline needed to enrich with audio features via API |

---

### 3. Algorithmic Feasibility

| Dataset | Feasibility of Course Algorithms | Feasibility of Beyond-Course Techniques |
|--------|-----------------------------------|-----------------------------------------|
| **Airbnb US Listings** | - Apriori/FP-Growth on `amenities` feasible (tens–hundreds of thousands of rows)<br>- Clustering (K-Means) on tabular features runs easily on a single machine<br>- Text mining (TF-IDF on descriptions) manageable if vocabulary limited (e.g., max_features)<br>- Host–listing graph small enough for NetworkX/igraph | - XGBoost on 100K+ rows is tractable on a laptop (seconds–minutes)<br>- LDA topic modeling feasible; may need sampling for very long texts<br>- BERT fine-tuning requires GPU but can use a subset of listings/reviews<br>- Hierarchical models can be slow if many cities; possible with careful grouping |
| **GitHub Social Network** | - Graph algorithms (PageRank, centrality, communities) scale well at 37K nodes / 289K edges [web:340][web:341]<br>- Node2Vec embeddings feasible in minutes on CPU<br>- Node classification with basic ML models (logistic regression, random forest) fast | - GCN/GAT/GraphSAGE training on this scale is comfortable on a single GPU; still feasible on CPU with some patience<br>- Link prediction (heuristics + ML) feasible; full all-pairs is expensive but negative sampling helps<br>- Influence maximization on full graph possible with greedy approximations, but may need heuristics for speed |
| **Spotify Million Playlist** | - Frequent itemsets on full dataset expensive; need sampling or approximate methods<br>- K-Means/other clustering impractical on full 66M interactions; must work on reduced representations (track embeddings, playlist vectors)<br>- TF-IDF on all titles/descriptions feasible but must stream or limit vocabulary | - Full-sequence Transformers on all 1M playlists expensive; realistic on sampled subsets or shorter sequences<br>- Word2Vec-style track embeddings (Skip-gram/CBOW) feasible with optimized libraries (Gensim, fastText) on powerful machine<br>- Multi-modal models (audio + interaction) likely need GPU and sub-sampling<br>- Production-scale recommendation pipelines require distributed systems (Spark, PySpark), but research-scale experiments can sample 100K playlists |

---

### 4. Bias Considerations

| Dataset | Sources of Bias | Consequences |
|--------|------------------|--------------|
| **Airbnb US Listings** | - Geographic bias: Major cities overrepresented; rural areas sparse<br>- Platform bias: Only Airbnb; excludes hotels and other platforms<br>- Host bias: Professional hosts with many listings vs. casual hosts<br>- Review bias: Only guests who choose to review are observed | - Models may overfit to big-city patterns and not generalize<br>- Pricing or “success” analyses could reinforce existing inequalities (e.g., gentrification, neighborhood exclusion)<br>- Professional host behavior could dominate patterns, hiding casual host dynamics |
| **GitHub Social Network** | - Sample bias: Only developers with ≥10 starred repos are included [web:340][web:349]<br>- Geographic bias: Overrepresentation of U.S./Europe developers [web:343]<br>- Labeling bias: Web vs. ML based on job titles; non-standard titles may be misclassified [web:340]<br>- Activity bias: Only active, visible developers included | - Conclusions may not generalize to casual or early-career developers<br>- Risk of overstating centrality or “importance” of certain regions/companies<br>- Using labels for screening/hiring could unfairly disadvantage certain groups |
| **Spotify Million Playlist** | - Popularity bias: Most streams and playlist entries concentrated on top artists/tracks [web:335][web:338]<br>- Demographic bias: Spotify users skew younger and by region (depends on country coverage)<br>- Historical bias: Music trends at time of data collection influence patterns<br>- Platform curation bias: Editorial playlists influence user behavior | - Recommenders may keep pushing already-popular tracks (“rich-get-richer” effect)<br>- Niche genres and minority artists under-recommended<br>- Analyses may not generalize to non-Spotify listeners or other platforms |

---

### 5. Ethical Considerations

| Dataset | Key Ethical Questions | Risk Level & Mitigations |
|--------|------------------------|--------------------------|
| **Airbnb US Listings** | - Could price/revenue models encourage further gentrification or displacement?<br>- Are we exposing patterns that let landlords optimize profit at the expense of local residents?<br>- Is there risk of re-identifying hosts or neighborhoods in a harmful way? | - **Risk:** Medium–High (housing and affordability are sensitive topics)<br>- **Mitigations:** Work with aggregated/geographically coarse results; avoid naming specific hosts; explicitly discuss housing/inequality context; frame findings as descriptive, not prescriptive for exploitation |
| **GitHub Social Network** | - Could developer embeddings/scores be used for hiring discrimination or surveillance?<br>- Are we ranking people in ways they never consented to?<br>- Are we amplifying visibility gaps between developers? | - **Risk:** Medium (professional consequences, reputation)<br>- **Mitigations:** Avoid per-user “scorecards”; present results in aggregate; emphasize methodological/structural insights; discuss ethical use in hiring and evaluation explicitly |
| **Spotify Million Playlist** | - Do recommender models further marginalize niche artists/genres?<br>- Are we reinforcing narrow listening habits instead of discovery?<br>- Could playlist generation be used to manipulate mood/behavior without transparency? | - **Risk:** Medium (cultural and economic impact on artists, user autonomy)<br>- **Mitigations:** Evaluate diversity/novelty of recommendations; analyze and report popularity bias; avoid user-level profiling; frame work as improving discovery and artist exposure, not purely engagement maximization |


In [1]:
from bs4 import BeautifulSoup
from google.colab import files

In [5]:
import requests # Import the requests library to fetch content from URLs

html_url = "https://insideairbnb.com/get-the-data/" # <<< REPLACE THIS WITH YOUR TARGET URL

extracted_urls_listings = []
extracted_urls_reviews = []

print(f"Processing URL: {html_url}")

try:
    # Fetch content from the URL
    response = requests.get(html_url)
    response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)
    content = response.content

    # BeautifulSoup can handle bytes directly and detect encoding
    soup = BeautifulSoup(content, 'html.parser')

    for link in soup.find_all('a', href=True):
        href = link['href']
        # Filter for URLs containing '/united-states/' and ending with 'listings.csv.gz'
        if '/united-states/' in href and href.endswith('listings.csv.gz'):
            extracted_urls_listings.append(href)
        if '/united-states/' in href and href.endswith('reviews.csv.gz'):
            extracted_urls_reviews.append(href)

    print(f"\nExtracted {len(extracted_urls_listings)} listings URLs:")
    for url in extracted_urls_listings:
        print(url)
    print(f"\nExtracted {len(extracted_urls_reviews)} reviews URLs:")
    for url in extracted_urls_reviews:
        print(url)
except requests.exceptions.RequestException as e:
    print(f"Error fetching URL {html_url}: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Processing URL: https://insideairbnb.com/get-the-data/

Extracted 34 listings URLs:
https://data.insideairbnb.com/united-states/ny/albany/2025-11-07/data/listings.csv.gz
https://data.insideairbnb.com/united-states/nc/asheville/2025-09-22/data/listings.csv.gz
https://data.insideairbnb.com/united-states/tx/austin/2025-09-16/data/listings.csv.gz
https://data.insideairbnb.com/united-states/ma/boston/2025-09-23/data/listings.csv.gz
https://data.insideairbnb.com/united-states/mt/bozeman/2025-11-12/data/listings.csv.gz
https://data.insideairbnb.com/united-states/fl/broward-county/2025-09-26/data/listings.csv.gz
https://data.insideairbnb.com/united-states/ma/cambridge/2025-09-28/data/listings.csv.gz
https://data.insideairbnb.com/united-states/il/chicago/2025-09-22/data/listings.csv.gz
https://data.insideairbnb.com/united-states/nv/clark-county-nv/2025-09-23/data/listings.csv.gz
https://data.insideairbnb.com/united-states/oh/columbus/2025-09-26/data/listings.csv.gz
https://data.insideairbnb.com

In [6]:
import pandas as pd

# --- Process Listings ---
listings_dfs = []
print(f"Downloading and combining {len(extracted_urls_listings)} listing files...")

for url in extracted_urls_listings:
    print(f"Loading: {url}")
    try:
        # pandas automatically detects gzip compression from the file extension
        df = pd.read_csv(url)
        # Optionally add a column to track source if needed, e.g., city/url
        # df['source_url'] = url
        listings_dfs.append(df)
    except Exception as e:
        print(f"Error loading {url}: {e}")

if listings_dfs:
    listings_df = pd.concat(listings_dfs, ignore_index=True)
    print(f"\nSuccessfully combined listings. Shape: {listings_df.shape}")
    display(listings_df.head())
else:
    print("No listings data could be loaded.")

# --- Process Reviews ---
reviews_dfs = []
print(f"\nDownloading and combining {len(extracted_urls_reviews)} review files...")

for url in extracted_urls_reviews:
    print(f"Loading: {url}")
    try:
        df = pd.read_csv(url)
        reviews_dfs.append(df)
    except Exception as e:
        print(f"Error loading {url}: {e}")

if reviews_dfs:
    reviews_df = pd.concat(reviews_dfs, ignore_index=True)
    print(f"\nSuccessfully combined reviews. Shape: {reviews_df.shape}")
    display(reviews_df.head())
else:
    print("No reviews data could be loaded.")

Downloading and combining 34 listing files...
Loading: https://data.insideairbnb.com/united-states/ny/albany/2025-11-07/data/listings.csv.gz
Loading: https://data.insideairbnb.com/united-states/nc/asheville/2025-09-22/data/listings.csv.gz
Loading: https://data.insideairbnb.com/united-states/tx/austin/2025-09-16/data/listings.csv.gz
Loading: https://data.insideairbnb.com/united-states/ma/boston/2025-09-23/data/listings.csv.gz
Loading: https://data.insideairbnb.com/united-states/mt/bozeman/2025-11-12/data/listings.csv.gz
Loading: https://data.insideairbnb.com/united-states/fl/broward-county/2025-09-26/data/listings.csv.gz


  df = pd.read_csv(url)


Loading: https://data.insideairbnb.com/united-states/ma/cambridge/2025-09-28/data/listings.csv.gz
Loading: https://data.insideairbnb.com/united-states/il/chicago/2025-09-22/data/listings.csv.gz
Loading: https://data.insideairbnb.com/united-states/nv/clark-county-nv/2025-09-23/data/listings.csv.gz
Loading: https://data.insideairbnb.com/united-states/oh/columbus/2025-09-26/data/listings.csv.gz
Loading: https://data.insideairbnb.com/united-states/tx/dallas/2025-11-23/data/listings.csv.gz
Loading: https://data.insideairbnb.com/united-states/co/denver/2025-09-29/data/listings.csv.gz
Loading: https://data.insideairbnb.com/united-states/tx/fort-worth/2025-09-16/data/listings.csv.gz
Loading: https://data.insideairbnb.com/united-states/hi/hawaii/2025-09-16/data/listings.csv.gz
Loading: https://data.insideairbnb.com/united-states/nj/jersey-city/2025-09-25/data/listings.csv.gz
Loading: https://data.insideairbnb.com/united-states/ca/los-angeles/2025-12-04/data/listings.csv.gz
Loading: https://data

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,host_profile_id,host_profile_url,hosts_time_as_user_years,hosts_time_as_user_months,hosts_time_as_host_years,hosts_time_as_host_months
0,2992450,https://www.airbnb.com/rooms/2992450,20251107023918,2025-11-07,city scrape,Luxury 2 bedroom apartment,The apartment is located in a quiet neighborho...,,https://a0.muscache.com/pictures/44627226/0e72...,4621559,...,1,0,0,0.07,,,,,,
1,3820211,https://www.airbnb.com/rooms/3820211,20251107023918,2025-11-07,city scrape,Restored Precinct in Center Sq. w/Parking,Step into the charming and comfy 1BR/1BA apart...,Overview<br /><br />The lovely apartment is lo...,https://a0.muscache.com/pictures/prohost-api/H...,19648678,...,5,0,0,2.3,,,,,,
2,5651579,https://www.airbnb.com/rooms/5651579,20251107023918,2025-11-07,city scrape,Large studio apt by Capital Center & ESP@,"Spacious studio with hardwood floors, fully eq...",The neighborhood is very eclectic. We have a v...,https://a0.muscache.com/pictures/b3fc42f3-6e5e...,29288920,...,1,1,0,2.95,,,,,,
3,6623339,https://www.airbnb.com/rooms/6623339,20251107023918,2025-11-07,city scrape,Center Sq. Loft in Converted Precinct w/ Parking,Step into the charming and comfy 1BR/1BA apart...,Overview<br /><br />The lovely apartment is lo...,https://a0.muscache.com/pictures/prohost-api/H...,19648678,...,5,0,0,2.62,,,,,,
4,9005989,https://www.airbnb.com/rooms/9005989,20251107023918,2025-11-07,city scrape,"Studio in The heart of Center SQ, in Albany NY",(21 years of age or older ONLY) NON- SMOKING.....,"There are many shops, restaurants, bars, museu...",https://a0.muscache.com/pictures/d242a77e-437c...,17766924,...,1,0,0,5.53,,,,,,



Downloading and combining 34 review files...
Loading: https://data.insideairbnb.com/united-states/ny/albany/2025-11-07/data/reviews.csv.gz
Loading: https://data.insideairbnb.com/united-states/nc/asheville/2025-09-22/data/reviews.csv.gz
Loading: https://data.insideairbnb.com/united-states/tx/austin/2025-09-16/data/reviews.csv.gz
Loading: https://data.insideairbnb.com/united-states/ma/boston/2025-09-23/data/reviews.csv.gz
Loading: https://data.insideairbnb.com/united-states/mt/bozeman/2025-11-12/data/reviews.csv.gz
Loading: https://data.insideairbnb.com/united-states/fl/broward-county/2025-09-26/data/reviews.csv.gz
Loading: https://data.insideairbnb.com/united-states/ma/cambridge/2025-09-28/data/reviews.csv.gz
Loading: https://data.insideairbnb.com/united-states/il/chicago/2025-09-22/data/reviews.csv.gz
Loading: https://data.insideairbnb.com/united-states/nv/clark-county-nv/2025-09-23/data/reviews.csv.gz
Loading: https://data.insideairbnb.com/united-states/oh/columbus/2025-09-26/data/re

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2992450,15066586,2014-07-01,16827297,Kristen,Large apartment; nice kitchen and bathroom. Ke...
1,2992450,21810844,2014-10-24,22648856,Christopher,"This may be a little late, but just to say Ken..."
2,2992450,27434334,2015-03-04,45406,Altay,The apartment was very clean and convenient to...
3,2992450,28524578,2015-03-25,5485362,John,Kenneth was ready when I got there and arrange...
4,2992450,35913434,2015-06-23,15772025,Jennifer,We were pleased to see how 2nd Street and the ...


In [7]:
display(listings_df.sample(5))
display(reviews_df.sample(5))

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,host_profile_id,host_profile_url,hosts_time_as_user_years,hosts_time_as_user_months,hosts_time_as_host_years,hosts_time_as_host_months
222359,17381227,https://www.airbnb.com/rooms/17381227,20250930030713,2025-09-30,city scrape,"Newport, Vineyards, Beaches 10 minutes 12 beds",Home with 1 acre rear fenced yard. 5 star int...,"Newport Vineyard (Brix Restaurant), Newport Na...",https://a0.muscache.com/pictures/cc5ee2f2-77d5...,27612979,...,4,0,0,0.89,,,,,,
154224,1445145077579514212,https://www.airbnb.com/rooms/1445145077579514212,20251204025459,2025-12-05,city scrape,Comfort in the Heart of Town.,Enjoy a stylish experience at this centrally-l...,,https://a0.muscache.com/pictures/hosting/Hosti...,53720573,...,1,0,0,4.66,1.465873e+18,https://www.airbnb.com/users/profile/146587293...,9.0,10.0,1.0,11.0
259477,1409692710248514238,https://www.airbnb.com/rooms/1409692710248514238,20250925033206,2025-09-25,city scrape,.3BR/2BA Spacious Downtown Families Kids Fun,"Well equipped, spacious, serene, clean 3 BR/2 ...",,https://a0.muscache.com/pictures/hosting/Hosti...,692682282,...,2,3,0,2.65,,,,,,
183812,16071397,https://www.airbnb.com/rooms/16071397,20251204025441,2025-12-05,previous scrape,Small Private Room in Gramercy/East Village,PLEASE DONT BOOK UNLESS YOU HAVE MULTIPLE POSI...,,https://a0.muscache.com/pictures/miso/Hosting-...,7555939,...,0,1,0,0.07,1.462735e+18,https://www.airbnb.com/users/profile/146273508...,12.0,4.0,9.0,0.0
14852,30390378,https://www.airbnb.com/rooms/30390378,20250923202714,2025-09-24,city scrape,"Blueground | Back Bay, gym, nr the common",Show up and start living from day one in Bosto...,,https://a0.muscache.com/pictures/prohost-api/H...,107434423,...,343,0,0,0.01,,,,,,


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
13160887,17790109,460078736501519622,2021-09-26,422227979,Kaitlyn,Truly a wonderful house to stay in and enjoy y...
3331251,38907909,1189230618283373099,2024-06-28,563077581,Paulina,Fantastic place and area. Ea and Nickm are gre...
6306823,34199862,1480662455629579797,2025-08-04,62265714,Caitlin,A+. I really can’t say enough good things abou...
9965116,23432574,398609664060314196,2021-07-03,390338755,Carol,Wonderful modern stand alone house with secure...
8071883,8090951,837712913760401337,2023-03-01,330536116,Marion,This is the place to stay! Look no further. ...


In [8]:
listings_df.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'ca

In [9]:
# listings_cols = [
#     'id',
#     'name',
#     'description',
#     'host_name',
#     'host_id',
#     'price',
#     'neighbourhood_cleansed',
#     'latitude',
#     'longitude',
#     'room_type',
#     'accommodates',
#     'review_scores_rating',
#     'amenities',
# ]

# # Filter the dataframe
# listings_df = listings_df[listings_cols]

# print(f"Filtered listings shape: {listings_df.shape}")
# display(listings_df.sample(5))

Filtered listings shape: (280673, 13)


Unnamed: 0,id,name,description,host_name,host_id,price,neighbourhood_cleansed,latitude,longitude,room_type,accommodates,review_scores_rating,amenities
64707,945657712977559613,Winterfell - Duplex close to OSU/Short North,"Welcome home, my Lord, to your Columbus strong...",Sam,90062227,$115.00,Near North/University,39.99421,-82.99793,Entire home/apt,6,4.91,"[""Hot water"", ""Coffee"", ""Dishes and silverware..."
252360,1192830311404590479,"New- Stunning Oceanfront ""Pelican Bluffs""","Perched on a dramatic oceanfront bluff, Pelica...",Katherine,18470900,$562.00,Unincorporated Areas,37.532139,-122.517919,Entire home/apt,6,4.98,"[""Cooking basics"", ""Hangers"", ""Lockbox"", ""Free..."
188556,31042303,Manhattan Huge Luxurious Room near Columbia Univ,Hi welcome to New York City! Traveling is abou...,Yao & Rain,10415675,,Harlem,40.82128,-73.95344,Private room,2,4.55,"[""Shower gel"", ""Dishes and silverware"", ""Refri..."
33896,1407255211659738057,Lyfe Beach Resort 2 Bdrm Oceanview Hollywood,Lyfe Beach Resort is a beachfront resort in Ho...,Michael,513817283,,Hollywood,25.987374,-80.119741,Entire home/apt,6,5.0,"[""Wifi"", ""TV"", ""Sun loungers"", ""Clothing stora..."
79209,789456,Hip Modern Ocean Front Real One Bdr,Ilikai Hotel Ocean Front 2 pool one adults onl...,Janna,4049989,$165.00,Primary Urban Center,21.28354,-157.83937,Entire home/apt,4,4.5,"[""Exterior security cameras on property"", ""Smo..."
