Findings Markdown

In [None]:
Group data

In [None]:
# Group by a categorical column (example: 'category')
grouped = df.groupby('category').agg({
    'value': ['count', 'mean', 'sum']
})

grouped


What it shows:
Prices vary far less between neighbourhoods than you’d expect — everything gravitates around £600–£700, regardless of borough.

Key takeaways:

Staten Island Shared Rooms (£715) are oddly the most expensive room type in the entire dataset.

That screams data anomaly, mispricing, or a landlord with delusions of grandeur.

Brooklyn Hotel Rooms (£711) are also unusually high compared with Manhattan hotels (£681).

Most stable, predictable area: Manhattan — all room types cluster tightly around £620–£680.

A few dodgy entries exist:

“brookln” and “manhatan” with a single listing each.

Classic data entry errors — you've cleaned them but good to note.

Conclusion:
There’s a surprising uniformity of pricing across New York. Borough choice explains far less than expected — room type is the stronger driver.

In [None]:
Extract insights

In [None]:
# Quick descriptive statistics
df.describe()

# Correlations for numeric columns
df.corr(numeric_only=True)

# Top categories
df['category'].value_counts()


Here’s what your grouped outputs tell us:

Most common neighbourhoods

Manhattan (43k)

Brooklyn (41k)

Queens (13k)

This mirrors real Airbnb distribution — the dataset appears realistic.

Most common room types

Entire homes dominate (53k)

Private rooms follow closely (46k)

Shared rooms are rare (2.2k)

Hotel rooms practically don’t exist (115)

Manhattan + Brooklyn are almost identical in split — signalling a similar Airbnb market structure.

Top-level price pattern

Despite stereotypes, Manhattan entire places aren't the most expensive.
Brooklyn, Queens, and Bronx are all within £10–£15 of each other.

This implies:
➡️ Price isn't driven by neighbourhood in this dataset
➡️ Hosts anchor prices to each other (herd pricing behaviour)

In [None]:
Identify patterns

In [None]:
# Trend over time
df.set_index('date')['value'].plot()

# Find anomalies (very basic)
df[df['value'] > df['value'].mean() + 3*df['value'].std()]

# Pivot for pattern spotting
df.pivot_table(values='value', index='category', columns='month', aggfunc='mean')


Low Correlations Everywhere

Your correlation matrix shows nearly every numeric variable is weakly related:

Price has almost no strong correlation with anything (abs corr ≈ 0.00–0.04).

Reviews, minimum nights, availability — also weak.

➡️ Translation:
The dataset doesn’t follow strong numeric relationships.
This is common with messy scraped Airbnb datasets — price is driven by unrecorded variables (amenities, décor, photos).

The only remotely interesting correlation:

availability_365 ↔ calculated_host_listings_count: 0.159

Hosts with many properties tend to leave them more widely available (i.e., professional landlords, less personal use).

Oddities to note

Availability has a minimum of –10 days, which is impossible — a classic data entry or scraping error.

NYC doesn’t have buildings from “construction_year = 0”, but you likely saw some weird values.

In [None]:
Understand numeric + categorical data

In [None]:
# Numeric summary
df.select_dtypes(include='number').describe()

# Categorical summary
df.select_dtypes(include='object').nunique()

# Crosstab: numeric category relationships
pd.crosstab(df['category'], df['outcome'])


Numeric:

Huge variance in host activity (listings: 1 → 332).

Availability ranges –10 → 3677, confirming the dataset is not fully clean at the source.

Price variance is small relative to expectation.

Categorical:

Over 60k unique property names — highly messy field.

Nearly 2,000 unique house rules — impossible to meaningfully analyse without NLP.

Neighbourhood has 8 unique groups, incl. typos → should be normalised.

Room Type × Neighbourhood Cross-Tab

Clear pattern:

Manhattan and Brooklyn dominate in every room category.

Hotel rooms almost exclusively live in Manhattan.

Queens + Bronx lean heavily toward private rooms — “budget boroughs”.