# HW 5 – Case Studies

**Learning Objectives**
- Practice full-cycle exploratory data analysis on two real-world case studies.
- Translate ambiguous stakeholder questions into concrete analytical tasks and visualizations.
- Summarize quantitative findings with clear, data-backed narratives.

## Getting Started
- Work inside a clean conda/virtualenv with pandas, numpy, seaborn, matplotlib, and plotly installed.
- Create a `data/` folder alongside this notebook to store the required CSV files.
- Use clear, commented code and keep figures legible (titles, axis labels, legends).
- Place all narrative answers in Markdown cells immediately after the relevant code.

## Data Sources
- NYC Airbnb listings: retrieve the most recent CSV from [Inside Airbnb](https://insideairbnb.com/get-the-data/) or the course repository (`nyc_airbnb.csv`).
- Food prices for nutrition: download `food_prices.csv` from the course repository or the Food Prices for Nutrition portal.
- If you store the files elsewhere, update the paths accordingly and document the change.

## Part A – NYC Airbnb Case Study

### Question 1 – Acquire and Inspect the NYC Airbnb Data (15 pts)

1. Download the latest NYC Airbnb listings data (CSV) from the Inside Airbnb site or the course website and save it as `nyc_airbnb.csv` inside a local `data/` directory.
2. Load the dataset with `pandas.read_csv`. If you place the notebook in the course repo root, the relative path should be `../data/nyc_airbnb.csv`.
3. Display the dataset shape, column names, and data types.
4. Compute the number of missing values per column and identify the three fields with the most missingness.
5. Add a short Markdown cell summarizing your most important takeaways from this first pass.

In [None]:
# TODO: load the NYC Airbnb data and perform the requested inspection
import pandas as pd
from pathlib import Path

data_path = Path('../data/nyc_airbnb.csv')
# Your code here


### Question 2 – Understand Price Levels Across the City (15 pts)

1. Compute the minimum, maximum, mean, and median price for the full dataset.
2. Group prices by `neighbourhood_group` (borough) and produce a tidy summary showing count of listings, mean price, and median price.
3. Create a visualization that contrasts the price distribution across boroughs (e.g., faceted histograms or box plots). Label axes clearly and style for readability.
4. Write 2–3 bullet points describing the major pricing patterns you observe across boroughs.

In [None]:
# TODO: summarize and visualize NYC Airbnb price distributions
# Your code here


### Question 3 – Investigate Drivers of Listing Availability (15 pts)

1. Make a violin or box plot of `availability_365` by `room_type` to compare availability distributions across listing types.
2. Quantify differences by calculating the median availability for each `room_type`.
3. Check whether room type availability differs by borough by creating a two-way summary (e.g., pivot table) and discuss one notable contrast.

In [None]:
# TODO: analyze how availability varies by room type and borough
# Your code here


### Question 4 – Popularity, Reviews, and Host Behavior (20 pts)

1. Create at least one figure that relates `number_of_reviews` to `price`. Consider log-scaling if helpful.
2. Aggregate reviews by `neighbourhood_group` to identify which boroughs receive the most total reviews and which have the highest median number review.
3. Identify hosts with more than 20 listings (`calculated_host_listings_count > 20`) and summarize how their average prices and number review compare with the overall market.
4. Provide a short interpretation (3–4 sentences) about what your findings suggest regarding listing popularity and host specialization.

In [None]:
# TODO: explore relationships among reviews, ratings, and hosts
# Your code here


### Question 5 – Manhattan Deep Dive (10 pts)

1. Filter to Manhattan listings with `price < 1000` to focus on typical market activity.
2. Map the spatial distribution using latitude and longitude (static map or scatter colored by price).
3. Report the five Manhattan neighbourhoods with the highest median price and describe how they compare on availability and review volume.

In [None]:
# TODO: complete the Manhattan-focused analysis
# Your code here


## Part B – Food Prices for Nutrition Case Study

### Question 6 – Load and Prepare the Food Prices for Nutrition Data (10 pts)

1. Download `food_prices.csv` from the course site and place it in `../data/` relative to this notebook.
2. Load the dataset with pandas, drop rows containing missing values in the core analytical fields, and limit the dataframe to the provided list of focus countries (India, Senegal, Albania, China, United States, United Arab Emirates, Türkiye, Egypt, Ghana).
3. Keep the columns needed for analysis: `Country Name`, `Time`, `Cost of a healthy diet [CoHD]`, `Percent of the population who cannot afford a healthy diet [CoHD_headcount]`, and `Population [Pop]`.
4. Create a `pop_log` column equal to the natural log of population.

In [None]:
# TODO: load and preprocess the food prices dataset
# Your code here


### Question 7 – Track Affordability Over Time (10 pts)

1. For each country, compute the year-over-year change in the percent of the population that cannot afford a healthy diet.
2. Create a line plot showing `CoHD_headcount` through time for each country (use color or facets to keep the plot readable).
3. Highlight at least one country whose trend deserves extra attention and explain the context in 2–3 sentences.

In [None]:
# TODO: analyze trajectories in diet affordability
# Your code here


### Question 8 – Explain the Relationship Between Cost and Affordability (15 pts)

1. Build an interactive scatter plot (Plotly Express is recommended) comparing `Cost of a healthy diet [CoHD]` (x-axis) and `CoHD_headcount` (y-axis).
2. Use `pop_log` to scale the marker sizes and add an animation or small-multiple feature over `Time`.
3. Annotate at least two notable country-years directly on the chart or in a short Markdown note to call out interesting patterns.

In [None]:
# TODO: create an interactive visualization linking cost and affordability
# Your code here


### Question 9 – Synthesize Your Findings (10 pts)

1. Write a concise reflection (roughly 200 words) comparing the challenges posed by the NYC Airbnb and food affordability case studies.
2. Address how data availability, cleaning, and visualization choices influenced your conclusions in each scenario.
3. Suggest one follow-up analysis or external data source that would deepen either case study.

> Add your written response here.