# Lab Instructions

Find a dataset that interests you. I'd recommend starting on [Kaggle](https://www.kaggle.com/). Read through all of the material about the dataset and download a .CSV file.

1. Write a short summary of the data.  Where did it come from?  How was it collected?  What are the features in the data?  Why is this dataset interesting to you?  

2. Identify 5 interesting questions about your data that you can answer using Pandas methods.  

3. Answer those questions!  You may use any method you want (including LLMs) to help you write your code; however, you should use Pandas to find the answers.  LLMs will not always write code in this way without specific instruction.  

4. Write the answer to your question in a text box underneath the code you used to calculate the answer.



The housing dataset comes from the California Census and contains information about housing districts in California. It was collected to analyze housing prices and the factors that influence them. The main features include median income, median house value, average house age, total rooms, total bedrooms, population, households, and proximity to the ocean. This dataset is interesting because it helps us understand how location, income, and demographics affect housing prices, which can be useful for real estate, economics, and urban planning.


In [1]:
import pandas as pd

# Load the housing dataset
df = pd.read_csv("housing.csv")

# Show the first few rows
df.head()


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


What is the average median house value?

In [2]:
# 1. Average median house value
avg_house_value = df["median_house_value"].mean()
avg_house_value


np.float64(206855.81690891474)

The average median house value in the dataset is about **$206,855**. This shows the typical price of a home across California during the time the data was collected.


Which areas (by ocean proximity) have the highest average house values?

In [4]:
# 2. Average house value by ocean proximity
avg_by_ocean = df.groupby("ocean_proximity")["median_house_value"].mean()
avg_by_ocean


ocean_proximity
<1H OCEAN     240084.285464
INLAND        124805.392001
ISLAND        380440.000000
NEAR BAY      259212.311790
NEAR OCEAN    249433.977427
Name: median_house_value, dtype: float64

When grouped by ocean proximity, houses near the **ocean** or in **<1H OCEAN** areas tend to have higher average values. Inland homes generally have lower values, showing that location near the coast increases housing prices.


Is there a correlation between median income and house value?

In [5]:
# 3. Correlation between income and house value
correlation = df["median_income"].corr(df["median_house_value"])
correlation


np.float64(0.6880752079585484)

There is a positive correlation (around **0.68**) between median income and median house value. This means that areas with higher household incomes usually have higher home prices, which makes sense economically.


What is the average house age in the dataset?

In [6]:
# 4. Average house age
avg_house_age = df["housing_median_age"].mean()
avg_house_age


np.float64(28.639486434108527)

The average house age in the dataset is about **28 years**. This shows that most homes in the dataset are not brand new, but also not extremely old — they tend to be middle-aged properties.


Which region (latitude/longitude) has the highest concentration of households?

In [7]:
# 5. Region with the highest households
top_household_area = df.groupby(["longitude", "latitude"])["households"].sum().idxmax()
top_household_area


(np.float64(-122.41), np.float64(37.79))

The region with the highest concentration of households is located at a specific **longitude and latitude** in the dataset. This indicates that certain urban areas (like major cities) hold much larger populations compared to rural or coastal areas.
