# Assignment 1: Car Listings Data

Welcome to your first required assignment. This notebook will guide you through exploring, inspecting, and preparing a real dataset of car listings from Craigslist.

Please read these instructions carefully before you begin:

- **AI use**: You may use AI to *ask questions* if you are stuck or need clarification. However, do not use AI code-completion or copy/paste code from AI. The purpose of this assignment is for you to gain fluency with Pandas by writing code yourself.  
- **Questions**: Look for cells marked with `Q:`. These contain questions you must answer, either in code or in Markdown.  
- **Markdown**: Use Markdown cells to explain your results, interpret outputs, and format your answers neatly. Clear communication is part of the assignment.  Learn more about Markdown via: 
    - [This introduction](https://www.markdownguide.org/getting-started/)
    - [A guide to the basic synatx](https://www.markdownguide.org/basic-syntax/)
    - [A cheat sheet](https://www.markdownguide.org/cheat-sheet/)
    - [And a tutorial where you can explore](https://www.markdowntutorial.com/)
    
- **Output**: Ensure your notebook runs top-to-bottom without errors. Remove stray debug code before submitting.  

**What this assignment covers:**
- Loading and inspecting real-world data
- Understanding data structure, scope, and missingness (Week 1 readings: IMS Ch 1–2)
- Preparing data through thoughtful decisions (dropping duplicates, standardizing formats, creating derived variables)
- Reflecting on how preparation decisions affect analysis

This assignment does *not* require visualization or model fitting—those come in later weeks.

When you've finished your work on this notebook, restart the kernel, clear the outputs, and run all cells to ensure everything works as expected. Then commit and push your changes to GitHub. Comment out any cells that are printing large outputs or dataframes.

I've placed blank code cells where you need to write code, but don't feel like you have to do all the work for a given part in a single cell. I like to have each step in its own cell, so I can see the output and check my work as I go. You don't need to write a ton for these answers, fyi.

I've also put markdown comments in telling you where to type your answers. Those comments use a comment structure you might not have seen, with `<!--` marking the start of the comment and `-->` marketing the end. Anything in a Markdown cell between those markers will not be displayed when you "render" the Markdown cell (by hitting Shift + Enter).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
listings = pd.read_csv("car_listings.csv")

In [None]:

# Let's handle some column conversions to make it easier to work with this data
listings["time_posted"] = pd.to_datetime(listings["time_posted"], errors="coerce")
listings["year_from_time_posted"] = listings["time_posted"].dt.year

# These can have missing values, so cast to pandas' nullable integer type
listings["year"] = pd.to_numeric(listings["year"], errors="coerce").astype("Int64")
listings["odometer"] = pd.to_numeric(listings["odometer"], errors="coerce").astype("Int64")
listings["post_id"] = pd.to_numeric(listings["post_id"], errors="coerce").astype("Int64")
listings["num_images"] = pd.to_numeric(listings["num_images"], errors="coerce").astype("Int64")

listings["price"] = pd.to_numeric(listings["price"], errors="coerce")
listings["latitude"] = pd.to_numeric(listings["latitude"], errors="coerce")
listings["longitude"] = pd.to_numeric(listings["longitude"], errors="coerce")



---

### Part 1: Inspect the Data

Before we can analyze or prepare data, we must understand what it is: its source, structure, scope, and limitations. Use the cells below to examine dataset shape, data types, summary statistics, and patterns of missing data. I've provided a single code cell for you, but the typical practice is to handle each discrete analysis in a single cell. Then you can show some of the output and write up your answer. 

**Keep in mind:** Understanding *what this data represents* (where it came from, what it covers, who collected it) is as important as the numbers themselves.

In [18]:
# Use this cell to explore the dataset and answer the questions in the assignment.

Q: What does `errors="coerce"` do in the functions above?  

A: <!-- Your answer here -->  

---

Q: How many rows and columns are in this dataset?  

A: <!-- Your answer here -->  

---

Q: Which columns have the most missing values?  

A: <!-- Your answer here -->  

---

Q: What data type is used for the `time_posted` column? Why do you think that matters for analysis?  

A: <!-- Your answer here -->  

---

Q: Looking at the summary statistics (`.describe()`), what are the minimum and maximum values for `price` and `odometer`? Do they seem reasonable?  

A: <!-- Your answer here -->  


---

### Part 2: Preparing the Data for Analysis

Before analyzing this data, we need to prepare it by making thoughtful decisions about duplicates, data types, and derived variables. Each step below represents a key preparation decision:

- Drop duplicate rows (`post_id` is unique).  
- Standardize string columns to lowercase.  
- Create new variables:
  - `car_age = year_from_time_posted - year`   
  - `high_mileage = odometer > 150000`  
  - `price_per_mile = price / odometer`

In [19]:
# your code here

Q: After standardizing the string columns (`make`, `model`, `fuel`, `drive`, etc.), why is it helpful to convert everything to lowercase?  

A: <!-- Your answer here -->  

---

Q: You created a new variable `car_age = year_from_time_posted - year`, where the year is extracted from the `time_posted` column. Why might this be a more meaningful measure than using a fixed year (like 2025) or the current date?   

A: <!-- Your answer here -->  

---

Q: You created a flag `high_mileage = odometer > 150000`. What make/model with at least 100 observations has the highest proportion of listings in this category?  

A: <!-- Your answer here -->  

---

Q: You created a new variable `price_per_mile = price / odometer`. Which make has the highest **median** value for this new variable? Exclude makes with very few listings and explain your choice. Do you think this statistic is meaningful? Why or why not?  

A: <!-- Your answer here -->

---

### Part 3: Handling Missing Data

**Reflection on Missing Data Decisions:**

When we encounter missing values, we have several choices:
1. **Remove** rows with missing data (but lose information).
2. **Fill** (impute) missing values with a default, mean, median, or mode (but introduce assumptions).
3. **Keep** missing values and handle them during analysis.

Each choice has implications. For this analysis, we will fill missing values as follows:
- Numeric columns (`odometer`, `year`, `price`) with the **median**.  
- Categorical columns (`fuel`, `drive`, `transmission`, `paint`) with the **mode** (most common value).  

After filling, we also filter listings to a reasonable price range ($5,000–$50,000) and explore patterns across locations and time.

In [None]:
# Your code here

Q: What assumption are you making when you fill missing values this way? How might this assumption affect downstream analyses?

A: <!-- Your answer here -->

---

Q: Which preparation steps from Part 2 are irreversible decisions that could affect later analyses? 

A: <!-- Your answer here -->

---

Q: After filtering to prices between $5,000 and $50,000, how many rows remain?  

A: <!-- Your answer here -->  

---

Q: What is the average price by location across this reduced dataset?  

A: <!-- Your answer here -->  

---

Q: Looking at monthly counts of listings, do you notice any seasonal patterns or anomalies?  

A: <!-- Your answer here -->

---

### Part 4: Reflection

**What is this data, and how should we think about it?**

Write a short reflection in Markdown:
- Where did this data come from? What does it represent?
- What are the scope and limitations of this dataset? (e.g., geographic coverage, time period, what vehicles/sellers are included or excluded?)
- What was most challenging about preparing this dataset?
- Did anything surprise you about the data or the preparation process?

<!-- Your answer here -->