# IS 4487 Assignment 7: Data Transformation with Airbnb Listings

In this assignment, you will:
- Load the Airbnb dataset you cleaned in Assignment 6
- Apply data transformation techniques like scaling, binning, encoding, and feature creation
- Make the dataset easier to use for tasks like pricing analysis, guest segmentation, or listing recommendations
- Practice writing up your analysis clearly so a business audience — like a host, marketing manager, or city partner — could understand it

## Why This Matters

Airbnb analysts, hosts, and city partners rely on clean and well-structured data to make smart decisions. Whether they’re adjusting prices, identifying high-performing listings, or designing better guest experiences, they need data that’s transformed, organized, and ready for use.

This assignment helps you practice that kind of real-world thinking: taking messy real data and getting it ready for action.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/assignment_07_data_transformation.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



## Dataset Description

The dataset you'll be using is a **detailed Airbnb listing file**, available from [Inside Airbnb](https://insideairbnb.com/get-the-data/).

Each row represents one property listing. The columns include:

- **Host attributes** (e.g., host ID, host name, host response time)
- **Listing details** (e.g., price, room type, minimum nights, availability)
- **Location data** (e.g., neighborhood, latitude/longitude)
- **Property characteristics** (e.g., number of bedrooms, amenities, accommodates)
- **Calendar/booking variables** (e.g., last review date, number of reviews)

The schema is consistent across cities, so you can expect similar columns regardless of the location you choose.

## 1. Setup and Load Your Data

You'll be working with the `cleaned_airbnb_data.csv` file you exported from Assignment 6.

📌 In Google Colab:
- Click the folder icon on the left sidebar
- Use the upload button to add your CSV file to the session
- Then use the code block below to read it into your notebook

Before getting started, make sure you import the libraries you'll need for this assignment:
- `pandas`, `numpy` for data manipulation
- `matplotlib.pyplot`, `seaborn` for visualizations


In [3]:
# Add code here 🔧
import pandas as pd
path = 'cleaned_airbnb_data.csv'
df = pd.read_csv(path)
df.head(10)

Unnamed: 0,id,listing_url,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,host_url,...,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,price_numeric,host_response_rate_numeric,host_acceptance_rate_numeric
0,2992450,https://www.airbnb.com/rooms/2992450,2025-08-04,city scrape,Luxury 2 bedroom apartment,The apartment is located in a quiet neighborho...,,https://a0.muscache.com/pictures/44627226/0e72...,4621559,https://www.airbnb.com/users/show/4621559,...,3.67,f,1,1,0,0,0.07,70.0,,50.0
1,3820211,https://www.airbnb.com/rooms/3820211,2025-08-04,city scrape,Funky Urban Gem: Prime Central Location - Park...,Step into the charming and comfy 1BR/1BA apart...,Overview<br /><br />The lovely apartment is lo...,https://a0.muscache.com/pictures/prohost-api/H...,19648678,https://www.airbnb.com/users/show/19648678,...,4.77,f,4,4,0,0,2.32,104.0,100.0,100.0
2,5651579,https://www.airbnb.com/rooms/5651579,2025-08-04,city scrape,Large studio apt by Capital Center & ESP@,"Spacious studio with hardwood floors, fully eq...",The neighborhood is very eclectic. We have a v...,https://a0.muscache.com/pictures/b3fc42f3-6e5e...,29288920,https://www.airbnb.com/users/show/29288920,...,4.64,f,2,1,1,0,2.97,75.0,100.0,99.0
3,6623339,https://www.airbnb.com/rooms/6623339,2025-08-04,city scrape,Bright & Cozy City Stay · Top Location + Parking!,Step into the charming and comfy 1BR/1BA apart...,Overview<br /><br />The lovely apartment is lo...,https://a0.muscache.com/pictures/prohost-api/H...,19648678,https://www.airbnb.com/users/show/19648678,...,4.72,f,4,4,0,0,2.68,101.0,100.0,100.0
4,9005989,https://www.airbnb.com/rooms/9005989,2025-08-04,city scrape,"Studio in The heart of Center SQ, in Albany NY",(21 years of age or older ONLY) NON- SMOKING.....,"There are many shops, restaurants, bars, museu...",https://a0.muscache.com/pictures/d242a77e-437c...,17766924,https://www.airbnb.com/users/show/17766924,...,4.77,f,1,1,0,0,5.67,110.0,,100.0
5,9501054,https://www.airbnb.com/rooms/9501054,2025-08-04,city scrape,Spacious suite with full bath by Capital Center,Great location within walking distance to the ...,The place is located in the Historic Mansion n...,https://a0.muscache.com/pictures/45153167-d704...,29288920,https://www.airbnb.com/users/show/29288920,...,4.67,f,2,1,1,0,3.7,60.0,100.0,99.0
6,10768745,https://www.airbnb.com/rooms/10768745,2025-08-04,city scrape,Alb hospital area studio bath wifi. (Red),Spacious warm studio in 1840 house close to ho...,,https://a0.muscache.com/pictures/hosting/Hosti...,5691268,https://www.airbnb.com/users/show/5691268,...,4.91,f,2,0,2,0,7.1,47.0,100.0,95.0
7,11253948,https://www.airbnb.com/rooms/11253948,2025-08-04,city scrape,/Fire Place Bungalow\ 1917 SUNY Eagle 6Beds 2B...,Cute single family bungalow (1200sqft) in a co...,The super convenient area is one of the big se...,https://a0.muscache.com/pictures/f0e43651-0834...,4259750,https://www.airbnb.com/users/show/4259750,...,4.87,f,8,8,0,0,1.8,192.0,97.0,87.0
8,11639446,https://www.airbnb.com/rooms/11639446,2025-08-04,city scrape,$55twin($30 foreign student)FreeBF Noa/c no smoke,Regular breakfast free<br /><br />Lunch packed...,One block away to unique organic honest weigh ...,https://a0.muscache.com/pictures/1129f67c-dfb2...,61700428,https://www.airbnb.com/users/show/61700428,...,4.87,f,2,0,2,0,2.07,55.0,100.0,0.0
9,12799126,https://www.airbnb.com/rooms/12799126,2025-08-04,city scrape,Private Room in the Hearth of the Albany,,"It's perfect location for any kind of events, ...",https://a0.muscache.com/pictures/3e656127-d8e2...,69729286,https://www.airbnb.com/users/show/69729286,...,4.75,f,2,0,2,0,0.46,58.0,100.0,71.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 459 entries, 0 to 458
Data columns (total 78 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            459 non-null    int64  
 1   listing_url                                   459 non-null    object 
 2   last_scraped                                  459 non-null    object 
 3   source                                        459 non-null    object 
 4   name                                          459 non-null    object 
 5   description                                   449 non-null    object 
 6   neighborhood_overview                         196 non-null    object 
 7   picture_url                                   459 non-null    object 
 8   host_id                                       459 non-null    int64  
 9   host_url                                      459 non-null    obj

## 2. Check for Skew in a Numeric Column

Business framing:  

Airbnb listings can have a wide range of values for things like price, availability, or reviews. These kinds of distributions can be hard to visualize, summarize, or model.

Choose one **numeric column** that appears skewed and do the following:
- Plot a histogram
- Apply a transformation (e.g., log or other method)
- Plot again to compare


### In Your Response:
1. What column did you examine?
2. What transformation did you try, and why?
3. How did the transformed version help make the data more usable for analysis or stakeholder review?



In [None]:
# Add code here 🔧

### ✍️ Your Response: 🔧
1.

2.

3.

## 3. Scale Two Numeric Columns

Business framing:

If an analyst wanted to compare listing price to number of nights required, or create a model that weighs both, those values need to be on a similar scale.

Follow these steps:
- Pick two numeric columns with different value ranges (e.g. one column may have a min of 0 and a max of 255; another column may have a min of 100 and a max of 400)
- Use Min-Max scaling on one column (the range should be “shrinked” down to just 0-1)
- Use Z-score Normalization (aka standardization) on the other column.
- Add 2 new columns to the dataset. These 2 new columns should be the ones you just created.

### In Your Response:
1. What two columns did you scale, and which methods did you use?
2. When might these scaled values be more useful than the originals?
3. Who at Airbnb might benefit from this transformation and why?

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Min-Max scale


# Z-score scale



### ✍️ Your Response: 🔧
1.

2.

3.

## 4. Group a Numeric Column into Categories

Business framing:  

Let’s say an Airbnb marketing team wants to segment listings by review activity. They don’t want exact numbers — they just want to know if a listing has “low,” “medium,” or “high” review volume.

Follow these steps:

- Choose a numeric column that could be grouped (e.g., reviews, availability).
- You’ll want to group the values of this column into 3 or 4 bins
- Create a new column. The values of this column will be the labels: “Low”, “Medium”, and “High.” These labels should correspond to your bins.

### In Your Response:
1. What column did you group, and how many categories did you use?
2. Why might someone prefer this grouped view over raw numbers?
3. Who would this help at Airbnb, and how?


In [None]:
# Add code here 🔧

### ✍️ Your Response: 🔧
1.

2.

3.

## 5. Create Two New Business-Relevant Variables

Business framing:  

Stakeholders often want to know things like: What’s the cost per night? Are listings geared toward long-term stays? These kinds of features aren’t always in the dataset — analysts create them.

Follow these steps:

- Think of two new columns you can create using the data you already have.
  - One might be a ratio or interaction between columns (e.g., price ÷ nights).
  - The other might be a flag based on a condition (e.g., stays longer than 30 days).
- Add the new columns to your DataFrame.

### In Your Response:
1. What two new columns did you create?
2. Who would use them (e.g., host, manager, or platform)?
3. How could they help someone make a better decision?

In [None]:
# Add code here 🔧

### ✍️ Your Response: 🔧 🔧
1.

2.

3.



## 6. Encode a Categorical Column

Business framing:  

Let’s say you’re helping the Airbnb data science team build a model to predict booking rates. Categorical columns like `room_type`, `neighbourhood`, or `cancellation_policy` can’t be used in models unless they’re converted to numbers.

- Choose one categorical column from your dataset (e.g., room type or neighborhood group)
- Decide on an encoding method:
  - Use one-hot encoding for nominal (unordered) categories
  - Use ordinal encoding (a ranking) only if the categories have a clear order
- Apply the encoding using `pandas` or another tool
- Add the new encoded column(s) to your DataFrame

### ✍️ In your markdown:
1. What column did you encode and why?
2. What encoding method did you use?
3. How could this transformation help a pricing model, dashboard, or business report?



In [None]:
# Add code here 🔧

### ✍️ Your Response: 🔧
1.
2.
3.

## 7. Reflection

You’ve applied the same kinds of transformation techniques used in real Airbnb analytics projects — from pricing engines to host tools to tourism dashboards.

Now step back and reflect.

### In Your Response:
1. What transformation step felt most important or interesting?
2. Which of your changes would be most useful to a host, analyst, or city planner?
3. If you were going to build a tool or dashboard, what would you do next with this data?
4. How does this relate to your customized learning outcome you created in canvas?



### ✍️ Your Response: 🔧

1.

2.

3.
4.



## Submission Instructions
✅ Checklist:
- All code cells run without error
- All markdown responses are complete
- Submit on Canvas as instructed

In [None]:
!jupyter nbconvert --to html "assignment_07_LastnameFirstname.ipynb"