# Section B: Practical questions with applied multiple choice

## General Rules:
- This is an open book examination.
- Students may make use of a calculator.
- This is an online examination where you will access a computer; however you may not communicate with other students in any form.
- Headphone are prohibited.
- The use of AI (chatGPT etc.) is prohibited.
- All cell phones are to be switched off for the duration of the exam.
- The invigilator will not assist you with the explanation of questions.
- Students are prohibited from conversing in any manner with other students.

## My Name and Surname

Name =
</br>
Surname =  

### Part 1: SQL Queries  
You are provided with a pre-populated SQLite database named `airbnb.db`. Download [here](https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata) if you haven't already. Your task is to explore this database and write a series of SQL queries to perform the tasks detailed below. Queries should be optimised to run within 20 seconds or less.

The tables and columns included in the `airbnb.db` are:

- `listings`: `id`, `host_id`, `name`, `neighbourhood_id`, `latitude`. `longitude`, `room_type_id`, ` construction_year`, `number_of_reviews`, `last_review`, `reviews_per_month`, `review_rate_number`, `calculated_host_listings_count` ,  `availability_365`, `instant_bookable`, `cancellation policy`, `house_rules`, `license`  
- `hosts`: `id`, `name`, `identity_verified`
- `neighbourhoods`: `id`, `name`, `neighbourhood_group_id`
- `neighbourhood_groups`: `id`, `name`
- `room types`: `id`, `type`
- `cancellation_policies`: `id`, `policy`

In [1]:
!pip install mysql-connector-python

Collecting mysql-connector-python
  Downloading mysql_connector_python-9.3.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (7.2 kB)
Downloading mysql_connector_python-9.3.0-cp311-cp311-manylinux_2_28_x86_64.whl (33.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.9/33.9 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mysql-connector-python
Successfully installed mysql-connector-python-9.3.0


In [2]:
import os
import json
import random
import sqlite3
import sqlparse
import pandas as pd
import numpy as np

import seaborn as sns
import mysql.connector

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import PolynomialFeatures

import matplotlib.pyplot as plt

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Load your database and create a database connection.
# You can connect to the sql database in any way you wish.
# Use this method if you are unsure how to proceed.
# Ensure the bike_store.db file is in the same directory as this notebook.
try:
    db_path = "/content/drive/MyDrive/airbnb_nyc.db"  # full path on Google Drive
    with sqlite3.connect(db_path) as conn:
        print(f"Opened SQLite database with version {sqlite3.sqlite_version} successfully.")

        # List all tables in the database
        tables_df = pd.read_sql('''SELECT name FROM sqlite_master WHERE type='table';''', conn)
        print(tables_df)

except sqlite3.OperationalError as e:
    print("Failed to open database:", e)


Opened SQLite database with version 3.37.2 successfully.
                    name
0   neighbourhood_groups
1         neighbourhoods
2             room_types
3  cancellation_policies
4                  hosts
5               listings


**1. Listing Availability**

**1.1 Listings available all year**

In [5]:
query = """
        SELECT COUNT(*) AS full_year_listings
        FROM listings
        WHERE availability_365 = 365;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,full_year_listings
0,573


**1.2 Neighborhoods with highest availability**

In [6]:
query = """
        SELECT n.name AS neighborhood, ROUND(AVG(l.availability_365), 1) AS avg_availability
        FROM listings l
        JOIN neighbourhoods n ON l.neighbourhood_id = n.id
        GROUP BY neighborhood
        ORDER BY avg_availability DESC;"""
q = pd.read_sql(query, conn)

# Show result
q.head()


Unnamed: 0,neighborhood,avg_availability
0,Shore Acres,396.0
1,Willowbrook,351.0
2,Midland Beach,326.3
3,Olinville,312.0
4,Mill Basin,307.0


**1.3 Availability vs. price vs. reviews correlation**

In [7]:
query = """
        SELECT
          ROUND(AVG(availability_365), 1) AS avg_availability,
          ROUND(AVG(price), 2) AS avg_price,
          ROUND(AVG(reviews_per_month), 2) AS avg_reviews
        FROM listings;"""
q = pd.read_sql(query, conn)

# Show result
q.head()


Unnamed: 0,avg_availability,avg_price,avg_reviews
0,135.1,623.6,1.35


**2. Pricing Patterns**

**2.1 Average price by room type**

In [8]:
query = """
        SELECT rt.type AS room_type, ROUND(AVG(l.price), 2) AS avg_price
        FROM listings l
        JOIN room_types rt ON l.room_type_id = rt.id
        GROUP BY room_type
        ORDER BY avg_price DESC;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,room_type,avg_price
0,Shared room,650.88
1,Private room,623.8
2,Entire home/apt,622.3


**2.2 Most expensive neighborhoods**

In [9]:
query = """
        SELECT n.name AS neighborhood, MAX(l.price) AS max_price
        FROM listings l
        JOIN neighbourhoods n ON l.neighbourhood_id = n.id
        GROUP BY neighborhood
        ORDER BY max_price DESC;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,neighborhood,max_price
0,Prospect-Lefferts Gardens,1200.0
1,Harlem,1200.0
2,Greenwich Village,1200.0
3,Gramercy,1200.0
4,East New York,1200.0


**2.3 Overpriced areas with low review activity**

In [10]:
query = """
        SELECT n.name AS neighborhood, ROUND(AVG(l.price), 2) AS avg_price, AVG(l.number_of_reviews) AS avg_reviews
        FROM listings l
        JOIN neighbourhoods n ON l.neighbourhood_id = n.id
        GROUP BY neighborhood
        HAVING avg_reviews < 10
        ORDER BY avg_price DESC;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,neighborhood,avg_price,avg_reviews
0,Little Neck,1030.67,8.333333
1,New Springville,932.0,8.0
2,Sea Gate,673.0,2.666667
3,Stuyvesant Town,625.58,4.263158
4,Unionport,613.25,9.0


**2.4 Price trends by minimum stay (short vs long)**

In [11]:
query = """
        SELECT
          CASE
            WHEN minimum_nights <= 3 THEN 'Short Stay'
            WHEN minimum_nights <= 14 THEN 'Medium Stay'
            ELSE 'Long Stay'
          END AS stay_type,
          ROUND(AVG(price), 2) AS avg_price
        FROM listings
        GROUP BY stay_type;"""
q = pd.read_sql(query, conn)

# Show result
q.head()


Unnamed: 0,stay_type,avg_price
0,Long Stay,622.68
1,Medium Stay,621.01
2,Short Stay,624.51


**3. Host Activity**

**3.1 Hosts with most listings**

In [12]:
query = """
        SELECT h.name AS host_name, COUNT(*) AS listings_count
        FROM listings l
        JOIN hosts h ON l.host_id = h.id
        GROUP BY host_name
        ORDER BY listings_count DESC
        LIMIT 10;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,host_name,listings_count
0,Michael,337
1,David,329
2,John,233
3,Alex,199
4,Anna,183


**3.2 Verified vs. unverified hosts**

In [13]:
query = """
        SELECT identity_verified, COUNT(*) AS total_hosts
        FROM hosts
        GROUP BY identity_verified;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,identity_verified,total_hosts
0,0,19582
1,1,19832


**3.3 Review frequency by host type (verified vs not)**

In [14]:
query = """
        SELECT h.identity_verified, ROUND(AVG(l.reviews_per_month), 2) AS avg_reviews_per_month
        FROM listings l
        JOIN hosts h ON l.host_id = h.id
        GROUP BY h.identity_verified;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,identity_verified,avg_reviews_per_month
0,0,1.36
1,1,1.34


**3.4 Detect large operators (hosts with ≥ 10 listings)**

In [15]:
query = """
        SELECT h.name AS host_name, COUNT(*) AS total_listings
        FROM listings l
        JOIN hosts h ON l.host_id = h.id
        GROUP BY host_name
        HAVING total_listings >= 10
        ORDER BY total_listings DESC;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,host_name,total_listings
0,Michael,337
1,David,329
2,John,233
3,Alex,199
4,Anna,183


**4. Neighbourhood Performance**

**4.1 Which neighbourhoods have the most listings?**
This query shows all `neighborhoods` with `the most listings`.

In [16]:
query = """
SELECT n.name AS neighborhood, COUNT(*) AS total_listings
FROM listings l
JOIN neighbourhoods n ON l.neighbourhood_id = n.id
GROUP BY neighborhood
ORDER BY total_listings DESC;"""
q = pd.read_sql(query, conn)

# Show result
q.head()


Unnamed: 0,neighborhood,total_listings
0,Bedford-Stuyvesant,3244
1,Williamsburg,3158
2,Harlem,2269
3,Bushwick,1984
4,Hell's Kitchen,1525


**4.2 Most reviewed listings by neighborhood**

In [17]:
query = """
        SELECT n.name AS neighborhood, SUM(l.number_of_reviews) AS total_reviews
        FROM listings l
        JOIN neighbourhoods n ON l.neighbourhood_id = n.id
        GROUP BY neighborhood
        ORDER BY total_reviews DESC;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,neighborhood,total_reviews
0,Bedford-Stuyvesant,111677
1,Williamsburg,76704
2,Harlem,71365
3,Bushwick,52861
4,Hell's Kitchen,51282


**4.3 Average price per neighborhood**

In [18]:
query = """
        SELECT n.name AS neighborhood, ROUND(AVG(l.price), 2) AS avg_price
        FROM listings l
        JOIN neighbourhoods n ON l.neighbourhood_id = n.id
        GROUP BY neighborhood
        ORDER BY avg_price DESC;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,neighborhood,avg_price
0,Little Neck,1030.67
1,New Springville,932.0
2,New Dorp Beach,924.33
3,Jamaica Hills,918.25
4,Shore Acres,913.0


**4.4 Dominant room type per neighborhood**

In [19]:
query = """
        SELECT n.name AS neighborhood, rt.type AS most_common_room_type, COUNT(*) AS count
        FROM listings l
        JOIN neighbourhoods n ON l.neighbourhood_id = n.id
        JOIN room_types rt ON l.room_type_id = rt.id
        GROUP BY neighborhood, most_common_room_type
        ORDER BY neighborhood, count DESC;"""
q = pd.read_sql(query, conn)

# Show result
q.head()


Unnamed: 0,neighborhood,most_common_room_type,count
0,Allerton,Entire home/apt,19
1,Allerton,Private room,15
2,Arden Heights,Entire home/apt,3
3,Arden Heights,Private room,2
4,Arrochar,Entire home/apt,12


**5. Review Behaviour**

**5.1 Most engaging listings (by reviews per month)**

In [20]:
query = """
        SELECT name, reviews_per_month
        FROM listings
        ORDER BY reviews_per_month DESC
        LIMIT 10;"""
q = pd.read_sql(query, conn)

# Show result
q.head()


Unnamed: 0,name,reviews_per_month
0,Enjoy great views of the City in our Deluxe Room!,58.5
1,Great Room in the heart of Times Square!,27.95
2,Studio Apartment 6 minutes from JFK Airport,15.32
3,6 Minutes From JFK Airport Cozy Bedroom,15.23
4,Nice Room 1 block away from Times Square action!,14.62


**5.2 Correlation: Reviews per month vs rating**

In [21]:
query = """
        SELECT ROUND(AVG(reviews_per_month), 2) AS avg_reviews, ROUND(AVG(review_rate_number), 2) AS avg_rating
        FROM listings
        WHERE reviews_per_month IS NOT NULL AND review_rate_number IS NOT NULL;"""
q = pd.read_sql(query, conn)

# Show result
q.head()


Unnamed: 0,avg_reviews,avg_rating
0,1.35,3.22


**5.3 Review trends over time (recent vs old listings)**

In [22]:
query = """
        SELECT
          CASE
            WHEN last_review >= '2023-01-01' THEN 'Recent'
            ELSE 'Old'
          END AS review_period,
          ROUND(AVG(reviews_per_month), 2) AS avg_reviews
        FROM listings
        GROUP BY review_period;"""
q = pd.read_sql(query, conn)

# Show result
q.head()


Unnamed: 0,review_period,avg_reviews
0,Old,1.35


**6. Exposing The Operators**

**6.1 Listings likely to be ghost/idle (no availability)**

In [23]:
query = """
        SELECT name, availability_365
        FROM listings
        WHERE availability_365 < 0;"""
q = pd.read_sql(query, conn)

# Show result
q.head()


Unnamed: 0,name,availability_365
0,Columbus Circle Luxury Bldg - Private Room&Bath,-4
1,Bright Modern Charming Housebarge,-8
2,Superior @ Box House,-2
3,Spacious Townhome Apt in Brooklyn,-9
4,SUPER CUTE EAST VILLAGE APARTMENT,-9


**6.2 Potential shell operators (hosts with many listings + unverified)**

In [24]:
query = """
        SELECT h.name AS host_name, h.identity_verified, COUNT(*) AS listings_count
        FROM listings l
        JOIN hosts h ON l.host_id = h.id
        GROUP BY host_name, h.identity_verified
        HAVING listings_count >= 1 AND h.identity_verified = 'no';"""
q = pd.read_sql(query, conn)

# Show result
q.head()


Unnamed: 0,host_name,identity_verified,listings_count
