# Section B: Practical questions with applied multiple choice

## General Rules:
- This is an open book examination.
- Students may make use of a calculator.
- This is an online examination where you will access a computer; however you may not communicate with other students in any form.
- Headphone are prohibited.
- The use of AI (chatGPT etc.) is prohibited.
- All cell phones are to be switched off for the duration of the exam.
- The invigilator will not assist you with the explanation of questions.
- Students are prohibited from conversing in any manner with other students.

## My Name and Surname

Name =
</br>
Surname =  

### Part 1: SQL Queries  
You are provided with a pre-populated SQLite database named `airbnb.db`. Download [here](https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata) if you haven't already. Your task is to explore this database and write a series of SQL queries to perform the tasks detailed below. Queries should be optimised to run within 20 seconds or less.

The tables and columns included in the `airbnb.db` are:

- `listings`: `id`, `host_id`, `name`, `neighbourhood_id`, `latitude`. `longitude`, `room_type_id`, ` construction_year`, `number_of_reviews`, `last_review`, `reviews_per_month`, `review_rate_number`, `calculated_host_listings_count` ,  `availability_365`, `instant_bookable`, `cancellation policy`, `house_rules`, `license`  
- `hosts`: `id`, `name`, `identity_verified`
- `neighbourhoods`: `id`, `name`, `neighbourhood_group_id`
- `neighbourhood_groups`: `id`, `name`
- `room types`: `id`, `type`
- `cancellation_policies`: `id`, `policy`

In [25]:
!pip install mysql-connector-python




In [26]:
import os
import json
import random
import sqlite3
import sqlparse
import pandas as pd
import numpy as np

import seaborn as sns
import mysql.connector

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import PolynomialFeatures

import matplotlib.pyplot as plt

In [27]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [28]:
# Load your database and create a database connection.
# You can connect to the sql database in any way you wish.
# Use this method if you are unsure how to proceed.
# Ensure the bike_store.db file is in the same directory as this notebook.
try:
    db_path = "/content/drive/MyDrive/airbnb_nyc.db"  # full path on Google Drive
    with sqlite3.connect(db_path) as conn:
        print(f"Opened SQLite database with version {sqlite3.sqlite_version} successfully.")

        # List all tables in the database
        tables_df = pd.read_sql('''SELECT name FROM sqlite_master WHERE type='table';''', conn)
        print(tables_df)

except sqlite3.OperationalError as e:
    print("Failed to open database:", e)


Opened SQLite database with version 3.37.2 successfully.
                    name
0   neighbourhood_groups
1         neighbourhoods
2             room_types
3  cancellation_policies
4                  hosts
5               listings


In [90]:
table_name = 'listings'

with sqlite3.connect(db_path) as conn:
    df = pd.read_sql(f"SELECT * FROM {table_name} LIMIT 5;", conn)
    print(f"First 5 rows of table: {table_name}")
    print(df)


First 5 rows of table: listings
        id                                              name     host_id  \
0  1001254                Clean & quiet apt home by the park -1589892906   
1  1002102                             Skylit Midtown Castle   795565271   
2  1003689  Entire Apt: Spacious Studio/Loft by central park  1843282861   
3  1004098         Large Cozy 1 BR Apartment In Midtown East -1746088462   
4  1005202                                   BlissArtsSpace!   627526493   

   neighbourhood_id  latitude  longitude country country_code  room_type_id  \
0                 1  40.64749  -73.97237       1            1             1   
1                 2  40.75362  -73.98377       1            1             2   
2                 3  40.79851  -73.94399       1            1             2   
3                 4  40.74767  -73.97500       1            1             2   
4                 5  40.68688  -73.95596       1            1             1   

   construction_year  ...  number_of

**1. How many listings are there in the listings table?**
This query calculates the total number of `listings` available in the `listings table`. It provides a summary `count`, which is helpful for understanding the dataset’s size.

In [40]:
query = """
        SELECT COUNT(*) AS total_listings
        FROM listings;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,total_listings
0,39415


**2. What is the average price of all listings?**
This query computes the average `price`, giving a sense of the typical cost of an Airbnb in San Francisco.


In [41]:
query = """
        SELECT AVG(price) AS average_price
        FROM listings;"""
q = pd.read_sql(query, conn)

# Show result
q.head()



Unnamed: 0,average_price
0,623.597564


**3. How many hosts are in the dataset?**
This query counts all `distinct hosts` in the dataset, helping measure host participation.

In [44]:
query = """
        SELECT COUNT(DISTINCT host_id) AS total_hosts
        FROM listings;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,total_hosts
0,39414


**4. What are the different room types available?**
This query lists all `unique room_types` available in the dataset.

In [50]:
query = """
        SELECT DISTINCT type
        FROM room_types;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,type
0,Entire home/apt
1,Private room
2,Shared room


**5. What is the most expensive and cheapest listing?**
This query finds the `listing` with the `highest price` and the `lowest price`.

In [62]:
# Most expensive listing
query = """
        SELECT id, name, price
        FROM listings
        ORDER BY price DESC
        LIMIT 1;"""
p = pd.read_sql(query, conn)

# Least expensive listing
query = """
        SELECT id, name, price
        FROM listings
        ORDER BY price ASC
        LIMIT 1;"""
q = pd.read_sql(query, conn)

# Show both results
print("Most Expensive Listing:")
print(p)

print("\nLeast Expensive Listing:")
print(q)



Most Expensive Listing:
        id                               name   price
0  2431241  Beautiful Central Harlem sleeps 4  1200.0

Least Expensive Listing:
        id                               name  price
0  1200164  MANHATTAN Neat, Nice, Bright ROOM   50.0


**6. Which neighbourhoods have listings?**
This query shows all `neighborhoods` with `at least one listing`.

In [65]:
query = """
        SELECT DISTINCT name
        FROM neighbourhoods;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,name
0,Allerton
1,Arden Heights
2,Arrochar
3,Arverne
4,Astoria


**7. How many superhosts(Exceptional hosts considered to have at least 4 reviews per month and number of reviews greater than 50) are there?**
This query counts how many hosts are marked as superhosts.

In [70]:
query = """
        SELECT DISTINCT host_id, name
        FROM listings
        WHERE reviews_per_month >= 4 AND number_of_reviews > 50;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,host_id,name
0,-763918612,PRIVATE Room on Historic Sugar Hill
1,-1694027102,☆Massive DUPLEX☆ 2BR & 2BTH East Village 9+ Gu...
2,1208474933,Astoria-Private Home NYC-
3,1363927884,Hospitality on Propsect Pk-12 yrs Hosting Lega...
4,387201125,yahmanscrashpads


**8. Which room type has the highest average price?**
This query helps identify which `room_type` tends to be `priced higher`.

In [77]:
query = """
        SELECT type, AVG(price) AS average_price
        FROM listings
        JOIN room_types ON listings.room_type_id = room_types.id
        GROUP BY type
        ORDER BY average_price DESC
        LIMIT 1;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,type,average_price
0,Shared room,650.88


**9. Which neighbourhoods have the highest number of listings?**
This query counts how many listings exist in each neighbourhood. It helps identify the most active or popular neighbourhoods for Airbnb activity.

In [92]:
query = """
        SELECT l.neighbourhood_id, COUNT(*) AS total_listings
        FROM listings l
        GROUP BY l.neighbourhood_id
        ORDER BY total_listings DESC
        LIMIT 5;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,neighbourhood_id,total_listings
0,5,3244
1,10,3158
2,11,2269
3,8,1984
4,6,1525


**10. Which host has the most listings?**
This query identifies the host with the greatest number of listings on the platform.

In [93]:
query = """
        SELECT host_id, COUNT(*) AS listing_count
        FROM listings
        GROUP BY host_id
        ORDER BY listing_count DESC
        LIMIT 1;"""
q = pd.read_sql(query, conn)

# Show result
q.head()


Unnamed: 0,host_id,listing_count
0,-1148768701,2


**11.Which listings are available every day of the year and have more than 50 reviews?**
This query filters for high-availability, high-activity listings.


In [95]:
query = """
        SELECT id, name
        FROM listings
        WHERE availability_365 = 365 AND number_of_reviews > 50;"""
q = pd.read_sql(query, conn)

# Show result
q.head()


Unnamed: 0,id,name
0,2489785,"1Bedroom, Seconds from L train"
1,4111337,"Bklyn, private Three Bedroom."
2,4929294,Convenience & Chill
3,6683951,Putnam Garden -2BDR Garden Apt
4,6701624,Clean Cozy & Comfy Apartment!


**12. Which hosts have listings in more than one neighbourhood?**
This identifies multi-location hosts, possibly professional operators.

In [97]:
query = """
        SELECT host_id
        FROM listings
        GROUP BY host_id
        HAVING COUNT(DISTINCT neighbourhood_id) > 1;"""
q = pd.read_sql(query, conn)

# Show result
q.head()


Unnamed: 0,host_id
0,-1148768701
