# Section B: Practical questions with applied multiple choice

## General Rules:
- This is an open book examination.
- Students may make use of a calculator.
- This is an online examination where you will access a computer; however you may not communicate with other students in any form.
- Headphone are prohibited.
- The use of AI (chatGPT etc.) is prohibited.
- All cell phones are to be switched off for the duration of the exam.
- The invigilator will not assist you with the explanation of questions.
- Students are prohibited from conversing in any manner with other students.

## My Name and Surname

Name =
</br>
Surname =  

### Part 1: SQL Queries  
You are provided with a pre-populated SQLite database named `airbnb.db`. Download [here](https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata) if you haven't already. Your task is to explore this database and write a series of SQL queries to perform the tasks detailed below. Queries should be optimised to run within 20 seconds or less.

The tables and columns included in the `airbnb.db` are:

- `listings`: `listing_id`, `host_id`, `listing_name`, `neighbourhood`, `room_type`, `price`, `minimum_nights`, `number_of_reviews`, `last_review`, `reviews_per_month`, `calculated_host_listings_count`, `availability_365`  
- `hosts`: `host_id`, `host_name`, `host_since`, `host_location`, `host_response_time`, `host_response_rate`, `host_is_superhost`  
- `reviews`: `review_id`, `listing_id`, `reviewer_id`, `review_date`, `comments`  
- `reviewers`: `reviewer_id`, `reviewer_name`  
- `calendar`: `listing_id`, `date`, `available`, `price`  
- `neighbourhoods`: `neighbourhood`, `borough`  
- `amenities`: `listing_id`, `amenity_name`

In [3]:
import os
import json
import random
import sqlite3
import sqlparse
import pandas as pd
import numpy as np

import seaborn as sns
import mysql.connector

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import PolynomialFeatures

import matplotlib.pyplot as plt

In [5]:
# Load your database and create a database connection.
# You can connect to the sql database in any way you wish.
# Use this method if you are unsure how to proceed.
# Ensure the bike_store.db file is in the same directory as this notebook.
try:
    db_path = "/content/drive/MyDrive/airbnb_nyc.db"  # full path on Google Drive
    with sqlite3.connect(db_path) as conn:
        print(f"Opened SQLite database with version {sqlite3.sqlite_version} successfully.")

        # List all tables in the database
        tables_df = pd.read_sql('''SELECT name FROM sqlite_master WHERE type='table';''', conn)
        print(tables_df)

except sqlite3.OperationalError as e:
    print("Failed to open database:", e)


Opened SQLite database with version 3.37.2 successfully.
                    name
0   neighbourhood_groups
1         neighbourhoods
2             room_types
3  cancellation_policies
4                  hosts
5               listings


In [20]:
table_name = 'amenities'

with sqlite3.connect(db_path) as conn:
    df = pd.read_sql(f"SELECT * FROM {table_name} LIMIT 5;", conn)
    print(f"First 5 rows of table: {table_name}")
    print(df)


DatabaseError: Execution failed on sql 'SELECT * FROM amenities LIMIT 5;': no such table: amenities

**1. First 5 Rows from Listings Table**
This query provides a quick overview of the listings data.
It selects and displays the first 5 rows from the listings table, which helps users understand the structure and sample content of the dataset.

In [7]:
query = "SELECT * FROM listings LIMIT 5;"
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,id,name,host_id,neighbourhood_id,latitude,longitude,country,country_code,room_type_id,construction_year,...,number_of_reviews,last_review,reviews_per_month,review_rate_number,calculated_host_listings_count,availability_365,instant_bookable,cancellation_policy_id,house_rules,license
0,1001254,Clean & quiet apt home by the park,-1589892906,1,40.64749,-73.97237,1,1,1,2020,...,9,,0.21,4,6,286,0,1,Clean up and treat the home the way you'd like...,
1,1002102,Skylit Midtown Castle,795565271,2,40.75362,-73.98377,1,1,2,2007,...,45,,0.38,4,2,228,0,2,Pet friendly but please confirm with me if the...,
2,1003689,Entire Apt: Spacious Studio/Loft by central park,1843282861,3,40.79851,-73.94399,1,1,2,2009,...,9,,0.1,3,1,289,0,2,"Please no smoking in the house, porch or on th...",
3,1004098,Large Cozy 1 BR Apartment In Midtown East,-1746088462,4,40.74767,-73.975,1,1,2,2013,...,74,,0.59,3,1,374,1,3,"No smoking, please, and no drugs.",
4,1005202,BlissArtsSpace!,627526493,5,40.68688,-73.95596,1,1,1,2009,...,49,,0.4,5,1,219,0,2,House Guidelines for our BnB We are delighted ...,


**2. List all neighbourhood group names.**
This query identifies all the unique neighbourhood group names in the dataset.
It retrieves distinct values from the neighbourhood_group column, offering insight into how San Francisco is geographically categorized.

In [8]:
query = "SELECT * FROM neighbourhood_groups;"
q = pd.read_sql(query, conn)

# Show result
q.head()



Unnamed: 0,id,name
0,4,Bronx
1,1,Brooklyn
2,2,Manhattan
3,3,Queens
4,5,Staten Island


**3. What is the name of the room type with ID = 2?**
This query reveals the room type name associated with the ID of 2.
It is useful for understanding what category this specific room type ID corresponds to in the room_types table.

In [21]:
query = "SELECT * FROM room_types WHERE id = 2;"
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,id,type
0,2,Entire home/apt


**4. How many listings are there in the listings table?**
This query calculates the total number of listings available in the listings table.
It provides a summary count, which is helpful for understanding the dataset’s size.

In [9]:
query = "SELECT COUNT(*) FROM listings;"
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,COUNT(*)
0,39415


**5. Which host has the ID 795565271?**
This query retrieves information about a specific host.
It returns the record of the host with an ID of 795565271 from the listings table to provide targeted information.



In [14]:
query = "SELECT * FROM hosts WHERE id = 795565271;"
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,id,name,identity_verified
0,795565271,Jenna,1


**6. What is the minimum and maximum price in the listings table?**
This query identifies the price range of listings in San Francisco.
It returns the minimum and maximum listing prices found in the listings table, offering insight into the affordability spectrum.

In [10]:
query = "SELECT MIN(price), MAX(price) FROM listings;"
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,MIN(price),MAX(price)
0,50.0,1200.0


**7. How many listings exist per neighbourhood?**
This query helps assess listing density across different areas.
It groups the data by neighbourhood and counts how many listings exist in each, highlighting areas with more or fewer options.

In [11]:
query = "SELECT neighbourhood_id, COUNT(*) AS listing_count FROM listings GROUP BY neighbourhood_id;"
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,neighbourhood_id,listing_count
0,1,145
1,2,1057
2,3,950
3,4,320
4,5,3244


**8. Which neighbourhood group has the highest number of listings?**
This query finds the most active region for short-term rentals.
It calculates the number of listings in each neighbourhood group and returns the one with the highest count.

In [12]:
query = """
    SELECT ng.name AS neighbourhood_group, COUNT(*) AS listing_count
    FROM listings l
    JOIN neighbourhoods n ON l.neighbourhood_id = n.id
    JOIN neighbourhood_groups ng ON n.neighbourhood_group_id = ng.id
    GROUP BY ng.name
    ORDER BY listing_count DESC
    LIMIT 1; """
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,neighbourhood_group,listing_count
0,Manhattan,16658


**11. Find hosts who have more than 1 listing in different neighbourhoods.**
This query identifies multi-listing hosts with wide geographical coverage.
It returns host IDs that manage more than 1 listings across different neighbourhoods, indicating hosts with a broad presence.



In [21]:
query = """
    SELECT h.id AS host_id, h.name, COUNT(DISTINCT l.neighbourhood_id) AS neighbourhood_count
    FROM hosts h
    JOIN listings l ON h.id = l.host_id
    GROUP BY h.id, h.name
    HAVING neighbourhood_count > 1;"""
q = pd.read_sql(query, conn)

# Show result
q.head()

Unnamed: 0,host_id,name,neighbourhood_count
0,-1148768701,Yolanda,2


**12. Average Price by Neighbourhood Group for High-Review, High-Price Listings**
This query focuses on high-quality, premium listings.
It calculates the average price per neighbourhood group, but only includes listings with more than 10 reviews and a price above $100.



In [24]:
query = """
    SELECT ng.name AS neighbourhood_group, AVG(l.price) AS avg_price
    FROM listings l
    JOIN neighbourhoods n ON l.neighbourhood_id = n.id
    JOIN neighbourhood_groups ng ON n.neighbourhood_group_id = ng.id
    WHERE l.number_of_reviews > 10 AND l.price > 100
    GROUP BY ng.name
    ORDER BY avg_price DESC;"""
q = pd.read_sql(query, conn)

# Show result
q.head()


Unnamed: 0,neighbourhood_group,avg_price
0,Staten Island,680.355649
1,Brooklyn,647.811899
2,Queens,647.669732
3,Manhattan,644.043753
4,Bronx,629.058824
