# Data Analytics with Databricks: A Case Study Approach
## Chapter Two

In this chapter, we will concentrate on the ingestion, cleaning, and management of the Airbnb Reviews dataset. Learners will utilize SQL to perform data cleaning and preparation tasks.

### Exercise 1: Removing Duplicates and Handling Missing Values (SQL-based)

#### **Dataset**
- The dataset is located in the `hive_metastore.default.airbnb_open_data` table.

#### **Output**
- Your output should be stored at: `hive_metastore.default.airbnb_open_data_copy`

#### **Learning Objective**
- Learn how to identify and remove duplicate entries and handle missing values in a SQL table.

#### **Context**
This exercise focuses on data cleaning, specifically identifying duplicate rows and handling missing values. Cleaning data is a crucial step in preparing datasets for analysis and ensures that insights are based on accurate, high-quality information.



#### **Exercise Question:**
- After removing duplicates and imputing missing values, how many rows remain in the dataset?

#### **End Goal:**
- A clean dataset with no duplicates and no missing values in the `review_score` column.



#### **Steps to be executed by the student:**
1. Write a query to find and count duplicate entries based on `host_id`, `last_review`, and `list_id`:

First we need to create a copy of the raw dataset 

In [0]:
%sql
CREATE TABLE hive_metastore.default.airbnb_open_data_copy AS
SELECT * 
FROM hive_metastore.default.airbnb_open_data;

In [0]:
%sql


2. Select only the latest review for each duplicate group.

In [0]:
%sql



3. Identify missing values in price:

In [0]:
%sql

4. Impute missing values in the `review_score` column with the average score of the dataset:

In [0]:
%sql

Display the count of missing values in price after imputation:

In [0]:
%sql


### Exercise 2: Calculating Summary Statistics (SQL-based)

#### **Dataset**
- The dataset is located in the `hive_metastore.default.airbnb_open_data` table.

#### **Output**
- Your output should be stored at: `hive_metastore.default.airbnb_open_data_copy`

#### **Learning Objective**
- Learn how to calculate summary statistics such as averages and counts using SQL.

#### **Context**
Understanding how to calculate summary statistics is essential for analyzing datasets. This exercise teaches learners how to use SQL to gain insights into the data by calculating the average review score per city and the total number of reviews per host.


#### **Exercise Question:**
- What is the average review score for "Los Angeles"?

#### **End Goal:**
- A table displaying the average review score per city and the total number of reviews per host.

#### **Steps to be executed by the student:**

### Solution

1. Write a query to calculate the average `review_rate_number` for each `neighbourhood_group`:

In [0]:
%sql


2. Write a query to count the total number of reviews for each host:

In [0]:
%sql

3. Combine both statistics into one query using subqueries:

In [0]:
%sql
