
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>



# LAB - Load and Explore Data


Welcome to the "Load and Explore Data" lab! In this session, you will learn essential skills in data loading and exploration using PySpark in a Databricks environment. Gain hands-on experience reading data from Delta tables, managing data permissions, computing summary statistics, and using data profiling tools to unveil insights in your Telco dataset. Let's dive into the world of data exploration!


**Lab Outline:**


In this Lab, you will learn how to:
1. Read data from delta table
1. Manage data permissions
1. Show summary statistics
1. Use data profiler to explore data frame
    - Check outliers
    - Check data distributions
1. Read previous versions of the delta table


## REQUIRED - SELECT CLASSIC COMPUTE
Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:
1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

2. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

   - Click **More** in the drop-down.
   
   - In the **Attach to an existing compute resource** window, use the first drop-down to select your unique cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

2. Find the triangle icon to the right of your compute cluster name and click it.

3. Wait a few minutes for the cluster to start.

4. Once the cluster is running, complete the steps above to select your cluster.


## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need to use one of the following Databricks runtime(s): **17.3.x-cpu-ml-scala2.13**


## Lab Setup

Before starting the Lab, follow these initial steps:

1. Run the provided classroom setup script. This script will establish necessary configuration variables tailored to each user. Execute the following code cell:

In [0]:
%run ../Includes/Classroom-Setup-1.2


**Other Conventions:**

Throughout this lab, we'll make use of the object `DA`, which provides critical variables. Execute the code block below to see various variables that will be used in this notebook:

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"Dataset Location:  {DA.paths.datasets}")

##Task 1: Read Data from Delta Table


+ Use Spark to read data from the Delta table into a DataFrame.



In [0]:
## Load dataset with spark
shared_volume_name = 'telco' ## From Marketplace
csv_name = 'telco-customer-churn-missing' ## CSV file name
dataset_path = f"{DA.paths.datasets.telco}/{shared_volume_name}/{csv_name}.csv" ## Full path

## Read dataset with spark
telco_df = <FILL_IN>

table_name = "telco_missing"
table_name_bronze = f"{table_name}_bronze"

## Write it as delta table
telco_df.write.<FILL_IN>
telco_df.show()

##Task 2: Manage Data Permissions

Establish controlled access to the Telco Delta table by granting specific permissions for essential actions.

+ Grant permissions for specific actions (e.g., read, write) on the Delta table.

In [0]:
%sql
---- Write query to Grant Permission to all the users to access Delta Table
<FILL_IN>;

##Task 3: Show Summary Statistics


Compute and present key statistical metrics to gain a comprehensive understanding of the Telco dataset.


+ Utilize PySpark to compute and display summary statistics for the Telco dataset.

+ Include key metrics such as mean, standard deviation, min, max, etc.

In [0]:
## Show summary of the Data
<FILL_IN>

##Task 4: Use Data Profiler to Explore DataFrame
Use the Data Profiler and Visualization Editor tools.

+ Use the Data Profiler to explore the structure, data types, and basic statistics of the DataFrame.
    - **Task 4.1.1:** Identify columns with missing values and analyze the percentage of missing data for each column.
    - **Task 4.1.2:** Review the data types of each column to ensure they match expectations. Identify any columns that might need type conversion.
+ Use Visualization Editor to Check Outliers and Data Distributions:
    - **Task 4.2.1:** Create a bar chart to visualize the distribution of churned and non-churned customers.
    - **Task 4.2.2:** Generate a pie chart to visualize the distribution of different contract types.
    - **Task 4.2.3:** Create a scatter plot to explore the relationship between monthly charges and total charges.
    - **Task 4.2.4:** Visualize the count of customers for each payment method using a bar chart.
    - **Task 4.2.5:** Compare monthly charges for different contract types using a box plot.


In [0]:
## Display the data and Explore the Data Profiler and Visualization Editor
<FILL_IN>

##Task 5: Drop the Column
Remove a specific column, enhancing data cleanliness and focus.


+ Identify the column that needs to be dropped. For example, let's say we want to drop the 'SeniorCitizen' column.


+ Use the appropriate command or method to drop the identified column from the Telco dataset.


+ Verify that the column has been successfully dropped by displaying the updated dataset.

In [0]:
## Drop SeniorCitizen Column 
telco_dropped_df = <FILL_IN>

## Overwrite the Delta table
telco_dropped_df.write.mode("overwrite")<FILL_IN>

## Task 6: Time-Travel to First 


Revert the Telco dataset back to its initial state, exploring the characteristics of the first version.


+ Utilize time-travel capabilities to revert the dataset to its initial version.


+ Display and analyze the first version of the Telco dataset to understand its original structure and content.


In [0]:
## Extract timestamp of first version (can also be set manually)
timestamp_v0 = spark.sql(<FILL_IN>)
(spark
        .read
        .option(<FILL_IN>)
        .table(<FILL_IN>)
        .printSchema()
)


##Task 7: Read previous versions of the delta table
Demonstrate the ability to read data from a specific version of the Delta table.

+ Replace the timestamp in the code with the actual version or timestamp of interest.

In [0]:
## Show table versions
<FILL_IN>

##Conclusion
In this lab, you demonstrated how to explore and manipulate the dataset using Databricks, focusing on data exploration, management, and time-travel capabilities.

&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>