# Databricks Lakehouse Platform

<p align="center">
    <img src="images/DatabricksLogo.png" width="300" height="150"/>
</p>

The [*Databricks Lakehouse Platform*](https://www.databricks.com/product/data-lakehouse) is a powerful solution for managing data at scale, offering a comprehensive environment that goes beyond data storage. It encompasses a wide range of tasks and capabilities, including efficient data ingestion, storage, ETL operations, and unified analytics supporting multiple programming languages. Users can seamlessly perform diverse activities, from building end-to-end machine learning pipelines to real-time streaming analytics, within this holistic platform.

Databricks is built upon *Apache Spark*, a powerful open-source distributed computing system that enables parallel processing of large datasets. Spark provides a unified analytics engine for big data processing, offering high-level APIs in multiple programming languages like Scala, Python, and SQL. This integration with Spark allows the platform to facilitate the creation, deployment, and maintenance of enterprise-grade data solutions. 

Moreover, Databricks ensures a consistent and efficient user experience through its integration with cloud storage, spanning across major cloud providers like AWS, Azure, and Google Cloud. Beyond its role as an analytics engine, Databricks uniquely addresses concerns related to **vendor lock-in**, providing organizations with the freedom to adapt and optimize their data management strategies with flexibility.

## Databricks as a Lakehouse Platform

To understand the innovation behind Databricks, let's explore the concept of a *Lakehouse*. Traditionally, data management involved the use of *Data Lakes* and *Data Warehouses*:

- A **Data Lake** is a storage system that allows for the ingestion of large amounts of data without predefined structures. It provides flexibility, enabling storage of data in its raw format, making it suitable for a variety of data types.

- A **Data Warehouse** is a structured, high-performance database optimized for analytical queries. Unlike a Data Lake, it requires predefined schemas, organizing data into tables. It excels at handling structured data, making it ideal for analytical tasks.

Now, envision a **Lakehouse** – Databricks' innovative approach that combines the best of both worlds. It seamlessly blends the flexibility and scalability of a Data Lake with the structure and efficiency of a Data Warehouse. In essence, Databricks acts as a unified platform capable of handling a spectrum of data types – from raw and unstructured to processed and organized.

## Deploying Architecture using Databricks

### Control Plane

The **Control Plane** serves as the command center for managing the Databricks infrastructure. Databricks hosts the Control Plane. Here, configurations are set, access controls are defined, and security policies are established. This plane takes responsibility for creating and managing clusters, ensuring optimal resource utilization, and governing the overall lifecycle of the Databricks platform.

In a broader context, the concept of a "plane" in cloud architecture refers to a distinct layer of functionality. 

### Data Plane

In contrast, the **Data Plane** is where the actual data processing and storage activities occur, and it resides in the cloud provider's environment. Databricks leverages cloud-native storage solutions like AWS S3, Azure Data Lake Storage, or Google Cloud Storage for storing both raw and processed data. Within the Data Plane, computational tasks are distributed across clusters, which are virtual machines stored on the desired cloud provider, optimizing data processing performance and scalability. This separation ensures that the computational workload is managed efficiently in the cloud provider's infrastructure.

<p align="center">
    <img src="images/Databricks Architecture.png" width="550" height="500"/>
</p>

### Deployment Workflow

To make use of the capabilities of the Databricks Lakehouse Platform, teams have to use a streamlined deployment workflow, that starts with the creation of a Databricks cluster. This cluster will be the heart of all computational tasks in the Databricks. We will talk in more detail about clusters in a later section, but for now let's have a look at the deployment workflow:

1. **Configuration through the Workspace**

   - Administrators start the process of configuring important aspects of the platform through the Control Plane <br><br>

2. **Cluster Creation with Control Plane**

   - Using web-based cluster management tools within the Control Plane, administrators can create and manage clusters in the cloud environment. These clusters, will be hosted in the Data Plane, and will be used for different data processing tasks. <br><br>

3. **Workspace Interaction for Collaboration**

   - Data scientists and engineers actively engage with the Databricks, benefiting from collaborative features facilitated by the Control Plane. *Databricks Notebooks* serve as a collaborative canvas for code development, query execution, and analysis, fostering teamwork and innovation. We will discuss about Notebooks in more detail later in this lesson. <br><br>

4. **Data Processing in the Cloud**

   - Activation of the Data Plane unfolds as cluster VMs within the cloud provider's infrastructure efficiently process and store data
   - Utilizing cloud-native storage solutions, such as AWS S3, Azure Data Lake Storage, or Google Cloud Storage, the Data Plane orchestrates the storage of both raw and processed data <br><br>

5. **Sharing Insights and Collaboration**

   - Insights and results derived from data processing are shared, becoming the focal point for knowledge exchange among team members. Team members collaboratively analyze results, interpreting findings and refining approaches iteratively.

## Create a Databricks Account

To explore the different features Databricks has to offer, the first step is to create a Databricks account. Depending on your cloud provider, there are different types of Databricks accounts available. To explore the platform without the need for a cloud provider account, *Databricks Community Edition* offers a free and accessible option, making it ideal for individuals and small teams.

Follow these steps to sign up for a Databricks Community Edition account:

- Visit the [Databricks Community Edition](https://docs.databricks.com/en/getting-started/community-edition.html) page using your web browser
- Click on the **Try Databricks** button at top-right of the page to begin the registration process
- Fill in essential details, including your name, phone number, and country
- Click **Continue** to proceed. When prompted to choose a cloud provider, opt for **Get started with Community Edition** at the bottom of the page to create a Databricks account without a cloud provider account.

<p align="center">
    <img src="images/CommunityEdition.png" width="300" height="500"/>
</p>

- After selecting Community Edition, you will be redirected to a page instructing you to check your email
- Look for an email from Databricks and follow the link provided to confirm your email address
- Once verified, set a password for your Databricks Community Edition account, and you should be redirected to the Databricks homepage

Now that your account is ready, let's familiarize ourselves with the various features it offers.

## Navigating the Databricks Workspace

Upon logging into your Databricks account, you'll find yourself in the default *persona* - the Data Science and Engineering persona.

<p align="center">
    <img src="images/DefaultPersona.png" width="850" height="400"/>
</p>

This persona is designed to cater to the needs of both data engineers and data scientists. However, Databricks offers flexibility by allowing you to switch between different personas, such as Machine Learning and SQL, based on your specific tasks and preferences.

### Changing Personas

To switch personas, you can use the "persona switcher" located in the upper-left corner of the page.

<p align="center">
    <img src="images/SwitchPersona.png" width="900" height="450"/>
</p>

Depending on your needs, you can choose between the following personas:

- **Data Science and Engineering**

  - Ideal for tasks involving data exploration, analysis, and engineering
  - Combines features essential for both data engineers and data scientists <br><br>

- **Machine Learning**

  - Tailored for developing and deploying machine learning models
  - Provides tools for model training, tuning, and deployment <br><br>

- **SQL**

  - Not present in Community Edition
  - Focused on executing SQL queries for data analysis and reporting 

### Main Features in Data Science and Engineering Persona

We will only use the Data Science and Engineering persona going forward, so let's understand its main features. Let's start by creating an expanded view of the menu options for quick access. To do this, click on the **Menu options** at the bottom-left of the menu bar, then select **Expand**.

<p align="center">
    <img src="images/ExpandMenu.png" width="900" height="500"/>
</p>

This will keep the menu options expanded, providing easier navigation and access to the tools and features you use frequently.

#### 1. Workspace

The **Workspace** tab serves as the central hub for your projects, offering a unified environment for data engineers, data scientists, and analysts to collaborate. Within the Workspace, teams can collectively develop code, create notebooks, and share insights. This collaborative hub streamlines workflows, promoting agility and innovation. 

Here, you'll notice two main folders: **Shared** and **Users**.

<p align="center">
    <img src="images/WorkspaceFolders.png" width="700" height="250"/>
</p>

The **Shared** folder is designed for collaborative efforts within your team. It's a shared space where team members can collaborate on projects, share notebooks, and work together on tasks. Notebooks and projects placed in this folder are accessible to everyone within your team.

The **Users** folder is personalized for each individual user in the Databricks environment. Within this folder, you can organize your work in a way that suits your preferences. It provides a private space for your notebooks and projects. Anything stored here is only accessible to you.

Within the Workspace, you have the ability to create folders and notebooks to structure your work efficiently:

<p align="center">
    <img src="images/CreateNotebook.png" width="950" height="350"/>
</p>

Alternatively, you can also create Notebooks using the **Create** button in the menu options. We will discuss Notebooks in more detail later in this lesson.

#### 2. Data

The **Data** tab can be used to manage tables, databases, and data storage. Please note that this tab will appear empty if no clusters have been created. However, normally this tab allows you to visualize your data for analysis. It provides a seamless way to interact with your data, making data exploration easier.

#### 3. Compute

In the **Compute** tab, you can manage clusters for data processing. This involves creating, configuring, and monitoring clusters based on your specific processing needs.

> Clusters are the fundamental units providing computational resources for running code, executing queries, and processing data in Databricks. When connected to a cloud provider, Databricks clusters represent a collection of Virtual Machines (VMs) within the cloud provider account.

There are two primary types of clusters:

- **All-Purpose** clusters:

  - Are versatile and suitable for a variety of data processing tasks. They offer a balanced combination of CPU and memory resources. 
  - They are designed to run continuously, making them ideal for persistent tasks like ongoing data exploration and analysis
  - They are also suited for interactive tasks where users need continuous access to a computing environment for running queries, experimenting with code, and exploring data interactively <br><br>

- **Job** clusters:

  - Are created dynamically for the duration of a specific job and terminate automatically once the job is completed 
  - They are efficient for running specific tasks or jobs, especially those scheduled at intervals
  - They provide resource isolation for each job, ensuring that the job's execution does not interfere with other ongoing tasks in the environment

Let's create an all-purpose cluster that we will use throughout this course:

-  From the **Compute** tab, select **Create compute** under the **All-purpose compute** tab
- Give your cluster a name and leave the Databricks runtime version as the default one
- Click **Create compute** to begin the process of creating a cluster
- Creating the cluster might take a couple of minutes. Once the cluster has been created, you should be able to see it in the **All-purpose compute** tab:

<p align="center">
    <img src="images/TrainingCluster.png" width="950" height="350"/>
</p>

Now that you have a cluster in your workspace, notice that if you navigate back to the**Data** tab, you will have a new database there. This default database serves as the initial location for data in Databricks. It's a pre-configured storage space where you can organize and manage your datasets.

> IMPORTANT: Note that in Databricks Community Edition, your cluster might automatically shut down after periods of inactivity (typically around 30 minutes of inactivity). You also won't have the possibility of restarting an idle cluster, so you will need to create a new one every time this happens.

#### 4. Workflows

The **Workflows** tab is dedicated to managing and monitoring data workflows and jobs. Here, you can schedule and track the progress of your data processing tasks. We will not cover workflows in this course.

## Databricks Notebooks

> **Databricks Notebooks** are interactive, web-based documents that enable users to integrate code, queries, visualizations and narrative text in a collaborative environment. These notebooks provide a unified platform for end-to-end data solutions, allowing users to work with different programming languages and technologies within a single interface.

Key features of Databricks Notebooks include:

- **Multi-Language Support**: Databricks Notebooks support multiple programming languages, including Python, Scala, SQL, and R. This flexibility allows users to choose the language that best suits their data processing and analysis needs.

- **Collaborative Environment**: Notebooks promote collaboration among team members by providing a shared workspace where multiple users can contribute, edit, and comment on the same documents

- **Interactive Data Exploration**: Users can perform interactive data exploration and analysis by writing and executing code cells in real-time. Notebooks offer an iterative development process, allowing users to experiment with code and immediately see the results.

- **Version Control**: Notebooks automatically track history, enabling users to review and revert to previous states. This version control feature provides a safety net for experimentation.

To create a new Notebook in Databricks, navigate to the **Workspace** tab. Select the location where you want to create the Notebook, your user folder for example. Click on the **Create** button, then select **Notebook**.

<p align="center">
    <img src="images/CreateNotebook2.png" width="950" height="400"/>
</p>

You can change the notebook name by clicking on the notebook name at the top (**Untitled Notebook ...**) and then entering a new name, for example **My first notebook**.

In order to run commands in a notebook, we first have to attach the notebook to a cluster, which will provide the compute power for our tasks. To do so, click on the **Connect** button at top-right of the page, and then select the cluster you have previously created:

<p align="center">
    <img src="images/AttachNotebook.png" width="850" height="250"/>
</p>

Once attached, the **Connect** button should display the name of the cluster you've just attached your notebook to.

> The default language for a notebook can be set from the drop-down next to the notebook name. Noticed right now the default language for your notebook is Python, but you can change this to SQL, Scala or R.

### Notebook Cells

> In Databricks Notebooks, a cell is a fundamental unit of content that can contain code, queries, or text. Cells enable users to organize and structure their work in a modular and interactive manner. 

There are primarily three types of cells in Databricks Notebooks: *code cells*, *SQL cells*, and *Markdown cells*. Noticed that when we created our first notebook, by default we have an empty code cell. 

> **Code cells** are used to write and execute code snippets in various programming languages such as Python, Scala and R. Our existing code cell has Python set as the default programming language. This can be changed using the programming language drop-down menu:

<p align="center">
    <img src="images/CellLanguage.png" width="850" height="200"/>
</p>

Code cells can be executed individually, allowing users to interactively run and test code in a step-by-step manner. For example let's write a simple Python command in our code cell:

```python
print("Hello, Databricks!")
```

> To run a cell press the play button or press **Shift + Enter**.

Once the cell has run, you should see the following output:

<p align="center">
    <img src="images/FirstCell.png" width="750" height="150"/>
</p>

You can create new cells using the **+** button above or below an existing cell. By default, this will create the same type of the cell as the previous one, but you can switch the cell type (code, SQL, Markdown) from the toolbar.

<p align="center">
    <img src="images/InsertCell.png" width="750" height="200"/>
</p>

> SQL cells are specifically designed for writing and executing SQL queries against structured data. They can interact with tables and datasets, making them well-suited for relational data analysis. 

An example of an SQL cell command would be:

```sql
SELECT * FROM table_name
```

> Markdown cells allow users to write formatted text, create headings, lists, links, or embed images to provide rich documentation. They support Markdown syntax, enabling users to structure content in a visually appealing way. They are often used to provide explanations, instructions, or narrative alongside code.

For example, the following Markdown code:

<p align="center">
    <img src="images/MarkdownCode.png" width="850" height="325"/>
</p>

Would be displayed like this, once the cell is run:

<p align="center">
    <img src="images/MarkdownRan.png" width="850" height="300"/>
</p>


### Notebook Magic Commands

Databricks Notebooks offer a multitude of powerful features, and among them, *magic commands* stand out as a versatile tool for improving interactive data analysis and overall coding experience. **Magic commands** in Databricks are prefixed with a percentage sign (`%`) and provide a wide range of functionalities, from switching programming languages to interacting with the Databricks File System (DBFS) and even running different notebooks.

#### 1. Switching Programming Languages

You can use magic commands to switch between different programming languages within the same notebook. While the entire notebook may have a default programming language, which can be set using the toggle next to the notebook name, you have the flexibility to use different programming languages in different notebook cells using magic commands.

For instance, in a Python notebook, you can use `%scala` or `%sql` magic commands within specific cells to introduce Scala or SQL code. This flexibility allows you to combine the strength of multiple languages within a single notebook, catering to the diverse requirements of your data analysis and processing tasks.

#### 2. Running Different Notebooks

Using the `%run` magic command, you can execute code from a different notebook withing your current notebook. This promotes modularity and code reusability, enabling you to build a library of functions or code snippets and easily share them across notebooks.

Let's look at an example of using the `%run` magic command. Start by creating a new notebook in your workspace. Call this notebook **New Notebook**. In the first cell of this notebook, write the following Python code to define a function:

<p align="center">
    <img src="images/NewNotebook.png" width="850" height="250"/>
</p>

Navigate back to your first notebook. In a new cell in this notebook, use the `%run` magic command to execute the cells from the **New Notebook**. This command will need the file path of the new notebook. You can find the file path of the notebook using the Workspace navigator as such:

<p align="center">
    <img src="images/CopyFilePath.png" width="850" height="400"/>
</p>

After copying the file path of the new notebook, call the function defined there using the following syntax: `%run path/to/new_notebook`, as seen in the image below. Print the result, and you should see the output, as the cell calculates the square of the given number using the function from the new notebook.

<p align="center">
    <img src="images/RunMagicCommand.png" width="850" height="400"/>
</p>


#### 3. File System (fs) Magic Commands

Magic commands prefixed with `%fs` provide a way of interacting with the *Databricks File System (DBFS)*.

> **DBFS** is a distributed file system mounted on top of existing cloud storage solutions (such as AWS S3, Azure Data Lake Storage, or Google Cloud Storage) to provide a unified and scalable storage layer for Databricks.

DBFS provides a hierarchical file system, and its structure is similar to that of a traditional file system:

- `dbfs:/`: The root of DBFS
- `dbfs:/mnt/`: Mount point for cloud storage
- `dbfs:/mnt/data/`: Example directory within DBFS

DBFS is accessible to all clusters in a Databricks Workspace, making it a centralized and shared storage layer. It supports various file formats and is seamlessly integrated with Databricks Notebooks.

A common `%fs` magic command is the list files `%fs ls` command. This command is used to list files and directories within DBFS. When you run this command you will typically see output similar to the following:

```javascript
dbfs:/databricks-datasets/
dbfs:/databricks-results/
mnt/
```
  - `dbfs:/databricks-datasets/`: This directory contains curated datasets provided by Databricks for exploration and learning purposes
  - `dbfs:/databricks-results/`: This directory is used to store temporary results generated during the execution of notebooks or jobs
  - `mnt/`: The `mnt/` directory is a common mount point for cloud storage, allowing you to access external data sources. This won't be present in Databricks Community Edition as this workspace is not directly associate with any cloud provider.

Other commands include:

- Reading file contents using the `%fs head` command

  - Use `%fs head` to view the contents of a file in DBFS without reading the entire file
  - For example: `%fs head "dbfs:/databricks-datasets/README.md"` <br><br>

- Writing to files using the `%fs cp` command

  - The `%fs cp` command copies files between locations, such as from the local file system to DBFS or between DBFS directories
  - For example: `%fs cp localfile.txt dbfs:/mnt/data/localfile.txt`

#### 4. Databricks Utilities (`dbutils`)

`dbutils` is a versatile module in Databricks that offers a range of utilities for performing tasks related to file management, collaboration and notebook customization. It provides an interactive set of tools to enhance your interactive data analysis and coding sessions within Databricks.

Key magic commands you can run using `dbutils` include:

- `%dbutils fs cp` to upload files from a local file system to DBFS. This is particularly useful for bringing external data into your Databricks environment. For example: `%dbutils fs cp localfile.txt dbfs:/mnt/data/localfile.txt`.

- `%dbutils fs ls` to list files and directories in DBFS. It provides an overview of the file structure within a specified directory. For example: `%dbutils fs ls "dbfs:/mnt/data/"`.

- `%dbutils notebook run` to execute other notebooks within the same workspace. This promotes modularity and code reusability across different notebooks. For example: `%dbutils notebook run "/Workspace/OtherNotebook"`.

While both `dbutils` and `%fs` provide functionality for file system operations in Databricks, the key distinction lies in the broader capabilities offered by `dbutils`. `dbutils` not only facilitates file manipulation but also allows you to execute other notebooks, create interactive widgets, and manage libraries. This extended functionality enhances the versatility of `dbutils`, making it a preferred choice for a comprehensive set of tasks within Databricks notebooks.

## Databricks Repos

We've briefly discussed before that Databricks provides version control functionality within notebooks, allowing users to track changes, revert to previous versions, and collaborate effectively.

To access version history, navigate to the desired notebook. In the notebook toolbar at the top of the page, locate the section that says **Last edit was made ...**. Click on this to access the version history panel:

<p align="center">
    <img src="images/VersionHistory.png" width="850" height="500"/>
</p>

This built-in version control for notebooks has limitations. Different Notebook versions can easily be deleted by users, and there is no possibility of working with branches. Databricks proposes an alternative, *Databricks Repos* - a solution that extends version control capabilities and introduces a more robust collaboration environment.

> Note: Databricks Repos are not part of the Databricks Community Edition. We will introduce the workflow to create and use a Databricks Repo here, but you won't be able to follow along at this point. Later when you get to your specialisation project you will get access to a full-feature Databricks account where you will be able to use Databricks Repos. Make sure to come back to this section then.

The paid version of Databricks UI will look a bit different than the Community Edition:

<p align="center">
    <img src="images/DatabricksAccount.png" width="950" height="500"/>
</p>

Notice that rather than having a toggle bar to switch between the different personas, all the features from the different personas are present in the left-hand side menu under their specific tab. 

Now let's understand the basics of Databricks Repos in more detail:

> **Databricks Repos** is a version control system integrated into Databricks, addressing the shortcomings of standard notebook versioning. It leverages Git for efficient version control, enabling users to manage changes, create branches, and collaborate seamlessly.

### Configure GitHub-Databricks Integration

Databricks integrates with various Git providers, enabling the functionality of Databricks Repos and facilitating collaborative workflows for efficient notebook management. To enable this integration, start by following these steps to configure a connection to GitHub:

- In the Databricks workspace, begin by navigating to your user settings

<p align="center">
    <img src="images/UserSettings.png" width="650" height="300"/>
</p>

- Within your user settings, locate and select the **Linked accounts** option. Here, choose GitHub as your preferred Git provider.

- Use the **Link Git Account** feature to establish a connection GitHub account, eliminating the need for using a personal access token. This action will redirect you to the following authorization page:

<p align="center">
    <img src="images/DatabricksAuth.png" width="450" height="500"/>
</p>

- Click **Authorize Databricks** to grant the necessary permissions. Upon successful authorization you will be redirected to Databricks.

- If the operation was successful you should see a message indicating that you have linked your GitHub account. Additionally, verify the linked GitHub account status under the **Linked accounts** section.

### Cloning a GitHub Repository in Databricks Repos

In this section, we'll walk through the process of creating a new repository in GitHub and cloning it into Databricks Repos. Begin by creating a new GitHub repository. Initialize this repository with a `README` file. Once created, make a note of the HTTPS URL of the repo.

To clone the newly created repository in Databricks Repos, follow these steps:

- Navigate to your Databricks workspace. In the workspace, click on the **New** button and select **Repo**.

<p align="center">
    <img src="images/AddNewRepo.png" width="450" height="300"/>
</p>

- This will open a new page, where you need to provide the previously copied repository URL, select the Git provider as GitHub, and enter a repository name for Databricks Repos

- Once you have filled in the details, click the **Create Repo** button to initiate the cloning process

- Once the process is complete, you should be able to see a new folder called **Repos** in the Databricks workspace. In this location, you should see the folder belonging to the newly created Databricks Repo.

<p align="center">
    <img src="images/DatabricksRepos.png" width="950" height="200"/>
</p>


### Working with Branches in Databricks Repos

Databricks Repos offer robust version control features, including the ability to work with branches, as mentioned before. Branches allow you to isolate your work, experiment with new features, and collaborate more effectively. Here's a step-by-step guide on how to work with branches:

#### 1. Understanding Branches

- In your Databricks workspace, navigate to the **Repos** folder where your cloned repository is hosted

- Next to the repository name, you can see the current branch (usually set to `main`). Click on the branch name, which will redirect you to the following page, displaying the existing branches:

<p align="center">
    <img src="images/BranchPage.png" width="850" height="450"/>
</p>

- In this window, you can view existing branches using the toggle dropdown next to the current branch name. As you can see above, our current repository only has one branch, the `main` branch. 

- Let's create a new branch called `dev` using the **Create Branch** button. At this point you might be met with the following error message:

<p align="center">
    <img src="images/ErrorMessage.png" width="550" height="450"/>
</p>

- Follow the link to the **Databricks GitHub app installation page**. This will redirect you to the following page:

<p align="center">
    <img src="images/InstallDatabricksApp.png" width="400" height="550"/>
</p>

- You can choose to allow access to all repositories in your GitHub account or only one/few selected ones. In this example, we will give access only to the repository that has been cloned in Databricks Repos. Click **Install** to continue. At this point, you might be asked to log in to GitHub to confirm access.

- Return to Databricks and try creating the `dev` branch again, this time you should be successful


#### 2. Making Changes

- Create a new file or clone an existing notebook within the `dev` branch. Before proceeding, make sure you are on the correct branch using the toggle next to the repo name.

- To add a new file you can use the **Add** button and choose the file type (folder, notebook, or file)

- To clone an existing file, navigate to the desired file in your workspace, click on the three dots on the right-hand side of the file name, and then select **Clone**

<p align="center">
    <img src="images/CloneNotebook.png" width="550" height="400"/>
</p>

- On the clone page, navigate to the desired location for the cloned notebook. Typically, you'd move back to the Workspace directory, then into the Repos directory, selecting your cloned Databricks Repo folder before clicking **Clone**.

- Once that's done you should see the newly cloned file in the Databricks Repo:

<p align="center">
    <img src="images/RepoClonedFile.png" width="850" height="300"/>
</p>

- After making these changes, commit them by clicking on the branch name next to the repository name. In this page, view the changes, add a descriptive commit message, then commit and push the changes using the **Commit & Push** button. 

<p align="center">
    <img src="images/ReposPush.png" width="850" height="450"/>
</p>

#### 3. Merging Branches

To merge the changes from the `dev` branch into `main`, navigate to GitHub. Create a pull request on the GitHub repository to merge the changes from your branch into the `main` branches.

> It's important to note that Databricks Repos don't support merging branches directly. GitHub is the preferred platform for this operation.

#### 4. Pulling Changes

- To retrieve the recently merged changes on the `main` branch, switch to the `main` branch in Databricks Repos

- Use the **Pull** button located on top-right side of the page to fetch and merge changes from the remote repository

By following these steps, you've successfully worked with branches in Databricks Repos, made changes, pushed them to the remote repository, and even managed to merge changes on GitHub.

## Key Takeaways

- Databricks offers a comprehensive solution for scalable data management, combining the flexibility of a Data Lake with the structure of a Data Warehouse
- The Databricks platform comprises a **Control Plane** for management and a **Data Plane** for processing and storage
- The Databricks Workspace serves as a unified environment where teams can collectively develop code, create notebooks, and share insights
- Databricks clusters, including **All-Purpose** and **Job** clusters, serve as the computational backbone, providing resources for running code, executing queries, and processing data in the cloud
- Databricks Notebooks support multiple languages and provide features such as version control and interactive data exploration
- Magic commands, prefixed with `%`, enhance the interactive coding experience by enabling language switching, running different notebooks, and interacting with the Databricks File System (DBFS), offering a versatile toolset for data analysis and manipulation
- Databricks Repos, integrated with Git, enhance version control capabilities, allowing users to create branches, collaborate effectively, and work seamlessly with GitHub for robust notebook management