Based upon `LLM 00b - Introduction to Databricks.py`

## Reading data
When you ran the **Setup** cell at the top of the notebook, some variables were created for you. One of the variables is `DA.paths.datasets` which is the path to datasets which will be used during this course.
One such dataset is located at **`{DA.paths.datasets}/news/labelled_newscatcher_dataset.csv`**. Let's use `pandas` to read that csv file.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
DA_paths_datasets = "../data"  # note rename of DA.paths.datasets variable (for local Jupyter notebook versions)

# Specify the location of the csv file
csv_location = f"{DA_paths_datasets}/news/labelled_newscatcher_dataset.csv"
# Read the dataset
newscatcher = pd.read_csv(csv_location, sep=";")
# Display the datset
newscatcher.head()

We can now use `matplotlib` to plot aggregate data from our dataset. 

The `display()` command will pretty-print a large variety of data types, including Apache Spark DataFrames or Pandas DataFrames.

It will also allow you to make visualizations without writing additional code. For example, after executing the below command click the `+` icon in the results to add a Visualization. Select the **Bar** visualization type and click "Save".

In [None]:
# Count how many articles exist per topic
newscatcher_counts_by_topic = (
    newscatcher.loc[:, ["topic", "title"]].groupby("topic").agg("count").reset_index(drop=False)
)

# Create a bar plot
plt.bar(newscatcher_counts_by_topic["topic"], height=newscatcher_counts_by_topic["title"])
plt.xticks(rotation=45)
plt.show()

In [None]:
display(newscatcher_counts_by_topic)

Databricks Runtime (DBR) environments come with many pre-installed libraries (for example, <a href="https://docs.databricks.com/release-notes/runtime/13.1ml.html#python-libraries-on-cpu-clusters"  target="_blank">DBR 13.1 python libraries</a>), but sometimes you'll want to install some additional ones.

Additional libraries can be installed directly onto your cluster in the **Compute** tab, or you can install them with a scope specific to your individual notebook using the `%pip` magic command.

Because sometimes you'll need to restart your python kernel after installing a new library via `%pip` it's considered best practice to put all `%pip` commands at the very top of your notebook.



Now we can import the newly installed `nlptest` package (see notes in comments below)

In [None]:
# pip install nlptest==1.1.0 (also required install of setuptools package)

import nlptest

# Note that this library has been superseded by `langtest`
# http://langtest.org/docs/pages/docs/install