# 9.2 - Exploring One Column of Data

## Welcome to Jupyter Notebooks
Jupyter Notebooks let you write code, run it, and see the results right away. In this notebook, you’ll explore a dataset and make charts to understand what the data is telling you.

A notebook is made of **cells**. Some cells have text (like directions), and some have code. To run a code cell, click the play button or press `Shift + Enter` while that cell is selected.

## Prerequisites

### Install extensions

Before starting, install the recommended Visual Studio Code extensions:
1. Open the **Extensions** tab ![Extension Icon](ExtensionIcon.png) on the left side.
2. Click the filter icon ![Funnel Icon](funnel.png) and choose `Recommended`.
3. Click the cloud icon ![Cloud Icon](cloud.png) to install the recommended extensions.

### Install libraries

Run the next code cell to install the Python libraries for this activity.

In [None]:
# install the necessary libraries for this notebook
%pip install pandas
%pip install matplotlib
%pip install numpy

### Loading the Libraries

After the libraries are installed, run the next cell to import them into Python. You should do this at the start of each new notebook.

In [4]:
import pandas

## Looking at the Data

Now we’ll load the data and explore it. Run each code cell in order and look at what changes in the output. If you edit a cell, run it again to update the result.

We’ll use a CSV file called `dogs.csv` that has information about dog breeds. To view it:
- Open the Explorer panel ![Explorer icon](ExplorerIcon.png) on the left.
- Right-click `dogs.csv` and choose **Open With...**
- Select **Text Editor**.

Each row is one dog breed. Each column (separated by commas) is a different trait. The first row has the column names.

### Loading the Data

We’ll use pandas to read the CSV file, then preview the first few rows.

In [None]:
# Load data about dogs from the CSV file
dogInfo = pandas.read_csv('dogs.csv')

# Display the first few rows of the data
dogInfo.head()

You can change how many rows you see by changing the number inside `head()`. Try `head(10)` or `head(20)`, then run the cell again.

### Data Description

To see all column names in the dataset, run the next cell with `keys()`.

In [None]:
dogInfo.keys()

To look at one column, use its name in square brackets. Example: `dogInfo['Breed Group']`
**Notice:** this uses square brackets `[]`, just like list indexing. In a DataFrame, the column names (keys) go inside the brackets to choose which column to display. Try a few different column names to see their data.

In [None]:
dogInfo['Breed Group']

The dataset has over 100 rows, so printing everything at once is hard to read.
- Use `head()` to see the first rows.
- Use `tail()` to see the last rows.
- You can pass a number to either one, like `tail(10)`.

Try changing the cell above to show the last 10 rows of data.

### Unique Values

Another useful step is finding all **different** values in a column. Use `unique()` on `Breed Group` to see all group names.

In [None]:
dogInfo['Breed Group'].unique()

#### Counting Unique Values

`unique()` shows the distinct values. If you only want the number of distinct values, use `nunique()`.
Edit the cell above and try `nunique()` on `Breed Group`.

To count how many times each value appears, use `value_counts()`.
Run the cell below, then change `unique()` to `value_counts()` and run again to see how many breeds are in each group.

In [None]:
dogInfo['Breed Group'].unique()

## Visualizing Data

Now let’s make charts from the data. We’ll start with a bar chart of breed groups.

Use `plot()` and set `kind` to choose a chart type:
- `kind='line'` for a line chart
- `kind='bar'` for a bar chart

Add a clear chart title with the `title` parameter.

In [1]:
dogInfo['Breed Group'].value_counts().plot(kind='bar', title='Dog Breeds by Group');

NameError: name 'dogInfo' is not defined

Bar charts work well for category data because they show counts clearly.

With this chart, you can quickly see:
- all categories (breed groups)
- which group appears most
- which groups appear least

### Your turn

Use the next code cell to create a bar chart for each column in the dataset.
As you make each chart, ask: **Is this a good chart for this type of data? Why or why not?**
Complete page 1 of your activity guide as you go. Use the **Copy** button above a chart when you need to paste it into your guide.

After you check all columns, choose one chart to analyze more deeply on page 2.

## Histograms

Run the next cell to make a bar chart for `Maximum Height` (in inches).
The x-axis shows possible heights, and the y-axis shows how many breeds can reach each height.

In [None]:
dogInfo['Maximum Height'].value_counts().plot(kind='bar', 
                                              title='Distribution of Dog Breeds by Height',
                                              ylabel='Number of Breeds',
                                              xlabel='Maximum Height (inches)');

This chart has two issues:
1. Bars are ordered by largest count, not by height.
2. There are many height values, so it looks crowded.

To fix the first issue, sort by height before plotting.
In the next cell, add `sort_index()` after `value_counts()`, like this:
`...value_counts().sort_index().plot(...)`

Run it and compare with the previous chart. What changed?

In [None]:
dogInfo['Maximum Height'].value_counts().plot(kind='bar',
                                              title='Distribution of Dog Breeds by Height', 
                                              ylabel='Number of Breeds',
                                              xlabel='Maximum Height (inches)');

Sorting makes the x-axis easier to read, but the chart is still crowded. A histogram can help.
A histogram groups values into ranges (bins), which makes patterns easier to see.

We'll use `hist()` instead of `value_counts().plot()`. Run the next cell to see what happens with `bins=5`.

**Note:** `hist()` works a little differently. After creating the chart, set labels using:
- `set_title()`
- `set_xlabel()`
- `set_ylabel()`

In [None]:
plt = dogInfo['Maximum Height'].hist(bins=5)
plt.set_title('Distribution of Dog Breeds by Height')
plt.set_xlabel('Maximum Height (inches)')
plt.set_ylabel('Number of Breeds');


Now the bars are wider and grouped, so trends are easier to spot.
The x-axis now shows ranges of heights instead of exact heights.

In the next cell, change `bins` to a list of cut points (bin edges).
Example: `bins=[0, 10, 20, 30, 40]`

Run it and compare with the previous histogram. What do you notice?

In [None]:
plt = dogInfo['Maximum Height'].hist(bins=5)
plt.set_title('Distribution of Dog Breeds by Height')
plt.set_xlabel('Maximum Height (inches)')
plt.set_ylabel('Number of Breeds');

If you use 5 numbers like `[0, 10, 20, 30, 40]`, you get 4 bars.
That’s because the numbers are **edges** of the bins, and each bar is the space between two edges.

Example: `bins=[0, 10, 25, 45]` gives 3 bars with different widths.
Try different bin edges and see how the chart changes.

You can also let pandas choose bin edges automatically by picking a bin count.
To see the exact edges pandas uses, run the next cell with `pandas.cut(..., retbins=True)`.

In [None]:
# Precalculate the edges of the bins. cut returns both the ranges for the bins
# and the bin edges. The ranges aren't helpful here, so we use the underscore to
# ignore them
_, edges = pandas.cut(dogInfo['Maximum Height'], bins=5, retbins=True)

plt = dogInfo['Maximum Height'].hist(bins=5)
plt.set_title('Distribution of Dog Breeds by Height')
plt.set_xlabel('Maximum Height (inches)')
plt.set_ylabel('Number of Breeds')
plt.set_xticks(edges); # set the x-ticks to the bin edges we calculated

### Your turn

Use the next code cell to make a histogram for `Maximum Weight`.
Try different bin counts and bin edge lists.

As you test, think about:
- What patterns become easier to see?
- Which bin setup tells the clearest story?

Complete page 3 of your activity guide when you choose your best bin setup.

Use the code cell below to create a histogram for one more column of your choice.
Test different bin counts and bin edges, then decide which version is easiest to understand.
Complete page 4 of your activity guide after you pick your final chart.

In [None]:
dogInfo.keys() # get the column names