<img src="../assets/ittc_logo_full.png" height=150>

# Lecture 3 Practical: Extracting and importing data

## In this Practical

In this Practical you will:

1. Import data from both CSV and Parquet files
2. Explore the DataFrames object
3. Export data to various formats

## 0. Before we being
First we'll need to install the packages we'll be using today. Run the cell below by clicking on it, then holding `shift` and pressing `enter` while still holding shift.

In [None]:
%pip install -r ../requirements.txt


## 1. Module imports

The first step in using Pandas (and any Python module) is to import it.

Here we import `pandas` and rename is as `pd` to make typing easier.

In [None]:
import pandas as pd

## 2. Importing data

There are three files that we will import and explore. They are:

* **concrete.csv**

    This table contains a range of recipes for concrete – including cement, water, aggregate, and age — and the resulting compressive strength. Compressive strength is the "dependent variable", and the recipe amounts are the "independent variables".

* **chemicalmanufacturing.csv**

    This table contains the yield (the dependent variable) as a function of a large number of manufacturing inputs and processes (the independent variables)

* **backblaze.parquet**

    This is public time series data for computer hard drives used by a large cloud provider. For each harddrive, this time series data records its serial number as well as its failure status.

We will import them with their _relative_ path. Relative to this notebook, they are located up one folder (`..`) and in the `data` directory. Their paths are:

* `../data/concrete.csv`
* `../data/chemicalmanufacturing.csv`
* `../data/backblaze.parquet`

### 2.1 Import

Import each of the above files by choosing between the following functions:

* `pd.read_csv()` [(documentation)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) 
* `pd.read_parquet()` [(documenation)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html)

In [None]:
# Import the concrete CSV file

# Set the path to the location of concrete.csv
concrete = pd.read_csv("../data/concrete.csv")

In [None]:
# Import chemical data
# HINT: you might need to use skiprows to fix any errors
# chemical = ...

In [None]:
# backblaze = ...

### 2.2 Display the table

To preview the table you have two options:

* Use the function `display()`
* Jupyter notebooks will automatically display the final value in a cell



In [None]:
# Preview each of the DataFrames using display()
display(concrete)

In [None]:
# Preview the backblaze DataFrame by setting it as the final value in this cell

# Uncomment this line:
# backblaze

## 3. Exploring the DataFrame

The following functions are useful for understanding your import DataFrame:

* `.columns()` to get the names of each column
* `len()` to get the number of rows
* `.shape()` to get the (row x columns) tuple
* `.info()` to get summary information on each column
* `.describe()` to generate an automatic statistical summary of each **numeric** column

In [None]:
# How many rows does each DataFrame have? Which is the largest?

display(len(concrete))

# chemical...
# backblaze...

In [None]:
# What are the names of the columns?
display(concrete.columns)

# chemical...
# backblaze...

In [None]:
# How many rows and columns does our table have? What are their data types?
# Explore the metadata about this table using the method .info()
concrete.info()

# chemical...
# backblaze...

In [None]:
# The .describe() method on concrete to get a statistical summary of each column

# concrete...

## 4. Selecting columns

There are multiple ways to select columns from your DataFrame including:

* `df["ColumnName"]`
* `df.ColumnName` (but only if the column name doesn't have spaces or special characters)
* `df.loc[:, "ColumnName"]`

When a single column is selected, Pandas will return a `Series` object.

In [None]:
# Select the Age column from concrete using each of these methods

# concrete...

Multiple columns can be selected at once which will return a DataFrame that is a subset of the original:

* `df[["ColumnName1", "ColumnName2"]]`

For example, to select both the Age and Cement columns we would use: `concrete[["Age", "Cement"]]`

In [None]:
# Select both the Age, Cement and Water columns from the concrete DataFrame

# concrete...

It is possible to perform calculations over a whole column similar to Excel.

Methods that will return a single value include:

* `.sum()`
* `.min()` and `.max()`
* `.mean()` and `.median()`

There are also methods that will return lists of values:

* `unique()` will return every unique value in a column _without_ duplicates
* `quantile()` will return a value for every quantile specified

In [None]:
# What is the mean and median compressive strength of all of the concrete recipes?

display(concrete["CompressiveStrength"].mean())

# median: concrete...

In [None]:
# What are the unique age values in the Age column?
concrete["Age"].unique()

In [None]:
# What are the 25 and 75th quantiles of concrete age?
concrete["Age"].quantile([0.25, 0.75])

In [None]:
# What is the 25th and 75th quantile of the concrete Compressive Strength?

# concrete...

## 5. Indexing into your DataFrame

### 5.1 Accessing rows

You can index individual rows using the method `.iloc(rownumber)` and ranges using `.iloc(start:end)`:

* e.g. `concrete.iloc[50]` or `concrete.iloc[30:70]`

You can select columns by name:

* Single columns: e.g. `concrete["FlyAsh"]`
* Multiple columns: e.g. `concrete[["FlyAsh", "FineAggregate", "Water"]]`

For advanced help, see the documentation [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html).

In [None]:
# Print row 50
concrete.iloc[50]

In [None]:
# Print rows 100-150
display(concrete.iloc[50:150])

### 5.2 Accessing cells

Individual cells are accessed by specifying the row followed by the column.

Recall that we have two methods for indexing: `iloc()` is integer based (starting at 0) and `loc()` uses the row and column labels directly.

* `concrete.iloc[50, 7]`
* `concrete.loc[50, "Age"]`

In [None]:
# Get the value from the 47th row of the Age column using both methods

# concrete...

We can mix slices of rows and columns:

In [None]:
# Print a subset: rows 50-70 of just two columns
concrete.loc[50:70, ["Cement", "FlyAsh"]]

# Or the same thing but with a different style: which do you prefer?
concrete[["Cement", "FlyAsh"]].iloc[50:70]

### 5.3 Modifying the DataFrame

We can modify both individual cells as well as full slices in a single command, e.g.

* Update a single cell: `concrete.loc[50, "Age"] = 100`
* Bulk update a slice: `concrete.loc[50:70, "Age"] += 100`

In [None]:
# Add 2 units of time to the Age of each row

# concrete..

## 6. Filtering the DataFrame

Filter rows using the `.query("....")` method:

* e.g. `concrete.query("Water > 150 and 400 < Cement < 500")`

In [None]:
concrete.query("Water > 150 and 400 < Cement < 500")

In [None]:
# Compare the mean CompressiveStrength for mixtures with an age both less and greater than 30

# Less than 30
meanless30 = concrete.query("Age < 30")["CompressiveStrength"].mean()

# Versus greater than 30...
# meanplus30 = ...

# Which is stronger?

We can also use the "pure Python" filter to create a mask, e.g.:

```
mask = (concrete["Water"] > 150) & (concrete["Water"] < 500)`
concrete[mask]
```

Use this method to filter the `chemical` DateFrame and show only those rows where `ManufacturingProcess38` is 2:

In [None]:
# mask = ...
# chemical[mask]

In [None]:
# For the ChemicalProcess table, find the mean Yield for each value of ManufacturingProcess38 (values: 0, 2, 3)
# Does this process seem to make a big difference to the yield?

yield0 = chemical.query("ManufacturingProcess38 == 0")["Yield"].mean()
# yield2 =
# yield3 =

## 7. Advanced: Setting the index

Take a look again at the backlaze data set:

In [None]:
display(backblaze)

The left-most column is the DataFrame index. Currently this is set simply as the row number (starting from 0).

Is this the most natural index? What if we want to set the index to the date instead?

We can use the method `.set_index()` to set any column as the index:

In [None]:
backblaze_date_index = backblaze.set_index("date")
display(backblaze_date_index)

The reason we might do this is that now we can use our friend the `.loc()` method to index using dates (or date ranges):

In [None]:
backblaze_date_index.loc["2013/05/10"]

We could also index instead by `serial_number`.

Set the index to `serial_number` and try indexing using one of the serial number `ZJV4H4Q7`:

In [None]:
# backblaze_serial_index = ...

# backblaze_serial_index.loc["ZJV4H4Q7"]

Most often, it is most useful when an index is unique, i.e. an index points to a single row.

This data set can be uniquely indexed by a _combination_ of `serial_number` and `date`.

Pandas allows for hierarchical indexes using the method `.set_index([col1, col2, ...])`:

In [None]:
backblaze_multiindex = backblaze.set_index(
    ["serial_number", "date"],
).sort_index()
display(blackblaze_multiindex)

Multiindexes allow you to select data base on one or more of the index values. If the DataFrame index is sorted, this can make selection across very large data sets fast and efficient.

Here are some examples of indexing with multiindexes:

In [None]:
# We can index one level, which returns a new DataFrame indexed by Date
backblaze_multiindex.loc["Z3006275"]

In [None]:
# We can index two levels by calling .loc() twice
backblaze_multiindex.loc["Z3006275"].loc["2013/05/10"]

In [None]:
# Or we can pass a tuple to loc() to do the same thing using the format:
# df.loc[(level1, level2), :]
backblaze_multiindex.loc[("Z3006275", "2013/05/10"), :]

In [None]:
# Finally, set ranges as part of the indexes
backblaze_multiindex.loc[
    (["Z3006275", "Z3001Y9R", "Z3004WYG"], slice("2013/01/01", "2016/01/01")), :
]