# CSC271: Lecture Notes

## Data Cleaning and Preparation

### In this lesson:
- [Introduction](#intro)
- [Inspecting the data](#inspecting)
- [Specifying column types](#specifying-data-types); (`dtypes`, `datetime`, `to_datetime`)
- [Removing (dropping) columns](#removing-columns) (`drop`)
- [Filling missing data](#filling-missing-data): (`fillna`)
- [Modifying column names](#column-names): (`rename`)
- [Adding columns](#creating-new-columns)
- [Removing rows](#removing-rows) (`dropna`, `dropduplicates`)


<a id="intro"></a>
### Introduction

**Raw data** is data collected from a source in its original form. For example, it may be data that you download from open data respository or data that you retrieve from a website.  

**Data cleaning** is the process of fixing problems with data, such as by correcting or removing data that is incomplete, inaccurate, or irrelevant. Relatedly, **data preparation** is the process of preparing data for analysis, such as by normalizing numeric values or encoding categorical variables. The terms clearning and preparation are often used interchangeably or in combination. 



In this lecture we'll use the following (fake) raw dataset representing pets and their veterinary appointments.




| owner_name       | pet_name  | pet_nickname | species | appointment_date    | last_visit   | num_visits | type            | insurance_provider | estimated_cost |
|-------------------|-----------|--------------|---------|----------------------|--------------|------------|-----------------|---------------------|----------------|
| Aisha Khan        | Bella     | B            | Dog     | 2026-02-01 09:30    | 15/01/2025   | 3          | Annual checkup  | HealthyPaws        | 120.75            |
| Miguel Torres     | WHiskers  |              | Cat     | 2026-02-03 14:00    | 10/12/2022   | 2          | Vaccination     |                     | 80             |
| Priya Patel       | Goldie    |              | Fish    |                      |              |           |                 | AquaLife            |                |
| Kwame Mensah      | Rex       | Rexy         | Dog     | 2026-02-03 16:45    | 20/02/2025   | 4          |                 |                     | 0              |
| Sofia Gonzalez    | MITTENS   |              | Cat     | 2026-02-03 10:00    | 12/03/2024   | 2          | Nail trim       | PetShield           | 15.50             |
| Hiro Tanaka       | Bella     |              | Dog     | 2026-01-30 13:30    | 05/04/2023   | 3          | Checkup         |                     | 120.75            |
| Fatima Al-Sayed   | Fluffy    |              | Rabbit  | 2026-01-30 09:45    | 25/01/2025   | 2          | Vaccination     | PetCarePlus         | 90             |
| Jamal Johnson     | shadow    |              | Cat     | 2026-02-01 15:00    |              |           |                 |                     |                |
| Leila Hassan      | Goldy     | Go-go        | Dog     | 2026-02-01 11:30    | 05/01/2026   | 2          | Checkup         |                     |                |
| Jamal Johnson     | shadow    |              | Cat     | 2026-02-01 15:00    |              |           |                 |                     |                |
| Tomasz Nowak      | Snowball  | Snowie       | Rabbit  |                      | 01/05/2002   | 5          | Vaccination     |                     | 140            |



In [4]:
import pandas as pd
import numpy as np

df = pd.read_csv('pets.csv')
df.head(5)

Unnamed: 0,owner_name,pet_name,pet_nickname,species,appointment_date,last_visit,num_visits,type,insurance_provider,estimated_cost
0,Aisha Khan,Bella,B,Dog,2026-02-01 09:30,15/01/2025,3.0,Annual checkup,HealthyPaws,120.75
1,Miguel Torres,WHiskers,,Cat,2026-02-03 14:00,10/12/2022,2.0,Vaccination,,80.0
2,Priya Patel,Goldie,,Fish,,,,,AquaLife,
3,Kwame Mensah,Rex,Rexy,Dog,2026-02-03 16:45,20/02/2025,4.0,,,0.0
4,Sofia Gonzalez,MITTENS,,Cat,2026-02-03 10:00,12/03/2024,2.0,Nail trim,PetShield,15.5


<a id="inspecting"></a>
### Inspecting the data

We called the `head` method above to see the first five lines of the `DataFrame`. 

We will also check how many missing values there are per column:

In [2]:
df.isna().sum()

owner_name            0
pet_name              0
pet_nickname          7
species               0
appointment_date      2
last_visit            3
num_visits            3
type                  4
insurance_provider    7
estimated_cost        4
dtype: int64

<a id="specifying-types"></a>
### Specifying column types

As we see when we inspect the `df.dtypes` attribute, most columns have the generic `object` type.

In [3]:
df.dtypes

owner_name             object
pet_name               object
pet_nickname           object
species                object
appointment_date       object
last_visit             object
num_visits            float64
type                   object
insurance_provider     object
estimated_cost        float64
dtype: object

Since the last_visit and estimated_cost column contain numeric values and some missing values, they are treated as type `float64`. 

The num_visits values are always whole numbers, so we don't need a floating point representation. We can instead treat this data as an integer this using the `astype` method:

In [None]:
df['num_visits'] = df['num_visits'].astype('Int16')
df.dtypes

I chose `Int16`, which is a 8-bit integer. That type can represent integer values from -128 to 127, which is sufficient for this data. There are also 16-bit, 32-bit, and 64-bit integers.  

Differences between integer types:
- `Int8`, `Int16`, `Int32`, `Int64` can contain missing values (`NaN`)
- `int8`, `int16`, `int32`, `int64` cannot contain missing values

**datetime**

For columns representing dates, we can treat them as the Pandas `datetime` type, which has a set of helpful attributes and methods. To do so, we have to specify how the date and time information is formatted. The table below shows a subset of the available options.

In [None]:
df['appointment_date'] = pd.to_datetime(df['appointment_date'], format='%Y-%m-%d %H:%M')
df['last_visit'] = pd.to_datetime(df['last_visit'], format='%d/%m/%Y')

df.dtypes

If any datetimes are missing, they will be set to `NaT`, which stands for **N**ot-**a**-**T**ime. This is similar to the NaN (**N**ot-**a**-**N**umber) value used for missing numeric values.

<div class="alert alert-block alert-info">

### datetime formatting

| Directive | Meaning                                      | Example Output      |
|-----------|---------------------------------------------|----------------------|
| %a        | Weekday as locale’s abbreviated name        | Sun                 |
| %A        | Weekday as locale’s full name              | Sunday              |
| %d        | Day of the month (zero-padded)             | 01                  |
| %b        | Month as locale’s abbreviated name         | Jan                 |
| %B        | Month as locale’s full name                | January             |
| %m        | Month as a zero-padded decimal number      | 01                  |
| %y        | Year without century (zero-padded)         | 26                  |
| %Y        | Year with century                          | 2026                |
| %H        | Hour (24-hour clock, zero-padded)          | 14                  |
| %I        | Hour (12-hour clock, zero-padded)          | 02                  |
| %p        | Locale’s AM or PM                          | PM                  |
| %M        | Minute (zero-padded)                       | 05                  |
| %S        | Second (zero-padded)                       | 09                  |
| %c        | Locale’s date and time representation      | Thu Jan 15 10:13:57 |
| %x        | Locale’s date representation               | 01/15/26            |
| %X        | Locale’s time representation               | 10:13:57            |

For a complete list see the [Python Documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior).
</div>

<a href="removing-columns"></a>
### Removing (dropping) columns

There are several reasons why we may wish to remove (drop) a column from a `DataFrame`, including because it:
- contains a high proportion of missing values
- is irrelevant to the analysis
- contains redundant information
- contains data with low variability (i.e., contains almost all the same value(s) for all rows)
- is of poor quality (i.e., it is unable to easily be cleaned)
-  contains sensitive or personally identifiable information that is not essential to the analysis

In our pet data, let's drop the pet's nickname column, because more than half of the rows are missing data in that column and it is not relevant to our analysis.

In [None]:
df = df.drop(columns=['pet_nickname'])
df.head()

<a href="removing-rows-missing"></a>

### Filling missing data

**Data imputation** or filling missing data, involves replacing missing values with values that are inferred from the existing data. This process can help maintain dataset completeness and improve the quality of analysis or modeling.

Using our knowledge of vet appointments, if the appointment type isn't specified, it is usually a check-up. We'll fill in the missing values with that type.

In [None]:
df['type'] = df['type'].fillna('Checkup')
df.head()

For the missing cost, we'll fill it in using the mean estimated cost.

In [None]:
mean_cost = df['estimated_cost'].mean()
df['estimated_cost'] = df['estimated_cost'].fillna(mean_cost)
df.head()

<a id="column-names"></a>
### Modifying Column Names

Let's update the columns of our data set. We can rename them one at a time or multiple at once:

In [None]:
df = df.rename(columns={'owner_name': 'Owner'})
df = df.rename(columns={'pet_name': 'Pet',
                        'species': 'Species',
                        'appointment_date': 'Appointment',
                        'insurance_provider': 'Insurance',
                        'last_visit': 'Last Visit Date', 
                        'num_visits': 'Visit Count',
                        'type': 'Type',
                        'estimated_cost': 'Estimated Cost'})
df.head()


<a id="creating-new-columns"></a>
### Adding columns

The `Appointment` column contains both the date and time of the appointment as a value of type `datetime`. Let's split this data into two columns. We'll do this by creating two new columns with the data and time and then removing the `Appointment` column.

In [None]:
df['Appointment Date'] = df['Appointment'].dt.date
df['Appointment Time'] = df['Appointment'].dt.time

df = df.drop(columns=['Appointment'])

df.head()

<div class="alert alert-block alert-info">

### Example Pandas datetime attributes


| Attribute        | Description                                                                 |
|-------------------|-----------------------------------------------------------------------------|
| `year`           | Returns the year component of the datetime values.                        |
| `month`          | Returns the month component (1–12).                                       |
| `day`            | Returns the day of the month (1–31).                                      |
| `hour`           | Returns the hour component (0–23).                                        |
| `minute`         | Returns the minute component (0–59).                                      |
| `second`         | Returns the second component (0–59).                                      |
| `date`           | Returns the date part as `datetime.date`.                                 |
| `time`           | Returns the time part as `datetime.time`.                                 |

For a complete list, see the [DateTimeIndex documentation](https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html).
</div>

<a href="removing-rows"></a>
### Removing Rows

Removing a row from a `Dataframe` may be desirable if it:
- contains missing data
- is duplicated
- is an outlier

Imagine that we are only interested in pets who have upcoming appointments booked. Since they aren't relevant to our analysis, we'll remove rows with missing appointment dates.

In [None]:
df = df.dropna(subset=['Appointment Date'])
df.head()

We can use the `Dataframe` method `drop_duplicates` to remove duplicate rows. For example, there are two rows for shadow, so let's remove the second occurance:

In [None]:
print(f'Size before removing duplicates {df.shape}')
df = df.drop_duplicates()
print(f'Size after removing duplicates {df.shape}')

<div class="alert alert-block alert-info">

| Method | Description |
|--------|------------|
| `df.drop_duplicates()` | Remove duplicate rows, keeping the first occurrence (default behavior). |
| `df.drop_duplicates(keep='first')` | Keep the first occurrence of each duplicate and drop the rest. |
| `df.drop_duplicates(keep='last')` | Keep the last occurrence of each duplicate and drop the rest. |
| `df.drop_duplicates(keep=False)` | Drop **all** duplicate rows, keeping only rows that are unique. |

For complete information, see `help(pd.Dataframe.drop_duplicates)` or https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

</div>