# 📓 Lesson 6: Changing Data Types and Using Categorical Data
📘 What you will learn:
1. How to check and change column data types
2. How to use astype() to convert types
3. How to work with category data to save memory
4. How to convert dates to datetime type

## Step 1: Load the Dataset
We’ll again use Sales_January_2019.csv from the data/ folder.


In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv('../data/Sales_January_2019.csv')

# Show data types
print(df.dtypes)

💡 This tells you what type each column currently is (e.g., object, int64, float64).

## Step 2: Convert Strings to Numbers
Some columns like Quantity Ordered may be read as strings. We need to convert them:

In [None]:
# Convert to numeric, set errors='coerce' to handle bad values
df['Quantity Ordered'] = pd.to_numeric(df['Quantity Ordered'], errors='coerce')
df['Price Each'] = pd.to_numeric(df['Price Each'], errors='coerce')

# Drop rows where conversion failed and became NaN
df = df.dropna(subset=['Quantity Ordered', 'Price Each'])

# Check types
print(df.dtypes)

Clean NaN values after conversion:


In [None]:
df = df.dropna(subset=['Quantity Ordered', 'Price Each'])

## Step 3: Convert to Categorical Data
If a column contains a small number of repeated values (e.g., city names or products), you can use category type to save memory:

In [None]:
# Before
print(df['Product'].memory_usage(deep=True))

# Convert to category
df['Product'] = df['Product'].astype('category')

# After
print(df['Product'].memory_usage(deep=True))


📌 What does deep=True do?

By default (deep=False), memory_usage() only shows the shallow memory used – the basic structure of the column.

When you set deep=True, Pandas calculates the true memory usage, including the actual memory consumed by strings or objects inside the column.

This is especially useful for columns with object or string data types.

In [None]:
df = pd.DataFrame({
    'Product': ['iPhone', 'iPhone', 'MacBook', 'iPhone', 'MacBook']
})

print("Without deep:", df['Product'].memory_usage(deep=False))
print("With deep:", df['Product'].memory_usage(deep=True))


As you can see, the memory with deep=True is more accurate because it includes the content of each string, not just the references.

💡 You can also check:

In [None]:
print(df['Product'].value_counts())

## Step 4: Convert Dates to datetime
To work with dates (e.g., filtering, sorting, grouping), convert them using pd.to_datetime():

In [None]:
df['Order Date'] = pd.to_datetime(df['Order Date'], errors='coerce')

# Check result
print(df['Order Date'].head())

You can now extract date parts:

In [None]:
df['Month'] = df['Order Date'].dt.month
df['Hour'] = df['Order Date'].dt.hour

print(df)

## Practice Exercises

1. Load Sales_January_2019.csv
2. Convert Quantity Ordered and Price Each to numeric
3. Convert Product and City to category
4. Convert Order Date to datetime
5. Create a new column for Month and Hour
6. Compare memory usage of categorical vs object types

In [None]:
df = pd.read_csv('../data/Sales_January_2019.csv')

df['Quantity Ordered'] = pd.to_numeric(df['Quantity Ordered'], errors='coerce')
df['Price Each'] = pd.to_numeric(df['Price Each'], errors='coerce')
df = df.dropna(subset=['Quantity Ordered', 'Price Each'])

# Convert to category
df['Product'] = df['Product'].astype('category')
df['City'] = df['Purchase Address'].str.extract(r'([A-Za-z\s]+),')

df['City'] = df['City'].astype('category')

# Convert to datetime
df['Order Date'] = pd.to_datetime(df['Order Date'], errors='coerce')

# Extract month and hour
df['Month'] = df['Order Date'].dt.month
df['Hour'] = df['Order Date'].dt.hour


## Summary

In this lesson, you learned how to:

- Change data types using astype() and to_numeric()
- Convert columns to category for better performance
- Convert string dates to datetime format
- Extract month, day, hour from datetime columns

👉 In the next lesson, you will learn how to apply functions to your data, perform calculations, and sort your results.