
# Class 10 — Pandas Basics: Data Representation & Types (Documented)

This notebook is a **fully narrated** walkthrough of how pandas represents data. Each section includes explanations, definitions, and example code.

> Topics: dtypes, memory usage, nullable dtypes, string vs object, datetime, timedelta, categorical types, and type conversion helpers.



## 1. Imports


In [None]:

import pandas as pd
import numpy as np

pd.__version__, np.__version__



**Why these libraries?**
- **pandas** is the primary library for tabular data.
- **numpy** underpins pandas with fast array operations.



## 2. Create a sample DataFrame

We'll build a small dataset we can safely modify without reading external files.


In [None]:

df = pd.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Evan"],
    "join_date": ["2022-01-15", "2021-12-30", "2022-02-10", "2023-05-01", None],
    "department": ["HR", "Finance", "Finance", "IT", "Finance"],
    "salary": [50000, 60000, 55000, 70000, None],
    "active": [True, True, False, True, None],
    "region": ["West", "West", "East", "North", "East"]
})
df



**Notes**
- `join_date` is currently text (`object`) with a missing value.
- `salary` has a missing value (`None`), which will coerce to `NaN` in numeric arrays.
- `active` has a missing value—this will be useful when we discuss **nullable booleans**.



## 3. Inspecting dtypes and memory usage


In [None]:

df.dtypes


In [None]:

# Memory usage (deep=True is more accurate for object columns)
df.memory_usage(deep=True)



**Definitions**
- **dtype**: The data type of a column/Series (e.g., `int64`, `float64`, `object`, `bool`, `string`, `category`, `datetime64[ns]`).  
- **`object`**: A generic dtype for Python objects, often used for strings or mixed/messy data.  
- **Memory usage**: Important for performance and scalability. Using the right dtype can reduce memory.



## 4. Converting to appropriate types

We'll convert columns to more specific dtypes for correctness and performance.


In [None]:

# Datetime conversion
df["join_date"] = pd.to_datetime(df["join_date"], errors="coerce")

# Numeric conversion (salary may have missing and/or text; use to_numeric)
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")

# Pandas string dtype (more consistent string handling than 'object')
df["name"] = df["name"].astype("string")
df["region"] = df["region"].astype("string")

# Nullable boolean dtype that supports missing values: 'boolean'
df["active"] = df["active"].astype("boolean")

df.dtypes



**Why this matters**
- `datetime64[ns]` enables date math and `.dt` accessors.
- `to_numeric(..., errors="coerce")` safely converts non-numeric strings to `NaN` instead of failing.
- `string` dtype provides consistent, vectorized string ops and proper missing value semantics (`<NA>`).
- `boolean` (lowercase) is pandas' **nullable boolean** type that supports `<NA>`, unlike NumPy's `bool`.



## 5. Working with datetime columns


In [None]:

df["year"] = df["join_date"].dt.year
df["month"] = df["join_date"].dt.month
df["weekday"] = df["join_date"].dt.day_name()

# Date filtering and arithmetic examples
recent = df[df["join_date"] >= "2022-01-01"]
delta_days = (pd.Timestamp("2024-01-01") - df["join_date"]).dt.days

df[["join_date", "year", "month", "weekday"]], recent[["id", "join_date"]], delta_days



**Definitions**
- **`.dt` accessor**: exposes vectorized datetime properties (e.g., `.year`, `.month`) and methods.
- **`Timedelta`**: the result of subtracting two datetimes; supports operations like `.days`, `.seconds`.



## 6. Categorical data for memory & performance


In [None]:

# Compare memory usage before/after converting 'department' to category
before = df["department"].memory_usage(deep=True)

df["department"] = df["department"].astype("category")

after = df["department"].memory_usage(deep=True)

before, after, float((before - after) / before * 100)



**Why category?**
- Stores values as integer **codes** referencing a small set of **categories**.
- Great for repeating strings (e.g., departments, regions) → **less memory**, faster `groupby` and `value_counts`.
- You can also set an **order** for categories (useful for sorting and comparisons).


In [None]:

# Setting ordered categories and renaming labels
df["department"] = df["department"].cat.set_categories(["HR", "Finance", "IT"], ordered=True)
df["department"] = df["department"].cat.rename_categories({"IT": "Information Tech"})
df["department"].cat.categories, df["department"].cat.ordered



## 7. String operations with the `.str` accessor


In [None]:

# Normalize names: strip spaces, title case, and create an email-like alias
df["name_clean"] = df["name"].str.strip().str.title()
df["email_alias"] = df["name_clean"].str.replace(r"\s+", ".", regex=True).str.lower() + "@example.com"

# Pattern checks
mask_contains_a = df["name_clean"].str.contains("a", case=False, na=False)
df[["name", "name_clean", "email_alias"]], mask_contains_a



**Notes**
- `.str` methods are **vectorized** (fast) and handle missing values gracefully (often via `na=` parameter).
- Prefer `string` dtype for consistent behavior; `object` can hide mixed types and slow things down.



## 8. Nullable numeric & boolean dtypes

Pandas provides **nullable** dtypes to represent missing values cleanly:
- `Int8/Int16/Int32/Int64` (capital **I**) — nullable integer types
- `Float32/Float64` — floats already support `NaN`, but you can downcast to save memory
- `boolean` — nullable boolean type


In [None]:

# Example: convert id to nullable Int64 and salary to Float32 (downcast to save memory)
df["id"] = df["id"].astype("Int64")
df["salary"] = df["salary"].astype("Float32")

df.dtypes, df.memory_usage(deep=True)



**Tip**
- Use `pd.to_numeric(..., downcast="integer" or "float")` to suggest a more compact dtype when possible.



## 9. Safe type conversions and error handling


In [None]:

# Suppose we have a messy numeric column as text
s = pd.Series(["10", "20", "oops", None, "30.5"])

# Convert safely
num = pd.to_numeric(s, errors="coerce")
num, num.dtype



**Definitions**
- **`errors="coerce"`**: invalid parsing becomes `NaN` instead of raising an error.
- **`errors="ignore"`**: leave data unchanged if parsing fails.



## 10. Missing values: `NaN`, `<NA>`, and `NaT`


In [None]:

# Demonstrate the three common missing representations
missing_demo = pd.DataFrame({
    "float_col": [1.0, np.nan, 3.0],
    "string_col": pd.Series(["x", None, "z"], dtype="string"),
    "dt_col": pd.to_datetime(["2020-01-01", None, "2020-01-03"])
})
missing_demo, missing_demo.dtypes



**Key points**
- **`NaN`**: Missing value for floating-point arrays (from NumPy).
- **`<NA>`**: Pandas' **scalar** for missing values in **nullable** dtypes (`string`, `Int64`, `boolean`, etc.).
- **`NaT`**: Missing value for datetime and timedelta dtypes.



## 11. Mini workflow: clean & optimize


In [None]:

clean = df.copy()

# Standardize strings
clean["name"] = clean["name"].str.strip().str.title().astype("string")
clean["region"] = clean["region"].str.strip().str.title().astype("string")

# Ensure dtypes
clean["join_date"] = pd.to_datetime(clean["join_date"], errors="coerce")
clean["department"] = clean["department"].astype("category")
clean["salary"] = pd.to_numeric(clean["salary"], errors="coerce").astype("Float32")
clean["id"] = clean["id"].astype("Int64")
clean["active"] = clean["active"].astype("boolean")

before_mem = df.memory_usage(deep=True).sum()
after_mem = clean.memory_usage(deep=True).sum()

clean, {"before_bytes": int(before_mem), "after_bytes": int(after_mem), "save_pct": float((before_mem - after_mem)/before_mem*100)}



**Outcome**
- Cleaned types enable safer operations and typically reduce memory.
- This pattern generalizes to most real-world CSVs after initial load.



## 12. Practice exercises
1. Convert a text date column to `datetime64[ns]`, extract `year` and `quarter`.
2. Change a high-cardinality text column to `category`; report memory savings.
3. Convert a messy numeric column using `pd.to_numeric(..., errors="coerce")`; count invalid rows.
4. Switch a boolean-like column (`"Y"/"N"/None"`) to pandas `boolean` dtype.
5. Downcast floats/ints to smaller dtypes where safe (`downcast=` parameter).
