# 13. first subtopic: `.dt accessor basics`.

-----

The `.dt` accessor is a special tool in Pandas that "unlocks" a large set of date and time properties and methods on a Series. You *cannot* use it on a regular `object` (text) column. You must *first* convert your column to the `datetime64[ns]` type using `pd.to_datetime()`.

Think of a `datetime64` object as a locked box containing all the parts of a date (year, month, day, hour, etc.). The `.dt` accessor is the **key** to that box. Once you use it (e.g., `s.dt`), Pandas opens the box and gives you access to all the individual components, like `.dt.year`, `.dt.month`, and `.dt.day_name()`.

**How It Works in Memory**: The `.dt` accessor itself doesn't store anything. It's a "gateway" or "accessor" object. When you call `s.dt.year`, Pandas is *not* storing a separate column of years. It's looking at the underlying `datetime64[ns]` data (which is just a single 64-bit integer for each date) and, on the fly, calculating the "year" component from that integer. This makes it very fast and memory-efficient.

**When to Use This**: You *must* use this for **feature engineering**. It's the standard way to break a single date column into multiple useful pieces of information for analysis or machine learning models.

  * Use `.dt.year` or `.dt.month` to **group by** time periods (e.g., `df.groupby(df['date'].dt.year).sum()`).
  * Use `.dt.day_name()` or `.dt.hour` to **filter** for specific patterns (e.g., "find all sales that happened on a weekend" or "during morning hours").

-----

### 0\. Syntax & Parameters (MUST COME FIRST)

The `.dt` accessor is placed *between* your Series name and the property/method you want.

```python
# Accessing a PROPERTY (no parentheses)
series.dt.property
```

  * **Properties:** `year`, `month`, `day`, `hour`, `minute`, `second`, `weekday`, `dayofyear`, `quarter`, etc.

<!-- end list -->

```python
# Accessing a METHOD (with parentheses)
series.dt.method()
```

  * **Methods:** `day_name()`, `month_name()`, `strftime()`, `normalize()`, `floor()`, `ceil()`, etc.

**Key Point:** You can *only* use `.dt` on a Series that has a `dtype` of `datetime64[ns]` or `timedelta64[ns]`.

-----

### 1\. Basic Example (Accessing Properties)

Let's convert a Series and access its basic parts.

```python
import pandas as pd
import numpy as np

s_dates = pd.Series(['2025-01-01', '2025-05-15', '2026-11-30'])

# 1. We MUST convert it first
s_dt = pd.to_datetime(s_dates)

print("--- 1. Converted Series (datetime64[ns]) ---")
print(s_dt)

# 2. Now, we can use the .dt accessor

# Example 1: Get the year
print("\n--- 2. Example 1: .dt.year ---")
print(s_dt.dt.year)

# Example 2: Get the month
print("\n--- 3. Example 2: .dt.month ---")
print(s_dt.dt.month)

# Example 3: Get the day
print("\n--- 4. Example 3: .dt.day ---")
print(s_dt.dt.day)

# Example 4: Get the day of the week (Mon=0, Sun=6)
print("\n--- 5. Example 4: .dt.weekday ---")
print(s_dt.dt.weekday)
```

**Output:**

```
--- 1. Converted Series (datetime64[ns]) ---
0   2025-01-01
1   2025-05-15
2   2026-11-30
dtype: datetime64[ns]

--- 2. Example 1: .dt.year ---
0    2025
1    2025
2    2026
Name: year, dtype: int32

--- 3. Example 2: .dt.month ---
0     1
1     5
2    11
Name: month, dtype: int32

--- 4. Example 3: .dt.day ---
0     1
1    15
2    30
Name: day, dtype: int32

--- 5. Example 4: .dt.weekday ---
0    2
1    3
2    0
Name: weekday, dtype: int32
```

**Explanation:**
As you can see, each `.dt.property` call returned a *new Series* containing just that part of the date. Note that `.dt.weekday` returned `2` (Wednesday), `3` (Thursday), and `0` (Monday).

-----

### 2\. Intermediate Example (Accessing Methods)

Methods (with `()`) give you more formatted or computed results.

```python
# Use the same s_dt from the previous example
s_dt = pd.to_datetime(pd.Series(['2025-01-01', '2025-05-15', '2026-11-30']))

# Example 5: Get the day's NAME
print("\n--- 6. Example 5: .dt.day_name() ---")
print(s_dt.dt.day_name())

# Example 6: Get the month's NAME
print("\n--- 7. Example 6: .dt.month_name() ---")
print(s_dt.dt.month_name())

# Example 7: Normalize (strip time information)
s_time = pd.Series(['2025-01-01 08:30:00', '2025-05-15 12:00:00'])
s_dt_time = pd.to_datetime(s_time)
print("\n--- 8. Example 7: Before .dt.normalize() ---")
print(s_dt_time)

print("\n--- 9. After .dt.normalize() ---")
print(s_dt_time.dt.normalize())
```

**Output:**

```
--- 6. Example 5: .dt.day_name() ---
0     Wednesday
1      Thursday
2        Monday
Name: day_name, dtype: object

--- 7. Example 6: .dt.month_name() ---
0     January
1         May
2    November
Name: month_name, dtype: object

--- 8. Example 7: Before .dt.normalize() ---
0   2025-01-01 08:30:00
1   2025-05-15 12:00:00
dtype: datetime64[ns]

--- 9. After .dt.normalize() ---
0   2025-01-01
1   2025-05-15
dtype: datetime64[ns]
```

**Explanation:**
The methods `day_name()` and `month_name()` returned string representations. `.dt.normalize()` is a useful method to "zero out" the time, leaving just the date, which is great for grouping by day.

-----

### 3\. Advanced or Tricky Case (Formatting & Filtering)

This is where `.dt` becomes a powerful tool.

**Example 8: `strftime()` for custom formats**
`strftime` (string format time) lets you build any date string you want.

```python
# %Y = 4-digit year, %m = 2-digit month, %d = 2-digit day
# %B = Full month name, %A = Full day name
print("\n--- 10. Example 8: .dt.strftime('%Y-%m') ---")
print(s_dt.dt.strftime('%Y-%m')) # Format as YYYY-MM

print("\n--- 11. Example 9: .dt.strftime (complex) ---")
print(s_dt.dt.strftime('%A, %B %d'))
```

**Output:**

```
--- 10. Example 8: .dt.strftime('%Y-%m') ---
0    2025-01
1    2025-05
2    2026-11
dtype: object

--- 11. Example 9: .dt.strftime (complex) ---
0    Wednesday, January 01
1     Thursday, May 15
2       Monday, November 30
dtype: object
```

**Explanation:**
`strftime` is a method that lets you *format* your dates back into strings, but in any custom format you want. This is great for reports.

**Example 10: Filtering with `.dt`**
This is the *real* power. You use the `.dt` accessor *inside* a boolean indexing mask.

```python
df = pd.DataFrame({
    'date': pd.to_datetime(['2025-01-15', '2025-02-10', '2025-03-05', '2025-04-30']),
    'sales': [100, 150, 50, 200]
})
print("\n--- 12. Original DataFrame ---")
print(df)

# Example 11: Find all sales from February
mask = df['date'].dt.month == 2
print("\n--- 13. Example 11: Sales from February ---")
print(df[mask])

# Example 12: Find all sales from the first half of the month
mask2 = df['date'].dt.day <= 15
print("\n--- 14. Example 12: Sales from 1st-15th ---")
print(df[mask2])
```

**Output:**

```
--- 12. Original DataFrame ---
        date  sales
0 2025-01-15    100
1 2025-02-10    150
2 2025-03-05     50
3 2025-04-30    200

--- 13. Example 11: Sales from February ---
        date  sales
1 2025-02-10    150

--- 14. Example 12: Sales from 1st-15th ---
        date  sales
0 2025-01-15    100
1 2025-02-10    150
2 2025-03-05     50
```

-----

### 4\. Real-World Use Case (Feature Engineering)

This is the \#1 use case for the `.dt` accessor. You take a *single* date column and create *many* new "feature" columns to help a machine learning model find patterns.

**Example 13: Create multiple features**

```python
df = pd.DataFrame({
    'timestamp': pd.to_datetime(['2025-11-17 08:30:00', '2025-11-18 12:15:00'])
})
print("\n--- 15. Original DataFrame ---")
print(df)

# Example 14: Create 'month', 'weekday', and 'hour' features
df['month'] = df['timestamp'].dt.month
df['weekday'] = df['timestamp'].dt.day_name()
df['hour'] = df['timestamp'].dt.hour
df['is_weekend'] = df['timestamp'].dt.weekday >= 5 # (Mon=0... Sat=5, Sun=6)

print("\n--- 16. Example 14: DataFrame with new features ---")
print(df)
```

**Output:**

```
--- 15. Original DataFrame ---
            timestamp
0 2025-11-17 08:30:00
1 2025-11-18 12:15:00

--- 16. Example 14: DataFrame with new features ---
            timestamp  month    weekday  hour  is_weekend
0 2025-11-17 08:30:00     11     Monday     8       False
1 2025-11-18 12:15:00     11    Tuesday    12       False
```

**Explanation:**
We started with just one `timestamp` column. From it, we "engineered" four new columns (`month`, `weekday`, `hour`, `is_weekend`) that can be fed to a model. A model can't understand `'2025-11-17 08:30:00'`, but it *can* understand `month=11`, `hour=8`, and `is_weekend=False`.

-----

### 5\. Common Mistakes / Pitfalls

**Mistake 15: `AttributeError: Can only use .dt accessor...`**
This is the \#1 error. You forgot to convert the column first.

```python
s_text = pd.Series(['2025-01-01'])
print("\n--- 17. Series is 'object' type ---")
print(f"Dtype: {s_text.dtype}")

# Wrong code
try:
    s_text.dt.year
except AttributeError as e:
    print(f"\n--- 18. Error ---")
    print(e)
```

**Error/Wrong Output:**
`AttributeError: Can only use .dt accessor with datetimelike values`
**Why it happens:** The column `s_text` is `object` (text). Pandas doesn't know it's a date.
**Example 19: Corrected code:**
You *must* convert it first.

```python
s_dt = pd.to_datetime(s_text)
print("\n--- 19. Corrected ---")
print(s_dt.dt.year)
```

**Mistake 20: Forgetting `.dt`**
This is a very common beginner mistake.

```python
s_dt = pd.to_datetime(pd.Series(['2025-01-01']))
print("\n--- 20. Series is datetime ---")
print(f"Dtype: {s_dt.dtype}")

# Wrong code
try:
    s_dt.year
except AttributeError as e:
    print(f"\n--- 21. Error ---")
    print(e)
```

**Error/Wrong Output:**
`AttributeError: 'Series' object has no attribute 'year'`
**Why it happens:** The `Series` object itself doesn't have a `.year` property. The *accessor* has the property.
**Example 21: Corrected code:**
You have to put `.dt` in the middle.

```python
print("\n--- 22. Corrected ---")
print(s_dt.dt.year)
```

# DateTime & categorical  `pd.Categorical()` and the "category" dtype.

-----

The **category** data type is a special, high-performance type in Pandas. It is a memory-saving "specialist" for columns that have a *limited number* of *repeating string values*.

Think of a column like "Gender" (`['Male', 'Female', 'Male', 'Male']...`). Instead of storing the full text string "Male" thousands of times, Pandas can use the `category` type to store "Male" and "Female" *once*, and then use tiny integers (like `0` and `1`) behind the scenes to represent the full column. This can save a massive amount of memory (often 90%+) and also speed up operations like `groupby`.

`pd.Categorical()` is the underlying *constructor* that creates this structure. However, in practice, you will almost always use the shortcut: `df['col'].astype('category')`.

**How It Works in Memory**: A `category` column is split into two parts:

1.  **`categories`**: An index of the *unique* values (e.g., `['Female', 'Male']`).
2.  **`codes`**: A column of *integers* (e.g., `[0, 1, 0, 0]`) that map to the `categories`.

This is why it's so efficient. Storing thousands of `int`s is much, much cheaper than storing thousands of full-text strings.

**When to Use This**:

  * **This is a critical optimization.** You should *always* use this on text (`object`) columns that have low cardinality (i.e., few unique values compared to the total size).
  * **Good candidates:** "State", "Country", "Gender", "Status" (e.g., "Pending", "Complete"), "Department", "SKU".
  * **Bad candidates:** "Full Name", "Email Address", "Comment Text" (these are all unique and won't save any memory).
  * Use `pd.Categorical()` (the constructor) when you need to *pre-define* the categories and their order (e.g., "Low", "Medium", "High").

-----

### 0\. Syntax & Parameters (MUST COME FIRST)

There are two ways to create a categorical type.

#### 1\. The Easy Way: `series.astype('category')`

This is what you will use 99% of the time.

```python
series.astype('category')
```

  * **What it does:** Converts an existing Series (usually `object` type) to `category` type. Pandas will automatically find the unique values.

#### 2\. The Powerful Way: `pandas.Categorical()`

This is the "constructor" you use when you need more control, like setting a specific *order*.

```python
pandas.Categorical(values, categories=None, ordered=False)
```

  * **`values`**
      * **What it does:** The raw data (a list, Series, etc.) that you want to convert.
      * **Default value:** (Required)
      * **When you would use it:** You *always* provide this. `pd.Categorical(['A', 'B', 'A'])`.
  * **`categories`**
      * **What it does:** An "allow list" of the *only* categories that are valid. If you provide this, any value in `values` that is *not* in this list will be converted to `NaN`.
      * **Default value:** `None`
      * **When you would use it:** To enforce data quality or set a specific order. `categories=['Low', 'Medium', 'High']`.
      * **What happens if you don't specify it:** Pandas infers the categories from the `values` (e.g., `['A', 'B']`).
  * **`ordered`**
      * **What it does:** A boolean (True/False). If `True`, it tells Pandas that the categories have a *meaningful order* (e.g., "Low" \< "Medium" \< "High").
      * **Default value:** `False`
      * **When you would use it:** You *must* set this to `True` if you are creating an ordinal (ordered) category. This "unlocks" sorting and min/max operations.
      * **What happens if you don't specify it:** The categories are treated as unordered (e.g., "Male" is not \> "Female").

-----

### 1\. Basic Example

Let's see the 99% use case: `.astype('category')`.

```python
import pandas as pd
import numpy as np

# Example 1: Original 'object' Series
# A column of 1 million rows, but only 3 unique values
s_object = pd.Series(['A', 'B', 'C', 'A'] * 250000)
print("--- 1. Original (object) ---")
print(s_object.head())
print(f"Dtype: {s_object.dtype}")
print(f"Memory: {s_object.memory_usage(deep=True)} bytes")

# Example 2: Converted 'category' Series
s_cat = s_object.astype('category')
print("\n--- 2. Converted (category) ---")
print(s_cat.head())
print(f"Dtype: {s_cat.dtype}")
print(f"Memory: {s_cat.memory_usage(deep=True)} bytes")
```

**Output:**

```
--- 1. Original (object) ---
0    A
1    B
2    C
3    A
dtype: object
Dtype: object
Memory: 60000128 bytes

--- 2. Converted (category) ---
0    A
1    B
2    C
3    A
dtype: category
Categories (3, object): ['A', 'B', 'C']
Dtype: category
Memory: 1000332 bytes
```

**Explanation:**
Look at the memory usage\! We went from \~60 MB to \~1 MB. This is a 98% memory saving. Pandas automatically found the 3 unique "Categories" (`['A', 'B', 'C']`) and converted the 1 million strings into 1 million small integers.

**Example 3: Using the `.cat` accessor**
Just like `.dt` for dates, `category` columns unlock the `.cat` accessor.

```python
# 's_cat' is our Series from Example 2
# Example 4: Get the categories
print("\n--- 3. .cat.categories ---")
print(s_cat.cat.categories)

# Example 5: Get the underlying integer codes
print("\n--- 4. .cat.codes (showing first 5) ---")
print(s_cat.cat.codes.head())
```

**Output:**

```
--- 3. .cat.categories ---
Index(['A', 'B', 'C'], dtype='object')

--- 4. .cat.codes (showing first 5) ---
0    0
1    1
2    2
3    0
4    1
dtype: int8
```

**Explanation:**
This shows how it works. It stores `['A', 'B', 'C']` *once*, and the data `['A', 'B', 'C', 'A', 'B']` is stored as `[0, 1, 2, 0, 1]`...

-----

### 2\. Intermediate Example

Using `pd.Categorical()` to create an **ordered** category. This is the "powerful" way.

**Example 6: Creating an *ordered* category**
This is for data where the order *matters* (e.g., 'Low' \< 'Medium' \< 'High').

```python
# The data, in a random order
data = ['Medium', 'Low', 'High', 'Low', 'Medium']

# Example 7: Define the *correct* order
level_order = ['Low', 'Medium', 'High']

# Create the categorical
s_ordered = pd.Categorical(data, categories=level_order, ordered=True)
s_ordered = pd.Series(s_ordered) # Put it back in a Series for easy viewing

print("--- 5. Ordered Category ---")
print(s_ordered)
print(f"\nIs it ordered? {s_ordered.cat.ordered}")
```

**Output:**

```
--- 5. Ordered Category ---
0    Medium
1       Low
2      High
3       Low
4    Medium
dtype: category
Categories (3, object): ['Low' < 'Medium' < 'High']

Is it ordered? True
```

**Explanation:**
By setting `ordered=True` and providing the `categories` list, we've "taught" Pandas the correct order. The output `['Low' < 'Medium' < 'High']` shows this.

**Example 8: Why an ordered category is powerful**
Now we can sort *logically*, not alphabetically.

```python
# Use the ordered Series from Example 7
print("\n--- 6. Logical Sorting ---")
print(s_ordered.sort_values())

# Example 9: We can also filter using < or >
print("\n--- 7. Logical Filtering (> 'Low') ---")
print(s_ordered[s_ordered > 'Low'])
```

**Output:**

```
--- 6. Logical Sorting ---
1       Low
3       Low
0    Medium
4    Medium
2      High
dtype: category
Categories (3, object): ['Low' < 'Medium' < 'High']

--- 7. Logical Filtering (> 'Low') ---
0    Medium
2      High
4    Medium
dtype: category
Categories (3, object): ['Low' < 'Medium' < 'High']
```

**Explanation:**

  * A normal sort (`.sort_values()`) on "High", "Low", "Medium" would be alphabetical ("High", "Low", "Medium").
  * Because our category is *ordered*, `sort_values()` correctly puts "Low" first.
  * We can also now use logical filters like `> 'Low'`, which is impossible with `object` strings.

-----

### 3\. Advanced or Tricky Case

Using `categories` to enforce data quality and finding values that aren't in the list.

**Example 10: Using `categories` as a "validator"**
What happens if our data has a typo?

```python
data = ['A', 'B', 'C', 'D'] # 'D' is a typo
allowed_cats = ['A', 'B', 'C']

# Example 11: Create a categorical, defining the *only* allowed categories
s_validated = pd.Categorical(data, categories=allowed_cats)
s_validated = pd.Series(s_validated)

print("--- 8. Validated Category ---")
print(s_validated)
```

**Output:**

```
--- 8. Validated Category ---
0      A
1      B
2      C
3    NaN
dtype: category
Categories (3, object): ['A', 'B', 'C']
```

**Explanation:**
We defined the `categories` as `['A', 'B', 'C']`. When `pd.Categorical` saw `'D'` in the `data`, it did *not* know what to do with it, so it converted it to `NaN`. This is a powerful way to clean data and enforce a "schema."

**Example 12: Adding a new category**
What if you need to add a new *valid* category after creation?

```python
s_cat = pd.Series(['A', 'B']).astype('category')
print("\n--- 9. Before ---")
print(s_cat.cat.categories)

# This will add 'C' to the list of known categories
s_cat_new = s_cat.cat.add_categories(['C'])
print("\n--- 10. After .cat.add_categories() ---")
print(s_cat_new.cat.categories)
```

**Output:**

```
--- 9. Before ---
Index(['A', 'B'], dtype='object')

--- 10. After .cat.add_categories() ---
Index(['A', 'B', 'C'], dtype='object')
```

**Example 13: Removing a category**

```python
# Example 14: Removing 'B'
s_cat_removed = s_cat.cat.remove_categories(['B'])
print("\n--- 11. After .cat.remove_categories() ---")
print(s_cat_removed.cat.categories)
```

**Output:**

```
--- 11. After .cat.remove_categories() ---
Index(['A'], dtype='object')
```

-----

### 4\. Real-World Use Case

**Example 15: Cleaning and ordering a "Survey" column**
You have survey data with "ratings" as text. You want to clean, order, and analyze it.

```python
df = pd.DataFrame({
    'user': [1, 2, 3, 4, 5],
    'rating': ['Good', 'OK', 'Bad', 'OK', 'Godo'] # 'Godo' is a typo
})
print("--- 12. Original Survey Data ---")
print(df)

# Example 16: Define the order and clean
# 1. Define the correct, ordered categories
rating_order = ['Bad', 'OK', 'Good']

# 2. Convert using pd.Categorical to validate
# This turns 'Godo' into NaT
df['rating_cat'] = pd.Categorical(df['rating'], categories=rating_order, ordered=True)

print("\n--- 13. Cleaned and Ordered ---")
print(df)
df.info()

# Example 17: Now we can analyze
print("\n--- 14. Find all users who rated > 'Bad' ---")
print(df[df['rating_cat'] > 'Bad'])
```

**Output:**

```
--- 12. Original Survey Data ---
   user rating
0     1   Good
1     2     OK
2     3    Bad
3     4     OK
4     5   Godo

--- 13. Cleaned and Ordered ---
   user rating rating_cat
0     1   Good       Good
1     2     OK         OK
2     3    Bad        Bad
3     4     OK         OK
4     5   Godo        NaN
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   user        5 non-null      int64   
 1   rating      5 non-null      object  
 2   rating_cat  4 non-null      category
dtypes: category(1), int64(1), object(1)
memory usage: 325.0 bytes

--- 14. Find all users who rated > 'Bad' ---
   user rating rating_cat
0     1   Good       Good
1     2     OK         OK
3     4     OK         OK
```

**Explanation:**
This is a perfect workflow. We used `pd.Categorical` to:

1.  **Clean** the data (the typo `'Godo'` became `NaN`).
2.  **Order** the data (`'Bad' < 'OK' < 'Good'`).
3.  **Enable** powerful filtering (like `df['rating_cat'] > 'Bad'`).

-----

### 5\. Common Mistakes / Pitfalls

**Mistake 18: Using `category` on a high-cardinality column**
"Cardinality" means "number of unique values."

```python
# Wrong code (bad idea)
s_email = pd.Series(['a@test.com', 'b@test.com', 'c@test.com', ...])
print("\n--- 15. High Cardinality (e.g., email) ---")
print(f"Memory (object): {s_email.memory_usage(deep=True)} bytes")

s_email_cat = s_email.astype('category')
print("\n--- 16. As category (BAD) ---")
print(f"Memory (category): {s_email_cat.memory_usage(deep=True)} bytes")
```

**Why it happens:** If almost *every* value is unique (like an email or a unique ID), the `category` type has to store *all* of them in the `categories` index *plus* the `codes`\! This often uses *more* memory than just leaving it as `object`.
**Rule of thumb:** `category` is good when `df['col'].nunique() / len(df)` is very low (e.g., \< 1%).

**Mistake 19: Trying to set a value that isn't a known category**

```python
s_cat = pd.Series(['A', 'B']).astype('category')
print("\n--- 17. Original Categories ---")
print(s_cat.cat.categories)

# Wrong code
try:
    s_cat.loc[2] = 'C' # 'C' is not in ['A', 'B']
except TypeError as e:
    print(f"\n--- 18. Error ---")
    print(e)
```

**Error/Wrong Output:**
`TypeError: cannot set item on a Categorical with a new category...`
**Why it happens:** This is a safety feature. The column *only* knows about 'A' and 'B'.
**Example 20: Corrected code:**
You *must* add the category *first*.

```python
s_cat_new = s_cat.cat.add_categories(['C'])
s_cat_new.loc[2] = 'C' # Now this works
print("\n--- 19. Corrected ---")
print(s_cat_new)
```

**Mistake 21: Forgetting `ordered=True`**

```python
# Wrong code
s_unordered = pd.Series(['Low', 'Medium', 'High']).astype('category')
print("\n--- 20. Unordered Category ---")
try:
    print(s_unordered.min()) # Fails on unordered
except TypeError as e:
    print(e)
```

**Error/Wrong Output:**
`Categorical is not ordered...`
**Why it happens:** You didn't tell Pandas the order. It has no idea if "Low" is smaller than "High".
**Correction:** You *must* use `pd.Categorical(..., ordered=True)` for `min()`, `max()`, or `<`/`>` to work.

---- 


Here are the combined remaining sections for **.dt accessor** and **pd.Categorical()**.

-----

### 6\. Key Terms (Explained Simply)

  * **`.dt` Accessor:** A "gateway" property on a `datetime64[ns]` Series that "unlocks" date and time properties (like `.dt.year`, `.dt.day_name()`).
  * **`datetime64[ns]`:** The data type for a date/time column in Pandas. You *must* have this type before you can use `.dt`.
  * **Feature Engineering:** The process of creating new, useful columns (features) from your raw data, often by using `.dt` (e.g., creating `.dt.month`, `.dt.weekday`).
  * **`category` (dtype):** A memory-saving data type for columns with few unique, repeated values (e.g., 'Male'/'Female').
  * **`pd.Categorical()`:** The "constructor" used to build a categorical column, especially when you need to define a specific *order* (`ordered=True`).
  * **`.cat` Accessor:** A "gateway" property on a `category` Series that "unlocks" its special properties (like `.cat.categories`, `.cat.codes`).
  * **Cardinality:** The number of unique values in a column. `category` type is best for **low-cardinality** columns (e.g., 'State', not 'Email').
  * **Ordinal:** A category with a meaningful order (e.g., 'Low' \< 'Medium' \< 'High'). Created by setting `ordered=True`.

-----

### 7\. Best Practices

  * **For `.dt`:**
      * **Convert first:** Always use `pd.to_datetime()` *before* you try to use `.dt`.
      * **Vectorize:** Use `.dt` properties to filter or create new columns.
          * **Good:** `df[df['date'].dt.weekday > 4]`
          * **Bad:** `for row in df.index: if row.date().weekday() > 4...` (This is thousands of times slower).
      * **Feature Engineer:** Don't hesitate to break a date into many `.dt` columns (`year`, `month`, `weekday`) for analysis.
  * **For `category`:**
      * **Check `nunique()`:** Before converting, check `df['col'].nunique()`. If it's very high (like 'Email'), *do not* use `category`.
      * **Use on low-cardinality `object`:** Always use `s.astype('category')` on columns like 'State', 'Gender', 'Status', 'Region' to save memory.
      * **Use `pd.Categorical()` for ordering:** If your category has a logical order ('Bad' \< 'Good'), use the full `pd.Categorical(..., ordered=True)` constructor to set it.
      * **Add categories first:** If you need to add a new value (like 'C') to a category column that only knows 'A' and 'B', you must use `s.cat.add_categories(['C'])` *before* you can set the value.

-----

### 8\. Mini Summary

  * **`.dt` accessor** is the key (`.dt.`) that unlocks properties (`.year`, `.month`) and methods (`.day_name()`) on a `datetime64[ns]` Series.
  * You *must* convert text to `datetime64[ns]` using `pd.to_datetime()` *before* you can use `.dt`.
  * **`category` dtype** is a massive memory-saver for text columns with few unique values (low cardinality).
  * Use `s.astype('category')` for a quick, unordered conversion.
  * Use `pd.Categorical(..., ordered=True)` to create *ordered* categories (like 'Low', 'Medium', 'High') that can be sorted logically.
  * `category` columns have a `.cat` accessor for tasks like adding/removing categories.

-----

### 10\. Practice Tasks

**Data for Tasks:**

```python
df_practice = pd.DataFrame({
    'timestamp': ['2025-01-01 10:00', '2025-01-31 22:00', '2025-02-15 05:00', '2025-02-16 12:00'],
    'user_level': ['Gold', 'Silver', 'Bronze', 'Silver'],
    'sales': [100, 50, 20, 55]
})
```

**Task 24 (Easy - `.dt`):**
First, convert the 'timestamp' column in `df_practice` to `datetime64[ns]`. Then, use the `.dt` accessor to create a new column called 'Hour' that contains just the hour (10, 22, 5, 12).

**Task 25 (Medium - `category`):**
The 'user\_level' column is a perfect candidate for `category` type. Create a new DataFrame `df_medium` where 'user\_level' is converted to a category. Then, print the *memory usage* of the original `df_practice['user_level']` and the new `df_medium['user_level']`.

**Task 26 (Hard - both):**
Using `df_practice`, create an "analysis DataFrame" that shows:

1.  A new column 'day\_name' (e.g., 'Wednesday') from the 'timestamp' column.
2.  The 'user\_level' column has been converted into an *ordered* category with the logical order `['Bronze', 'Silver', 'Gold']`.
3.  Filter this new DataFrame to show only sales that occurred on a 'day\_name' *after* 'Monday' (using `.dt.weekday`) and had a 'user\_level' *greater than* 'Bronze'.

-----

### 11\. Recommended Next Topic

You've learned how to convert and work with the most important data types. The next logical step from the roadmap is to focus on a new data-cleaning task: finding and handling "bad" data beyond just `NaN`s.

[cite\_start]**Recommended:** **Handling Missing Data (re-visited) & Duplicates (`.isna()`, `.dropna()`, `.fillna()`, `.duplicated()`, `.drop_duplicates()`)** [cite: 42-45, 51-53]

-----

### 12\. Quick Reference Card

| Accessor | Data Type | Syntax | Example Properties / Methods |
| :--- | :--- | :--- | :--- |
| **`.dt`** | `datetime64[ns]` | `series.dt.property` | `.dt.year`, `.dt.month`, `.dt.day`, `.dt.hour`, `.dt.weekday` |
| | (dates/times) | `series.dt.method()` | `.dt.day_name()`, `.dt.month_name()`, `.dt.normalize()`, `.dt.strftime()` |
| **`.cat`** | `category` | `series.cat.property` | `.cat.categories`, `.cat.codes`, `.cat.ordered` |
| | (categorical) | `series.cat.method()` | `.cat.add_categories()`, `.cat.remove_categories()`, `.cat.set_categories()` |
| **To Create** | `object` -\> `datetime` | `pd.to_datetime(series)` | (Use `format` and `errors='coerce'`) |
| **To Create** | `object` -\> `category` | `series.astype('category')` | (Fastest, unordered) |
| **To Create** | `object` -\> `category` | `pd.Categorical(series, ...)` | (Use for `ordered=True`) |

-----

### 13\. Common Interview Questions

1.  **I have a 'date' column as `object`. Why can't I use `df['date'].dt.year`?**
      * The `.dt` accessor *only* works on a column that is already a `datetime64[ns]` type. You must first convert it using `df['date'] = pd.to_datetime(df['date'])`.
2.  **I have a column of 5 million rows for "State". `df.info()` shows it's using 400MB of memory. How can I fix this?**
      * Convert it to a `category` type: `df['State'] = df['State'].astype('category')`. Since there are only \~50 unique states, Pandas will store the 50 names once and use small integers for the 5 million rows, drastically reducing memory usage.
3.  **How do you create a 'Survey' column where "Bad" \< "OK" \< "Good"?**
      * You can't use `.astype('category')`, as that will be unordered.
      * You must use the `pd.Categorical` constructor:
      * `order = ['Bad', 'OK', 'Good']`
      * `df['Survey_Cat'] = pd.Categorical(df['Survey_Raw'], categories=order, ordered=True)`
4.  **How do you find all sales that happened on a weekend?**
      * First, ensure your date column is `datetime64[ns]`.
      * Then, use the `.dt.weekday` property. The weekday accessor returns Monday=0, Sunday=6.
      * `df_weekends = df[df['date'].dt.weekday >= 5]`

-----

### 14\. Performance Considerations

  * **`.dt` Accessor:**
      * **Time Complexity:** All `.dt` property access (like `.dt.year`) is **O(n)**. The calculation is vectorized and very fast.
      * **Memory Usage:** Accessing a `.dt` property creates a *new Series* (a copy) containing the extracted data (e.g., a Series of `int`s for the year). This is generally small.
  * **`category` Type:**
      * **Time Complexity (Creation):** `.astype('category')` is **O(n log k)** or `O(n + k)`, where 'n' is rows and 'k' is unique categories. It has to find all unique values and create the codes. This is a one-time cost.
      * **Time Complexity (Operations):** Operations like `groupby('my_cat_col')` are *significantly faster* on a `category` column than an `object` column, because Pandas can group on the underlying integers.
      * **Memory Usage:** This is the main point. `category` can be **10-100x more memory-efficient** than `object` for low-cardinality columns.

-----

### 15\. When NOT to Use This

  * **When NOT to use `.dt`:**
      * You can't. If you have a `datetime64[ns]` column, `.dt` is the *only* correct way to access its components.
  * **When NOT to use `category`:**
      * **High-Cardinality Columns:** Do **NOT** use `.astype('category')` on a column where most values are unique (like 'Email', 'User\_ID', 'Comment\_Text'). It will use *more* memory and be *slower* because the `categories` list will be huge.
      * **If you need to do string operations:** Once you convert to `category`, you lose the `.str` accessor. You can't do `s.cat.contains('A')`. You must use the `categories` themselves. (e.g., `s.cat.categories.str.contains('A')`). It's more complex.
      * **If you are constantly adding new "types":** If your column is a "tag" field where new tags appear *constantly*, it's a bad fit. You would have to call `.cat.add_categories()` every time, which is inefficient.