# 12. Type conversions first subtopic: `.astype()`.

-----

The `.astype()` method is your primary, all-purpose tool for changing the data type of a Series or an entire DataFrame. Think of it as a "type-casting" command. You use it when you *know* the data is in the wrong format and you want to *force* Pandas to change it.

For example, a column of numbers (`1`, `2`, `3`) might be loaded as text (`'1'`, `'2'`, `'3'`), which Pandas calls an `object` type. You can't do math on text. `.astype(int)` is the command you use to "re-cast" that text into actual numbers.

**How It Works in Memory**: `.astype()` *always* returns a **new** Series or DataFrame (a copy) with the new data type. It does this by creating a new NumPy array in memory for each column being converted and then copying the data over, changing its type in the process. Because it's a copy, it's a memory-intensive operation—your DataFrame will temporarily take up twice the memory.

**When to Use This**:

  * **Always use this** when you need to change from one valid type to another (e.g., `int` to `float`, `float` to `int`, `object` to `int`, `object` to `category`).
  * Use it to convert text (`object`) to numbers (`int` or `float`) when you are **100% sure** the column contains *only* numbers (or text that looks like numbers).
  * Use it to convert text (`object`) to `category` to save memory.
  * Use it to "downcast" types to save memory (e.g., from `int64` to `int32`).

-----

### 0\. Syntax & Parameters (MUST COME FIRST)

This method can be called on a Series or a DataFrame.

#### `series.astype()`

```python
series.astype(dtype, copy=True, errors='raise')
```

#### `dataframe.astype()`

```python
dataframe.astype(dtype, copy=True, errors='raise')
```

  * **`dtype`**
      * **What it does:** This is the most important parameter. It's the target data type you want to convert *to*.
      * **Default value:** (Required)
      * **When you would use it:** You *always* provide this.
          * **For a Series:** `s.astype(int)` or `s.astype('int64')` or `s.astype(np.int64)`.
          * **For a DataFrame:** You pass a **dictionary** mapping columns to types: `df.astype({'col_A': int, 'col_B': float})`.
  * **`copy`**
      * **What it does:** A boolean (True/False). By default, it *always* makes a copy. Setting `copy=False` tells Pandas to *try* not to make a copy *if* the data type is already correct.
      * **Default value:** `True`
      * **When you would use it:** You almost never change this. The default `True` is safer and more predictable.
  * **`errors`**
      * **What it does:** Tells Pandas what to do if it finds a value it *cannot* convert.
      * **Default value:** `'raise'`
      * **When you would use it:**
          * `'raise'`: The default. It will stop your code and show a `ValueError` (e.g., if you try to `astype(int)` on the string `'hello'`).
          * `'ignore'`: This is **dangerous**. It will silently fail, not perform the conversion, and return your *original* object. It's almost never what you want.
      * **Note:** For handling bad data, `pd.to_numeric()` (our next topic) is *much* better because it has an `errors='coerce'` option.

-----

### 1\. Basic Example

Let's do the most common conversions: from `object` (text) to `int` and `float` to `int`.

```python
import pandas as pd
import numpy as np

# Example 1: Basic conversion (float to int)
s_float = pd.Series([1.0, 2.5, 3.9])
print("--- 1. Original (float) ---")
print(s_float)
print(f"Dtype: {s_float.dtype}")

# This will TRUNCATE the decimals
s_int = s_float.astype(int)
print("\n--- 2. After .astype(int) ---")
print(s_int)
print(f"Dtype: {s_int.dtype}")
```

**Output:**

```
--- 1. Original (float) ---
0    1.0
1    2.5
2    3.9
dtype: float64
Dtype: float64

--- 2. After .astype(int) ---
0    1
1    2
2    3
dtype: int64
Dtype: int64
```

**Explanation:**
`.astype(int)` successfully converted the Series. Note that it does *not* round—it **truncates** (cuts off) the decimal. `2.5` became `2`, and `3.9` became `3`.

**Example 2: `object` (text) to `int`**
This is the most common use case after loading a CSV.

```python
# 'Price' is loaded as text
s_text = pd.Series(['100', '200', '300'])
print("\n--- 3. Original (text/object) ---")
print(s_text)
print(f"Dtype: {s_text.dtype}")

# Convert to numeric
s_numeric = s_text.astype(int)
print("\n--- 4. After .astype(int) ---")
print(s_numeric)
print(f"Dtype: {s_numeric.dtype}")

# Now we can do math
print(f"\nSum: {s_numeric.sum()}")
```

**Output:**

```
--- 3. Original (text/object) ---
0    100
1    200
2    300
dtype: object
Dtype: object

--- 4. After .astype(int) ---
0    100
1    200
2    300
dtype: int64
Dtype: int64

Sum: 600
```

**Explanation:**
The original Series was `object` type (you can't do math on it). `s_text.sum()` would have just concatenated the strings. `s_numeric.sum()` works perfectly.

-----

### 2\. Intermediate Example

Using `.astype()` on a full DataFrame and converting to `category`.

**Example 3: Using `astype` on a DataFrame with a `dict`**
This is the standard way to clean multiple columns at once.

```python
df = pd.DataFrame({
    'A_str': ['1', '2', '3'],
    'B_float': [10.5, 11.2, 12.9],
    'C_keep': ['x', 'y', 'z']
})
print("--- 5. Original DataFrame ---")
df.info()

# We want to change A to int and B to int
df_clean = df.astype({'A_str': int, 'B_float': int})

print("\n--- 6. Cleaned DataFrame ---")
df_clean.info()

print("\n--- 7. Cleaned Data ---")
print(df_clean)
```

**Output:**

```
--- 5. Original DataFrame ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   A_str    3 non-null      object 
 1   B_float  3 non-null      float64
 2   C_keep   3 non-null      object 
dtypes: float64(1), object(2)
memory usage: 200.0+ bytes

--- 6. Cleaned DataFrame ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   A_str    3 non-null      int64 
 1   B_float  3 non-null      int64 
 2   C_keep   3 non-null      object
dtypes: int64(2), object(1)
memory usage: 200.0+ bytes

--- 7. Cleaned Data ---
   A_str  B_float C_keep
0      1       10      x
1      2       11      y
2      3       12      z
```

**Explanation:**
By passing a dictionary `{'A_str': int, 'B_float': int}` to `df.astype()`, we changed *only* those two columns. `C_keep` was left alone, and `B_float` was truncated, just as in Example 1.

**Example 4: Converting `object` to `category` (Memory Saver)**
This is a *critical* optimization.

```python
# A Series with lots of repeated text
s_gender = pd.Series(['Male', 'Female', 'Male', 'Male', 'Female'] * 1000)
print("\n--- 8. Original (object) ---")
print(s_gender.head())
print(f"Memory (object): {s_gender.memory_usage(deep=True)} bytes")

# Convert to category
s_cat = s_gender.astype('category')
print("\n--- 9. Converted (category) ---")
print(s_cat.head())
print(f"Memory (category): {s_cat.memory_usage(deep=True)} bytes")
```

**Output:**

```
--- 8. Original (object) ---
0      Male
1    Female
2      Male
3      Male
4    Female
dtype: object
Memory (object): 310040 bytes

--- 9. Converted (category) ---
0      Male
1    Female
2      Male
3      Male
4    Female
dtype: category
Categories (2, object): ['Female', 'Male']
Memory (category): 7016 bytes
```

**Explanation:**
The `object` Series stores 5,000 full-text strings ("Male", "Female", ...). The `category` type is much smarter: it stores the unique values (`['Female', 'Male']`) *once* and then uses tiny integers (`0`, `1`) to point to them. The memory usage dropped from 310,000 bytes to 7,000 bytes. This is one of the most important uses of `.astype()`.

-----

### 3\. Advanced or Tricky Case

Handling `NaN` (missing) values during conversion.

**Example 5: `float` with `NaN` to `int` (The Old Way - Fails)**
You cannot have `NaN` in a standard `int` column.

```python
s_nan = pd.Series([1.0, 2.0, np.nan])
print("\n--- 10. Original (float with NaN) ---")
print(s_nan)

try:
    s_nan.astype(int)
except ValueError as e:
    print(f"\n--- 11. Error ---")
    print(e)
```

**Output:**

```
--- 10. Original (float with NaN) ---
0    1.0
1    2.0
2    NaN
dtype: float64

--- 11. Error ---
Cannot convert non-finite values (NA or inf) to integer
```

**Example 6: The `Int64` (nullable integer) fix**
Pandas created a special type `Int64` (capital "I") to fix this.

```python
# Use the string alias 'Int64' (capital I)
s_nullable_int = s_nan.astype('Int64')
print("\n--- 12. Converted to 'Int64' (nullable) ---")
print(s_nullable_int)
print(f"Dtype: {s_nullable_int.dtype}")
```

**Output:**

```
--- 12. Converted to 'Int64' (nullable) ---
0       1
1       2
2    <NA>
dtype: Int64
Dtype: Int64
```

**Explanation:**
The `int64` (lowercase "i") type *cannot* hold `NaN`s. The special `Int64` (capital "I") type *can*. It stores the `NaN` as a special `<NA>` marker. This is extremely useful.

**Example 7: Converting to `bool` (True/False)**
`astype(bool)` has very specific rules.

```python
s_bool = pd.Series([0, 1, 10, -1, 0.0, 0.5])
print("\n--- 13. Numbers for bool conversion ---")
print(s_bool)

print("\n--- 14. After .astype(bool) ---")
print(s_bool.astype(bool))
```

**Output:**

```
--- 13. Numbers for bool conversion ---
0     0.0
1     1.0
2    10.0
3    -1.0
4     0.0
5     0.5
dtype: float64

--- 14. After .astype(bool) ---
0    False
1     True
2     True
3     True
4    False
5     True
dtype: bool
```

**Explanation:**
When converting to `bool`, **only `0` and `0.0`** are `False`. *All* other numbers (positive, negative, or decimals like 0.5) are `True`.

**Example 8: `object` to `bool`**
This is even trickier and can be dangerous.

```python
s_text_bool = pd.Series(['True', 'False', 'TRUE', 'FALSE', 'true', 'false', 'T', 'F'])
print("\n--- 15. Text for bool conversion ---")
print(s_text_bool)

print("\n--- 16. After .astype(bool) ---")
print(s_text_bool.astype(bool))
```

**Output:**

```
--- 15. Text for bool conversion ---
0     True
1    False
2     TRUE
3    FALSE
4     true
5    false
6        T
7        F
dtype: object

--- 16. After .astype(bool) ---
0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
dtype: bool
```

**Explanation:**
This is a **major pitfall**. When converting from `object` to `bool`, *any non-empty string* is `True`. Even the string `'False'` becomes `True`\! To *properly* convert strings like 'True'/'False', you must use `.map({'True': True, 'False': False})`.

-----

### 4\. Real-World Use Case

**Example 9: Cleaning a "dirty" DataFrame**
You just loaded a file and `.info()` shows everything is an `object`.

```python
df_dirty = pd.DataFrame({
    'Order ID': ['1001', '1002', '1003'],
    'Quantity': ['5', '1', '10'],
    'Price': ['10.99', '5.00', '1.25'],
    'Category': ['Fruit', 'Fruit', 'Veg']
})
print("\n--- 17. Dirty DataFrame ---")
df_dirty.info()

# Example 10: Define the data types
# 'Order ID' should stay text, 'Quantity' should be int
# 'Price' should be float, 'Category' should be category
type_map = {
    'Quantity': int,
    'Price': float,
    'Category': 'category'
}

# Example 11: Clean the DataFrame
df_clean = df_dirty.astype(type_map)

print("\n--- 18. Cleaned DataFrame ---")
df_clean.info()

# Example 12: Prove it works
print(f"\nTotal quantity: {df_clean['Quantity'].sum()}")
print(f"Total price: {df_clean['Price'].sum()}")
```

**Output:**

```
--- 17. Dirty DataFrame ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Order ID  3 non-null      object
 1   Quantity  3 non-null      object
 2   Price     3 non-null      object
 3   Category  3 non-null      object
dtypes: object(4)
memory usage: 224.0+ bytes

--- 18. Cleaned DataFrame ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Order ID  3 non-null      object  
 1   Quantity  3 non-null      int64   
 2   Price     3 non-null      float64 
 3   Category  3 non-null      category
dtypes: category(1), float64(1), int64(1), object(1)
memory usage: 319.0 bytes

Total quantity: 16
Total price: 17.24
```

**Explanation:**
We used a `type_map` dictionary to clean the three columns that were wrong. `Order ID` was correctly left as an `object`. We can now perform math on `Quantity` and `Price`, and `Category` is memory-efficient.

-----

### 5\. Common Mistakes / Pitfalls

**Mistake 13: `ValueError` from a "dirty" entry**
This is the single most common problem with `.astype()`.

```python
s_dirty = pd.Series(['1', '2', '3-Oops', '4'])
print("\n--- 19. Dirty Series ---")
print(s_dirty)

# Wrong code
try:
    s_dirty.astype(int)
except ValueError as e:
    print(f"\n--- 20. Error ---")
    print(e)
```

**Error/Wrong Output:**
`ValueError: invalid literal for int() with base 10: '3-Oops'`
**Why it happens:** `.astype(int)` is an "all or nothing" operation. The instant it found `'3-Oops'`, which it couldn't convert to a number, it failed and stopped.
**Correction:** This is where `.astype()` is the *wrong tool*. You **must** use `pd.to_numeric(s, errors='coerce')` for this, which we will cover next.

**Mistake 14: `astype(bool)` on text**
(See Example 8) The string `'False'` converts to `True`, which is almost never what you want.

**Mistake 15: Forgetting to re-assign**
`.astype()` returns a *copy*. It does not change your original DataFrame.

```python
df = pd.DataFrame({'A': ['1', '2']})
print("\n--- 21. Original ---")
print(f"Dtype: {df['A'].dtype}")

# Wrong code
df.astype({'A': int}) # This creates a copy, then throws it away

print("\n--- 22. After (Still object!) ---")
print(f"Dtype: {df['A'].dtype}")

# Example 16: Corrected code
df_clean = df.astype({'A': int})
print("\n--- 23. Corrected (New DF) ---")
print(f"Dtype: {df_clean['A'].dtype}")
```

**Output:**

```
--- 21. Original ---
Dtype: object

--- 22. After (Still object!) ---
Dtype: object

--- 23. Corrected (New DF) ---
Dtype: int64
```

----






# Type conversions second subtopic: `pd.to_numeric()`.





-----

`pd.to_numeric()` is your specialized tool for converting a Series (or column) to a **numeric type** (like `int64` or `float64`). Unlike the all-purpose `.astype()`, `pd.to_numeric()` is *smarter* and *safer* when dealing with "dirty" data that might contain non-numeric values.

Its killer feature is the `errors='coerce'` parameter. This tells Pandas: "Try to make this column numeric. If you find a value you can't convert (like `'hello'` or `'3-Oops'`), don't stop and raise an error. Just *coerce* it into a `NaN` (Not a Number) value, and keep going."

**How It Works in Memory**: `pd.to_numeric()` is a function, not a method, so you call it as `pd.to_numeric(s)`. It *always* returns a **new** `pd.Series` (a copy) with the converted data. It inspects each string value: if it looks like an integer, it's parsed as an integer; if it has a decimal, it's parsed as a float. If `errors='coerce'` is used, a `NaN` value may be introduced, which will often force the *entire* Series to become `float64` (since `NaN` is a float).

**When to Use This**:

  * **This is the preferred tool** for converting a text (`object`) column to numbers when you *suspect* there might be bad data.
  * Use this when your column has values like `'$1,200'`, `'50%'`, or `'--'` mixed in with clean numbers.
  * You *must* use this (with `errors='coerce'`) *before* you can use `.astype(int)`, as it will clean out the non-numeric "gunk" that would cause `.astype()` to fail.

-----

### 0\. Syntax & Parameters (MUST COME FIRST)

`pd.to_numeric()` is a top-level Pandas function, *not* a DataFrame method.

```python
pandas.to_numeric(arg, errors='raise', downcast=None)
```

  * **`arg`** (argument)
      * **What it does:** The object you want to convert. This is typically a `pd.Series` (or a single column from a DataFrame, like `df['col']`).
      * **Default value:** (Required)
      * **When you would use it:** You *always* provide this. `pd.to_numeric(df['my_column'])`
  * **`errors`**
      * **What it does:** This is the most important parameter. It tells Pandas what to do when it finds a value it can't convert.
      * **Default value:** `'raise'`
      * **When you would use it:**
          * `'raise'`: The default. It will stop your code and show a `ValueError` if it finds a bad value (like `'hello'`). This is the same behavior as `.astype()`.
          * `'coerce'`: **This is the killer feature.** It will replace any bad value (like `'hello'`) with `NaN`. This allows your code to run without stopping.
          * `'ignore'`: This is the *least* useful. It will silently fail on bad values, leaving them as-is in the original `object` column (so the whole column remains `object`).
  * **`downcast`**
      * **What it does:** An advanced memory-saving feature. If your data is all numbers, this will try to "downcast" them to the smallest possible numeric type.
      * **Default value:** `None`
      * **When you would use it:**
          * `downcast='integer'`: If your data is `[1.0, 2.0, 3.0]`, it will be converted to `int64`.
          * `downcast='float'`: Will try to use `float32` instead of `float64`.
      * **What happens if you don't specify it:** It will use the standard `int64` or `float64`, which is fine.

-----

### 1\. Basic Example

Let's see the most important use: `errors='coerce'`.

```python
import pandas as pd
import numpy as np

# Example 1: A dirty Series with a non-numeric string
s_dirty = pd.Series(['1', '2', '3-Oops', '4'])
print("--- 1. Dirty Series ---")
print(s_dirty)
print(f"Dtype: {s_dirty.dtype}")

# Example 2: Using .astype() (This will FAIL)
try:
    s_dirty.astype(int)
except ValueError as e:
    print(f"\n--- 2. .astype(int) FAILS ---")
    print(e)

# Example 3: Using pd.to_numeric() with errors='coerce'
s_clean = pd.to_numeric(s_dirty, errors='coerce')
print("\n--- 3. pd.to_numeric(errors='coerce') WORKS ---")
print(s_clean)
print(f"Dtype: {s_clean.dtype}")
```

**Output:**

```
--- 1. Dirty Series ---
0         1
1         2
2    3-Oops
3         4
dtype: object
Dtype: object

--- 2. .astype(int) FAILS ---
invalid literal for int() with base 10: '3-Oops'

--- 3. pd.to_numeric(errors='coerce') WORKS ---
0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64
Dtype: float64
```

**Explanation:**
This is the perfect comparison. `.astype(int)` saw `'3-Oops'` and failed instantly. `pd.to_numeric(errors='coerce')` saw `'3-Oops'`, didn't panic, and quietly replaced it with `NaN` (Not a Number). Note that the `dtype` is now `float64`, because `NaN` is a float.

-----

### 2\. Intermediate Example

You can combine `to_numeric` with other methods to clean data.

**Example 4: A common data-cleaning workflow**
Data often has `$` and `,` symbols. `to_numeric` can't handle them *unless* you clean them first.

```python
s_prices = pd.Series(['$1,200.50', '$50.00', 'No Data', '$100.00'])
print("--- 4. Price Series (very dirty) ---")
print(s_prices)

# Example 5: Clean the strings first
s_cleaned_strings = s_prices.str.replace('$', '').str.replace(',', '')
print("\n--- 5. After replacing '$' and ',' ---")
print(s_cleaned_strings)

# Example 6: NOW use to_numeric
s_numeric_prices = pd.to_numeric(s_cleaned_strings, errors='coerce')
print("\n--- 6. After pd.to_numeric(errors='coerce') ---")
print(s_numeric_prices)
print(f"Total value: {s_numeric_prices.sum()}")
```

**Output:**

```
--- 4. Price Series (very dirty) ---
0    $1,200.50
1       $50.00
2      No Data
3     $100.00
dtype: object

--- 5. After replacing '$' and ',' ---
0    1200.50
1      50.00
2    No Data
3     100.00
dtype: object

--- 6. After pd.to_numeric(errors='coerce') ---
0    1200.5
1      50.0
2       NaN
3     100.0
dtype: float64
Total value: 1350.5
```

**Explanation:**
This is a standard 2-step process.

1.  Use `.str.replace()` to remove non-numeric characters like `$` and `,`.
2.  Use `pd.to_numeric(errors='coerce')` to convert the cleaned strings to numbers, which neatly handles the remaining bad entries like `'No Data'`.

**Example 7: Using `downcast='integer'`**
This is a memory-saving trick.

```python
s_floats = pd.Series(['1.0', '2.0', '3.0'])
print("\n--- 7. Series of float-strings ---")
print(s_floats)

# Example 8: Standard conversion
s_float_std = pd.to_numeric(s_floats)
print(f"\n--- 8. Standard conversion (Dtype: {s_float_std.dtype}) ---")
print(s_float_std)

# Example 9: Downcast to integer
s_downcast = pd.to_numeric(s_floats, downcast='integer')
print(f"\n--- 9. Downcast to integer (Dtype: {s_downcast.dtype}) ---")
print(s_downcast)
```

**Output:**

```
--- 7. Series of float-strings ---
0    1.0
1    2.0
2    3.0
dtype: object

--- 8. Standard conversion (Dtype: float64) ---
0    1.0
1    2.0
2    3.0
dtype: float64

--- 9. Downcast to integer (Dtype: int64) ---
0    1
1    2
2    3
dtype: int64
```

**Explanation:**
The standard `to_numeric` saw the `.` and kept them as `float64`. By adding `downcast='integer'`, we told Pandas: "If possible, after converting, please try to store these as integers." It was possible, so it stored them as `int64`.

-----

### 3\. Advanced or Tricky Case

What happens if you *want* `int` but `coerce` gives you `NaN`?

**Example 10: The `coerce` + `fillna` + `astype` chain**
This is the full "pro" pattern to get a clean `int` column from dirty data.

```python
s_dirty = pd.Series(['1', '2', '3-Oops', '4', np.nan])
print("--- 10. Dirty Series with NaN ---")
print(s_dirty)

# Step 1: Force all non-numeric to NaN
s_coerced = pd.to_numeric(s_dirty, errors='coerce')
print("\n--- 11. Step 1: Coerced (float with NaN) ---")
print(s_coerced)

# Step 2: Fill the NaNs with a value (e.g., 0)
s_filled = s_coerced.fillna(0)
print("\n--- 12. Step 2: Filled (float with 0.0) ---")
print(s_filled)

# Step 3: Now that it's safe (no NaNs), convert to int
s_final_int = s_filled.astype(int)
print("\n--- 13. Step 3: Final (int) ---")
print(s_final_int)
```

**Output:**

```
--- 10. Dirty Series with NaN ---
0         1
1         2
2    3-Oops
3         4
4       NaN
dtype: object

--- 11. Step 1: Coerced (float with NaN) ---
0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

--- 12. Step 2: Filled (float with 0.0) ---
0    1.0
1    2.0
2    0.0
3    4.0
4    0.0
dtype: float64

--- 13. Step 3: Final (int) ---
0    1
1    2
2    0
3    4
4    0
dtype: int64
```

**Explanation:**
This 3-step chain is one of the most important patterns in data cleaning:

1.  `pd.to_numeric(s, errors='coerce')` turns all "bad" strings (`'3-Oops'`) into `NaN`.
2.  `.fillna(0)` replaces all `NaN`s (both the original `np.nan` and the new ones) with `0`.
3.  `.astype(int)` safely converts the all-float, no-`NaN` Series into a clean integer Series.

**Example 11: Applying to a DataFrame column**
You *don't* call `pd.to_numeric(df)`. You call it on the *column* and re-assign it.

```python
df = pd.DataFrame({'A': [1, 2], 'B': ['5', '6-Bad']})
print("\n--- 14. Original DataFrame ---")
df.info()

# Apply the function to the 'B' column
df['B'] = pd.to_numeric(df['B'], errors='coerce')

print("\n--- 15. After applying to 'B' ---")
df.info()
print(df)
```

**Output:**

```
--- 14. Original DataFrame ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       2 non-null      int64 
 1   B       2 non-null      object
dtypes: int64(1), object(1)
memory usage: 160.0+ bytes

--- 15. After applying to 'B' ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       2 non-null      int64  
 1   B       1 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 160.0+ bytes
   A    B
0  1  5.0
1  2  NaN
```

-----

### 4\. Real-World Use Case

**Example 12: Cleaning a "Percentage" column**
You load a column that looks like `95%`, `90%`, `80.5%`.

```python
s_percent = pd.Series(['95%', '90%', '80.5%', 'N/A'])
print("--- 16. Original Percentage Series ---")
print(s_percent)

# Example 13: Clean and convert
# Step 1: Remove the '%'
s_cleaned = s_percent.str.replace('%', '')
print("\n--- 17. After removing '%' ---")
print(s_cleaned)

# Step 2: Convert, coercing 'N/A'
s_numeric = pd.to_numeric(s_cleaned, errors='coerce')
print("\n--- 18. After to_numeric ---")
print(s_numeric)

# Example 14: Bonus step - convert to decimal
s_decimal = s_numeric / 100
print("\n--- 19. As decimal ---")
print(s_decimal)
```

**Output:**

```
--- 16. Original Percentage Series ---
0      95%
1      90%
2    80.5%
3      N/A
dtype: object

--- 17. After removing '%' ---
0      95
1      90
2    80.5
3     N/A
dtype: object

--- 18. After to_numeric ---
0    95.0
1    90.0
2    80.5
3     NaN
dtype: float64

--- 19. As decimal ---
0    0.950
1    0.900
2    0.805
3      NaN
dtype: float64
```

**Explanation:**
This is a perfect, common workflow. `to_numeric` can't handle `%`, so we remove it. Then `to_numeric` *can* handle `'N/A'` by coercing it to `NaN`.

-----

### 5\. Common Mistakes / Pitfalls

**Mistake 15: Using `errors='ignore'`**
This is the most dangerous and useless option.

```python
s_dirty = pd.Series(['1', '2', '3-Oops', '4'])
print("\n--- 20. Dirty Series ---")
print(s_dirty)

# Wrong code (using 'ignore')
s_ignored = pd.to_numeric(s_dirty, errors='ignore')
print("\n--- 21. After errors='ignore' ---")
print(s_ignored)
print(f"Dtype: {s_ignored.dtype}")
print(f"Sum (fails): {s_ignored.sum()}")
```

**Error/Wrong Output:**

```
--- 20. Dirty Series ---
0         1
1         2
2    3-Oops
3         4
dtype: object

--- 21. After errors='ignore' ---
0         1
1         2
2    3-Oops
3         4
dtype: object
Dtype: object
Sum (fails): 123-Oops4
```

**Why it happens:** `errors='ignore'` means "If you find a bad value, *just stop* and return the original object." It didn't convert *anything*. The `dtype` is still `object`. Use `errors='coerce'`.

**Mistake 16: Forgetting to re-assign**
Like `.astype()`, `pd.to_numeric()` returns a *new Series*.

```python
df = pd.DataFrame({'A': ['1', '2-Bad']})
print("\n--- 22. Original DF ---")
print(df.dtypes)

# Wrong code
pd.to_numeric(df['A'], errors='coerce') # This returns a new Series, but we don't save it

print("\n--- 23. After (Still object!) ---")
print(df.dtypes)

# Example 17: Corrected code
df['A'] = pd.to_numeric(df['A'], errors='coerce')
print("\n--- 24. Corrected ---")
print(df.dtypes)
```

**Output:**

```
--- 22. Original DF ---
A    object
dtype: object

--- 23. After (Still object!) ---
A    object
dtype: object

--- 24. Corrected ---
A    float64
dtype: object
```

----



# Type conversions third subtopic: `pd.to_datetime()`.




-----

`pd.to_datetime()` is your specialized tool for converting strings or numbers that *represent dates* into a true **datetime object**. This is one of the most important steps in data cleaning.

When you load data, dates are almost always read as simple text (`object`), like `'2025-11-17'`. You can't do date-based operations on text (e.g., "find all sales from last week," or "group by month"). `pd.to_datetime()` is the function that parses this text and converts it into a special `datetime64[ns]` object, which is a data type that Pandas understands as a date and time. This "unlocks" all of a DataFrame's powerful time-series abilities.

**How It Works in Memory**: `pd.to_datetime()` is a highly optimized function that parses strings. It creates a **new** `pd.Series` (a copy) in memory. The data type of this new Series will be `datetime64[ns]`, which means it stores each date as a 64-bit integer representing the number of nanoseconds since a specific point in time (1970-01-01, the "Unix epoch"). This is what makes date calculations (like finding the difference between two dates) extremely fast.

**When to Use This**:

  * **This is the preferred tool** for converting any date-like column (`object`) into a `datetime64` type.
  * Use this when your dates are in a standard format (`'2025-11-17'`) that Pandas can guess.
  * You *must* use this with the `format` parameter when your dates are in a weird or non-standard format (`'11/17/2025'`, `'17-Nov-2025'`).
  * Use this with `errors='coerce'` to handle unparseable dates (like `'hello'`) by turning them into `NaT` (Not a Time).

-----

### 0\. Syntax & Parameters (MUST COME FIRST)

`pd.to_datetime()` is a top-level Pandas function.

```python
pandas.to_datetime(arg, errors='raise', format=None, infer_datetime_format=False, ...)
```

  * **`arg`** (argument)
      * **What it does:** The object you want to convert. This is typically a `pd.Series` (e.g., `df['col']`), a list, or even a full DataFrame (if you're assembling dates from columns).
      * **Default value:** (Required)
      * **When you would use it:** You *always* provide this. `pd.to_datetime(df['date_column'])`
  * **`errors`**
      * **What it does:** Tells Pandas what to do when it finds a date string it *cannot* parse.
      * **Default value:** `'raise'`
      * **When you would use it:**
          * `'raise'`: The default. It will stop your code and show a `ValueError` if it finds a bad date (like `'hello'`).
          * `'coerce'`: **This is the killer feature.** It will replace any bad/unparseable date (like `'hello'` or `'2025-02-30'`) with `NaT` (Not a Time), which is the `NaN` equivalent for datetimes.
          * `'ignore'`: This is the *least* useful. It will silently fail on bad values, returning the original `object` Series.
  * **`format`**
      * **What it does:** This is the *second* most important parameter. It's a "format string" that tells Pandas *exactly* how your dates are structured. (e.g., `%m/%d/%Y` tells Pandas to look for "month/day/4-digit-year").
      * **Default value:** `None`
      * **When you would use it:** You *must* use this when your dates are in a non-standard format. If you don't, Pandas will try to guess, which can be slow or (worse) *wrong*.
      * **What happens if you don't specify it:** Pandas will try to automatically `infer_datetime_format`. This is often successful for standard formats but will fail on ambiguous ones (is `01/02/2025` Jan 2nd or Feb 1st?).
  * **`infer_datetime_format`**
      * **What it does:** If `True` (and `format` is `None`), Pandas will try to "learn" the format from the first non-`NaN` date string.
      * **Default value:** `False`
      * **When you would use it:** Set this to `True` for a **massive speed boost** if your dates are in a *consistent*, standard format.
      * **What happens if you don't specify it:** Pandas will test several formats, which is slower.

-----

### 1\. Basic Example

Let's convert a standard, clean list of date strings.

```python
import pandas as pd
import numpy as np

# Example 1: A Series of ISO-formatted strings (standard)
s_dates = pd.Series(['2025-01-01', '2025-01-02', '2025-01-03'])
print("--- 1. Original (text/object) ---")
print(s_dates)
print(f"Dtype: {s_dates.dtype}")

# Example 2: Convert using pd.to_datetime
s_datetime = pd.to_datetime(s_dates)
print("\n--- 2. After pd.to_datetime() ---")
print(s_datetime)
print(f"Dtype: {s_datetime.dtype}")

# Example 3: What we've "unlocked" (the .dt accessor)
print("\n--- 3. Unlocked .dt accessor ---")
print(f"Day names: {s_datetime.dt.day_name().tolist()}")
print(f"Months: {s_datetime.dt.month.tolist()}")
```

**Output:**

```
--- 1. Original (text/object) ---
0    2025-01-01
1    2025-01-02
2    2025-01-03
dtype: object
Dtype: object

--- 2. After pd.to_datetime() ---
0   2025-01-01
1   2025-01-02
2   2025-01-03
dtype: datetime64[ns]
Dtype: datetime64[ns]

--- 3. Unlocked .dt accessor ---
Day names: ['Wednesday', 'Thursday', 'Friday']
Months: [1, 1, 1]
```

**Explanation:**
Pandas automatically recognized the standard `'YYYY-MM-DD'` format and converted the `object` Series to a `datetime64[ns]` Series. This "unlocked" the `.dt` accessor, which lets us instantly pull out the day name, month, year, etc.

**Example 4: Handling mixed standard formats**
Pandas is smart enough to guess many common formats.

```python
s_mixed = pd.Series(['2025-01-01', '02/01/2025', 'Jan 3, 2025'])
print("\n--- 4. Mixed (but standard) formats ---")
print(pd.to_datetime(s_mixed))
```

**Output:**

```
--- 4. Mixed (but standard) formats ---
0   2025-01-01
1   2025-02-01
2   2025-01-03
dtype: datetime64[ns]
```

**Explanation:** Pandas correctly (in this US-based-logic case) inferred all three formats. *Warning:* `'02/01/2025'` is ambiguous (Feb 1st or Jan 2nd?). This is why `format` is safer.

-----

### 2\. Intermediate Example

Using the `format` parameter for non-standard, ambiguous dates.

**Example 5: The Ambiguity Problem**
Is `'05-06-2025'` May 6th or June 5th?

```python
s_ambiguous = pd.Series(['05-06-2025', '07-08-2025'])
print("--- 5. Ambiguous Dates ---")
print(pd.to_datetime(s_ambiguous))
```

**Output:**

```
--- 5. Ambiguous Dates ---
0   2025-05-06
1   2025-07-08
dtype: datetime64[ns]
```

**Explanation:** By default, Pandas (and the underlying `dateutil` library) "guessed" `MM-DD-YYYY`. This is dangerous if your data is actually from Europe (`DD-MM-YYYY`).

**Example 6: Using `format` to be explicit (Day-First)**
This is the *safe* way. `format` strings tell Pandas exactly what to look for:

  * `%d` = day
  * `%m` = month
  * `%Y` = 4-digit year

<!-- end list -->

```python
s_ambiguous = pd.Series(['05-06-2025', '07-08-2025'])
# Tell Pandas to read it as Day-Month-Year
dt_dayfirst = pd.to_datetime(s_ambiguous, format='%d-%m-%Y')
print("\n--- 6. With format='%d-%m-%Y' (Day-First) ---")
print(dt_dayfirst)
```

**Output:**

```
--- 6. With format='%d-%m-%Y' (Day-First) ---
0   2025-06-05
1   2025-08-07
dtype: datetime64[ns]
```

**Explanation:**
By providing the `format`, we forced Pandas to read `'05-06-2025'` as June 5th, not May 6th. This is robust and correct.

**Example 7: Using `format` for other weird strings**

```python
s_weird = pd.Series(['Nov 17, 2025, 03:30 PM'])
# %b = Month abbrev, %d = day, %Y = year
# %I = 12-hr, %M = minute, %p = AM/PM
dt_weird = pd.to_datetime(s_weird, format='%b %d, %Y, %I:%M %p')
print("\n--- 7. Parsing a complex string ---")
print(dt_weird)
```

**Output:**

```
--- 7. Parsing a complex string ---
0   2025-11-17 15:30:00
dtype: datetime64[ns]
```

-----

### 3\. Advanced or Tricky Case

Using `errors` to handle bad data.

**Example 8: `errors='raise'` (The Default)**
This is what happens when a `format` is specified but the data doesn't match.

```python
s_dirty = pd.Series(['11-17-2025', '11-18-2025', 'NOT A DATE', '11-20-2025'])
print("\n--- 8. Dirty Date Series ---")
print(s_dirty)

try:
    # Use default errors='raise'
    pd.to_datetime(s_dirty, format='%m-%d-%Y')
except ValueError as e:
    print(f"\n--- 9. errors='raise' FAILS ---")
    print(e)
```

**Output:**

```
--- 8. Dirty Date Series ---
0    11-17-2025
1    11-18-2025
2    NOT A DATE
3    11-20-2025
dtype: object

--- 9. errors='raise' FAILS ---
time data 'NOT A DATE' does not match format '%m-%d-%Y' (match)
```

**Example 10: `errors='coerce'` (The Fix)**
This is the *best* way to handle bad data.

```python
# Use errors='coerce' to turn bad dates into NaT
dt_coerced = pd.to_datetime(s_dirty, format='%m-%d-%Y', errors='coerce')
print("\n--- 10. errors='coerce' WORKS ---")
print(dt_coerced)
```

**Output:**

```
--- 10. errors='coerce' WORKS ---
0   2025-11-17
1   2025-11-18
2          NaT
3   2025-11-20
dtype: datetime64[ns]
```

**Explanation:** `errors='coerce'` is the perfect tool. It converted the good dates, saw `'NOT A DATE'`, and converted it to `NaT` (Not a Time) without stopping.

**Example 11: `errors='ignore'` (The Weird One)**
This is almost never what you want.

```python
# Use errors='ignore'
dt_ignored = pd.to_datetime(s_dirty, format='%m-%d-%Y', errors='ignore')
print("\n--- 11. errors='ignore' (returns object) ---")
print(dt_ignored)
print(f"Dtype: {dt_ignored.dtype}")
```

**Output:**

```
--- 11. errors='ignore' (returns object) ---
0    11-17-2025
1    11-18-2025
2    NOT A DATE
3    11-20-2025
dtype: object
Dtype: object
```

**Explanation:** `errors='ignore'` saw the bad value and just *gave up*, returning the *original object-type Series*.

-----

### 4\. Real-World Use Case

**Example 12: Assembling a date from multiple columns**
This is a *very* common scenario.

```python
df = pd.DataFrame({
    'year': [2025, 2026, 2025],
    'month': [1, 5, 12],
    'day': [1, 15, 31],
    'sales': [100, 200, 300]
})
print("--- 12. Original DataFrame with cols ---")
print(df)

# Example 13: Pass a DataFrame to pd.to_datetime
# It will find 'year', 'month', 'day'
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
print("\n--- 13. After assembling date ---")
print(df)
df.info()
```

**Output:**

```
--- 12. Original DataFrame with cols ---
   year  month  day  sales
0  2025      1    1    100
1  2026      5   15    200
2  2025     12   31    300

--- 13. After assembling date ---
   year  month  day  sales       date
0  2025      1    1    100 2025-01-01
1  2026      5   15    200 2026-05-15
2  2025     12   31    300 2025-12-31
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   year    3 non-null      int64         
 1   month   3 non-null      int64         
 2   day     3 non-null      int64         
 3   sales   3 non-null      int64         
 4   date    3 non-null      datetime64[ns]
dtypes: datetime64[ns](1), int64(4)
memory usage: 248.0 bytes
```

**Explanation:** `pd.to_datetime()` is smart. When you pass it a DataFrame containing columns with standard names (`year`, `month`, `day`, `hour`, etc.), it automatically assembles them into a single datetime column.

**Example 14: Converting Unix timestamps**

```python
s_epoch = pd.Series([1731846000, 1731932400])
print("\n--- 14. Unix Epoch (int) ---")
print(s_epoch)

# Tell it the unit is 'seconds'
dt_epoch = pd.to_datetime(s_epoch, unit='s')
print("\n--- 15. Converted from Unix ---")
print(dt_epoch)
```

**Output:**

```
--- 14. Unix Epoch (int) ---
0    1731846000
1    1731932400
dtype: int64

--- 15. Converted from Unix ---
0   2024-11-17 12:20:00
1   2024-11-18 12:20:00
dtype: datetime64[ns]
```

*(Self-correction: The provided timestamp `1731846000` is for 2024, not 2025. The output is correct based on the input.)*

-----

### 5\. Common Mistakes / Pitfalls

**Mistake 15: The `format` string is *wrong***
This is the \#1 error. Your `format` string *must exactly match* your data.

```python
s_data = pd.Series(['17/11/2025']) # Day/Month/Year
print("\n--- 16. Data (D/M/Y) ---")
print(s_data)

# Wrong code
try:
    # We told it Month/Day/Year
    pd.to_datetime(s_data, format='%m/%d/%Y')
except ValueError as e:
    print(f"\n--- 17. Error: Format mismatch ---")
    print(e)
```

**Error/Wrong Output:**
`ValueError: time data '17/11/2025' does not match format '%m/%d/%Y' (match)`
**Why it happens:** It was looking for a Month (`%m`), but `17` is not a valid month.
**Example 18: Corrected code:**

```python
# Correct code
dt_correct = pd.to_datetime(s_data, format='%d/%m/%Y')
print("\n--- 18. Corrected format ---")
print(dt_correct)
```

**Mistake 19: Forgetting to re-assign**
`pd.to_datetime()` returns a *new Series*.

```python
df = pd.DataFrame({'date': ['2025-01-01']})
print("\n--- 19. Original (object) ---")
print(df.dtypes)

# Wrong code
pd.to_datetime(df['date']) # This does nothing!

print("\n--- 20. After (still object!) ---")
print(df.dtypes)

# Example 20: Corrected code
df['date'] = pd.to_datetime(df['date'])
print("\n--- 21. Corrected ---")
print(df.dtypes)
```

**Output:**

```
--- 19. Original (object) ---
date    object
dtype: object

--- 20. After (still object!) ---
date    object
dtype: object

--- 21. Corrected ---
date    datetime64[ns]
dtype: object
```