### **Identifying Data Types in Pandas**

Before cleaning or transforming data, it‚Äôs essential to **check the data types** of each column in your DataFrame. Pandas provides multiple methods for this.

---
##### ‚úÖ **Key Takeaway**
* `dtypes` ‚Üí Quick glance at column data types
* `info()` ‚Üí Summary including non-null values
* `select_dtypes()` ‚Üí Filter columns by type for focused operations

‚û°Ô∏è **1. Using `dtypes`**

The simplest way to view **data types of all columns**: **`print(df.dtypes)`**
> This displays each column name along with its corresponding data type (`int64`, `float64`, `object`, `bool`, `datetime64`, etc.).
---

‚û°Ô∏è **2. Using `info()`**

For a **more detailed summary**, including **data types**, **non-null counts**, and **memory usage**: **`df.info()`**

**This method gives a quick overview of:**
* Number of non-null entries per column
* Column data types
* Total memory usage
---

‚û°Ô∏è **3. Using `select_dtypes()`**

üîπ `You can **filter columns** based on their data type(s)`:
```python
# Select only numeric columns
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns

# Select specific data types
specific_columns = df.select_dtypes(include=['int64', 'object']).columns
```
üîπ `You can also include **multiple types** at once`:
```python
df.select_dtypes(include=['int64', 'float64', 'bool'])
```
üîπ `And use both include and exclude`:
```python
df.select_dtypes(include=['object'], exclude=['datetime64'])
```

In [5]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [1.0, 2.0, 3.0],
    'C': ['a', 'b', 'c'],
    'D': [True, False, True],
    'E': pd.date_range('20230101', periods=3),
    'F': [1, 2, None]
})

# Output datatypes of all columns
print(f"Datatypes of all columns:\n{df.dtypes}\n")

# Compehensive overview
df.info()

# Numeric columns
numeric_columns = df.select_dtypes(include=[np.number]).columns
print(f"\nNumeric columns:{numeric_columns}\n")

# String (object) columns
string_columns = df.select_dtypes(include=['object']).columns
print(f"String columns:{string_columns}\n")

# Boolean columns
bool_columns = df.select_dtypes(include=['bool']).columns
print(f"Boolean columns:{bool_columns}\n")

# Datetime columns
date_columns = df.select_dtypes(include=['datetime']).columns
print(f"Datetime columns:{date_columns}\n")

Datatypes of all columns:
A             int64
B           float64
C            object
D              bool
E    datetime64[ns]
F           float64
dtype: object

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   A       3 non-null      int64         
 1   B       3 non-null      float64       
 2   C       3 non-null      object        
 3   D       3 non-null      bool          
 4   E       3 non-null      datetime64[ns]
 5   F       2 non-null      float64       
dtypes: bool(1), datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 255.0+ bytes

Numeric columns:Index(['A', 'B', 'F'], dtype='object')

String columns:Index(['C'], dtype='object')

Boolean columns:Index(['D'], dtype='object')

Datetime columns:Index(['E'], dtype='object')



#### ‚û°Ô∏è **Identifying Data Types (Continued)**

In addition to checking data types for the entire DataFrame, you can inspect **individual columns** or **specific entries** to understand the data more precisely.

---

‚û°Ô∏è **1. Check Data Type of a Single Column**
```python
print(df['A'].dtype)
```
This returns the data type (`int64`, `float64`, `object`, etc.) of the column **A**.

---

‚û°Ô∏è **2. Check Data Type of a Specific Entry**
```python
print(type(df['A'][0]))
```
This displays the Python data type of a specific value in the column (e.g., `<class 'numpy.int64'>` or `<class 'str'>`).

---

‚û°Ô∏è **3. Columns with Mixed Data Types**

If a column contains **mixed types** (like integers, floats, and strings), pandas automatically stores it as **`object`**:
```python
df['H'] = [1, 'two', 3.0]
print(df['H'].dtype)
```
`Output:` **object**

üîπ **Problem 1: You are given a DataFrame.**
* Create a **dictionary with column names as keys** and their **data types as values**.

In [21]:
import pandas as pd

# Create a sample DataFrame with a variety of column types
df = pd.DataFrame({
    'A': [1, 2, 3],                            # integers -> typically int64
    'B': [1.0, 2.0, 3.0],                      # floats -> float64
    'C': ['a', 'b', 'c'],                      # strings -> object dtype
    'D': [True, False, True],                  # booleans -> bool
    'E': pd.date_range('20230101', periods=3), # datetimes -> datetime64[ns]
    'F': [1, 2, None]                          # mixed/nullable -> upcast to float64 (NaN for None)
})

# Build a dictionary mapping each column name to its dtype (as a string).
data_types = {column: str(df[column].dtype) for column in df.columns} # Converting dtype to str makes it easy to log.
print(data_types)

{'A': 'int64', 'B': 'float64', 'C': 'object', 'D': 'bool', 'E': 'datetime64[ns]', 'F': 'float64'}


### **Converting Numeric Data Types**

When working with datasets, numeric values are sometimes stored as **strings** (e.g., `"25"`, `"100.5"`).
To perform calculations, they must be **converted to numeric data types**.

---
‚úÖ **Key Takeaway**

Use `pd.to_numeric()` when you expect **inconsistent or messy data**, since it can safely handle errors and maintain data integrity.

---

‚û°Ô∏è **1. Using `astype()`**

The simplest method to **convert data to numeric form** is: **`df['A'] = df['A'].astype(float)`**
> This directly converts the column **A** to a numeric type (e.g., `int` or `float`).
---

‚û°Ô∏è **2. Using `pd.to_numeric()`**

`pd.to_numeric()` offers **more control** and is the preferred method for handling invalid data.

üîπ You can specify how to deal with errors using the **`errors`** parameter:
* `'raise'` ‚Üí raises an error if invalid data is found (default)
* `'coerce'` ‚Üí converts invalid entries to `NaN`
* `'ignore'` ‚Üí leaves invalid values unchanged

In [None]:
df = pd.DataFrame({
    'A': ['1', '2', '3'],
    'B': [1.1, 2.2, 3.3],
    'C': [1, 2, 3]
})

df[['A', 'B', 'C']] = df[['A', 'B', 'C']].apply(pd.to_numeric, errors='coerce')
print(df.dtypes)

A     object
B    float64
C      int64
dtype: object
A      int64
B    float64
C      int64
dtype: object


In [31]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': ['1', '2', '3'],
    'B': [1.1, 2.2, 3.3],
    'C': [1, 2, 3]
})
print("Datatype before change:\n", df.dtypes, '\n')

df['A'] = df['A'].astype(int)
df['B'] = df['B'].astype(float)
df['C'] = df['C'].astype(np.int32)
print("Datatype after change:\n", df.dtypes, '\n')

df1 = pd.DataFrame({
    'D': ['1', '2', 'three', '4'],
    'E': ['1.1', '2.2', 'NaN', '4.4']
})

df1['D'] = pd.to_numeric(df1['D'], errors='coerce')
df1['E'] = pd.to_numeric(df1['E'], errors='coerce')

# Notice that three has converted to NaN
print(f"Converts Invalid Entry in Column D into NaN:\n{df1}")

Datatype before change:
 A     object
B    float64
C      int64
dtype: object 

Datatype after change:
 A      int64
B    float64
C      int32
dtype: object 

Converts Invalid Entry in Column D into NaN:
     D    E
0  1.0  1.1
1  2.0  2.2
2  NaN  NaN
3  4.0  4.4


### **Categorical Data in Pandas**

Categorical data in pandas is extremely useful when working with **non-numeric data** that has a **limited number of possible values** (e.g., `"Male"`, `"Female"`, `"Other"` or `"High"`, `"Medium"`, `"Low"`).

---
‚û°Ô∏è **Why Use Categorical Data?**
1. **Memory Efficiency**
   * Repeated string values are stored only once in memory.
   * Instead of storing full strings repeatedly, pandas stores **integer codes** internally.
2. **Faster Computation**: Operations like sorting, comparisons, and grouping are faster since integers are compared instead of strings.

3. **Meaningful Representation**: Categorical data provides a structured way to represent **non-numeric labels** with specific categories.
---
‚úÖ **Key Takeaway**

Use **categorical data** to save memory and improve performance when working with columns that have **repeated string values or a limited set of categories**.

---

‚û°Ô∏è **Internal Representation**

Categorical data in pandas is represented using two main components:
1. **Categories Array**
   * A unique array containing **all possible category values**.
   * Example: `['Low', 'Medium', 'High']`

2. **Codes Array**
   * An **integer array** where each number refers to the index of a category in the categories array.
   * Example: `[0, 2, 1, 0]` corresponds to the categories above.

In [33]:
import pandas as pd

df = pd.DataFrame({
    'Priority': ['High', 'Low', 'Medium', 'High', 'Low']
})
df['Priority'] = df['Priority'].astype('category')

print(df['Priority'])
print('\n', df['Priority'].cat.codes)

0      High
1       Low
2    Medium
3      High
4       Low
Name: Priority, dtype: category
Categories (3, object): ['High', 'Low', 'Medium']

 0    0
1    1
2    2
3    0
4    1
dtype: int8


‚û°Ô∏è **Let's consider a column with 1 million rows, containing only the values 'Apple', 'Banana', and 'Cherry'.**
1. **`Without Categorical:`**
* Each string is stored separately.
* Memory usage ‚âà (5 + 6 + 6) bytes * 1,000,000 ‚âà 17 MB
* (Assuming each character takes 1 byte and accounting for Python's string overhead)

2. **`With Categorical:`**
* Categories Array: ['Apple', 'Banana', 'Cherry'] ‚âà 17 bytes
* Codes Array: 1,000,000 integers (1 byte each if using uint8) ‚âà 1 MB
* Total: ‚âà 1 MB

In [34]:
import pandas as pd
import numpy as np

# Generating the data
n = 1_000_000
data = np.random.choice(['Apple', 'Banana', 'Cherry'], size=n)

# Without Categorical
df = pd.DataFrame({'Fruit': data})
print("Memory usage without Categorical:")
print(df.memory_usage(deep=True))

# With Categorical
df['Fruit'] = df['Fruit'].astype('category')
print("\nMemory usage with Categorical:")
print(df.memory_usage(deep=True))

Memory usage without Categorical:
Index         132
Fruit    54666783
dtype: int64

Memory usage with Categorical:
Index        132
Fruit    1000272
dtype: int64


### **Date and Time Conversion in Pandas**

Working with **dates and times** is a key part of data analysis ‚Äî especially for time series data such as sales, weather, or stock prices.
Pandas provides powerful tools to handle, manipulate, and analyze datetime information.

---




---

### ‚úÖ **Key Takeaways**

* Use **`pd.to_datetime()`** to convert strings to datetime objects.
* Use **`errors='coerce'`** to safely handle invalid entries.
* Use **`tz_localize()`** and **`tz_convert()`** for timezone-aware operations.
* Always ensure datetime columns are properly typed before performing date-based analysis or resampling.


‚û°Ô∏è **Primary Function: `pd.to_datetime()`**

The main function for converting data to datetime in pandas is **`pd.to_datetime()`**.
It automatically detects common date formats and converts strings into proper datetime objects.

In [46]:
import pandas as pd

# Converting a series of strings to datetime
df = pd.DataFrame({'date': ['2023-07-15', '2023-07-16', '2023-07-17']})
print(f"Original DataFrame before Datetime Conversion:\n{df}")
print(df.dtypes, '\n')

df['date'] = pd.to_datetime(df['date'])
print(f"Modified DataFrame after Datetime conversion:\n{df}")
print(df.dtypes)

Original DataFrame before Datetime Conversion:
         date
0  2023-07-15
1  2023-07-16
2  2023-07-17
date    object
dtype: object 

Modified DataFrame after Datetime conversion:
        date
0 2023-07-15
1 2023-07-16
2 2023-07-17
date    datetime64[ns]
dtype: object


‚û°Ô∏è **Specifying Date Formats**

While pandas can infer most formats, specifying the format helps in **ambiguous cases** or **custom date patterns**.

In [66]:
# American format (month/day/year)
date = pd.to_datetime('7/15/2023', format='%m/%d/%Y')
print(date)

# Custom format with month abbreviation
date = pd.to_datetime('15-Jul-2023', format='%d-%b-%Y')
print(date)

2023-07-15 00:00:00
2023-07-15 00:00:00


‚û°Ô∏è **Handling Invalid Dates**

You can handle errors gracefully using the `errors` parameter:
* **`errors='raise'`** ‚Üí Raises an error on invalid data (default).
* **`errors='coerce'`** ‚Üí Converts invalid data to `NaT` (Not a Time).
* **`errors='ignore'`** ‚Üí Leaves invalid entries unchanged.

In [48]:
dates = ['2023-07-15', 'Invalid Date', '2023-07-17']
pd.to_datetime(dates, errors='coerce')  # 'Invalid Date' becomes NaT

DatetimeIndex(['2023-07-15', 'NaT', '2023-07-17'], dtype='datetime64[ns]', freq=None)

‚û°Ô∏è **Handling Time Zones**

You can **assign** and **convert** time zones using `tz_localize()` and `tz_convert()`.

In [57]:
date = pd.to_datetime('2023-07-15 12:00:00')
print(f"Original Date Format: {date}\n")

date_tz = date.tz_localize('UTC')
print(f"Datetime after using [.tz_localize('UTC')]: {date_tz}\n")

date_tz_convert = date.tz_localize('UTC').tz_convert('US/Eastern')
print(f"Datetime after using [.tz_localize('UTC').tz_convert('US/Eastern')]: {date_tz_convert}")

Original Date Format: 2023-07-15 12:00:00

Datetime after using [.tz_localize('UTC')]: 2023-07-15 12:00:00+00:00

Datetime after using [.tz_localize('UTC').tz_convert('US/Eastern')]: 2023-07-15 08:00:00-04:00


‚û°Ô∏è **`Problem:` You are given a DataFrame - your task is to process this data using pandas and perform the following operations.** 

üîπ **Output the updated DataFrame after each operation.**
* Convert the **'order_date' column to datetime format**. Handle any **invalid dates** in the 'order_date' column by replacing them with **NaT (Not a Time)**.
* Create a **new column 'formatted_date'** that displays the date in the **format: "DD-Mon-YYYY"** (e.g., "15-Mar-2023").

In [75]:
import pandas as pd

sample_data = {
    'order_id': [1, 2, 3, 4, 5],
    'order_date': ['2023-03-15', '2023-04-01', 'invalid_date', '2023-03-20', '2023-04-05'],
    'total_amount': [100.50, 200.75, 150.25, 300.00, 75.80]
}
df = pd.DataFrame(sample_data)
print(f"Original DataFrame w/o Conversion:\n{df}\n")

# Convert the 'order_date' column to datetime format. Handle invalid dates in column by replacing them with NaT.
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
print(f"After converting 'Order Date' into Datetime & Handling Invalid Dates:\n{df}\n")

# Create a new column 'formatted_date' that displays the date in the format: "DD-Mon-YYYY"
df['formatted_date'] = pd.to_datetime(df['order_date']).dt.strftime('%d-%b-%Y')
print(df)

Original DataFrame w/o Conversion:
   order_id    order_date  total_amount
0         1    2023-03-15        100.50
1         2    2023-04-01        200.75
2         3  invalid_date        150.25
3         4    2023-03-20        300.00
4         5    2023-04-05         75.80

After converting 'Order Date' into Datetime & Handling Invalid Dates:
   order_id order_date  total_amount
0         1 2023-03-15        100.50
1         2 2023-04-01        200.75
2         3        NaT        150.25
3         4 2023-03-20        300.00
4         5 2023-04-05         75.80

   order_id order_date  total_amount formatted_date
0         1 2023-03-15        100.50    15-Mar-2023
1         2 2023-04-01        200.75    01-Apr-2023
2         3        NaT        150.25            NaN
3         4 2023-03-20        300.00    20-Mar-2023
4         5 2023-04-05         75.80    05-Apr-2023
