# 💾 Part 7: Pure Pandas Data Type Optimization (Memory Efficiency)

**Goal:** To significantly reduce the memory footprint of a DataFrame by optimizing its column data types, using only **100% native Pandas methods** (no NumPy/external libraries needed).

---
### Key Learning Objectives
1.  Analyze memory usage with `.memory_usage(deep=True)`.
2.  Master `pd.to_numeric()` for converting messy strings into numbers.
3.  Optimize object/string columns using `.astype('category')`.
4.  Optimize integer/float columns using smaller types (`uint8`, `float32`).
5.  Convert strings to `datetime` objects and extract features using the `.dt` accessor.
   

In [1]:
import pandas as pd

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

print("=== PURE PANDAS DATA TYPE OPTIMIZATION ===")
print("\n🔧 Environment setup complete! (100% Pandas)")

# Load the Titanic dataset (continuing from Monday)
titanic_url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
titanic_df = pd.read_csv(titanic_url)

# Apply Monday's missing value fixes using pandas-only methods (Group-based Age Imputation)
age_by_group = titanic_df.groupby(['Pclass', 'Sex'])['Age'].transform('median')
titanic_df['Age'] = titanic_df['Age'].fillna(age_by_group)

# Apply Monday's missing value fixes using pandas-only methods (Mode Imputation)
embarked_mode = titanic_df['Embarked'].mode()[0]
titanic_df['Embarked'] = titanic_df['Embarked'].fillna(embarked_mode)

print("✅ Dataset loaded with missing value fixes applied")

# Analyze current data types using pure pandas
print("\n📋 Current Data Types:")
print(titanic_df.dtypes)

print(f"\n💾 Current Memory Usage: {titanic_df.memory_usage(deep=True).sum() / 1024:.2f} KB")

=== PURE PANDAS DATA TYPE OPTIMIZATION ===

🔧 Environment setup complete! (100% Pandas)
✅ Dataset loaded with missing value fixes applied

📋 Current Data Types:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

💾 Current Memory Usage: 285.64 KB


## 2. Handling Messy Numeric Data (`pd.to_numeric()`)

Real-world data often stores numbers as strings due to commas, currency symbols, or non-numeric values (like 'unknown').

* **Cleanup:** Use Pandas string accessor (`.str`) methods (like `.str.replace()`) to clean the data.
* **Conversion:** Use `pd.to_numeric(errors='coerce')` to convert the string column to a proper numeric type, replacing any remaining invalid text with `NaN`.

In [2]:
# Create a messy dataset to demonstrate pandas conversion techniques
messy_data = pd.DataFrame({
    'customer_id': ['001', '002', '003', '004', '005'],
    'age': ['25', '30', 'unknown', '35', '28'],
    'salary': ['50000', '60,000', '70000', 'confidential', '55000'],    # Has commas
    'department': ['IT', 'HR', 'IT', 'Finance', 'HR'],
})

print("🔧 Created messy dataset for practice:")
print(messy_data)
print("\n📋 Original data types:")
print(messy_data.dtypes)

# Demonstrate different error handling approaches using pandas-only
print("\n1. Converting Age Column:")
# Method 1: Coerce errors to NaN (pandas-only)
messy_data['age_coerce'] = pd.to_numeric(messy_data['age'], errors='coerce')
print("With errors='coerce':", messy_data['age_coerce'].tolist())

print("\n2. Converting Salary Column (with preprocessing):")
# Clean salary data first using pandas string methods
messy_data['salary_clean'] = messy_data['salary'].str.replace(',', '')
# Convert to numeric using pandas
messy_data['salary_numeric'] = pd.to_numeric(messy_data['salary_clean'], errors='coerce')
print("Final numeric salaries:", messy_data['salary_numeric'].tolist())

🔧 Created messy dataset for practice:
  customer_id      age        salary department
0         001       25         50000         IT
1         002       30        60,000         HR
2         003  unknown         70000         IT
3         004       35  confidential    Finance
4         005       28         55000         HR

📋 Original data types:
customer_id    object
age            object
salary         object
department     object
dtype: object

1. Converting Age Column:
With errors='coerce': [25.0, 30.0, nan, 35.0, 28.0]

2. Converting Salary Column (with preprocessing):
Final numeric salaries: [50000.0, 60000.0, 70000.0, nan, 55000.0]


## 3. Advanced Numeric Type Optimization

Smaller data types require less memory. We aim to downcast our numeric columns (`int64`, `float64`) to the smallest type that can safely hold the data range without losing information.

* **Integer Downcasting:** `int64` → `uint8`, `int16`, etc., using `.astype()`.
* **Float Downcasting:** `float64` → `float32`. This requires checking for **precision loss** before committing.

In [3]:
# Work with Titanic data for realistic optimization
print("🎯 Optimizing Titanic Dataset Numeric Types")
titanic_optimized = titanic_df.copy()

# Analyze and Apply Integer Optimizations
print("\n1. Applying Integer Optimizations:")
# PassengerId can be uint16 (range 1-891)
titanic_optimized['PassengerId'] = titanic_optimized['PassengerId'].astype('uint16')
# SibSp can be uint8 (range 0-8)
titanic_optimized['SibSp'] = titanic_optimized['SibSp'].astype('uint8')
# Parch can be uint8 (range 0-6)
titanic_optimized['Parch'] = titanic_optimized['Parch'].astype('uint8')
print("✅ Integer optimizations applied!")

# Analyze and Apply Float Optimizations
print("\n2. Applying Float Optimizations:")
float_cols = ['Age', 'Fare']
for col in float_cols:
    original_memory = titanic_optimized[col].memory_usage(deep=True)
    test_float32 = titanic_optimized[col].astype('float32')

    # Check for precision loss using pandas comparison
    # If the absolute difference between original (float64) and float32 is greater than a threshold (0.01), we lose precision.
    diff_series = (titanic_optimized[col] - test_float32).abs()
    precision_loss = (diff_series > 0.01).any()

    if not precision_loss:
        titanic_optimized[col] = test_float32
        new_memory = titanic_optimized[col].memory_usage(deep=True)
        print(f"✅ {col}: Converted to float32. Memory: {original_memory} → {new_memory} bytes")
    else:
        print(f"❌ {col}: Kept as float64 (precision loss detected).")

🎯 Optimizing Titanic Dataset Numeric Types

1. Applying Integer Optimizations:
✅ Integer optimizations applied!

2. Applying Float Optimizations:
✅ Age: Converted to float32. Memory: 7260 → 3696 bytes
✅ Fare: Converted to float32. Memory: 7260 → 3696 bytes


## 4. Categorical and Boolean Optimization

This is often the source of the biggest memory savings.

* **Categorical:** Use `.astype('category')` for string columns with a **low number of unique values** (e.g., `Sex`, `Embarked`). Pandas stores the unique strings once and assigns efficient integer codes to each row.
* **Boolean:** Convert `int` columns representing binary data (0/1) to the `bool` type using `.astype('bool')`.

In [4]:
# Apply categorical conversions using pandas-only
print("\n1. Applying Categorical Conversions:")
categorical_cols = ['Pclass', 'Sex', 'Embarked']

for col in categorical_cols:
    original_memory = titanic_optimized[col].memory_usage(deep=True)
    titanic_optimized[col] = titanic_optimized[col].astype('category')
    new_memory = titanic_optimized[col].memory_usage(deep=True)
    print(f"✅ {col}: Optimized memory savings: {original_memory} → {new_memory} bytes")

print("\n2. Understanding Categorical Data Structure:")
sex_categorical = titanic_optimized['Sex']
print(f"Sex categories: {sex_categorical.cat.categories.tolist()}")
print(f"Sex codes (first 5): {sex_categorical.cat.codes[:5].tolist()}")


# Convert Survived column to boolean using pandas-only
print("\n3. Boolean Conversion:")
original_memory = titanic_optimized['Survived'].memory_usage(deep=True)
titanic_optimized['Survived'] = titanic_optimized['Survived'].astype('bool')
new_memory = titanic_optimized['Survived'].memory_usage(deep=True)

print(f"✅ Survived converted to boolean. Memory: {original_memory} → {new_memory} bytes")


1. Applying Categorical Conversions:
✅ Pclass: Optimized memory savings: 7260 → 1155 bytes
✅ Sex: Optimized memory savings: 47983 → 1239 bytes
✅ Embarked: Optimized memory savings: 44682 → 1281 bytes

2. Understanding Categorical Data Structure:
Sex categories: ['female', 'male']
Sex codes (first 5): [1, 0, 0, 0, 1]

3. Boolean Conversion:
✅ Survived converted to boolean. Memory: 7260 → 1023 bytes


## 5. DateTime Conversion and Feature Extraction

Handling dates and times is a specialized form of type optimization.

* **Conversion:** Use `pd.to_datetime(errors='coerce')` to convert strings to the `datetime` type.
* **Extraction:** Use the `.dt` accessor (e.g., `.dt.year`, `.dt.day_name()`) to extract valuable time-based features from the `datetime` column.

In [5]:
# Create messy date data using pandas-only methods
messy_data_dates = pd.DataFrame({
    'join_date': ['2023-01-15', '2023-02-20', 'invalid', '2023-03-10', '2023-04-05']
})

print("1. DateTime Conversion:")
print("Original join_date values:", messy_data_dates['join_date'].tolist())

# Convert with error handling using pandas
messy_data_dates['join_date_dt'] = pd.to_datetime(messy_data_dates['join_date'], errors='coerce')
print("Converted to datetime:", messy_data_dates['join_date_dt'].tolist())


# Extract features from datetime using pandas .dt accessor
print("\n2. DateTime Feature Extraction:")
# Extract various date components using pandas-only
messy_data_dates['join_year'] = messy_data_dates['join_date_dt'].dt.year
messy_data_dates['join_weekday_name'] = messy_data_dates['join_date_dt'].dt.day_name()
messy_data_dates['join_is_weekend'] = messy_data_dates['join_date_dt'].dt.weekday >= 5

print("📋 Extracted datetime features:")
datetime_features = ['join_date_dt', 'join_year', 'join_weekday_name', 'join_is_weekend']
print(messy_data_dates[datetime_features])

print("✅ DateTime feature extraction complete using pandas-only!")

1. DateTime Conversion:
Original join_date values: ['2023-01-15', '2023-02-20', 'invalid', '2023-03-10', '2023-04-05']
Converted to datetime: [Timestamp('2023-01-15 00:00:00'), Timestamp('2023-02-20 00:00:00'), NaT, Timestamp('2023-03-10 00:00:00'), Timestamp('2023-04-05 00:00:00')]

2. DateTime Feature Extraction:
📋 Extracted datetime features:
  join_date_dt  join_year join_weekday_name  join_is_weekend
0   2023-01-15     2023.0            Sunday             True
1   2023-02-20     2023.0            Monday            False
2          NaT        NaN               NaN            False
3   2023-03-10     2023.0            Friday            False
4   2023-04-05     2023.0         Wednesday            False
✅ DateTime feature extraction complete using pandas-only!


In [6]:
# Start fresh for final optimization tracking
titanic_final = titanic_df.copy()

print("\n📊 6. Final Titanic Optimization Summary")
print("=" * 40)
print(f"Memory usage BEFORE optimization: {titanic_final.memory_usage(deep=True).sum() / 1024:.2f} KB")

# Apply all optimized types at once for the final DataFrame
titanic_final['PassengerId'] = titanic_final['PassengerId'].astype('uint16')
titanic_final['SibSp'] = titanic_final['SibSp'].astype('uint8')
titanic_final['Parch'] = titanic_final['Parch'].astype('uint8')
titanic_final['Survived'] = titanic_final['Survived'].astype('bool')

# Apply categorical types
categorical_cols = ['Pclass', 'Sex', 'Embarked']
for col in categorical_cols:
    titanic_final[col] = titanic_final[col].astype('category')

# Apply optimized float types (assuming safe conversion from Cell 6)
titanic_final['Age'] = titanic_final['Age'].astype('float32')
titanic_final['Fare'] = titanic_final['Fare'].astype('float32')

final_memory = titanic_final.memory_usage(deep=True).sum() / 1024
original_memory = titanic_df.memory_usage(deep=True).sum() / 1024
total_savings = original_memory - final_memory
savings_percentage = (total_savings / original_memory * 100)

print(f"Memory usage AFTER optimization: {final_memory:.2f} KB")
print(f"Total savings: {total_savings:.2f} KB ({savings_percentage:.1f}%)")


print("\nOptimized data types:")
print(titanic_final.dtypes)


📊 6. Final Titanic Optimization Summary
Memory usage BEFORE optimization: 285.64 KB
Memory usage AFTER optimization: 161.20 KB
Total savings: 124.45 KB (43.6%)

Optimized data types:
PassengerId      uint16
Survived           bool
Pclass         category
Name             object
Sex            category
Age             float32
SibSp             uint8
Parch             uint8
Ticket           object
Fare            float32
Cabin            object
Embarked       category
dtype: object


In [8]:
print("\n" + "="*60)
print("📚 SUMMARY: Pure pandas Data Type Optimization")
print("="*60)

print("\n✅ SKILLS MASTERED TODAY (pandas-Only):")
print("1. Memory analysis with .memory_usage(deep=True)")
print("2. Numeric conversion with pd.to_numeric(errors='coerce')")
print("3. Categorical optimization with .astype('category')")
print("4. Boolean conversion using .astype('bool')")
print("5. DateTime conversion and feature extraction with .dt accessor")


print("\n🔥 POWER TECHNIQUE OF THE DAY:")
print("PURE PANDAS DATA TYPE OPTIMIZATION")
print(f"→ Achieved {savings_percentage:.1f}% memory reduction using pandas-only")
print("→ Stores categorical data efficiently with integer codes")
print("→ Ensures data integrity with precision testing (implicit in our approach)")

print(f"\n✓ Session 7 completed! Pure pandas optimization mastered - {savings_percentage:.1f}% memory saved!")
print("🐼 100% pandas-powered, zero external dependencies! 🐼")


📚 SUMMARY: Pure pandas Data Type Optimization

✅ SKILLS MASTERED TODAY (pandas-Only):
1. Memory analysis with .memory_usage(deep=True)
2. Numeric conversion with pd.to_numeric(errors='coerce')
3. Categorical optimization with .astype('category')
4. Boolean conversion using .astype('bool')
5. DateTime conversion and feature extraction with .dt accessor

🔥 POWER TECHNIQUE OF THE DAY:
PURE PANDAS DATA TYPE OPTIMIZATION
→ Achieved 43.6% memory reduction using pandas-only
→ Stores categorical data efficiently with integer codes
→ Ensures data integrity with precision testing (implicit in our approach)

✓ Session 7 completed! Pure pandas optimization mastered - 43.6% memory saved!
🐼 100% pandas-powered, zero external dependencies! 🐼
