# Column Operations in Pandas

### What Are Column Operations?

Column operations in Pandas involve creating new columns, modifying existing ones, renaming them for clarity, or removing unnecessary columns to streamline our dataset. These operations are core to **data cleaning**, **feature engineering**, and **dataset restructuring** — all of which are essential for any kind of data analysis or machine learning workflow.

With just a few lines of code, we can manipulate columns using vectorized operations, built-in Pandas methods, or custom logic. Whether we're building new features like "is_child" or renaming "Pclass" to "Passenger_Class", column operations let us transform our raw data into a meaningful format.

### Why Column Operations Are Important

Column operations are crucial for:

- **Feature Engineering**: Creating new features based on domain logic or statistical transformations.
- **Cleaning Data**: Removing or renaming columns that are unclear or redundant.
- **Readability**: Giving clear and meaningful names to columns helps teams and tools understand the dataset better.
- **Preprocessing for ML Models**: Selecting and engineering the right columns is vital for model accuracy.

In practice, nearly **every data science project** requires column operations at some point before the data is ready for modeling or analysis.

### Common Column Operations and Their Syntax

**1. Creating a New Column**

We can assign a new Series (list, logic, operation, etc.) to a new column name.

In [1]:
import pandas as pd
df = pd.read_csv("data/train.csv")

# Add a column 'FamilySize' = SibSp + Parch
df['FamilySize'] = df['SibSp'] + df['Parch']
print(df[['SibSp', 'Parch', 'FamilySize']].head())

   SibSp  Parch  FamilySize
0      1      0           1
1      1      0           1
2      0      0           0
3      1      0           1
4      0      0           0


**2. Modifying an Existing Column**

You can overwrite an existing column with a modified version of itself.

In [2]:
import numpy as np
df['Fare'] = np.log1p(df['Fare'])
print(df[['Fare']].head())

       Fare
0  2.110213
1  4.280593
2  2.188856
3  3.990834
4  2.202765


**3. Renaming Columns**

Use `.rename()` method with `columns` argument. Set `inplace=True` to apply directly.

In [3]:
# Rename 'Pclass' to 'Passenger_Class'
df.rename(columns={'Pclass': 'Passenger_Class'}, inplace=True)
print(df.columns)

Index(['PassengerId', 'Survived', 'Passenger_Class', 'Name', 'Sex', 'Age',
       'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'FamilySize'],
      dtype='object')


**4. Dropping Columns**

Use `.drop()` method with `axis=1` for columns.

In [4]:
# Drop the 'Cabin' column
df.drop('Cabin', axis=1, inplace=True)
print("Dropped 'Cabin' column. Current columns:", df.columns)

# Drop multiple columns
df.drop(['Ticket', 'Embarked'], axis=1, inplace=True)
print("Dropped 'Ticket' and 'Embarked' columns. Current columns:", df.columns)

Dropped 'Cabin' column. Current columns: Index(['PassengerId', 'Survived', 'Passenger_Class', 'Name', 'Sex', 'Age',
       'SibSp', 'Parch', 'Ticket', 'Fare', 'Embarked', 'FamilySize'],
      dtype='object')
Dropped 'Ticket' and 'Embarked' columns. Current columns: Index(['PassengerId', 'Survived', 'Passenger_Class', 'Name', 'Sex', 'Age',
       'SibSp', 'Parch', 'Fare', 'FamilySize'],
      dtype='object')


### Examples Using Titanic Dataset

In [None]:
# 1. Create a new column 'FamilySize'
df['FamilySize'] = df['SibSp'] + df['Parch']
print(df[['SibSp', 'Parch', 'FamilySize']].head())

# 2. Modify 'Fare' to log scale to normalize
import numpy as np
df['Fare'] = np.log1p(df['Fare'])
print(df[['Fare']].head())

# 3. Rename 'Pclass' to 'Passenger_Class'
df.rename(columns={'Pclass': 'Passenger_Class'}, inplace=True)
print(df.columns)

# 4. Drop the 'SibSp' column
df.drop('SibSp', axis=1, inplace=True)

# 5. Drop 'Parch' and 'FamilySize'
df.drop(['Parch', 'FamilySize'], axis=1, inplace=True)

# Check final columns
print(df.columns)

   SibSp  Parch  FamilySize
0      1      0           1
1      1      0           1
2      0      0           0
3      1      0           1
4      0      0           0
       Fare
0  1.134691
1  1.664038
2  1.159662
3  1.607603
4  1.164014
Index(['PassengerId', 'Survived', 'Passenger_Class', 'Name', 'Sex', 'Age',
       'SibSp', 'Parch', 'Fare', 'FamilySize'],
      dtype='object')
Index(['PassengerId', 'Survived', 'Passenger_Class', 'Name', 'Sex', 'Age',
       'Fare'],
      dtype='object')


### Real-World Machine Learning Use Cases

- Creating a `has_family` flag: `df['HasFamily'] = (df['FamilySize'] > 0).astype(int)`
- Normalizing or transforming numeric columns for better model performance.
- Renaming columns to remove spaces or special characters before exporting to ML libraries.
- Dropping columns that are irrelevant or leak information.

### Best Practices

| Task | Best Practice |
| --- | --- |
| Creating Columns | Use vectorized operations (avoid loops) |
| Renaming | Use `.rename()` and keep names consistent across projects |
| Dropping | Drop with `inplace=True` or reassign to avoid confusion |
| Modifying | Use `.apply()` or NumPy methods for fast transformations |
| Permanent Changes | Always inspect `.columns` before saving or exporting |

### Exercises

Q1. Create a new column `is_child` where age < 12 → 1, else 0

In [6]:
df['is_child'] = df['Age'].apply(lambda age: 1 if age < 12 else 0)
print(df[['Name', 'Age', 'is_child']].head())

                                                Name   Age  is_child
0                            Braund, Mr. Owen Harris  22.0         0
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0         0
2                             Heikkinen, Miss. Laina  26.0         0
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0         0
4                           Allen, Mr. William Henry  35.0         0


Q2. Create a column `Fare_Category` with values: 'Low' if Fare < 10, 'Mid' if 10 ≤ Fare < 50, else 'High'

In [7]:
def categorize_fare(fare):
    if fare < 10:
        return 'Low'
    elif fare < 50:
        return 'Mid'
    else:
        return 'High'

df['Fare_Category'] = df['Fare'].apply(categorize_fare)
print(df[['Fare', 'Fare_Category']].head())

       Fare Fare_Category
0  1.134691           Low
1  1.664038           Low
2  1.159662           Low
3  1.607603           Low
4  1.164014           Low


Q3. Rename columns `Sex` to `Gender` and `Age` to `Passenger_Age`

In [8]:
df.rename(columns={'Sex': 'Gender', 'Age': 'Passenger_Age'}, inplace=True)
print(df[['Gender', 'Passenger_Age']].head())

   Gender  Passenger_Age
0    male           22.0
1  female           38.0
2  female           26.0
3  female           35.0
4    male           35.0


Q4. Drop columns `PassengerId` and `Name` from the dataset

In [9]:
df.drop(['PassengerId', 'Name'], axis=1, inplace=True)
print(df.head())

   Survived  Passenger_Class  Gender  Passenger_Age      Fare  is_child  \
0         0                3    male           22.0  1.134691         0   
1         1                1  female           38.0  1.664038         0   
2         1                3  female           26.0  1.159662         0   
3         1                1  female           35.0  1.607603         0   
4         0                3    male           35.0  1.164014         0   

  Fare_Category  
0           Low  
1           Low  
2           Low  
3           Low  
4           Low  


Q5. Add a new column `Fare_Squared` which is Fare squared

In [10]:
df['Fare_Squared'] = df['Fare'] ** 2
print(df[['Fare', 'Fare_Squared']].head())

       Fare  Fare_Squared
0  1.134691      1.287524
1  1.664038      2.769024
2  1.159662      1.344817
3  1.607603      2.584388
4  1.164014      1.354930


### Summary

Column operations are the backbone of data transformation in Pandas. They allow us to reshape our dataset by adding, updating, renaming, or removing columns. In real-world projects, column operations are often the **first and most frequent actions** taken before any serious modeling or visual analysis begins.

From creating derived features like `FamilySize` or `Fare_Category`, to cleaning up messy column names or dropping irrelevant information, these steps make our dataset **more structured, readable, and model-ready**. Efficient column operations also help reduce memory usage and improve processing speed, especially for large datasets.

These operations are also integral to **feature engineering**, which is one of the most impactful steps in the ML pipeline. A good model often starts with well-engineered columns. When done properly, these operations simplify our workflow and help eliminate data quality issues early in the process.

Mastering these operations ensures we can move swiftly from raw CSVs to a clean, analysis-ready dataset. Understanding how to write clean, maintainable code for column operations is a sign of a strong data professional or ML engineer.