# **Day 5 – Pandas: Data Manipulation II**

Day 5 of the QuantLake Internship

## **Objective**

Today’s focus was to level up data wrangling skills using Pandas:
- Reshape data with `.pivot()`, `.pivot_table()`, and `.melt()`
- Apply custom functions with `.apply()`, `.map()`, and `.replace()`
- Combine datasets using `pd.concat()`
- Build a mini data transformation pipeline using real-world data

## **Section 1: Reshaping DataFrames**

In this section, we explored ways to reshape and reorganize datasets using functions like `.pivot()`, `.pivot_table()`, and `.melt()`

These methods allow transforming data from long to wide format and vice versa — which is crucial for summarizing or analyzing data across multiple dimensions.

In [1]:
import pandas as pd
import numpy as np

In [5]:
# Load your dataset
df = pd.read_csv('Sample - Superstore.csv', encoding='cp1252')

In [7]:
# Sample dataset
data = {
    'Region': ['East', 'East', 'West', 'West'],
    'Category': ['Furniture', 'Technology', 'Furniture', 'Technology'],
    'Sales': [1500, 2500, 1800, 2200]
}

df = pd.DataFrame(data)

In [8]:
# Pivot
pivot_df = df.pivot(index='Region', columns='Category', values='Sales')
print("Pivot Table:")
print(pivot_df)

Pivot Table:
Category  Furniture  Technology
Region                         
East           1500        2500
West           1800        2200


In [9]:
# Pivot Table with aggregation
pivot_avg = df.pivot_table(values='Sales', index='Region', columns='Category', aggfunc='mean')
print("\nPivot Table with Aggregation:")
print(pivot_avg)


Pivot Table with Aggregation:
Category  Furniture  Technology
Region                         
East         1500.0      2500.0
West         1800.0      2200.0


In [10]:
# Melting (Unpivot)
melted = pd.melt(df, id_vars=['Region'], value_vars=['Sales'])
print("\nMelted Data:")
print(melted)


Melted Data:
  Region variable  value
0   East    Sales   1500
1   East    Sales   2500
2   West    Sales   1800
3   West    Sales   2200


## **Section 2: Applying Custom Functions**

Here, we used `.apply()` with lambda functions to perform row/column-level operations.

We created derived columns such as profit classification and flags based on business logic, allowing us to enhance datasets with more insightful and actionable fields.

In [11]:
# Add a Profit column
df['Profit'] = [300, 500, 200, 700]

In [12]:
# Apply profit classification
df['Profit_Level'] = df['Profit'].apply(lambda x: 'High' if x > 500 else 'Medium' if x > 250 else 'Low')
print("Profit Level Classification:")
print(df[['Profit', 'Profit_Level']])

Profit Level Classification:
   Profit Profit_Level
0     300       Medium
1     500       Medium
2     200          Low
3     700         High


In [13]:
# Flag high discounts
df['Discount'] = [0.1, 0.5, 0.3, 0.92]
df['High_Discount'] = df['Discount'].apply(lambda x: 'Yes' if x > 0.9 else 'No')
print("\nHigh Discount Flag:")
print(df[['Discount', 'High_Discount']])


High Discount Flag:
   Discount High_Discount
0      0.10            No
1      0.50            No
2      0.30            No
3      0.92           Yes


## **Section 3: Mapping & Replacing Values**

This section focused on cleaning and transforming categorical data using .map() and `.replace()`

Typical use cases included mapping full country names to codes or standardizing segment labels, which improves consistency and readability in reports.



In [14]:
# Mapping example
df['Country'] = ['India', 'USA', 'India', 'Canada']
country_map = {'India': 'IN', 'USA': 'US', 'Canada': 'CA'}
df['Country_Code'] = df['Country'].map(country_map)

In [15]:
# Replacing example
df['Segment'] = ['Consumer', 'Home Office', 'Corporate', 'Consumer']
df['Segment'] = df['Segment'].replace({'Consumer': 'Retail'})

In [16]:
print("Mapped & Replaced Columns:")
print(df[['Country', 'Country_Code', 'Segment']])

Mapped & Replaced Columns:
  Country Country_Code      Segment
0   India           IN       Retail
1     USA           US  Home Office
2   India           IN    Corporate
3  Canada           CA       Retail


## **Section 4: Combining DataFrames**

We practiced merging multiple DataFrames using `pd.concat()` both vertically and horizontally.

This is useful for stacking data from different sources or combining features, and also essential in scenarios like appending monthly reports or merging user profiles.

In [17]:
df1 = pd.DataFrame({
    'ID': [1, 2],
    'Name': ['Alice', 'Bob']
})

df2 = pd.DataFrame({
    'ID': [3, 4],
    'Name': ['Charlie', 'Diana']
})

In [18]:
# Vertical concat
combined_vert = pd.concat([df1, df2], ignore_index=True)
print("Vertically Combined:")
print(combined_vert)

Vertically Combined:
   ID     Name
0   1    Alice
1   2      Bob
2   3  Charlie
3   4    Diana


In [19]:
# Horizontal concat with mismatched columns
df3 = pd.DataFrame({'ID': [1, 2], 'Age': [23, 30]})
horiz_comb = pd.concat([df1, df3], axis=1)
print("\nHorizontally Combined:")
print(horiz_comb)


Horizontally Combined:
   ID   Name  ID  Age
0   1  Alice   1   23
1   2    Bob   2   30


## **Section 5: End-to-End Mini Pipeline**

We built a complete data wrangling pipeline that included loading raw data, filtering, grouping, applying transformations, and pivoting the results to simulate a dashboard-ready summary.

This task reflected real-world data transformation steps used in reporting and analytics.



In [22]:
# Load dataset
superstore_df = pd.read_csv('Sample - Superstore.csv', encoding='cp1252')

In [23]:
# Step 1: Filter
filtered = superstore_df[superstore_df['Sales'] > 500]

In [24]:
# Step 2: Add derived column
filtered['Profit_Level'] = filtered['Profit'].apply(lambda x: 'High' if x > 100 else 'Low')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['Profit_Level'] = filtered['Profit'].apply(lambda x: 'High' if x > 100 else 'Low')


In [25]:
# Step 3: Group
grouped = filtered.groupby(['Category', 'Region'])['Profit'].agg(['sum', 'mean']).reset_index()

In [26]:
# Step 4: Pivot
pivot_dashboard = grouped.pivot(index='Region', columns='Category', values='mean')
print("Final Dashboard:")
print(pivot_dashboard)

Final Dashboard:
Category  Furniture  Office Supplies  Technology
Region                                          
Central   42.691829       125.734569  297.883231
East      27.608349       201.709670  336.405846
South     48.965407       139.492102  189.485028
West      58.020595       255.967001  251.700606


# **Summary**

- On Day 5, I learned to reshape, enrich, and transform datasets using advanced Pandas operations.

- I applied `.pivot()`, `.pivot_table()`, and `.melt()` for reshaping; used `.apply()` and lambda functions for creating derived columns; cleaned categorical data with `.map()` and `.replace()`, and combined multiple DataFrames using pd `.concat()`.

- Finally, I built a mini data processing pipeline — from raw CSV to structured, summarized output — simulating real-world transformation workflows.

- These skills form the backbone of efficient data preparation for analytics and dashboards.