# Exercise 4.4: Data Wrangling & Subsetting

## 📋 Table of Contents

1. [Introduction](#Introduction)
2. [Import Libraries & Data](#Import)
3. [Dropping Columns](#Dropping)
4. [Renaming Columns](#Renaming)
5. [Changing Data Types](#Types)
6. [Transposing Data](#Transposing)
7. [Creating a Data Dictionary](#DataDict)
8. [Subsetting Data](#Subsetting)
9. [Exporting Data](#Exporting)
10. [Reflection](#Reflection)

## Introduction

In this exercise, I’ll wrangle and subset data using pandas. Tasks include dropping & renaming columns, changing data types, subsetting data, and exporting cleaned files.

## Import Libraries & Data

In [4]:
import pandas as pd
import os

# Set the path to your project folder
path = r'C:\\Users\\rhysm\\OneDrive\\Desktop\\Career Foundry\\Data Immersion\\Module 4\\04-2025 Instacart Basket Analysis'

# Import orders.csv and products.csv
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders.csv'))
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'))

## Dropping Columns

In [6]:
# Drop 'eval_set' column from df_ords
df_ords = df_ords.drop(columns=['eval_set'])
print(df_ords.head())

   order_id  user_id  order_number  order_dow  order_hour_of_day  \
0   2539329        1             1          2                  8   
1   2398795        1             2          3                  7   
2    473747        1             3          3                 12   
3   2254736        1             4          4                  7   
4    431534        1             5          4                 15   

   days_since_prior_order  
0                     NaN  
1                    15.0  
2                    21.0  
3                    29.0  
4                    28.0  


## Renaming Columns

In [8]:
# Rename 'order_dow' to 'order_day_of_week'
df_ords.rename(columns={'order_dow': 'order_day_of_week'}, inplace=True)
print(df_ords.head())

   order_id  user_id  order_number  order_day_of_week  order_hour_of_day  \
0   2539329        1             1                  2                  8   
1   2398795        1             2                  3                  7   
2    473747        1             3                  3                 12   
3   2254736        1             4                  4                  7   
4    431534        1             5                  4                 15   

   days_since_prior_order  
0                     NaN  
1                    15.0  
2                    21.0  
3                    29.0  
4                    28.0  


## Changing Data Types

In [10]:
# Convert 'order_id' to string
df_ords['order_id'] = df_ords['order_id'].astype('str')
print(df_ords.dtypes)

order_id                   object
user_id                     int64
order_number                int64
order_day_of_week           int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object


## Transposing Data

In [12]:
# Transpose the df_prods dataframe (for data dictionary)
df_prods_t = df_prods.T
print(df_prods_t)

                                    0                 1      \
product_id                              1                 2   
product_name   Chocolate Sandwich Cookies  All-Seasons Salt   
aisle_id                               61               104   
department_id                          19                13   
prices                                5.8               9.3   

                                              2      \
product_id                                        3   
product_name   Robust Golden Unsweetened Oolong Tea   
aisle_id                                         94   
department_id                                     7   
prices                                          4.5   

                                                           3      \
product_id                                                     4   
product_name   Smart Ones Classic Favorites Mini Rigatoni Wit...   
aisle_id                                                      38   
department_id     

## Creating a Data Dictionary

In [19]:
# Create a data dictionary from df_prods
data_dict = df_prods.to_dict('index')
print(list(data_dict.items())[:1])  # Preview the first entry

[(0, {'product_id': 1, 'product_name': 'Chocolate Sandwich Cookies', 'aisle_id': 61, 'department_id': 19, 'prices': 5.8})]


## Subsetting Data

In [22]:
# Subset breakfast items (department_id == 14)
df_breakfast = df_prods[df_prods['department_id'] == 14]
print(df_breakfast.head())

# Subset dinner party items
dinner_parties = [5, 20, 7, 12]
df_dinner_party = df_prods[df_prods['department_id'].isin(dinner_parties)]
print(df_dinner_party.head())
print("Total dinner party items:", df_dinner_party.shape)

# Subset user_id 1 orders
df_user_1 = df_ords[df_ords['user_id'] == 1]
print(df_user_1)

     product_id                                      product_name  aisle_id  \
27           28                                 Wheat Chex Cereal       121   
33           34                                               NaN       121   
67           68                           Pancake Mix, Buttermilk       130   
89           90                                      Smorz Cereal       121   
210         211  Gluten Free Organic Cereal Coconut Maple Vanilla       130   

     department_id  prices  
27              14    10.1  
33              14    12.2  
67              14    13.7  
89              14     3.9  
210             14     3.6  
    product_id                                    product_name  aisle_id  \
2            3            Robust Golden Unsweetened Oolong Tea        94   
6            7                  Pure Coconut Water With Orange        98   
9           10  Sparkling Orange Juice & Prickly Pear Beverage       115   
10          11                               Pe

## Exporting Data

In [25]:
# Export cleaned orders and departments dataframes
df_ords.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'))
df_prods.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_wrangled.csv'))

## Reflection

✅ In this exercise, I practiced key data wrangling techniques—dropping and renaming columns, converting data types, subsetting data, and exporting cleaned data files.