# DATA TRANSFORMATION

### Starting Point
- This notebook uses a pre-cleaned dataset from the data cleaning phase.
- No additional cleaning is performed; all steps here focus on transformations and business logic only.

In [1]:
import pandas as pd
data = pd.read_csv("cleaned_data_superstore.csv")
print(data)

        invoice-id stock-code                          description  qty  \
0         536365.0     85123A   WHITE HANGING HEART T-LIGHT HOLDER    6   
1         536365.0      71053                  WHITE METAL LANTERN    6   
2         536365.0     84406B       CREAM CUPID HEARTS COAT HANGER    8   
3         536365.0     84029G  KNITTED UNION FLAG HOT WATER BOTTLE    6   
4         536365.0     84029E       RED WOOLLY HOTTIE WHITE HEART.    6   
...            ...        ...                                  ...  ...   
406824    581587.0      22613          PACK OF 20 SPACEBOY NAPKINS   12   
406825    581587.0      22899         CHILDREN'S APRON DOLLY GIRL     6   
406826    581587.0      23254        CHILDRENS CUTLERY DOLLY GIRL     4   
406827    581587.0      23255      CHILDRENS CUTLERY CIRCUS PARADE    4   
406828    581587.0      22138        BAKING SET 9 PIECE RETROSPOT     3   

               invoice-date  unit-date  cust-id         country  
0       2010-12-01 08:26:00      

In [3]:
#Converting the data into DataFrame
df = pd.DataFrame(data)
print(df)

        invoice-id stock-code                          description  qty  \
0         536365.0     85123A   WHITE HANGING HEART T-LIGHT HOLDER    6   
1         536365.0      71053                  WHITE METAL LANTERN    6   
2         536365.0     84406B       CREAM CUPID HEARTS COAT HANGER    8   
3         536365.0     84029G  KNITTED UNION FLAG HOT WATER BOTTLE    6   
4         536365.0     84029E       RED WOOLLY HOTTIE WHITE HEART.    6   
...            ...        ...                                  ...  ...   
406824    581587.0      22613          PACK OF 20 SPACEBOY NAPKINS   12   
406825    581587.0      22899         CHILDREN'S APRON DOLLY GIRL     6   
406826    581587.0      23254        CHILDRENS CUTLERY DOLLY GIRL     4   
406827    581587.0      23255      CHILDRENS CUTLERY CIRCUS PARADE    4   
406828    581587.0      22138        BAKING SET 9 PIECE RETROSPOT     3   

               invoice-date  unit-date  cust-id         country  
0       2010-12-01 08:26:00      

## Filter rows

### Sanity Filtering
- Before analysis, records with logically invalid values are removed (such as negative or impossible values).
- This ensures the dataset reflects real-world scenarios and prevents misleading results.

In [7]:
# Remove rows with negative values in the quantity column
df = df[df['qty'] >= 0]
print(df)

        invoice-id stock-code                          description  qty  \
0         536365.0     85123A   WHITE HANGING HEART T-LIGHT HOLDER    6   
1         536365.0      71053                  WHITE METAL LANTERN    6   
2         536365.0     84406B       CREAM CUPID HEARTS COAT HANGER    8   
3         536365.0     84029G  KNITTED UNION FLAG HOT WATER BOTTLE    6   
4         536365.0     84029E       RED WOOLLY HOTTIE WHITE HEART.    6   
...            ...        ...                                  ...  ...   
406824    581587.0      22613          PACK OF 20 SPACEBOY NAPKINS   12   
406825    581587.0      22899         CHILDREN'S APRON DOLLY GIRL     6   
406826    581587.0      23254        CHILDRENS CUTLERY DOLLY GIRL     4   
406827    581587.0      23255      CHILDRENS CUTLERY CIRCUS PARADE    4   
406828    581587.0      22138        BAKING SET 9 PIECE RETROSPOT     3   

               invoice-date  unit-date  cust-id         country  
0       2010-12-01 08:26:00      

In [8]:
df['qty'].min()  #to check the minimum value to see if there are any negative number

1

In [10]:
df['invoice-id'] = df['invoice-id'].astype(int)
df['cust-id'] = df['cust-id'].astype(int)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 397924 entries, 0 to 406828
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   invoice-id    397924 non-null  int32 
 1   stock-code    397924 non-null  object
 2   description   397924 non-null  object
 3   qty           397924 non-null  int64 
 4   invoice-date  168649 non-null  object
 5   unit-date     397924 non-null  int64 
 6   cust-id       397924 non-null  int32 
 7   country       397924 non-null  object
dtypes: int32(2), int64(2), object(4)
memory usage: 24.3+ MB


In [12]:
df['invoice-id'].min()
df['cust-id'].min()

12346

### Business Filtering Decision
- After reviewing the dataset, no additional business-based filtering was required, as all remaining records already represent valid and relevant transactions for analysis.

### Derived Columns
- New columns are created to capture key analytical metrics that are not directly available in the raw data.
- These derived values simplify analysis and support business insights.

In [14]:
df.rename(columns = {
    'unit-date': 'unit-price'
}, inplace = True)

df.columns

Index(['invoice-id', 'stock-code', 'description', 'qty', 'invoice-date',
       'unit-price', 'cust-id', 'country'],
      dtype='object')

In [16]:
#creating the new column
df['total'] = df['qty'] * df['unit-price']
df.columns

Index(['invoice-id', 'stock-code', 'description', 'qty', 'invoice-date',
       'unit-price', 'cust-id', 'country', 'total'],
      dtype='object')

In [17]:
#reviewing the column is created or not
df['total']

0         12
1         18
2         16
3         18
4         18
          ..
406824     0
406825    12
406826    16
406827    16
406828    12
Name: total, Length: 397924, dtype: int64

## Business Rules and Ordering
- Business rules are applied to translate numerical metrics into interpretable categories.
- The dataset is sorted to highlight the most impactful records for analysis.


**To filter whether the transaction is lower or higher**


In [28]:
df['value_flag'] = df['total'].apply(
    lambda x: 'High Value' if x >= 5000 else 'Low Value'
)

df.columns

Index(['invoice-id', 'stock-code', 'description', 'qty', 'invoice-date',
       'unit-price', 'cust-id', 'country', 'total', 'value_flag'],
      dtype='object')

In [25]:
df['value_flag']

0         Low Value
1         Low Value
2         Low Value
3         Low Value
4         Low Value
            ...    
406824    Low Value
406825    Low Value
406826    Low Value
406827    Low Value
406828    Low Value
Name: value_flag, Length: 397924, dtype: object


**Sorting Data**


In [31]:
#sorting by the total
df = df.sort_values(by='total', ascending=False)
df

Unnamed: 0,invoice-id,stock-code,description,qty,invoice-date,unit-price,cust-id,country,total,value_flag
406349,581483,23843,"PAPER CRAFT , LITTLE BIRDIE",80995,2011-12-09 09:15:00,2,16446,United Kingdom,161990,High Value
38120,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,,1,12346,United Kingdom,74215,High Value
159239,556444,22502,PICNIC BASKET WICKER 60 PIECES,60,2011-06-10 15:28:00,649,15098,United Kingdom,38940,High Value
121396,551697,POST,POSTAGE,1,2011-05-03 13:46:00,8142,16029,United Kingdom,8142,High Value
254663,567423,23243,SET OF TEA COFFEE SUGAR TINS PANTRY,1412,,5,17450,United Kingdom,7060,High Value
...,...,...,...,...,...,...,...,...,...,...
199989,561382,85231G,ORANGE SCENTED SET/9 T-LIGHTS,4,,0,16750,United Kingdom,0,Low Value
45973,542399,21985,PACK OF 12 HEARTS DESIGN TISSUES,24,,0,15549,United Kingdom,0,Low Value
110022,550192,22348,TEA BAG PLATE RED RETROSPOT,12,,0,12582,Finland,0,Low Value
110015,550191,21654,RIDGED GLASS FINGER BOWL,6,,0,16806,United Kingdom,0,Low Value


In [32]:
#sorting by the invoice-id
df = df.sort_values(by='invoice-id', ascending=True)
df

Unnamed: 0,invoice-id,stock-code,description,qty,invoice-date,unit-price,cust-id,country,total,value_flag
6,536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,2010-12-01 08:26:00,4,17850,United Kingdom,24,Low Value
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3,17850,United Kingdom,18,Low Value
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3,17850,United Kingdom,18,Low Value
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3,17850,United Kingdom,18,Low Value
5,536365,22752,SET 7 BABUSHKA NESTING BOXES,2,2010-12-01 08:26:00,7,17850,United Kingdom,14,Low Value
...,...,...,...,...,...,...,...,...,...,...
406822,581587,22629,SPACEBOY LUNCH BOX,12,2011-12-09 12:50:00,1,12680,France,12,Low Value
406820,581587,22730,ALARM CLOCK BAKELIKE IVORY,4,2011-12-09 12:50:00,3,12680,France,12,Low Value
406823,581587,23256,CHILDRENS CUTLERY SPACEBOY,4,2011-12-09 12:50:00,4,12680,France,16,Low Value
406814,581587,22631,CIRCUS PARADE LUNCH BOX,12,2011-12-09 12:50:00,1,12680,France,12,Low Value


In [34]:
#sorting by the invoice-data
df = df.sort_values(by='invoice-date', ascending=False)
df

Unnamed: 0,invoice-id,stock-code,description,qty,invoice-date,unit-price,cust-id,country,total,value_flag
406819,581587,22726,ALARM CLOCK BAKELIKE GREEN,4,2011-12-09 12:50:00,3,12680,France,12,Low Value
406825,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2,12680,France,12,Low Value
406818,581587,22727,ALARM CLOCK BAKELIKE RED,4,2011-12-09 12:50:00,3,12680,France,12,Low Value
406828,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4,12680,France,12,Low Value
406821,581587,22367,CHILDRENS APRON SPACEBOY DESIGN,8,2011-12-09 12:50:00,1,12680,France,8,Low Value
...,...,...,...,...,...,...,...,...,...,...
389008,579885,22597,MUSICAL ZINC HEART DECORATION,11,,0,15444,United Kingdom,0,Low Value
389009,579885,22602,RETROSPOT WOODEN HEART DECORATION,10,,0,15444,United Kingdom,0,Low Value
389030,579885,22530,MAGIC DRAWING SLATE DOLLY GIRL,2,,0,15444,United Kingdom,0,Low Value
389031,579885,22536,MAGIC DRAWING SLATE PURDEY,3,,0,15444,United Kingdom,0,Low Value


### Saving the dataset

In [35]:
df.to_csv("cleaned_data_superstore.csv", index=False)

### Business Rules and Ordering
- Business rules translate numerical metrics into clear categories, and sorting highlights the most impactful records for analysis.

## Summary
- The dataset was cleaned to ensure validity and consistency, followed by the creation of meaningful derived columns to support analysis.
- Business rules were applied to translate numerical values into interpretable categories, and the data was ordered to highlight key records.
- The final dataset is structured, reliable, and ready for aggregation and insight generation.