# Data Cleaning and Transformation

In [13]:
import pandas as pd
import numpy as np
df_sm = pd.read_excel("charity shop data initial.xlsx", sheet_name = "Space Management", index_col = None)

Due to a shop refurbish in January 2024, the data for the weeks 14th January to 28th January is all zero, therefore we remove these 3 weeks of data as these do not show a true representation for the sales.

In [15]:
df = df_sm.loc[(df_sm['date week commencing'] != '28/01/2024') & (df_sm['date week commencing'] != '21/01/2024') & (df_sm['date week commencing'] != '14/01/2024')]
df.shape

(5146, 10)

The sub categories electrical, furniture and non clothing promotion are not sold in the shop so when these have values other than 0 this is a human error at the till. As discussed with the store manager we decided to move all electrical sales to the jewellery category, furniture to books and non clothing promotion to kids non clothing.
This means changing all the column values for these sub categories too.

In [25]:
#move revenue for electrical, furniture and non clothing promotion to jewellery, books and kids non clothing due to human error at till
#here we move the dept £, and number of items sold to the correct category, then update the number of bays for the totals and make the required sub categories 0
for index, row in df.iterrows():
    date = row['date week commencing']
    if row['sub category'] == 'jewellery':
        electrical_dept = df[(df['sub category'] == 'electrical') & (df['date week commencing'] == date)]['Dept £'].values
        electrical_num = df[(df['sub category'] == 'electrical') & (df['date week commencing'] == date)]['No. items sold'].values
        df.loc[index, 'Dept £'] += electrical_dept
        df.loc[index, 'No. items sold'] += electrical_num
    if row['sub category'] == 'books':
        furniture_dept = df[(df['sub category'] == 'furniture') & (df['date week commencing'] == date)]['Dept £'].values
        furniture_num = df[(df['sub category'] == 'furniture') & (df['date week commencing'] == date)]['No. items sold'].values
        df.loc[index, 'Dept £'] += furniture_dept
        df.loc[index, 'No. items sold'] += furniture_num
    if row['sub category'] == 'kids non clothing':
        nc_prom_dept = df[(df['sub category'] == 'non clothing promotion') & (df['date week commencing'] == date)]['Dept £'].values
        nc_prom_num = df[(df['sub category'] == 'non clothing promotion') & (df['date week commencing'] == date)]['No. items sold'].values
        df.loc[index, 'Dept £'] += nc_prom_dept
        df.loc[index, 'No. items sold'] += nc_prom_num
    if row['sub category'] == 'electrical':
        electrical_bays = row['No. of bays']
        df.loc[index, 'Dept £'] = 0
        df.loc[index, 'No. items sold'] = 0
        df.loc[index, 'No. of bays'] = 0
    if row['sub category'] == 'furniture':
        furniture_bays = row['No. of bays']
        df.loc[index, 'Dept £'] = 0
        df.loc[index, 'No. items sold'] = 0
        df.loc[index, 'No. of bays'] = 0
    if row['sub category'] == 'non clothing promotion':
        nc_prom_bays = row['No. of bays']
        df.loc[index, 'Dept £'] = 0
        df.loc[index, 'No. items sold'] = 0
        df.loc[index, 'No. of bays'] = 0
    if row['sub category'] == 'total for non clothing':
        df.loc[index, 'No. of bays'] = df.loc[index, 'No. of bays'] - electrical_bays - furniture_bays - nc_prom_bays
    if row['sub category'] == 'donated total':
        df.loc[index, 'No. of bays'] = df.loc[index, 'No. of bays'] - electrical_bays - furniture_bays - nc_prom_bays
    if row['sub category'] == 'donated and big total':
        df.loc[index, 'No. of bays'] = df.loc[index, 'No. of bays'] - electrical_bays - furniture_bays - nc_prom_bays

In [27]:
#drop electrical, furniture and non clothing promotion rows since these are not sold in the shop, after the corrections have been made first
df = df[(df['sub category'] != 'electrical') & (df['sub category'] != 'furniture') & (df['sub category'] != 'non clothing promotion')]

In [29]:
df.shape

(4633, 10)

In [35]:
#update new percentages in the dataset
for index, row, in df.iterrows():
    date = row['date week commencing']
    if row['sub category'] != 'donated and big total':
        df.loc[index, 'Dept %'] = np.round(100*(df.loc[index, 'Dept £'] / df[(df['sub category'] == 'donated and big total') & (df['date week commencing'] == date)]['Dept £'].values),2)
        df.loc[index, '% of space'] = np.round(100*(df.loc[index, 'No. of bays'] / df[(df['sub category'] == 'donated and big total') & (df['date week commencing'] == date)]['No. of bays'].values), 2)

In [37]:
#export the new updated dataframe to excel
df.to_excel("charity shop data transformed.xlsx", index = False, sheet_name = 'Space Management')

Now that we have the new dataframe in excel we still have some changes to make. Firstly, we finish correcting all the columns by calculating the new average selling price and average sale per bay values, since these will have changed when moving the values from electrical, furniture and non clothing promotion and are easy to correct in excel. After this, we also ensure all the column data types are correct and adjust if needed, and change the sub category column to proper sentence case to make it look more readable in graphs. Finally, we also pull out the day, month and year for each date into separate columns and create two new columns called modified month and modified year. These new columns are to account for dates where maybe there are only 1 or 2 days in that week which are in a particular month or year. For example if a week begins on 31st december 2024, we would give this a modified month of january and a modified year of 2025, since 6 of the 7 days fall into January 2024. The formulae used to work out these new column values can be seen in the sheet named 'Space Management - Formulae' in the Charity Shop excel workbook.