# Data Checks
Here we will do some initial data checks to get a general overview and ensure the data appears as expected to enable smooth analysis.

In [3]:
import pandas as pd
df_sm = pd.read_excel("charity shop data initial.xlsx", sheet_name = "Space Management", index_col = None)

In [5]:
df_sm.shape

(5236, 10)

We see the dataframe being the correct size. We have 16 weeks where there are 31 sub categories (due to the aditional EOL Clothing being introduced in the financial year of 2025), and the other remaining 158 weeks have 30 sub categories as expected. Therefore, (16x31) + (158x30) = 5236 as is seen, and the 10 columns is correct. 

In [8]:
df_sm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5236 entries, 0 to 5235
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   date week commencing     5236 non-null   object 
 1   category                 5236 non-null   object 
 2   sub category             5236 non-null   object 
 3   Dept £                   5236 non-null   float64
 4   Dept %                   5236 non-null   float64
 5   No. items sold           5236 non-null   int64  
 6   Average selling price £  5236 non-null   float64
 7   No. of bays              5236 non-null   float64
 8   % of space               5236 non-null   float64
 9   Average sale per bay £   5236 non-null   float64
dtypes: float64(6), int64(1), object(3)
memory usage: 409.2+ KB


We have an initial look at the data and the data types for each column, we see there are no null values and confirm this next.

In [11]:
null_mask = df_sm.isnull().any(axis=1)
null_rows = df_sm[null_mask]

print(null_rows)

Empty DataFrame
Columns: [date week commencing, category, sub category, Dept £, Dept %, No. items sold, Average selling price £, No. of bays, % of space, Average sale per bay £]
Index: []


As is shown above we have no null values in the dataset so we do not need to fix anything there.

In [14]:
df_sm.describe()

Unnamed: 0,Dept £,Dept %,No. items sold,Average selling price £,No. of bays,% of space,Average sale per bay £
count,5236.0,5236.0,5236.0,5236.0,5236.0,5236.0,5236.0
mean,478.140521,12.716826,143.036096,4.575407,3.040393,12.790649,133.865376
std,927.088922,24.447392,281.151248,3.510331,5.832331,24.431132,88.766878
min,-14.95,-0.29,-1.0,-2.5,0.0,0.0,-29.9
25%,52.75,1.4175,8.0,2.04,0.5,2.08,79.0975
50%,116.515,3.24,27.0,3.62,1.0,4.08,137.405
75%,307.69,8.32,117.0,6.6925,2.0,8.33,181.0
max,6227.52,100.0,2063.0,30.0,41.0,100.0,942.25


Now we have a quick look at the dataset using the describe function to get an overall feel for it.

In [17]:
df_sm.nunique()

date week commencing        174
category                      4
sub category                 31
Dept £                     3627
Dept %                     1723
No. items sold              759
Average selling price £    1104
No. of bays                  49
% of space                  323
Average sale per bay £     3458
dtype: int64

Next we look at unique item counts. We see 174 different date values which is correct as we have 174 weeks' worth of data. We also see 4 categories and 31 sub categories as expected.

In [20]:
df_sm['category'].value_counts()

category
clothing        2452
non clothing    1392
total            870
big              522
Name: count, dtype: int64

We can confirm the individual category value counts are correct. There should be 3 big values each week, so 3x174 = 522, there should be 5 total values each week, so 5x174 = 870, and for non clothing there should be 8 values each week, so 8x174 = 1392. Then for clothing we have the rest of the values, as we already confirmed earlier that the total count is correct.

In [23]:
df_sm['sub category'].value_counts()

sub category
ladies tops                     174
home                            174
total for big                   174
big gifts                       174
big cards wrap                  174
big xmas cards                  174
donated total                   174
total for non clothing          174
non clothing promotion          174
kids non clothing               174
furniture                       174
electrical                      174
media                           174
books                           174
jewellery                       174
total for clothing              174
ladies knitwear                 174
clothing promotion              174
kids clothing                   174
mens coats jackets and suits    174
mens bottoms                    174
mens shoes accs                 174
mens tops                       174
ladies accs                     174
ladies shoes                    174
ladies coats                    174
ladies dresses                  174
ladies trousers

Now we look at the sub category counts, we see all the sub categories have a count of 174 apart from eol clothing with 16 as expected.
Finally, we confirm that all the dates are correct, there should be 174 weeks all separated by exactly a week, and we see this is correct below.

In [27]:
sm_dates = df_sm['date week commencing'].unique() #the dates in the dataframe
from datetime import datetime, timedelta

#end point dates
ed = datetime(2022, 3, 27)
sd = datetime(2025, 7, 20)

dates = []

#create list of expected dates
while sd >= ed:
    dates.append(sd.strftime("%d/%m/%Y"))
    sd -= timedelta(days=7)

In [29]:
#check whether the expected dates are the same as the dataframe dates
check_dates_sm = list(sm_dates == dates)
count = 0
for i in check_dates_sm:
    if i != True:
        print(count)
    count += 1

In [31]:
df_sm.head()

Unnamed: 0,date week commencing,category,sub category,Dept £,Dept %,No. items sold,Average selling price £,No. of bays,% of space,Average sale per bay £
0,20/07/2025,clothing,ladies tops,451.37,12.11,72,6.27,2.0,8.33,225.69
1,20/07/2025,clothing,ladies knitwear,54.75,1.47,9,6.08,0.5,2.08,109.5
2,20/07/2025,clothing,ladies skirts,85.81,2.3,12,7.15,0.5,2.08,171.62
3,20/07/2025,clothing,ladies trousers,186.06,4.99,29,6.42,1.0,4.17,186.06
4,20/07/2025,clothing,ladies dresses,140.25,3.76,15,9.35,3.0,12.5,46.75
