# Workshop 8 - Pandas Retail Store Analysis
## Karyn M.
## DAB July

### Part 1: Data Loading and Exploration

### 1. Loading Data: Import all CSV files into Pandas DataFrames and display the first few rows of each.
### 2. Exploratory Analysis: Use descriptive statistics to get a feel for the data:
        ### Obtain summary statistics of numerical columns in the products, sales, and inventory dataframes
        ### Check for missing values in all dataframes
        ###  Display the data types of each column in all dataframes
### 3. Basic Information Retrieval:
        ### How many unique products are in the product catalog?
        ### What are the top 5 most expensive products?
        ### Which store has the largest floor space?
        ### What is the distribution of customers by state?

In [4]:
import sqlite3

In [5]:
import pandas as pd
import glob

#A. Get all CSV file paths from a folder
csv_files = glob.glob("*.csv")

#B. Create a dictionary of DataFrames
dataframes = {file: pd.read_csv(file) for file in csv_files}

#C. Display the first few rows of each DataFrame
for file, df in dataframes.items():
    print(f"\n--- {file} ---")
    print(df.head())


--- wk8-products.csv ---
   product_id        product_name     category  subcategory    brand    price  \
0           1     Apple iPhone 13  Electronics  Smartphones    Apple   899.99   
1           2  Samsung Galaxy S21  electronics  Smartphones  Samsung   799.99   
2           3     Sony WH-1000XM4  ELECTRONICS   Headphones     Sony   349.99   
3           4         Dell XPS 13  Electronics      Laptops     Dell  1299.99   
4           5    Nike Classic Tee     Clothing       Shirts     Nike    24.99   

     cost  weight  
0  649.99    0.45  
1  539.99    0.50  
2  210.00    0.60  
3  899.99    2.80  
4   12.50    0.20  

--- wk8-sales.csv ---
   sale_id        date  store_id  customer_id  product_id  quantity    total  \
0        1  2022-01-15       3.0           12           5         2    49.98   
1        2  2022-01-16       1.0            5          10         1    49.99   
2        3  2022-01-18       2.0            8           3         1   349.99   
3        4  2022-01-20  

In [6]:
# See all keys (file names)
print(dataframes.keys())

dict_keys(['wk8-products.csv', 'wk8-sales.csv', 'wk8-customers.csv', 'wk8-stores.csv', 'wk8-inventory.csv'])


In [7]:
# Rename each DataFrame to an easier variable name
inventory_df = dataframes["wk8-inventory.csv"]
sales_df = dataframes["wk8-sales.csv"]
stores_df = dataframes["wk8-stores.csv"]
customers_df = dataframes["wk8-customers.csv"]
products_df = dataframes["wk8-products.csv"]

# check first rows to make sure everything loaded correctly
print(inventory_df.head())
print(sales_df.head())
print(stores_df.head())
print(customers_df.head())
print(products_df.head())

   inventory_id  store_id  product_id  quantity_in_stock last_restock_date  \
0             1         1           1                 25        2022-10-15   
1             2         1           5                 42        2022-11-02   
2             3         1          10                 18        2022-09-30   
3             4         1          15                  5        2022-10-20   
4             5         1          20                 12               NaN   

   reorder_level  
0           10.0  
1           15.0  
2            8.0  
3           10.0  
4           15.0  
   sale_id        date  store_id  customer_id  product_id  quantity    total  \
0        1  2022-01-15       3.0           12           5         2    49.98   
1        2  2022-01-16       1.0            5          10         1    49.99   
2        3  2022-01-18       2.0            8           3         1   349.99   
3        4  2022-01-20       4.0           20           7         1   349.99   
4        5  2022-

In [9]:
#Obtain summary statistics of numerical columns in the products, sales, and inventory dataframes

print(inventory_df.describe())
print(sales_df.describe())
print(products_df.describe())

       inventory_id   store_id  product_id  quantity_in_stock  reorder_level
count     75.000000  75.000000   75.000000          75.000000      69.000000
mean      38.000000   8.000000   12.733333          22.640000      12.681159
std       21.794495   4.349588    7.179801          13.923187       3.327659
min        1.000000   1.000000    1.000000           0.000000       8.000000
25%       19.500000   4.000000    6.500000          15.000000      10.000000
50%       38.000000   8.000000   13.000000          25.000000      10.000000
75%       56.500000  12.000000   19.000000          33.500000      15.000000
max       75.000000  15.000000   25.000000          45.000000      20.000000
         sale_id   store_id  customer_id  product_id   quantity        total
count  40.000000  39.000000    40.000000   40.000000  40.000000    40.000000
mean   20.500000   6.487179    14.075000   13.500000   1.625000   352.296500
std    11.690452   4.235637     8.303189    8.857852   0.952392   377.625776

In [10]:
# Check for missing values in all dataframes - Inventory
inventory_df.isnull().sum()


inventory_id         0
store_id             0
product_id           0
quantity_in_stock    0
last_restock_date    9
reorder_level        6
dtype: int64

In [11]:
# Check for missing values in all dataframes - Sales
sales_df.isnull().sum()

sale_id           0
date              1
store_id          1
customer_id       0
product_id        0
quantity          0
total             0
payment_method    1
dtype: int64

In [12]:
# Check for missing values in all dataframes - Stores
stores_df.isnull().sum()

store_id        0
store_name      0
address         0
city            0
state           0
zip_code        0
region          0
size_sqft       2
opening_date    1
dtype: int64

In [13]:
# Check for missing values in all dataframes - Customers
customers_df.isnull().sum()

customer_id          0
first_name           0
last_name            0
email                3
phone                2
address              0
city                 0
state                0
zip_code             1
registration_date    3
dtype: int64

In [14]:
# Check for missing values in all dataframes - Products
products_df.isnull().sum()

product_id      0
product_name    0
category        0
subcategory     0
brand           0
price           0
cost            0
weight          2
dtype: int64

In [18]:
# Display the data types of each column in all dataframes- Inventory
print (inventory_df.dtypes)

inventory_id           int64
store_id               int64
product_id             int64
quantity_in_stock      int64
last_restock_date     object
reorder_level        float64
dtype: object


In [19]:
# Display the data types of each column in all dataframes- Sales
print (sales_df.dtypes)

sale_id             int64
date               object
store_id          float64
customer_id         int64
product_id          int64
quantity            int64
total             float64
payment_method     object
dtype: object


In [20]:
# Display the data types of each column in all dataframes- Stores
print (stores_df.dtypes)

store_id          int64
store_name       object
address          object
city             object
state            object
zip_code         object
region           object
size_sqft       float64
opening_date     object
dtype: object


In [21]:
# Display the data types of each column in all dataframes- Customers
print (customers_df.dtypes)

customer_id           int64
first_name           object
last_name            object
email                object
phone                object
address              object
city                 object
state                object
zip_code             object
registration_date    object
dtype: object


In [22]:
# Display the data types of each column in all dataframes- Products
print (products_df.dtypes)

product_id        int64
product_name     object
category         object
subcategory      object
brand            object
price           float64
cost            float64
weight          float64
dtype: object


In [25]:
# How many unique products are in the product catalog?
unique_products = products_df['product_id'].nunique()
print (unique_products)

30


In [28]:
# What are the top 5 most expensive products?
top_five = products_df.sort_values(by = 'price', ascending = False).head(5)
print (top_five)

    product_id        product_name        category  subcategory    brand  \
3            4         Dell XPS 13     Electronics      Laptops     Dell   
0            1     Apple iPhone 13     Electronics  Smartphones    Apple   
1            2  Samsung Galaxy S21     electronics  Smartphones  Samsung   
19          20    Dyson V11 Vacuum  HOME & KITCHEN   Appliances    Dyson   
26          27  Sony PlayStation 5     Electronics       Gaming     Sony   

      price    cost  weight  
3   1299.99  899.99    2.80  
0    899.99  649.99    0.45  
1    799.99  539.99    0.50  
19   599.99  375.00    6.70  
26   499.99  399.00    4.50  


In [29]:
# Which store has the largest floor space?

largest_space = stores_df.sort_values(by = 'size_sqft', ascending = False).head(1)
print (largest_space)

   store_id         store_name          address         city state zip_code  \
2         3  Los Angeles Plaza  789 Commerce St  Los Angeles    CA    90001   

  region  size_sqft opening_date  
2   West    55000.0   2004-03-10  


In [31]:
# What is the distribution of customers by state?
customer_locations = customers_df['state'].value_counts()
print (customer_locations)

state
TX                4
CA                3
NY                2
MA                1
NM                1
WI                1
KY                1
MD                1
Maryland          1
TN                1
MI                1
NV                1
Oregon            1
Washington        1
CO                1
California        1
IN                1
North Carolina    1
OH                1
FL                1
Texas             1
PA                1
AZ                1
IL                1
MO                1
Name: count, dtype: int64


In [None]:
df.isnull().sum()
print(inventory_df.head())
print(sales_df.head())
print(stores_df.head())
print(customers_df.head())
print(products_df.head())

In [None]:
##Part 2: Data Cleaning
## 1. Handling Missing Values:
    ### Identify all missing values in each dataset
    ### For numerical columns with missing values, replace them with the column mean
    ### For categorical columns with missing values, replace them with the most frequent value
    ### For date columns with missing values, use forward fill or backward fill as appropriate
## 2. Removing Duplicates:
    ### Check for and remove any duplicate entries in the customers and products dataframes
    ### Explain your approach for identifying duplicates

In [32]:
# Handling missing values 
# Identify missing values in all dataframes - Inventory
inventory_df.isnull().sum()

inventory_id         0
store_id             0
product_id           0
quantity_in_stock    0
last_restock_date    9
reorder_level        6
dtype: int64

In [37]:
# For numerical columns with missing values, replace them with the column mean : Inventory
# inventory_df['reorder_level'] 
# get mean of reorder level

inventory_df['reorder_level'].describe()

count    69.000000
mean     12.681159
std       3.327659
min       8.000000
25%      10.000000
50%      10.000000
75%      15.000000
max      20.000000
Name: reorder_level, dtype: float64

In [34]:
# Inventory

# get mean of reorder level
inventory_df['reorder_level'].mean()

np.float64(12.681159420289855)

In [35]:
# For numerical columns with missing values, replace them with the column mean



inventory_df['reorder_level'].fillna(inventory_df['reorder_level'].mean())

0     10.0
1     15.0
2      8.0
3     10.0
4     15.0
      ... 
70    10.0
71     8.0
72    10.0
73    15.0
74    10.0
Name: reorder_level, Length: 75, dtype: float64

In [38]:
## actual database change: For numerical columns with missing values, replace them with the column mean

# this change the database permentaly 
inventory_df['reorder_level'].fillna(df['reorder_level'].mean(), inplace=True )

In [39]:
# check that reorder level missing values are now filled
inventory_df.isnull().sum()

inventory_id         0
store_id             0
product_id           0
quantity_in_stock    0
last_restock_date    9
reorder_level        0
dtype: int64

In [40]:
# For date columns with missing values, use forward fill or backward fill as appropriate
inventory_df['last_restock_date']= inventory_df['last_restock_date'].fillna(method='ffill').fillna(method='bfill')

  inventory_df['last_restock_date']= inventory_df['last_restock_date'].fillna(method='ffill').fillna(method='bfill')


In [41]:
# check that last restock date level missing values are now filled
inventory_df.isnull().sum()

inventory_id         0
store_id             0
product_id           0
quantity_in_stock    0
last_restock_date    0
reorder_level        0
dtype: int64

In [42]:
# Handling missing values 
# Identify missing values in all dataframes - Sales
sales_df.isnull().sum()

sale_id           0
date              1
store_id          1
customer_id       0
product_id        0
quantity          0
total             0
payment_method    1
dtype: int64

In [44]:
# For categorical columns with missing values, replace them with the most frequent value: sales
# get mean of reorder level
sales_df['payment_method'].mode()

0    Credit Card
Name: payment_method, dtype: object

In [52]:
#For categorical columns with missing values, replace them with the most frequent value-Payment Method: Sales
sales_df['payment_method'].fillna(sales_df['payment_method'].mode()[0], inplace= True)

In [54]:
# For date columns with missing values, use forward fill or backward fill as appropriate: Sales
sales_df['date']= sales_df['date'].fillna(method='ffill').fillna(method='bfill')

  sales_df['date']= sales_df['date'].fillna(method='ffill').fillna(method='bfill')


In [56]:
# missing value store_id will be filled with UNK ( unknown):Sales
# a store ID value should not be replaced by mode or mean since it is a unique to a specific store
sales_df['store_id'].fillna("Unknown", inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  sales_df['store_id'].fillna("Unknown", inplace = True)
  sales_df['store_id'].fillna("Unknown", inplace = True)


In [57]:

# check if sales_df missing values addressed : Sales

sales_df.isnull().sum()

sale_id           0
date              0
store_id          0
customer_id       0
product_id        0
quantity          0
total             0
payment_method    0
dtype: int64

In [58]:
# Handling missing values 
# Identify missing values in all dataframes - Stores
stores_df.isnull().sum()

store_id        0
store_name      0
address         0
city            0
state           0
zip_code        0
region          0
size_sqft       2
opening_date    1
dtype: int64

In [62]:
# For numerical columns with missing values, replace them with the column mean : Stores
# stores_df['size_sqft'] 


stores_df['size_sqft'].describe()

count       15.000000
mean     30653.846154
std      10500.654087
min      16500.000000
25%      23000.000000
50%      30653.846154
75%      33750.000000
max      55000.000000
Name: size_sqft, dtype: float64

In [61]:
# fillin missing value for 'size_sqft using mean:
stores_df['size_sqft'].fillna(stores_df['size_sqft'].mean(), inplace=True )

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  stores_df['size_sqft'].fillna(stores_df['size_sqft'].mean(), inplace=True )


In [63]:
# For date columns with missing values, use forward fill or backward fill as appropriate: Stores
stores_df['opening_date']= stores_df['opening_date'].fillna(method='ffill').fillna(method='bfill')

  stores_df['opening_date']= stores_df['opening_date'].fillna(method='ffill').fillna(method='bfill')


In [66]:

# check missing values in all customers dataframe are resolved  - stores
stores_df.isnull().sum()

store_id        0
store_name      0
address         0
city            0
state           0
zip_code        0
region          0
size_sqft       0
opening_date    0
dtype: int64

In [67]:
# identify customers_df missing values: Customers

customers_df.isnull().sum()

customer_id          0
first_name           0
last_name            0
email                3
phone                2
address              0
city                 0
state                0
zip_code             1
registration_date    3
dtype: int64

In [68]:
# missing value email will be filled with UNK ( unknown):customers
# an email value should not be replaced by mode or mean since it is unique to a specific customer
customers_df['email'].fillna("Unknown", inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  customers_df['email'].fillna("Unknown", inplace = True)


In [None]:
# missing value email will be filled with UNK ( unknown):customers
# a phone number value should not be replaced by mode or mean since it is unique to a specific customer
customers_df['email'].fillna("Unknown", inplace = True)

In [72]:
# For categorical columns with missing values, replace them with the most frequent value | Customers- zip_code

customers_df['phone'].fillna(customers_df['phone'].mode()[0], inplace= True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  customers_df['phone'].fillna(customers_df['phone'].mode()[0], inplace= True)


In [71]:
# For date columns with missing values, use forward fill or backward fill as appropriate: Customers
customers_df['registration_date']= customers_df['registration_date'].fillna(method='ffill').fillna(method='bfill')

  customers_df['registration_date']= customers_df['registration_date'].fillna(method='ffill').fillna(method='bfill')


In [73]:
# check if sales_df missing values addressed : Customers

customers_df.isnull().sum()

customer_id          0
first_name           0
last_name            0
email                0
phone                0
address              0
city                 0
state                0
zip_code             0
registration_date    0
dtype: int64

In [74]:
# Handling missing values 
# Identify missing values in all dataframes - Products
products_df.isnull().sum()

product_id      0
product_name    0
category        0
subcategory     0
brand           0
price           0
cost            0
weight          2
dtype: int64

In [76]:
#get mean of products_df['weight']
products_df['weight'].describe()

count    28.000000
mean      3.835714
std       5.693236
min       0.200000
25%       0.500000
50%       1.150000
75%       3.750000
max      25.000000
Name: weight, dtype: float64

In [78]:
# For numerical columns with missing values, replace them with the column mean | Products
products_df['weight'].fillna(products_df['weight'].mean(), inplace=True )

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  products_df['weight'].fillna(products_df['weight'].mean(), inplace=True )


In [79]:
# Handling missing values 
# check missing values in addressed- Products
products_df.isnull().sum()

product_id      0
product_name    0
category        0
subcategory     0
brand           0
price           0
cost            0
weight          0
dtype: int64

In [None]:
### Discussion Questions

In [None]:
##  What are the key advantages of using Pandas for data cleaning compared to other methods?

# Advantage: with pandas/python you can re-use the code on other sets of data.

# Advantage: pandas can quickly handle large datasets that might cause excel crash.

# Advantage: pandas has built-in functions that can speed-up the data cleaning process.  

## Dicussion Questions:

### 1. What are the key advantages of using Pandas for data cleaning compared to other methods?

#### Advantage: When you use pandas/python you can reuse the code for other datasets, can allow you to work in a systematic way that leads to consistent results

#### Advantage: Pandas/python allows you to work on large datasets that would crash programs like excel.

#### Advantage: Pandas comes with functions that allows for quick data cleaning as compared to excel.

##### Advantage: Pandas/ python allows you to pull from various data sources/ file types



### 2. How would your approach to handling missing values differ if the missing data was not random but had a pattern or meaning?

#### You may want to look into what the patterns reveal.

#### You may want to check if missing data is sensitive data ( sex, illness, race, income),the pattern of missing data can reveal insights.

### 3. What types of data quality issues might not be immediately visible through simple DataFrame inspection methods?

#### You may miss discrepancies between columns e.g if the country is England then the capital can't be Paris.

#### You may miss wrong data types, inconsistent formatting of information ( lowercase/uppercase, phones numbers captured with dashes and without, missing area codes).

#### You may miss duplicated data.

#### You may miss inconsistent units of measurements ( pounds vs grams).

##### You may miss Missing spelling or the inculsion of some shorthand for categorical data.






### 4. How would you document your data cleaning process to ensure reproducibility?

#### You can use comments or markdown to document the cleaning process step by step.

#### You can comment out why each code is being used, purpose of code and you may include what you expect to the code to return

#### You can use very clearly named functions to reuse or shorten code length 

#### You may want to include justifications for decisions e.g why or how you filled in missing data.



### 5. In what scenarios might it be better to remove rows with missing values rather than imputing them?

#### You could remove missing data/ values that are not critical to the analysis, where the exculsion will not have a significat impact on the outcome.

####  You could remove missing data when you are dealing with a dataset that requires that analysis on factual data only ( medical data, finacial data).