# Conclusions

## Data Quality - Operational Improvements

During the data validation and preparation stage it quickly became evident that the overall quality of the provided data is poor. The CSV format data was rejected in preference to the JSON format sources because the CSV files had more missing data and were not consistent with the JSON data. It was assumed that the CSV files are the result of a first attempt to extract data from the JSON files. 

More detailed examination of the JSON data revealed frequent gaps and inconsistencies. For example:

- Customer Reviews: Out of 39,000 reviews, 16,000 (40%) were duplicated across multiple customers. This could be a result of an error during extraction form operational systems, hopefully it is not a reflection of the production data in operational systems or even an attempt to create misleading reviews. Also score rankings were inconsistent, potentially because customers do not know if the review score rank of 1 to 5 means poor to good or is it good to poor. Sentiment analysis of the review text supported this observation.
- Customers: The data of birth for customers appears to be artificially generated and of limited use for demographic analysis. This could be deliberate obfuscation and hopefully not a reflection of the customer data held in source operational systems.
- Sales & Stock: Approximately 2% to 3% of data is inconsistent or incomplete, for example stock without sales prices or descriptions.

Due to the above, more time was spent on creating and documenting processes (ie Jupyter Notebooks) to make the validation and preparation stage easily repeatable when better quality data is provided. The reasons for the many data quality issues need to be understood so that the data extraction can be repeated more successfully. If the root cause is extraction then this can be easily addressed and appropriate validation checks put in place. If the causes are due to issues in the source operational systems then these need to be addressed as part of JCP's planned overhaul of its customer and stock systems. 

With better quality data, the data collation can be repeated and then additional data analysis steps developed

## Business Observations

As stated earlier, the data analysis completed is limited due to the poor quality of data available. All conclusions must be heavily caveated and probably not actioned on until the exercise can be repeated.

### Product & Price Range

JCP has a very large product range, with over 6,000 stock items and with a wide range of prices from $3.61 up to $4,023.51. In these broad ranges, 40% of the total $4m sales revenue is generated by only 100 products. In terms of price, 50% of the sales revenue comes from products over $500 each, conversely 1,200 products are priced below $20 and generate less than $150k revenue. The very high price but low volume products do not make a big contribution and similarly the very high volume products do not make a big contribution to the overall sales revenue. The sweet spot appears to be products in the $20 to $4,500 price range.

The product range should be reviewed to confirm if rationalistion could improve overall sales profit by ceasing the sale of the very low price, higher volume, products and also very high price low volume products. Obviously the profit margin of individual products needs to be known for that analysis to be completed.

### Geographic Impact

The distribution of the number of customers across all US States was extremely even, ranging from 80 to 95 customers per state, in comparison the total population across states (and territories) ranges from 300 to 39 million people. This could be yet another data quality issue or a very unusual geographic spread. Similarly the spread of sales revenue is unusual, for example the US Minor Outlying Islands has a population of 300 but spends as much as Alabama with a population of 5 million. And the largest state, California with a population of 39 million has the lowest sales revenue.

The first action is to review the quality of the sales data to see if these ranges are accurate and if not then what is the root cause of the data collection issues. If the sales data is accurate then a very detailed examination of sales patterns needs to be completed to understand why bigger revenue is not obtained from states such as California.


# References {.unnumbered}

Microsoft (2023) https://devblogs.microsoft.com/python/data-wrangler-release/

Ncr, P.C. et al. (1999) ‘CRISP-DM 1.0’ 

Hotz (2024) https://www.datascience-pm.com/crisp-dm-2/

Modern Retail (2023) https://www.modernretail.co/operations/jcpenney-is-the-latest-department-store-to-announce-a-major-turnaround-plan/

