# COGS 108 - Data Checkpoint

## Ann Yang, Caleb Kim, Elisa Le, Dylan Cheng
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

- Ann Yang: Writing, Conceptualization, Experimental investigation
- Caleb Kim: Writing, Methodology, Project administration
- Elisa Le: Writing, Methodology, Conceptualization
- Dylan Cheng: Writing, Background research, Experimental investigation, Software

## Research Question

<b> How accurately can product attributes such as brand name, intended gender, color, and product type predict the price of a fashion item? </b>

The Kaggle dataset, Fashion Clothing Products Dataset, includes the following columns: ProductID, ProductName, ProductBrand, Gender, Price (INR), NumImages, Description, and PrimaryColor. 

We will be developing a predictive model that uses product features such as brand name, intended gender, color, and product type to estimate the price of fashion items. In order to do this, we will utilize Multiple Regression Models and ANOVA tests. This is a statistical inference task aimed at identifying which product attributes exert the strongest influence on pricing. By examining regression coefficients and evaluating model fit, we can determine which features most meaningfully drive price variation, offering insight into how different product characteristics shape market value.


## Background and Prior Work

Through this project, we're seeking to understand which product features have the biggest impact on the variance of sales for fast-fashion brands. We hope to utilize the Fashion Clothing Products Dataset Sales dataset on Kaggle in order to perform various tests including regression models and ANOVA to identify which features have the largest weight: https://www.kaggle.com/datasets/shivamb/fashion-clothing-products-catalog/data?select=myntra_products_catalog.csv. Ideally, we're able to extrapolate what we find from this data and apply it to the fast fashion, or fashion, industry as a whole.

Fast fashion especially interested us because there is such a wide array of different fast fashion brands who price differently. For example, Shien prices for clothing are cheaper on average compared to another fast fashion brand, Zara. This inspired us to discover what other factors actually contribute to pricing of products and if we can then use those findings to predict prices.

Through our research, we found that brands and retailers play a significant role in how a product is priced. Tsidulko states “Fashion prices are determined by both the brands that design, manufacture, and market clothing and the retailers that sell it”<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) He explains that brands have the biggest influence on prices because they can mark up prices but also set up promotions and discounts on their products. Tsidulko also adds that “budget brands typically sell at low prices,”<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) which shows that brand identity sets the tone for how much a consumer should pay for their products. This reinforces our beliefs that a fast fashion brand has a large influence on pricing.

A study by Kopalle et al.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) found that product type is also important to influence the pricing of a product. They find that there is a difference in how products are priced, specifically between fashion and stable goods, which suggests that the category of a product has a significant impact on how it is priced. This implies that products within the fashion industry vary in price due to different product types. An example of this would be the price of a shirt compared to the price of a shoe. 

Another study<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) , based on prices on Amazon, showed that there is a significant difference between prices of products that are addressed to women or men. Although having a study just based off of Amazon data is hard to generalize, this information is still important to note while we look at differences in intended gender in the fashion industry through our dataset.

These sources work together to suggest that brand, product type, and intended gender all influence the pricing of fashion products.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Tsidulko, J. (2023) Fashion Pricing Strategy. https://www.oracle.com/retail/fashion/fashion-pricing-strategy/ 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Kopalle, P., Biswas, D., Chintagunta, P. K., Fan, J., Pauwels, K., Ratchford, B. T., & Sills, J. A. (2009). Retailer pricing and competitive effects. Journal of Retailing, 85(1), 56–70.  https://doi.org/10.1016/j.jretai.2008.11.005 
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Springer, C. (2020) Paying the Pink Tax on a Blue Dress - Exploring Gender-Based Price-Premiums in Fashion Recommendations https://doi.org/10.1007/978-3-030-64266-2_12


## Hypothesis


<b> Product attributes, specifically brand name, intended gender, color, and product type, have a statistically significant effect on price, and variation in these features explains a meaningful portion of the overall price variance. We believe that we can then use these factors to predict prices. </b>

The studies above suggest that visible product attributes shape how fashion items are priced. Brand name captures differences in how quality is perceived and its target audience, which allows some labels to charge higher compared to their competitors. Products geared to different genders also reflect gender based price differences in apparel, where products marketed to women or men can have different prices even when they function similarly. Color also signals trendiness and seasonality, influencing demand which allows the price that retailers can sustain. Since these attributes all affect how products are perceived, we believe that brand, gender, color, and product type is correlated to the variance in fashion item prices.

# Data

### Dataset #1
  - Dataset Name: <b> Fashion Clothing Products Dataset </b>
  - Link to the dataset: <b> https://www.kaggle.com/datasets/shivamb/fashion-clothing-products-catalog </b>
  - Number of observations: <b> 12492 </b>
  - Number of variables: <b>8</b>
  - The variables that are most important to this project are ProductName and Price (INR). ProductName is important because it encompasses all of the different product features including Product Brand, the intended gender, product color, and the actual product itself. Price is important because this is the value that we're predicting.
  - Some shortcomings of our dataset may be the lack of product features that may best predict pricing and the lack of observations the dataset has overall.

In [108]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [109]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading airline-safety.csv:   0%|          | 0.00/1.23k [00:00<?, ?B/s][A
Overall Download Progress:  50%|█████     | 1/2 [00:00<00:00,  6.09it/s]   [A

Successfully downloaded: airline-safety.csv



Downloading bad-drivers.csv:   0%|          | 0.00/1.37k [00:00<?, ?B/s][A
Overall Download Progress: 100%|██████████| 2/2 [00:00<00:00,  5.72it/s][A

Successfully downloaded: bad-drivers.csv





### Fashion Clothing Products Dataset

This dataset emcompasses a multitude of various products from several different brands. The dataset is fairly explanatory, detailing the following features of the fashion products: Product ID, Product Name, Product Brand, Intended Gender, Price in INR (India's Currency), Number of Images, Product Description, and Primary Colors.

The specific metrics that we are more concerned with for the purpose of this project are the product's brand, its intended gener, the product's primary color, the actual product item, and what it's priced at. Although the columns of this dataset are fairly explanatory, here's a quick look into what each column details. 

The ProductID column is a unique ID detailing the various products of the dataset, which are presumably made from the individual that created this dataset. The ProductName column details the brand, intended gender, color, and various other details of the product itself. The ProductBrand column details the brand of the product. The Gender column details the clothing or fashion piece's intended audience. The Price (INR) column details the Price in Indian Rupees. This column is particulary important because it is the feature that we are trying to predict. The NumImages column details the number of images the product has for its website listing. The Description column is the product's description, presumably pulled from its listing page which includes similar details to the ProductName column but goes into more detail regarding the fit, look, fade, buttons, etc. Finally, the PrimaryColor details teh primary color category of the product.

One major concern of the dataset itself is if the products in the dataset can be extrapolated to represent how clothing products are generally priced. Although there are a lot of great features we can utilize to create the predictive model, one of the major features that typically determine the price of a product are a Product's Brand Name which typically lies on a scale of popularity that can not be objectively determined. Beyond this, there aren't many major concerns but some minor concerns include how the ProductName column is broken down and how our team can extract the key words detailing each unique product which is something we will have to figure out.

In [110]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
import pandas as pd
df = pd.read_csv('myntra_products_catalog.csv')

# make data tidy
df_cols = df.columns
tidy_df = df.drop(columns = 'Description')

# dataset size
cdf_size = df.shape

# missing values datset
rows_null_val = tidy_df[tidy_df.isnull().any(axis=1) == True]
missing_data = 'Data seems to be missing only when the product is a scent'
missing_num = 894

# clean dataset
def product_type(str):
    convert_lst = str.split(' ')
    p_type = convert_lst[-1]
    return p_type

price_conv = 0.011

clean_df = tidy_df.dropna()
clean_df = clean_df.drop_duplicates(subset = 'ProductName')
clean_df = clean_df.assign(ProductType = clean_df['ProductName'].apply(product_type))
# might want to bring this back to add more columns in regards to the product iself
clean_df = clean_df.drop(columns = 'ProductName')
clean_df = clean_df.rename(columns = {'Price (INR)':'Price'})
clean_df['Price'] = round(clean_df['Price'] * price_conv, 2)

clean_df.head()

# for next week, clean ProductType column for products where the last word isn't the actual product (i.e. fragrances, etc)
#clean_df.value_counts()

Unnamed: 0,ProductID,ProductBrand,Gender,Price,NumImages,PrimaryColor,ProductType
0,10017413,DKNY,Unisex,129.2,7,Black,Bag
1,10016283,EthnoVogue,Women,63.91,7,Beige,Jacket
2,10009781,SPYKAR,Women,9.89,7,Pink,Jeans
3,10015921,Raymond,Men,61.59,5,Blue,Suit
4,10017833,Parx,Men,8.35,5,White,Shirt


## Ethics

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)


### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?


 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

>Yes, one source of bias we identified is the fact that our dataset was web-scraped. This means that we only get a snapshot in time of certain products. Additionally, product attributes such as NumImages and Description completeness will vary by the specific product types and price tiers. Unfortunately, this may introduce bias that affects price relationship, so we will be sure to treat findings as correlational. 

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> Though the dataset doesn't explicitly share personally identifiable information because it is product focused, we will not publish any raw URLs or full text content in the report. We will introduce certain findings from a high level instead, and exclude any  information that would not help us answer our research question.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
       
> Gender is considered to be a sensitive attribute especially in context for fashion. This may mean that there are  pricing differences with certain products, so we will be sure to check if model error differs across the female and male gender groups. Overall, we will mention these concerns in our final write up and do our best to evaluate price distributions by gender.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
> To examine the data for possible sources of bias, we will check for imbalances across attributes such as "ProductBrand," "Gender," and "PrimaryColor" because there may be certain brands that are dominating the dataset. We will also parse through our dataset to check for missing fields that may appear and consider controlling those attributes to see if our conclusions will change.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
> We understand the importance of having clear visualizations, summary statistics, and reports, and we will use the appropriate ones for skewed price distributions. This is to avoid potential misleading interpretations driven by outliers, as fashion products have a wide range in numerical value for price. We will clearly define our definitions, such as predictability, and account for outliers to make sure our conclusions are not exaggerated.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
> We will do our best to document how we analyze the data and any missing descriptions in the dataset. This will allow its reproducibility later on in case we re-run our results. 

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
> In the case that we remove "Gender" from the equation, there are still other attributes that may still be impacted by gendered marketing. To clarify, this may mean that the model is guessing gender from other clues such as brand or description. We will interpret our results (especially those that related to gender) cautiously, mentioning that there amy be pricing patterns in the catalog to avoid suggesting that one gender's products are "worth more."
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)
> We can compute error metrics such as MAE and RMSE seperately for the respective Gender groups to see if there are differing performances. In the case where one group has a consistently higher error, we will document that limitation. Additionally, we will check if there is an imbalance or if different price ranges explain this.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
> We plan to report metrics such as MAE, RMSE, and R^2 because each one is connected to a different harm such as large overpricing errors. Using baselines to compare (such as overall mean price) will help us contextualize our results. 
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
>We will explain the decisions the models make clearly, using interpretable models such as Ridge regression. This will allow us to explain which attributes may be driving predictions. Brands and genders can be sensitive features and need to be justified and explanable should there be confusion. 
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
> Because this model predicts price from the catalog attributes, not other features such as the quality or consumer behavior, prices may be influenced by factors not in the dataset. To account for this, our predictions will be treated as approximate, and we will include these limitations in our final project writeup. 

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
> There may be the case that the model could be interpreted to justify discriminatory pricing, especially for gendered pricing. To account for this risk, we will clearly state that this model only reflects a catalog pattern for a specific segment of time, and is not concerned with "fairness."


## Team Expectations 

* As a team, we will communicate through iMessages and respond to each other within 24 hours, or 6 hours if a deadline is approaching. We will schedule meetings as needed either virtually or in-person based on team members’ preference for that week. We will aim to meet weekly, but we will be flexible about rescheduling meetings.
* Feedback should be given in a polite way, but still be straightforward. An example would be “I think X could be an issue because of Y. What do you think?”
* When making decisions, we will aim to get a unanimous vote by compromising. If a unanimous vote is impossible, we will resort to taking the majority vote. Each member will have equal authority in each decision. If a teammate is not responsive when we need to make a decision, the teammate will not be able to participate in the vote unless they respond in time.
* Team members will specialize in tasks, but always have the opportunity to help with other tasks with confirmation. Separate tasks will be assigned to people with each assignment and all teammates must agree with task assignments. Our team will check the list of current tasks and progress by checking GitHub issues.
* We will make a plan and create deadlines based on assignments. We will update if plans ever change.
* A struggling member should message the group chat as soon as they have difficulties that they believe might hinder their performance on a task. Another teammate with more expertise should step in and help, dividing the issue up with other knowledgeable teammates.
* Expectations will be kept on a google doc.

## Project Timeline Proposal 

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/6  |  3:30 PM | Background research, research question, and hypothesis finalized. | Align on project scope, success criteria, and dataset selection. | 
| 2/11  |  3:30 PM |  Datasets gathered and documented (features, size, source). | Review dataset limitations and finalize analysis plan. | 
| 2/18  | 3:30 PM  | Data cleaning and preprocessing completed.  | Walk through cleaning decisions and confirm dataset readiness.   |
| 2/25  | 3:30 PM  | Initial EDA and at least two visualizations created. | Interpret early trends and decide on final visualizations.   |
| 3/4  | 3:30 PM  | Remaining visualization(s) and core analysis completed. | Discuss results and connect findings back to hypothesis. |
| 3/11  | 3:30 PM  |Draft of overview, results, conclusion, and ethics section written.| Peer review writing, identify gaps, and plan video structure. |
| 3/18  | Before 11:59 PM  | Final edits completed; notebook polished and video recorded. | Final walkthrough, submission check, Turn it in! |