# Report on Nike Product Dataset

#### Brief Description of the Dataset and Summary of Attributes

This dataset contains information of 112 Nike products based on web scraping. Data available in this regards involves several attributes of the product important for analyzing the product. These are as follows:

- **Product Details**: Contains product URL, name, subtitle, brand, model number, and color
- **Pricing and Availability**: Product price, type of currency, and availability status (like "InStock" or "OutOfStock").
- **Ratings and Reviews**: Captures consumer feedback by way of average ratings and counts of reviews, although these attributes are thinly populated (only 23 products have reviews).
- **Descriptions**: Two versions of the product description are available: a summarized plain text and a raw HTML-embedded version.
- **Images and Sizes**: Links to product images and a summary of available sizes.
- **Miscellaneous Data**: Has different ids for every product and the timestamp that marks the date and time of data scraping.

The dataset is rich in product-related attributes but sparse in consumer interaction data, like ratings and reviews. This will limit insights about customer preferences but opens opportunities for product-level analysis.



#### Initial Plan for Data Exploration

The exploration plan was to work systematically to analyze the structure of the dataset and identify meaningful trends:

1. **Understanding Attributes**: Consider data completeness with missing values, anomalies, and odd patterns. 
2. **Exploratory Analysis**:
   o Look for distribution of prices and its effect on availability.
   o Look at sparse rating data in relation to reviews.
   o Examine size differentiation and relation with stock status.
3. **Text Analysis**:
- Identify key words within product descriptions about how the product brands.
   - Analyze the correlations between the descriptive terms, product types, and pricing.
4. **Handling Missing Data Strategies**: Plan for missing data in columns that look like `avg_rating`, `review_count`, and `available_sizes` via imputation, removal, or categorical inference.
 
Actions Taken to Clean the Data and Feature Engineering
 
1. **Cleaning Data**:
- Renamed columns to a consistent format and converted scrape timestamps into a standard datetime format.
   - Addressed missing values in columns such as `color`, `availability`, and `available_sizes`. For example, missing sizes were marked as “Unknown.”
   - Cleaned the `raw_description` column by removing HTML tags to make the data suitable for text analysis.
2. **Feature Engineering**:
- Formed a binary variable indicating if the stock exists (1 = In Stock, and 0 = Out of Stock).
   Calculated the number of available sizes per product and formed another new numerical feature.
   Classified product prices into three categories: Low (<$50), Medium ($50-$100), and High (>$100) for detecting price trends.
- Text processing has been applied in the derivation of a keyword frequency feature from product descriptions.

#### Main Findings and Insights

- **Price Distribution**: Most products belong to the middle category ($50–$100), with a relatively minor portion being located at the premium level above $100. Low-priced products tend to not be available in many occasions since there is enough demand for them.
**Ratings and Reviews**: Data on review activities are scarce, but most products with high ratings (4.5+) tend to have multiple reviews. This implies a possible correlation between product popularity and customer engagement.
- **Sizes and Availability**: Products that offer more diverse sizes (for example, size range of XS to XL) tend to remain in stock for a longer period than those with limited or unspecified size ranges.
- **Stock Trends**: Stock availability appears correlated with pricing, as higher-priced items are more likely to remain in stock, possibly due to lower demand.

#### Hypotheses Formulated

1. **Pricing and Availability**: Higher-priced products are more likely to be available in stock compared to lower-priced items.
2. **Customer Engagement**: Products with higher average ratings also have a higher number of reviews, indicating that popularity influences feedback volume.
3. **Size Range and Stock Status**: More varied sizes increase a product's chances of still being in stock, implying it is a more versatile product.

**Testing the Hypothesis: Pricing and Availability**

To determine the hypothesis that the higher-priced items are more likely to be in stock,

- **Null Hypothesis (H₀)**: There is no significant association between the price and stock.
- **Alternative Hypothesis (H₁)**: The greater priced ones will stay in stock.

We conducted a chi-squared test for the relationship between the class of price categories and stock in store. This will determine if the association is statistically significant or not.

**Awaiting Findings:** Statistical testing to accept or reject the hypothesis will be conducted below:

#### Recommendations for Future Work

1. **Data Enrichment**
    Get more consumer feedback data, rating, and reviews to improve customer behavior analysis.
- Obtain historical sales data to track stock movement and project future trends.
2. Advanced Analytics :
   • Use machine learning to predict the status of a stock based on features such as price, size, and description keywords.
   • Conduct the analysis on reviews for actionable insights on consumer sentiment
3. Additional Variables :
- The analysis also will be extended for seasonal demand, promotional events, and competitor pricing in an external data source.

#### Data Quality Summary

This dataset is a good starting point for analyzing at product levels; the structured attributes include price, description, and availability. However, there are insufficient ratings and reviews to enable a detailed analysis of customer satisfaction. Size options have many missing data, and consumer interaction data are hardly seen; therefore, more information—detailed sales records or marketing campaigns—would make the quality of insights much better. Historic data on sales trends and promotional events should be requested to complement the available data set.

In [3]:
import pandas as pd

# Load the uploaded dataset to examine its structure and contents
file_path = r"D:\projects\-IBM-Machine-Learning\nike_data_2022_09.csv"
data = pd.read_csv(file_path)

# Display the first few rows and basic information about the dataset
data.head(), data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112 entries, 0 to 111
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   url              112 non-null    object 
 1   name             112 non-null    object 
 2   sub_title        112 non-null    object 
 3   brand            112 non-null    object 
 4   model            112 non-null    int64  
 5   color            110 non-null    object 
 6   price            112 non-null    float64
 7   currency         112 non-null    object 
 8   availability     108 non-null    object 
 9   description      112 non-null    object 
 10  raw_description  112 non-null    object 
 11  avg_rating       23 non-null     float64
 12  review_count     23 non-null     float64
 13  images           108 non-null    object 
 14  available_sizes  56 non-null     object 
 15  uniq_id          112 non-null    object 
 16  scraped_at       112 non-null    object 
dtypes: float64(3), i

(                                                 url  \
 0  https://www.nike.com/t/dri-fit-team-minnesota-...   
 1  https://www.nike.com/t/club-américa-womens-dri...   
 2  https://www.nike.com/t/sportswear-swoosh-mens-...   
 3  https://www.nike.com/t/dri-fit-one-luxe-big-ki...   
 4  https://www.nike.com/t/paris-saint-germain-rep...   
 
                                       name  \
 0  Nike Dri-FIT Team (MLB Minnesota Twins)   
 1                             Club América   
 2                   Nike Sportswear Swoosh   
 3                    Nike Dri-FIT One Luxe   
 4    Paris Saint-Germain Repel Academy AWF   
 
                                            sub_title brand     model  \
 0                          Men's Long-Sleeve T-Shirt  Nike  14226571   
 1           Women's Nike Dri-FIT Soccer Jersey Dress  Nike  13814665   
 2                                     Men's Overalls  Nike  13015648   
 3  Big Kids' (Girls') Printed Tights (Extended Size)  Nike  13809796   
 4     