#### Purpose of the Notebook:
- To explore the cleaned dataset for insights that could enhance the semantic search functionality.
- To visualize data distributions and relationships that can inform feature selection and model tuning.

#### Goal:
- Understand data distribution and quality.
- Identify patterns or trends in product attributes that might influence search relevance.
- Assess how different attributes correlate with each other, potentially guiding the semantic search algorithm.

#### Insights Looking For:
- Distribution of prices, ratings, and colors.
- Relationship between product descriptions and categories to ensure semantic relevance.
- Size availability patterns which might affect user queries.


In [1]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords

In [3]:
# Download necessary NLTK data
nltk.download('stopwords')

# Load the cleaned dataset
df = pd.read_csv('../data/semantic_search_ready_data.csv')

# Basic dataset info
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112 entries, 0 to 111
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            112 non-null    object 
 1   sub_title       112 non-null    object 
 2   color           112 non-null    object 
 3   price           112 non-null    float64
 4   description     112 non-null    object 
 5   avg_rating      112 non-null    float64
 6   review_count    112 non-null    float64
 7   parsed_sizes    112 non-null    object 
 8   dominant_color  112 non-null    object 
dtypes: float64(3), object(6)
memory usage: 8.0+ KB
None


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
