Step 1: Data Loading and Initial Inspection 📊
First, we'll load the dataset and get a quick overview, as demonstrated previously. This ensures the data is correctly loaded and helps us understand its structure

In [2]:
import pandas as pd

# Load the dataset
df = pd.read_csv('C:/Users/ayesh/Downloads/ai_product_descriptions_dataset.csv')

# Display the first 5 rows
print("First 5 rows of the dataset:")
print(df.head())

# Display concise summary
print("\nConcise summary of the dataset:")
print(df.info())

First 5 rows of the dataset:
   Product_ID                         Title     Category       Brand  \
0        1001    Wireless Bluetooth Earbuds  Electronics   SoundBeat   
1        1002           Men's Running Shoes      Fashion  SprintFlex   
2        1003          Smart LED TV 43 inch  Electronics    VivoView   
3        1004        Organic Green Tea Pack      Grocery  NatureLeaf   
4        1005  Portable Power Bank 10000mAh  Electronics   ChargePro   

                                        Key_Features  \
0  Bluetooth 5.3; Noise Cancellation; IPX5 Water ...   
1       Breathable Mesh; Lightweight; Anti-slip Sole   
2                   4K Ultra HD; Smart Apps; HDMI x3   
3            100% Organic; Antioxidant Rich; 25 Bags   
4              Fast Charging; Dual USB; Compact Size   

                                               Specs  \
0              Battery: 24h; Weight: 50g; Range: 10m   
1             Size: 7-12; Weight: 250g; Color: Black   
2  Screen: 43in; Resolution: 3840

Step 2: Text Preprocessing 🧹
Text data often needs cleaning before analysis. This involves converting text to lowercase, removing punctuation, and potentially removing common "stop words" (like "the", "is", "a") that don't add much meaning.

In [3]:
import re
from nltk.corpus import stopwords
import nltk

# Download stop words if not already downloaded
try:
    nltk.data.find('corpora/stopwords')
except nltk.downloader.DownloadError:
    nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    words = text.split() # Split into words
    words = [word for word in words if word not in stop_words] # Remove stop words
    return ' '.join(words)

# Apply preprocessing to the 'Description' column
df['Processed_Description'] = df['Description'].apply(preprocess_text)

print("\nDescriptions after preprocessing (first 5):")
for i, desc in enumerate(df['Processed_Description'].head()):
    print(f"Product {df['Product_ID'][i]}: {desc}")


Descriptions after preprocessing (first 5):
Product 1001: enjoy crystalclear sound bluetooth 53 technology ipx5 water resistance active lifestyles perfect workouts travel
Product 1002: designed performance comfort running shoes feature breathable mesh upper antislip sole stability terrains
Product 1003: immerse stunning 4k visuals smart features give access popular streaming apps
Product 1004: enjoy refreshing healthy cup green tea packed natural antioxidants pack contains 25 tea bags premium quality leaves
Product 1005: stay charged onthego compact powerful 10000mah power bank supports fast charging multiple devices simultaneously


In [4]:
pip install nltk

Collecting nltkNote: you may need to restart the kernel to use updated packages.

  Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Using cached nltk-3.9.1-py3-none-any.whl (1.5 MB)
Installing collected packages: nltk
Successfully installed nltk-3.9.1



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\ayesh\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Step 3: Feature Extraction (Text Vectorization) ➡️🔢
To analyze text quantitatively, we need to convert it into numerical representations. A common method is TF-IDF (Term Frequency-Inverse Document Frequency), which reflects how important a word is to a document in a collection or corpus.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=100) # Limiting to top 100 features for simplicity

# Fit and transform the processed descriptions
tfidf_matrix = tfidf_vectorizer.fit_transform(df['Processed_Description'])

# Get feature names (words)
feature_names = tfidf_vectorizer.get_feature_names_out()

print("\nTF-IDF matrix shape:", tfidf_matrix.shape)
print("Top 10 most frequent words (features):")
print(feature_names[:10])


TF-IDF matrix shape: (10, 100)
Top 10 most frequent words (features):
['10000mah' '15l' '25' '4k' '53' 'access' 'active' 'antioxidants'
 'antislip' 'apps']


Step 4: Simple Analysis: Most Frequent Words 📈
Now that we have numerical representations, we can perform simple analyses, such as finding the most frequent words in the entire corpus. This can give insights into common themes across product descriptions.

In [5]:
import numpy as np

# Sum the TF-IDF scores for each word across all documents
word_scores = np.asarray(tfidf_matrix.sum(axis=0)).flatten()

# Create a DataFrame of words and their TF-IDF scores
word_df = pd.DataFrame({'word': feature_names, 'score': word_scores})

# Sort by score in descending order
word_df = word_df.sort_values(by='score', ascending=False)

print("\nTop 10 most important words across all descriptions (by TF-IDF score):")
print(word_df.head(10))


Top 10 most important words across all descriptions (by TF-IDF score):
           word     score
98        water  0.630335
78  performance  0.598423
69        mouse  0.585043
97          use  0.548852
44     features  0.532390
99     workouts  0.532254
96       travel  0.515955
16         boil  0.481184
95          tea  0.462217
38        enjoy  0.456088


Step 5: Generating "Optimized" Descriptions (Hypothesis Formulation) 💡
For A/B testing, you need a control (original description) and a variant (optimized description). In a real scenario, this optimization would come from insights (e.g., adding keywords, making it more concise, highlighting benefits). Here, we'll simulate a simple optimization: making descriptions slightly more "benefit-oriented" or "action-oriented" by appending a phrase.

In [6]:
# Create a 'Variant_Description' column by simulating an "optimization"
# This is a placeholder for a real optimization strategy (e.g., using NLP models to generate better text)
df['Variant_Description'] = df['Description'].apply(lambda x: x + " Experience the difference today!")

print("\nOriginal vs. Variant Descriptions (first 2 examples):")
print("Product ID:", df['Product_ID'][0])
print("Original:", df['Description'][0])
print("Variant :", df['Variant_Description'][0])
print("\nProduct ID:", df['Product_ID'][1])
print("Original:", df['Description'][1])
print("Variant :", df['Variant_Description'][1])


Original vs. Variant Descriptions (first 2 examples):
Product ID: 1001
Original: Enjoy crystal-clear sound with Bluetooth 5.3 technology and IPX5 water resistance for active lifestyles. Perfect for workouts and travel.
Variant : Enjoy crystal-clear sound with Bluetooth 5.3 technology and IPX5 water resistance for active lifestyles. Perfect for workouts and travel. Experience the difference today!

Product ID: 1002
Original: Designed for performance and comfort, these running shoes feature a breathable mesh upper and anti-slip sole for stability on all terrains.
Variant : Designed for performance and comfort, these running shoes feature a breathable mesh upper and anti-slip sole for stability on all terrains. Experience the difference today!


Step 6: A/B Test Setup (Missing Code from Above) 🧪
This is where we define the experiment. We'll assign users (or product views) to either the Control Group (sees original description) or the Variant Group (sees optimized description). We'll simulate user assignment and a conversion metric.

In [7]:
import numpy as np

# Simulate user IDs for the A/B test
num_users = 1000
user_ids = np.arange(1, num_users + 1)
np.random.shuffle(user_ids) # Shuffle to simulate random assignment

# Assign users to Control (A) or Variant (B) group
# We'll split them roughly 50/50
split_point = num_users // 2
control_group_users = user_ids[:split_point]
variant_group_users = user_ids[split_point:]

# Simulate a DataFrame for user interactions in the A/B test
# This would typically come from a logging system
ab_test_data = pd.DataFrame({
    'user_id': user_ids,
    'group': ['control'] * len(control_group_users) + ['variant'] * len(variant_group_users),
    'product_id_viewed': np.random.choice(df['Product_ID'], num_users), # Simulate users viewing a random product
    'conversion': np.random.choice([0, 1], num_users, p=[0.9, 0.1]) # Simulate conversion (e.g., 10% conversion rate overall)
})

# For simplicity, let's assume conversion rates based on group
# Control group conversion rate: 8%
# Variant group conversion rate: 12% (simulating a positive uplift due to optimization)
ab_test_data['conversion'] = ab_test_data.apply(
    lambda row: np.random.choice([0, 1], p=[0.92, 0.08]) if row['group'] == 'control' else np.random.choice([0, 1], p=[0.88, 0.12]),
    axis=1
)


print("\nSimulated A/B Test Data (first 10 rows):")
print(ab_test_data.head(10))

print(f"\nControl group size: {len(control_group_users)}")
print(f"Variant group size: {len(variant_group_users)}")


Simulated A/B Test Data (first 10 rows):
   user_id    group  product_id_viewed  conversion
0      523  control               1004           0
1      115  control               1010           0
2       94  control               1010           0
3      649  control               1007           0
4      511  control               1009           0
5      863  control               1010           0
6      404  control               1010           0
7      346  control               1006           0
8      938  control               1005           1
9      740  control               1004           0

Control group size: 500
Variant group size: 500


Step 7: Statistical Analysis of A/B Test Results 🔬
After running the experiment for a sufficient period and collecting enough data, we analyze the results to determine if the variant performed significantly better than the control. We'll use statistical hypothesis testing.

In [8]:
from scipy import stats

# Calculate conversion rates for each group
control_conversions = ab_test_data[ab_test_data['group'] == 'control']['conversion'].sum()
control_total = len(ab_test_data[ab_test_data['group'] == 'control'])
control_cr = control_conversions / control_total if control_total > 0 else 0

variant_conversions = ab_test_data[ab_test_data['group'] == 'variant']['conversion'].sum()
variant_total = len(ab_test_data[ab_test_data['group'] == 'variant'])
variant_cr = variant_conversions / variant_total if variant_total > 0 else 0

print(f"\nControl Group Conversion Rate: {control_cr:.4f}")
print(f"Variant Group Conversion Rate: {variant_cr:.4f}")

# Perform a Z-test for two proportions
# H0 (Null Hypothesis): There is no difference in conversion rates between groups.
# H1 (Alternative Hypothesis): There is a difference in conversion rates between groups.

# Number of successes (conversions) and observations (total users) for each group
n_control = control_total
x_control = control_conversions
n_variant = variant_total
x_variant = variant_conversions

# Calculate pooled proportion
p_pooled = (x_control + x_variant) / (n_control + n_variant)

# Calculate standard error
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_control + 1/n_variant))

# Calculate Z-statistic
if se == 0: # Avoid division by zero if sample sizes are tiny or p_pooled is 0 or 1
    z_stat = 0
else:
    z_stat = (variant_cr - control_cr) / se

# Calculate p-value (two-tailed test)
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Determine statistical significance
alpha = 0.05 # Significance level
if p_value < alpha:
    print("\nResult: The difference is statistically significant. Reject the null hypothesis.")
    print("Conclusion: The Variant (optimized description) likely led to a higher conversion rate.")
else:
    print("\nResult: The difference is NOT statistically significant. Fail to reject the null hypothesis.")
    print("Conclusion: There is no strong evidence that the Variant improved conversion rate.")


Control Group Conversion Rate: 0.0840
Variant Group Conversion Rate: 0.1360
Z-statistic: 2.6277
P-value: 0.0086

Result: The difference is statistically significant. Reject the null hypothesis.
Conclusion: The Variant (optimized description) likely led to a higher conversion rate.


In [None]:
Step 8: Decision Making and Deployment 🚀
Based on the A/B test analysis:

If the variant performs significantly better: Deploy the optimized descriptions to all users.

If there's no significant difference or the control performs better: Stick with the original, or iterate on a new variant.

In [None]:
Step 9: Monitoring and Iteration 🔄
After deployment, continuously monitor the performance of the new descriptions to ensure the uplift is sustained. This also feeds into future optimization cycles.