### ðŸ“Œ Loading the Original Cleaned Dataset for Bootstrapping

The real, cleaned dataset is successfully loaded and ready for bootstrap-based synthetic data generation.


In [1]:
import pandas as pd
import numpy as np

# loading original cleaned dataset

df = pd.read_csv("apr_finaldata.csv")

print(df.shape)
df.head()


(8856, 11)


Unnamed: 0,product_id,discounted_price,actual_price,discount_percentage,rating,rating_count,user_id,user_name,review_id,review_title,review_content
0,B07JW9H4J1,399.0,1099.0,64.0,4.2,24269.0,AG3D6O4STAQKAY2UVGEUV46KN35Q,Manav,R3HXWT0LRP0NMF,Satisfied,Looks durable Charging is fine tooNo complains
1,B07JW9H4J1,399.0,1099.0,64.0,4.2,24269.0,AHMY5CWJMMK5BJRBBSNLYT3ONILA,Adarsh gupta,R2AJM3LFTLZHFO,Charging is really fast,Charging is really fast
2,B07JW9H4J1,399.0,1099.0,64.0,4.2,24269.0,AHCTC6ULH4XB6YHDY6PCH2R772LQ,Sundeep,R6AQJGUP6P86,Value for money,good product.
3,B07JW9H4J1,399.0,1099.0,64.0,4.2,24269.0,AGYHHIERNXKA6P5T7CZLXKVPT7IQ,S.Sayeed Ahmed,R1KD19VHEDV0OR,Product review,Till now satisfied with the quality.
4,B07JW9H4J1,399.0,1099.0,64.0,4.2,24269.0,AG4OGOFWXJZTQ2HKYIOCOY3KXF2Q,jaspreet singh,R3C02RMYQMK6FC,Good quality,This is a good product . The charging speed is...


In [2]:
df['rating'].unique()

array(['4.2', '4.0', '3.9', '4.1', '4.3', '4.4', '4.5', '3.7', '3.3',
       '3.6', '3.4', '3.8', '3.5', '4.6', '3.2', '5.0', '4.7', '3.0',
       '2.8', '4', '3.1', '4.8', '2.3', '|', '2', '3', '2.6', '2.9'],
      dtype=object)

- Checking for unsual entries and neglecting those rows from the dataset.

In [3]:
df = df[df['rating'] != '|']


In [4]:
df['rating'].unique()

array(['4.2', '4.0', '3.9', '4.1', '4.3', '4.4', '4.5', '3.7', '3.3',
       '3.6', '3.4', '3.8', '3.5', '4.6', '3.2', '5.0', '4.7', '3.0',
       '2.8', '4', '3.1', '4.8', '2.3', '2', '3', '2.6', '2.9'],
      dtype=object)

**Check Class Imbalance in Ratings and Sentiment**

 The percentage-wise rating distribution of the original data, which serves as the statistical reference for synthetic data generation. This step verifies the **baseline rating behavior** that will be reflected in the synthetic dataset.



In [5]:
rating_dist = (
    df['rating'].value_counts(normalize=True)
      .sort_index()
      * 100
)

print("Rating distribution (%)")
print(rating_dist)


Rating distribution (%)
rating
2       0.022601
2.3     0.056504
2.6     0.079105
2.8     0.101706
2.9     0.079105
3       0.079105
3.0     0.135609
3.1     0.226014
3.2     0.124308
3.3     0.949260
3.4     0.745847
3.5     1.988925
3.6     2.542660
3.7     2.825178
3.8     6.328399
3.9     8.543338
4       3.740536
4.0     7.695785
4.1    16.951068
4.2    15.662787
4.3    15.583682
4.4     8.441632
4.5     5.243530
4.6     1.084868
4.7     0.418126
4.8     0.203413
5.0     0.146909
Name: proportion, dtype: float64


In [6]:
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')

**Create Rating Bins (For Balancing)**

Ratings are divided into four ordinal categories using `pd.cut`:
  - **Low** (0â€“2.5)
  - **Medium** (2.5â€“3.5)
  - **High** (3.5â€“4.5)
  - **Very_High** (4.5â€“5.1)

  *Note: This step is essential because **bootstrap resampling can amplify minority or majority classes** depending on the sampling strategy. Detecting imbalance beforehand allows us to plan for **balanced synthetic data generation***

In [7]:
df['rating_bin'] = pd.cut(
    df['rating'],
    bins=[0, 2.5, 3.5, 4.5, 5.1],
    labels=['Low', 'Medium', 'High', 'Very_High'],
    include_lowest=True
)

df['rating_bin'].value_counts(dropna=False)

# Check imbalance before balancing
print("Original rating_bin distribution:")
print(df['rating_bin'].value_counts(normalize=True) * 100)

Original rating_bin distribution:
rating_bin
High         93.558594
Medium        4.508984
Very_High     1.853317
Low           0.079105
Name: proportion, dtype: float64


In [8]:
df['rating'].dtype


dtype('float64')

**Balancing the Original Data**

This process ensures that *minority rating classes are not underrepresented*, which is critical for:
- Fair model training,
- Stable recommendation learning,
- Avoiding bias toward majority rating classes.


In [9]:
from sklearn.utils import resample

# Size per class
bin_sizes = df['rating_bin'].value_counts()
print("Original bin sizes:")
print(bin_sizes)

max_size = bin_sizes.max()
print("\nTarget size per bin:", max_size)

balanced_chunks = []

for rating_class, group_df in df.groupby('rating_bin'):
    if len(group_df) == 0:
        continue

    group_upsampled = resample(
        group_df,
        replace=True,
        n_samples=max_size,
        random_state=42
    )

    balanced_chunks.append(group_upsampled)

df_balanced = pd.concat(balanced_chunks, ignore_index=True)

print("\nBalanced shape:", df_balanced.shape)

print("\nBalanced rating_bin distribution:")
print(df_balanced['rating_bin'].value_counts(normalize=True) * 100)


Original bin sizes:
rating_bin
High         8279
Medium        399
Very_High     164
Low             7
Name: count, dtype: int64

Target size per bin: 8279

Balanced shape: (33116, 12)

Balanced rating_bin distribution:
rating_bin
Low          25.0
Medium       25.0
High         25.0
Very_High    25.0
Name: proportion, dtype: float64


  for rating_class, group_df in df.groupby('rating_bin'):


**Stratified Bootstrapping to Generate Synthetic Data**

In [10]:
original_n = len(df)
target_n = original_n * 12

print("Original rows:", original_n)
print("Target synthetic rows:", target_n)

synthetic = df_balanced.sample(
    n=target_n,
    replace=True,
    random_state=123
)

print("Synthetic shape:", synthetic.shape)


Original rows: 8849
Target synthetic rows: 106188
Synthetic shape: (106188, 12)



This approach preserves:
- The **balanced class distribution** created earlier,
- The **empirical variability** of the original dataset,
while significantly increasing the data volume for robust model training.



**Sanity Checks on Synthetic Data**

In [11]:
#------------------------
# Rating Distribution
#------------------------
print("Synthetic rating distribution (%):")
print(
    synthetic['rating']
    .value_counts(normalize=True)
    .sort_index() * 100
)

#------------------------
# Rating-bin distribution
#------------------------
print("\nSynthetic rating_bin distribution (%):")
print(
    synthetic['rating_bin']
    .value_counts(normalize=True) * 100
)


Synthetic rating distribution (%):
rating
2.0     7.091197
2.3    17.839116
2.6     0.499115
2.8     0.571628
2.9     0.477455
3.0     1.101819
3.1     1.235545
3.2     0.625306
3.3     5.234113
3.4     4.120051
3.5    11.224432
3.6     0.714770
3.7     0.694052
3.8     1.663088
3.9     2.238483
4.0     3.253663
4.1     4.413870
4.2     4.179380
4.3     4.379968
4.4     2.296870
4.5     1.244020
4.6    14.351904
4.7     5.730403
4.8     2.877915
5.0     1.941839
Name: proportion, dtype: float64

Synthetic rating_bin distribution (%):
rating_bin
Medium       25.089464
High         25.078163
Low          24.930312
Very_High    24.902060
Name: proportion, dtype: float64


**Save Synthetic Dataset**

In [12]:
synthetic.to_csv("synthetic_stratified_rating_1Lakh.csv", index=False)

from google.colab import files
files.download("synthetic_stratified_rating_1Lakh.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>