### Word Embedding Models (BERT)

#### 
Use pre-trained deep learning models, like BERT or RoBERTA, to get rich, contextualized word embeddings for the review_text.

**What This Means:** 
Word embeddings represent the semantic meaning of text in a high-dimensional space.
By averaging token embeddings, you get a single vector representing the entire review.

**Why Use These?:** 
Embeddings capture more nuanced meanings compared to TF-IDF.
Useful for complex tasks where semantic understanding matters.

**Tools:** 
Use Hugging Face's transformers library to easily load and apply pre-trained models.

####
We also explored advanced techniques like embeddings and BERT. Embeddings convert words into dense numeric vectors, capturing semantic relationships and contextual meaning. BERT, in particular, is a state-of-the-art NLP model that processes text bidirectionally to understand the relationships between words.
For example, BERT can distinguish between “tight shoulders” and “tight hips,” making it highly effective for analyzing fit-specific feedback. 

In [1]:
import numpy as np
import pandas as pd 
import zipfile
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from nltk.corpus import stopwords
import nltk
import string
import torch
from transformers import RobertaTokenizer, RobertaModel
from sklearn.manifold import TSNE
import umap
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.manifold import TSNE
from sklearn.naive_bayes import MultinomialNB

In [2]:
# Open the zip file
with zipfile.ZipFile('data.zip') as z:
    # Open the JSON file
    with z.open('data/renttherunway_final_data.json') as f:
        df_rtr = pd.read_json(f, lines=True)

df_rtr.head()

Unnamed: 0,fit,user_id,bust size,item_id,weight,rating,rented for,review_text,body type,review_summary,category,height,size,age,review_date
0,fit,420272,34d,2260466,137lbs,10.0,vacation,An adorable romper! Belt and zipper were a lit...,hourglass,So many compliments!,romper,"5' 8""",14,28.0,"April 20, 2016"
1,fit,273551,34b,153475,132lbs,10.0,other,I rented this dress for a photo shoot. The the...,straight & narrow,I felt so glamourous!!!,gown,"5' 6""",12,36.0,"June 18, 2013"
2,fit,360448,,1063761,,10.0,party,This hugged in all the right places! It was a ...,,It was a great time to celebrate the (almost) ...,sheath,"5' 4""",4,116.0,"December 14, 2015"
3,fit,909926,34c,126335,135lbs,8.0,formal affair,I rented this for my company's black tie award...,pear,Dress arrived on time and in perfect condition.,dress,"5' 5""",8,34.0,"February 12, 2014"
4,fit,151944,34b,616682,145lbs,10.0,wedding,I have always been petite in my upper body and...,athletic,Was in love with this dress !!!,gown,"5' 9""",12,27.0,"September 26, 2016"


In [3]:
# Combine review_summary and review_text
df_rtr['combined_text'] = df_rtr['review_summary'] + " " + df_rtr['review_text']

In [4]:
# Encode Target Variable: Convert the 'fit' categories into numerical labels

custom_order = np.array(['small', 'fit', 'large'])

# Create LabelEncoder and set classes_
le = LabelEncoder()
le.classes_ = custom_order

# Transform target variable
df_rtr['fit_encoded'] = le.transform(df_rtr['fit'])

In [5]:
# Train-test split
X = df_rtr['combined_text']
y = df_rtr['fit_encoded']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
# Load pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = AutoModel.from_pretrained('distilbert-base-uncased')
model = model.to('cuda' if torch.cuda.is_available() else 'cpu')  # Use GPU if available

In [7]:
# Tokenize and batch process
texts = df_rtr['combined_text'].tolist()
batch_size = 8  # Adjust based on memory availability
all_embeddings = []

In [8]:
# Process in batches
for i in range(0, len(texts), batch_size):
    batch_texts = texts[i:i + batch_size]
    inputs = tokenizer(batch_texts, padding=True, truncation=True, max_length=128, return_tensors='pt')
    inputs = {key: val.to('cuda' if torch.cuda.is_available() else 'cpu') for key, val in inputs.items()}  # Move to GPU if available

    with torch.no_grad():
        outputs = model(**inputs)
        batch_embeddings = outputs.last_hidden_state.mean(dim=1).cpu().numpy()  # Average token embeddings
        all_embeddings.append(batch_embeddings)



In [9]:
# Concatenate all embeddings into a single array
embeddings = np.vstack(all_embeddings)

####
Use these embeddings as features for classification or clustering.

#### Clustering and Visualization:

Apply dimensionality reduction to embeddings:

In [None]:
tsne = TSNE(n_components=2, random_state=42)
reduced_embeddings = tsne.fit_transform(embeddings) 

In [None]:
# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')

# Fit and transform the training data
X_train_vectorized = vectorizer.fit_transform(X_train)

tsne = TSNE(n_components=2, random_state=42, init='random')
reduced_embeddings = tsne.fit_transform(X_train_vectorized)  # Use the same data as y_train

In [None]:
# Vectorize text using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000, stop_words="english")
X_train_vectorized = vectorizer.fit_transform(X_train)

# TSNE with sparse input (using init="random")
tsne = TSNE(n_components=2, random_state=42, init="random")
reduced_embeddings = tsne.fit_transform(X_train_vectorized)

# Visualize the TSNE output
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], c=y_train, cmap="viridis", s=10, alpha=0.7)
plt.colorbar(label="Fit Categories")
plt.xlabel("TSNE Component 1")
plt.ylabel("TSNE Component 2")
plt.title("TSNE Clusters Colored by Fit")
plt.show()

In [None]:
# Scatter plot of the reduced embeddings
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], s=10, alpha=0.7)
plt.title("TSNE Visualization of Embeddings")
plt.xlabel("TSNE Component 1")
plt.ylabel("TSNE Component 2")
plt.show()

####
Visualize clusters:

In [None]:
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], c=y_train, cmap='viridis')
plt.colorbar()
plt.show()

####  Combine with Other Features:

Concatenate embeddings with features like age or manufacturer for richer analytics:

In [None]:
final_features = np.hstack([embeddings.numpy(), df_rtr[['age', 'rating']].values])

#### While powerful, BERT’s computational cost and complexity make it less practical for this relatively straightforward task. Simpler methods like TF-IDF paired with Logistic Regression proved more efficient and equally effective in our case.