# Lexical Patterns in Sustainable vs Fast Fashion Products

This project investigates whether textual product descriptions contain identifiable lexical markers distinguishing sustainable fashion from fast fashion items.

We analyze lexical patterns, detect dataset bias, and evaluate a text-based classification model.

In [None]:
import pandas as pd
import numpy as np
import re

from sklearn.utils import resample
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
eco = pd.read_csv("/content/sustainable_eco.csv.csv", engine="python", on_bad_lines="skip")
india = pd.read_csv("/content/sustainable_india.csv.csv", engine="python", on_bad_lines="skip")
fashion = pd.read_csv("/content/fashion_general.csv.csv", engine="python", on_bad_lines="skip")

print("Loaded successfully")

Loaded successfully


In [None]:
india_clean = india.rename(columns={
    "Name": "product_name",
    "Brand": "brand",
    "Price": "price"
})

india_clean["category"] = None
india_clean["description"] = india_clean["product_name"]
india_clean["sustainability_label"] = 1

india_clean = india_clean[
    ["product_name", "brand", "price", "category", "description", "sustainability_label"]
]

In [None]:
fashion_clean = fashion.rename(columns={
    "productDisplayName": "product_name",
    "masterCategory": "category"
})

fashion_clean["brand"] = "unknown"
fashion_clean["price"] = None
fashion_clean["description"] = fashion_clean["product_name"]
fashion_clean["sustainability_label"] = 0

fashion_clean = fashion_clean[
    ["product_name", "brand", "price", "category", "description", "sustainability_label"]
]

In [None]:
final_df = pd.concat([india_clean, fashion_clean], ignore_index=True)

print(final_df["sustainability_label"].value_counts())
print(final_df.shape)

sustainability_label
0    44424
1      379
Name: count, dtype: int64
(44803, 6)


In [None]:
majority = final_df[final_df.sustainability_label == 0]
minority = final_df[final_df.sustainability_label == 1]

majority_downsampled = resample(
    majority,
    replace=False,
    n_samples=len(minority),
    random_state=42
)

balanced_df = pd.concat([majority_downsampled, minority])

print(balanced_df["sustainability_label"].value_counts())

sustainability_label
0    379
1    379
Name: count, dtype: int64


In [None]:
import re

def clean_text_func(text):
    text = str(text).lower()
    text = text.replace("khadi", "")

    brands_to_remove = ["nike", "adidas", "puma", "park avenue", "urban yoga"]
    for b in brands_to_remove:
        text = text.replace(b, "")

    text = re.sub(r"[^a-zA-Z ]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()

    return text

balanced_df["clean_text"] = balanced_df["description"].apply(clean_text_func)

In [None]:
tfidf = TfidfVectorizer(max_features=300)
X = tfidf.fit_transform(balanced_df["clean_text"])
y = balanced_df["sustainability_label"].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [None]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[81  1]
 [ 1 69]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99        82
           1       0.99      0.99      0.99        70

    accuracy                           0.99       152
   macro avg       0.99      0.99      0.99       152
weighted avg       0.99      0.99      0.99       152



In [None]:
feature_names = tfidf.get_feature_names_out()
coefficients = model.coef_[0]

top_positive = sorted(zip(coefficients, feature_names), reverse=True)[:15]
top_negative = sorted(zip(coefficients, feature_names))[:15]

print("Top words predicting Sustainable (1):")
for coef, word in top_positive:
    print(word, coef)

print("\nTop words predicting Non-Sustainable (0):")
for coef, word in top_negative:
    print(word, coef)

Top words predicting Sustainable (1):
cotton 2.6036906351162794
east 1.955582958001724
linen 1.9158980274909503
dress 1.8322004670337393
phosphene 1.7976751521942287
lafaani 1.7976751521942287
denim 1.7742316108312166
cupro 1.7141108496654691
by 1.7132676506772782
studio 1.4667254541823136
amala 1.4667254541823136
pirani 1.457097137560792
anush 1.457097137560792
hemp 1.4369617854302752
house 1.3848946677458707

Top words predicting Non-Sustainable (0):
men -3.503870184274295
women -2.6880972324726398
black -2.216498853601138
printed -1.5221396778549436
white -1.184026529375331
blue -1.1630324977310913
brown -1.1014798394012069
sunglasses -1.0709370445935738
shoes -0.9291523980067732
wrangler -0.893598610056401
watch -0.890027652385932
earrings -0.8831945218690509
casual -0.7734908817708164
red -0.765885486851077
grey -0.7629677728465692


In [None]:
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
cv_scores = cross_val_score(
    LogisticRegression(max_iter=1000),
    X,
    y,
    cv=5,
    scoring="f1"
)

print("Cross-Validation F1 Scores:", cv_scores)
print("Mean F1 Score:", cv_scores.mean())
print("Std Deviation:", cv_scores.std())

Cross-Validation F1 Scores: [0.67826087 0.89051095 0.86567164 0.95890411 0.52427184]
Mean F1 Score: 0.7835238829021213
Std Deviation: 0.1595271963049026


## Cross-Validation Analysis

While a single train-test split showed high F1 performance, 5-fold cross-validation revealed significant variance (Mean F1 ≈ 0.78, Std ≈ 0.16).

This indicates that lexical separation is present but not uniformly distributed across folds. The instability likely arises from dataset-source bias and limited sustainable samples.

This highlights the importance of cross-validation in preventing overly optimistic evaluation.