In [None]:
!wget -c https://zenodo.org/records/3355823/files/ecommerceDataset.csv

--2025-09-24 05:47:56--  https://zenodo.org/records/3355823/files/ecommerceDataset.csv
Resolving zenodo.org (zenodo.org)... 188.185.45.92, 188.185.43.25, 188.185.48.194, ...
Connecting to zenodo.org (zenodo.org)|188.185.45.92|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36949114 (35M) [text/plain]
Saving to: ‘ecommerceDataset.csv’


2025-09-24 05:48:25 (1.24 MB/s) - ‘ecommerceDataset.csv’ saved [36949114/36949114]



The command **!wget -c https://zenodo.org/records/3355823/files/ecommerceDataset.csv**  is used to download a file from a specified URL.

In this case, it is downloading a CSV file named ecommerceDataset.csv from Zenodo, a research data repository.

Here is a breakdown of the command:

!wget: This is a shell command used to retrieve files from the web. The exclamation mark ! is used to run shell commands directly from a Jupyter notebook cell.

-c: This flag stands for "continue," which means if the download is interrupted, it will resume from where it left off.

## **Step 1: Loading and Exploring the Data**

In [None]:
import pandas as pd
data = pd.read_csv("ecommerceDataset.csv",
                   names=['Labels','Description'])
data.head()

Unnamed: 0,Labels,Description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50425 entries, 0 to 50424
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Labels       50425 non-null  object
 1   Description  50424 non-null  object
dtypes: object(2)
memory usage: 788.0+ KB


In [None]:
print(data['Labels'].value_counts())

Labels
Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8671
Name: count, dtype: int64


# **Step 2: Text Preprocessing**
Preprocessing involves cleaning and preparing text data.

**Common steps include:**

Removing punctuation and special characters

Converting text to lowercase

Removing stop words

Lemmatization

In [None]:
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer



# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    if not isinstance(text, str):
        return ""
    # Remove punctuation and special characters
    text = re.sub(r'\W', ' ', text)
    # Convert text to lowercase
    text = text.lower()
    # Remove stop words and lemmatize
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words])
    return text

# Ensure all entries in the 'Description' column are strings and fill missing values with an empty string
data['Description'] = data['Description'].fillna('').astype(str)

# Apply preprocessing to the 'Description' column
data['cleaned_description'] = data['Description'].apply(preprocess_text)

# Display the first few rows to ensure preprocessing worked
print(data.head())


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


      Labels                                        Description  \
0  Household  Paper Plane Design Framed Wall Hanging Motivat...   
1  Household  SAF 'Floral' Framed Painting (Wood, 30 inch x ...   
2  Household  SAF 'UV Textured Modern Art Print Framed' Pain...   
3  Household  SAF Flower Print Framed Painting (Synthetic, 1...   
4  Household  Incredible Gifts India Wooden Happy Birthday U...   

                                 cleaned_description  
0  paper plane design framed wall hanging motivat...  
1  saf floral framed painting wood 30 inch x 10 i...  
2  saf uv textured modern art print framed painti...  
3  saf flower print framed painting synthetic 13 ...  
4  incredible gift india wooden happy birthday un...  


# **Step 3: Vectorization**
We need to convert the text data into numerical format.

We'll use TF-IDF Vectorizer from scikit-learn.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(data['cleaned_description']).toarray()

# The label column will be our target variable
y = data['Labels']


# **Step 4: Model Building**
We'll split the data into training and testing sets and then build a Logistic Regression model.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model
model = LogisticRegression()
model.fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# **Step 5: Evaluation**
We'll evaluate the model using accuracy, precision, recall, and F1-score.

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print('Classification Report:')
print(classification_report(y_test, y_pred))


Accuracy: 0.9648983639067923
Classification Report:
                        precision    recall  f1-score   support

                 Books       0.97      0.96      0.96      2387
Clothing & Accessories       0.98      0.98      0.98      1744
           Electronics       0.96      0.94      0.95      2067
             Household       0.96      0.98      0.97      3887

              accuracy                           0.96     10085
             macro avg       0.97      0.96      0.97     10085
          weighted avg       0.96      0.96      0.96     10085

