<a href="https://colab.research.google.com/github/IrmaM00/product-category-prediction/blob/main/notebook/product_reviews_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Uvoz biblioteka

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

Uvoz i prikazivanje podataka

In [7]:
url = "https://raw.githubusercontent.com/IrmaM00/product-category-prediction/main/data/products.csv"
df = pd.read_csv(url)
print("Dataset shape (rows, columns):", df.shape)

Dataset shape (rows, columns): (35311, 8)


Izbacivanje nepostojecih vrednosti i nepotrebnih kolona

In [8]:
df = df.dropna()
print("Dataset shape after dropping NA:", df.shape)

Dataset shape after dropping NA: (34760, 8)


In [9]:
df = df.drop(columns=['product ID', 'Merchant ID', '_Product Code', 'Number_of_Views', 'Merchant Rating', ' Listing Date  '])
df.head()

Unnamed: 0,Product Title,Category Label
0,apple iphone 8 plus 64gb silver,Mobile Phones
1,apple iphone 8 plus 64 gb spacegrau,Mobile Phones
2,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,Mobile Phones
3,apple iphone 8 plus 64gb space grey,Mobile Phones
4,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,Mobile Phones


Standardizacija - Spajanje kolona slicnih imena

In [10]:
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

In [11]:
# print(df["category_label"].value_counts())
mapping = {
    "Fridges": "Fridge Freezers",
    "fridge": "Fridge Freezers",
    "Freezers": "Fridge Freezers",
    "Mobile Phone": "Mobile Phones",
    "CPU": "CPUs"
}

df["category_label"] = df["category_label"].astype(object).replace(mapping).astype("category")

Priprema podataka radi ML-a

In [12]:
X = df["product_title"]
y = df["category_label"]

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Vektorizacija i modeli

In [14]:
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [15]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Naive Bayes": MultinomialNB(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Support Vector Machine": LinearSVC()
}

Provera modela

In [16]:
for name, model in models.items():
    model.fit(X_train_vec, y_train)
    y_pred = model.predict(X_test_vec)
    print(f"\n{name} - Classification Report:")
    print(classification_report(y_test, y_pred))

# Support Vector Machine i Naive Bayes daju najbolje rezultate


Logistic Regression - Classification Report:
                  precision    recall  f1-score   support

            CPUs       1.00      1.00      1.00       759
 Digital Cameras       1.00      0.99      1.00       532
     Dishwashers       0.97      0.92      0.95       675
 Fridge Freezers       0.94      0.99      0.96      2226
      Microwaves       1.00      0.94      0.97       461
   Mobile Phones       0.99      0.99      0.99       805
             TVs       0.99      0.97      0.98       700
Washing Machines       0.99      0.92      0.96       794

        accuracy                           0.97      6952
       macro avg       0.98      0.97      0.97      6952
    weighted avg       0.97      0.97      0.97      6952


Naive Bayes - Classification Report:
                  precision    recall  f1-score   support

            CPUs       1.00      1.00      1.00       759
 Digital Cameras       1.00      1.00      1.00       532
     Dishwashers       0.99      0.90     