**Importuri**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


**Incarcare date**

In [2]:
df = pd.read_csv("../data/products.csv")
df.columns = df.columns.str.strip()
df.head()


Unnamed: 0,product ID,Product Title,Merchant ID,Category Label,_Product Code,Number_of_Views,Merchant Rating,Listing Date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,5/10/2024
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,12/31/2024
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,Mobile Phones,FP-8086-IE,3092.0,3.9,11/10/2024
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,5/2/2022
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,4/12/2023


In [3]:
df.info()
df.isna().sum()
df["Category Label"].value_counts().head(10)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   product ID       35311 non-null  int64  
 1   Product Title    35139 non-null  object 
 2   Merchant ID      35311 non-null  int64  
 3   Category Label   35267 non-null  object 
 4   _Product Code    35216 non-null  object 
 5   Number_of_Views  35297 non-null  float64
 6   Merchant Rating  35141 non-null  float64
 7   Listing Date     35252 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 2.2+ MB


Category Label
Fridge Freezers     5495
Washing Machines    4036
Mobile Phones       4020
CPUs                3771
TVs                 3564
Fridges             3457
Dishwashers         3418
Digital Cameras     2696
Microwaves          2338
Freezers            2210
Name: count, dtype: int64

**Curatare minimala**

In [4]:
df = df.dropna(subset=["Product Title", "Category Label"])
df["Product Title"] = df["Product Title"].str.lower()


**Feature Engineering**

In [5]:
df["title_length"] = df["Product Title"].apply(len)
df["word_count"] = df["Product Title"].apply(lambda x: len(x.split()))
df["has_number"] = df["Product Title"].str.contains(r"\d").astype(int)


Aceste features ajută la diferențierea produselor tehnice vs. non-tehnice.

**Pregatire date**

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer


**Pipeline complet**

In [7]:
X = df[["Product Title", "title_length", "word_count", "has_number"]]
y = df["Category Label"]

preprocessor = ColumnTransformer(
    transformers=[
        ("text", TfidfVectorizer(stop_words="english", max_features=5000), "Product Title"),
        ("num", "passthrough", ["title_length", "word_count", "has_number"])
    ]
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


**Model 1 – Logistic Regression**

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

model_lr = Pipeline([
    ("prep", preprocessor),
    ("clf", LogisticRegression(max_iter=1000))
])

model_lr.fit(X_train, y_train)
pred_lr = model_lr.predict(X_test)

print(accuracy_score(y_test, pred_lr))
print(classification_report(y_test, pred_lr))


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9414529914529914
                  precision    recall  f1-score   support

             CPU       0.00      0.00      0.00        14
            CPUs       0.98      1.00      0.99       726
 Digital Cameras       0.99      0.99      0.99       535
     Dishwashers       0.90      0.96      0.93       684
        Freezers       0.98      0.84      0.91       422
 Fridge Freezers       0.94      0.91      0.92      1087
         Fridges       0.81      0.91      0.86       702
      Microwaves       0.99      0.94      0.97       464
    Mobile Phone       0.00      0.00      0.00        17
   Mobile Phones       0.96      0.98      0.97       795
             TVs       0.96      0.99      0.98       724
Washing Machines       0.96      0.95      0.96       821
          fridge       0.00      0.00      0.00        29

        accuracy                           0.94      7020
       macro avg       0.73      0.73      0.73      7020
    weighted avg       0.94      0.94      0.94    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


**Model 2 – LinearSVC**

In [9]:
from sklearn.svm import LinearSVC

model_svc = Pipeline([
    ("prep", preprocessor),
    ("clf", LinearSVC())
])

model_svc.fit(X_train, y_train)
pred_svc = model_svc.predict(X_test)

print(accuracy_score(y_test, pred_svc))
print(classification_report(y_test, pred_svc))


0.9534188034188035


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


                  precision    recall  f1-score   support

             CPU       0.00      0.00      0.00        14
            CPUs       0.98      1.00      0.99       726
 Digital Cameras       1.00      0.99      1.00       535
     Dishwashers       0.91      0.95      0.93       684
        Freezers       0.97      0.91      0.94       422
 Fridge Freezers       0.95      0.94      0.94      1087
         Fridges       0.86      0.91      0.88       702
      Microwaves       0.98      0.97      0.98       464
    Mobile Phone       0.00      0.00      0.00        17
   Mobile Phones       0.97      0.99      0.98       795
             TVs       0.98      0.99      0.99       724
Washing Machines       0.97      0.95      0.96       821
          fridge       0.00      0.00      0.00        29

        accuracy                           0.95      7020
       macro avg       0.74      0.74      0.74      7020
    weighted avg       0.95      0.95      0.95      7020



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Unele categorii cu frecvență foarte redusă nu au fost prezise în setul de test, ceea ce a generat un UndefinedMetricWarning.

LinearSVC a oferit acuratețe mai mare → ales ca model final.

**Salvarea modelului**

In [10]:
import pickle

with open("../models/model.pkl", "wb") as f:
    pickle.dump(model_svc, f)
