<a href="https://colab.research.google.com/github/Edenshmuel/PapaJohns_Data_Science_Project/blob/Nadav/Predicting_New_Categories.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing libraries and reading data

In [1]:
from google.colab import drive
import os
import shutil
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

In [2]:
def reconnect_to_drive():
    # Disconnect if there is an existing connection
    try:
        drive.flush_and_unmount()
        print("📤 Previous connection to Drive was lost")
    except:
        print("ℹ️ There was no previous connection")

    # Remove the /content/drive folder if it exists
    drive_mount_point = '/content/drive'
    if os.path.exists(drive_mount_point):
        shutil.rmtree(drive_mount_point)
        print("🗑️ Old mount point removed")

    # Connect to Drive
    drive.mount(drive_mount_point)
    print("📂 Connected to Drive")

reconnect_to_drive()

📤 Previous connection to Drive was lost
Mounted at /content/drive
📂 Connected to Drive


In [3]:
cleaned_data  = pd.read_csv('/content/drive/MyDrive/Final_Project_PapaJohns/cleaned_data.csv')
category_mapping = pd.read_csv('/content/drive/MyDrive/Final_Project_PapaJohns/category_mapping.csv')
desc_encoding_map = pd.read_csv('/content/drive/MyDrive/Final_Project_PapaJohns/desc_encoding_map.csv')

In [4]:
# Replace NaN with "undefined"
category_mapping['קטגוריה'] = category_mapping['קטגוריה'].fillna('לא מוגדר')

## Mergers and training table creation

In [5]:
merged = cleaned_data.merge(desc_encoding_map, left_on='clean_desc_encoded', right_on='code', how='left')
merged = merged.merge(category_mapping, left_on='category_encoded', right_on='קוד', how='left')

### 🔍 Why we use only `item_description` as input

In this classification task, the goal is to predict the **category of a new product** based solely on its textual description — for example: `"Coca Cola"`, `"Papa Deal"`, or `"Greek Salad"`.

We focus only on `item_description` for the following reasons:

- ✅ It is the **only available information** when a **new product** is added to the system.
- ✅ It contains meaningful linguistic patterns (e.g., "pizza", "drink", "sauce") that are useful for text classification.
- ❌ We ignore features like `clean_desc_encoded`, `quantity`, or `date`, since they are either:
  - Not available for new products,
  - Or irrelevant for categorizing based on name/description alone.

This approach ensures that the model:
- Can generalize to products it has **never seen before**,
- And works **in real-time**, using only the name provided during product creation.

In [6]:
# Retrieving description and category
model_data = merged[['Unnamed: 0', 'קטגוריה']].rename(columns={
    'Unnamed: 0': 'item_description',
    'קטגוריה': 'category'}).dropna()

In [7]:
# Removing the 'Undefined' category from the training
model_data = model_data[model_data['category'] != 'לא מוגדר']

## Mapping categories from the file

In [8]:
# Category Mapping: Text to Code
category_to_index = dict(zip(category_mapping['קטגוריה'], category_mapping['קוד']))
index_to_category = {v: k for k, v in category_to_index.items()}

In [9]:
# Filter category "לא מוגדר" from mapping
category_to_index.pop('לא מוגדר', None)
index_to_category.pop(0, None)

'לא מוגדר'

## data fragmentation

In [10]:
# Encoding the categories from names to numbers
y_encoded = model_data['category'].map(category_to_index)

# Adaptation to XGBoost: Making the encoding start from 0
X = model_data['item_description']
y_adjusted = y_encoded - 1

In [11]:
# Split into Train/Test
X_train, X_test, y_train, y_test = train_test_split(X, y_adjusted, test_size=0.2, random_state=42, stratify=y_adjusted)

## Model building and training

In [12]:
model = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', XGBClassifier(use_label_encoder=False, eval_metric='mlogloss'))])

model.fit(X_train, y_train)

Parameters: { "use_label_encoder" } are not used.



## Predicting and returning category names

In [13]:
y_pred = model.predict(X_test)

In [14]:
# החזרת הקידוד המקורי
y_pred_orig = y_pred + 1
y_test_orig = y_test + 1

In [15]:
# המרה חזרה לשמות קטגוריות
y_pred_labels = [index_to_category[i] for i in y_pred_orig]
y_test_labels = [index_to_category[i] for i in y_test_orig]

print(classification_report(y_test_labels, y_pred_labels))

              precision    recall  f1-score   support

         אחר       1.00      1.00      1.00      2636
  מנה עיקרית       1.00      1.00      1.00     19972
       קינוח       1.00      0.99      1.00      1218
        רוטב       0.99      1.00      0.99      3977
       שתייה       1.00      0.99      1.00      3781
       תוספת       1.00      1.00      1.00     14142

    accuracy                           1.00     45726
   macro avg       1.00      1.00      1.00     45726
weighted avg       1.00      1.00      1.00     45726



## New product prediction function (including security)

### 🧠 How the Model Handles Unknown or New Categories

This classification model is designed to predict the category of a product **based solely on its description** (`"Greek Salad"`, `"Papa Deal"`, `"Coca Cola"`).

#### 🟢 Standard Behavior:
- The model uses a trained `TF-IDF + XGBoost` pipeline to predict the **most likely category** from the known set (`'Main Dish'`, `'Drink'`, `'Dessert'`).
- These categories are based on the `category_mapping.csv` file and aligned with the internal system codes (1–6).

#### ⚠️ Special Handling for New/Unknown Products:
- If the model is **not confident enough** in its prediction (the top probability is **below a certain threshold**, such as 0.6),  
  it will **not return a specific category**.
- Instead, it returns a special label: **"⚠️ Category not recognized – Unclassified"**.

#### ✅ Why this is important:
- It ensures that **new or unusual products** (like limited-time offers or misspelled items) are not forced into incorrect categories.
- It also allows system operators to **review and manually classify** such items, or update the model over time.

> In summary: the model is capable of both confidently classifying known products and flagging new or unclear ones as "Unclassified".

In [22]:
def predict_from_input(model, index_to_category, threshold=0.6):
    print("🔍 Enter a product description (or type 'סיום' to quit):")
    while True:
        user_input = input("📝 Description: ").strip()
        if user_input.lower() == 'סיום':
            print("👋 Exiting prediction mode.")
            break

        # חיזוי
        probas = model.predict_proba([user_input])[0]
        max_proba = np.max(probas)
        predicted_index = np.argmax(probas)
        original_index = predicted_index + 1  # שימור ההתאמה למיפוי המקורי

        if max_proba < threshold:
            print("⚠️ Category not recognized – Unclassified")
        else:
            category = index_to_category[original_index]
            print(f"✅ Predicted category: {category} (Confidence: {max_proba:.2f})")

## Example of a prediction:

In [25]:
predict_from_input(model, index_to_category)

🔍 Enter a product description (or type 'סיום' to quit):
📝 Description: עוגת שוקולד
✅ Predicted category: קינוח (Confidence: 0.99)
📝 Description: רוטב סלסה
✅ Predicted category: רוטב (Confidence: 1.00)
📝 Description: סיום
👋 Exiting prediction mode.
