<a href="https://colab.research.google.com/github/StefanRaduMaris/AI-model-product_classifier/blob/main/notebook/product_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üß† AI Product Categorization Model
    ‚Äì Project Description

This notebook presents the full workflow of developing an AI model for product categorization. The model is designed to automatically classify products into predefined categories based on their characteristics (such as name, description, or other features). This approach can be used for e-commerce, inventory management, or retail analytics.

## üìå Project Objectives

Build a machine learning model capable of classifying products into categories.

Process and prepare product data for model training.

Train and evaluate different algorithms to find the best-performing model.

Provide accurate and reliable product category predictions.

## üîç Step 1: Problem Definition

The goal of this project is to automatically assign a category to each product.
We aim to solve a classification problem, where the expected output is a category label (e.g., Electronics, Clothing, Food, etc.).
This model helps improve product catalog organization and reduces manual tagging work.

## üìä Step 2: Data Collection & Exploration

We load a dataset containing products and their corresponding categories.
In this step, we:

Examine product features (e.g., name, description, price).

Check the distribution of categories.

Identify missing or inconsistent values.

Visualize data to understand common patterns across product types.

By exploring the dataset, we gain insights into which features are most helpful for categorization.

In [32]:
#i need to import pandas because this library is helping me to read my file from github
import pandas as pd
#This url is the path to find data file on my github account
url='https://raw.githubusercontent.com/StefanRaduMaris/AI-model-product_classifier/refs/heads/main/data/products.csv'

#We read the csv file
df = pd.read_csv(url)

#First look in this data
print(df.head(10))

#Checking quantity of information(how my rows)
print(f"We have {df.shape[0]} rows and {df.shape[1] } columns")
#Checking data type and missing values
print(df.info())
print(df.isna().sum())


   product ID                                      Product Title  Merchant ID  \
0           1                    apple iphone 8 plus 64gb silver            1   
1           2                apple iphone 8 plus 64 gb spacegrau            2   
2           3  apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...            3   
3           4                apple iphone 8 plus 64gb space grey            4   
4           5  apple iphone 8 plus gold 5.5 64gb 4g unlocked ...            5   
5           6  apple iphone 8 plus gold 5.5 64gb 4g unlocked ...            6   
6           7               apple iphone 8 plus 64 gb space grey            7   
7           8                apple iphone 8 plus 64gb space grey            8   
8           9                apple iphone 8 plus 64gb space grey            9   
9          10                apple iphone 8 plus 64gb space grey           10   

   Category Label _Product Code  Number_of_Views  Merchant Rating  \
0   Mobile Phones    QA-2276-XC        

Overall, the dataset is relatively complete, with missing values present in several columns but at a minimal proportion. It provides a rich foundation for analyzing product performance, merchant behavior, category distribution, and marketplace dynamics.The most important stuff is that we can see we have to make some change in columns to make our work easier.

In [33]:
df = df.rename(columns={
"product ID":"Product ID",
"Product Title":"Product Title",
"Merchant ID":"Merchant ID",
" Category Label":"Category Label",
"_Product Code":"Product Code",
"Number_of_Views":"Number of Views",
"Mechant Rating":"Merchant Rating",
"Listing Date":"Listing Date",
})

df.to_csv("products.csv", index=False)

print(df.isna().sum())

Product ID           0
Product Title      172
Merchant ID          0
Category Label      44
Product Code        95
Number of Views     14
Merchant Rating    170
 Listing Date       59
dtype: int64


## üßπ Step 3: Data Preprocessing

To ensure accurate classification, we prepare the dataset by:

Cleaning text fields (removing symbols, converting to lowercase).

Cleaning duplicates.

Checking witch columns should be in our model.

Handling missing or duplicated entries.



So first question is witch are the categories?

In [34]:
df['Category Label'].value_counts()

Unnamed: 0_level_0,count
Category Label,Unnamed: 1_level_1
Fridge Freezers,5495
Washing Machines,4036
Mobile Phones,4020
CPUs,3771
TVs,3564
Fridges,3457
Dishwashers,3418
Digital Cameras,2696
Microwaves,2338
Freezers,2210


So look like we have some problems between Freezers,fridge, Fridge Freezers, between CPUs and CPU,between Mobile Phone and Mobile Phones so let`s solve this.

In [43]:
df = df.replace("Freezers", "Fridge Freezers").replace('fridge', 'Fridge Freezers').replace('Fridges',"Fridge Freezers")
df=df.replace('Mobile Phone', 'Mobile Phones').replace('CPU', 'CPUs')
df['Category Label'].value_counts()



Unnamed: 0_level_0,count
Category Label,Unnamed: 1_level_1
Fridge Freezers,11285
Mobile Phones,4075
Washing Machines,4036
CPUs,3855
TVs,3564
Dishwashers,3418
Digital Cameras,2696
Microwaves,2338


Let`s check again NaN values, it is really important to remove those values from our model because is really important for our precission

In [44]:
df.isna().sum()

Unnamed: 0,0
Product ID,0
Product Title,172
Merchant ID,0
Category Label,44
Product Code,95
Number of Views,14
Merchant Rating,170
Listing Date,59


We have to take a important deccision here.Should we remove this data?It is this column important for our model?(Because if the column will not be in the model i will just remove the column without removing NaN values)

Before starting and analyze data i want to check for duplicates.Let`s check the results.

In [48]:
df[df.duplicated()]

Unnamed: 0,Product ID,Product Title,Merchant ID,Category Label,Product Code,Number of Views,Merchant Rating,Listing Date


Now we know that we don`t have duplicates so let`s start examing witch columns will influent the most the category.