## Task: Product Category Prediction from Title
### ‚úçÔ∏èAuthor: Sladjan Jeremic / SladjanJ
In this project, a machine learning model is developed to automatically suggest the appropriate product category based on its title (e.g. "Apple iPhone 7 32GB" ‚Üí "Mobile Phones"). The goal is to automate the product categorization process in an online store in order to reduce manual work, speed up the creation of new listings, and lower the risk of human error.

This Jupyter notebook will walk through all key steps of the project: loading and exploring a real‚Äëworld dataset with tens of thousands of products, preparing and cleaning the data, performing feature engineering (primarily on the Product Title field), transforming text using methods such as TF‚ÄìIDF, training and comparing several classification models, evaluating them with metrics like accuracy, precision, recall, and F1‚Äëscore, and finally selecting and training the best model, which will later be saved and used in dedicated scripts for training and interactive category prediction.

### Step 1 ‚Äì Importing libraries üß∞
In this first step, the required Python libraries for data loading, exploration, and modeling will be imported. As the project evolves, additional libraries will be added here so that all dependencies are clearly grouped at the top of the notebook.

In [23]:
import pandas as pd

### Step 2 ‚Äì Loading and exploring the data üìä
In this step, the product dataset is loaded from the data/products.csv file into a DataFrame, and the first few rows are displayed. This provides an initial overview of the available columns and helps to understand the structure and content of the data before any cleaning or modeling.

In [None]:
df = pd.read_csv("data/products.csv")
print(df.head())

    product ID                                      Product Title  \
0            1                    apple iphone 8 plus 64gb silver   
1            2                apple iphone 8 plus 64 gb spacegrau   
2            3  apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...   
3            4                apple iphone 8 plus 64gb space grey   
4            5  apple iphone 8 plus gold 5.5 64gb 4g unlocked ...   
5            6  apple iphone 8 plus gold 5.5 64gb 4g unlocked ...   
6            7               apple iphone 8 plus 64 gb space grey   
7            8                apple iphone 8 plus 64gb space grey   
8            9                apple iphone 8 plus 64gb space grey   
9           10                apple iphone 8 plus 64gb space grey   
10          11  apple iphone 8 plus 5.5 single sim 4g 64gb silver   
11          12    sim free iphone 8 plus 64gb by apple space grey   
12          13           apple iphone 8 plus 64gb gold smartphone   
13          14    apple iphone 8 p

### Initial data overview üîç

The first rows of the dataset show that each product has an ID, a textual title, a merchant identifier, a target category label, a product code, engagement information (number of views), a merchant rating and a listing date. The `Product Title` column will be the main source of information for text-based features, while `Category Label` will be used as the target variable for model training. At first glance, the sample rows do not show obvious missing values, but this will be confirmed more systematically in the next steps using summary statistics and null-value checks.


### Step 3 ‚Äì Data cleaning and preprocessing üßº

In [36]:
print(df.info())
print("-"*50)
print(df.isna().sum())
print("-"*50)
df = df.dropna(subset=['Product Title', ' Category Label'])
print(df.isna().sum())

<class 'pandas.core.frame.DataFrame'>
Index: 35096 entries, 0 to 35310
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   product ID       35096 non-null  int64  
 1   Product Title    35096 non-null  object 
 2   Merchant ID      35096 non-null  int64  
 3    Category Label  35096 non-null  object 
 4   _Product Code    35002 non-null  object 
 5   Number_of_Views  35082 non-null  float64
 6   Merchant Rating  34926 non-null  float64
 7    Listing Date    35038 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 2.4+ MB
None
--------------------------------------------------
product ID           0
Product Title        0
Merchant ID          0
 Category Label      0
_Product Code       94
Number_of_Views     14
Merchant Rating    170
 Listing Date       58
dtype: int64
--------------------------------------------------
product ID           0
Product Title        0
Merchant ID          0
 C

### Data cleaning summary üßº

Rows with missing values in the key columns `Product Title` and `Category Label` have been removed, ensuring that all remaining records contain both a valid product title and a target category label. Since the focus of this project is on predicting categories solely from product titles, the other columns (such as product code, number of views, merchant rating and listing date) will not be used for the initial modeling phase, even though some of them still contain a small number of missing values.
