<a href="https://colab.research.google.com/github/Bojescu/product_category_classifier/blob/main/notebook/eda_and_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Load and Inspect the Dataset

The dataset is loaded directly from the GitHub repository using its **raw URL**.  
Column names are normalized to Python-friendly formats (no spaces or leading/trailing characters).  

**Steps performed:**
- Load the CSV file into a Pandas DataFrame.
- Rename columns (`product ID` → `product_id`, `Product Title` → `title`, etc.).
- Print the dataset shape (rows × columns).
- Display the first 5 rows for a quick overview.
- Show dataset information (`dtypes`, non-null counts) to identify missing values and data types.


In [9]:
import os
import pandas as pd

url = "https://raw.githubusercontent.com/Bojescu/product_category_classifier/main/data/products.csv"

df = pd.read_csv(url).rename(columns={
    'product ID':'product_id',
    'Product Title':'title',
    'Merchant ID':'merchant_id',
    ' Category Label':'category',
    '_Product Code':'product_code',
    'Number_of_Views':'views',
    'Merchant Rating':'merchant_rating',
    ' Listing Date  ':'listing_date',
})

print(df.shape)

print("\nFirst 5 rows:")
display(df.head())

print("\nDataset info:")
df.info()


(35311, 8)

First 5 rows:


Unnamed: 0,product_id,title,merchant_id,category,product_code,views,merchant_rating,listing_date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,5/10/2024
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,12/31/2024
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,Mobile Phones,FP-8086-IE,3092.0,3.9,11/10/2024
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,5/2/2022
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,4/12/2023



Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   product_id       35311 non-null  int64  
 1   title            35139 non-null  object 
 2   merchant_id      35311 non-null  int64  
 3   category         35267 non-null  object 
 4   product_code     35216 non-null  object 
 5   views            35297 non-null  float64
 6   merchant_rating  35141 non-null  float64
 7   listing_date     35252 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 2.2+ MB


## 2. Missing Values and Data Types

A quick check is performed to identify missing values and inspect the data types of each column.  
This summary helps to:
- Detect columns with null values that may require cleaning or imputation.
- Confirm that each column has the expected data type.
- Provide an overview of dataset completeness before feature engineering.

The resulting table displays:
- **dtype**: the data type of the column.  
- **non_null**: the number of non-missing entries.  
- **nulls**: the number of missing entries.



In [10]:
# Missing values per column
df_info = pd.DataFrame({
    "dtype": df.dtypes.astype(str),
    "non_null": df.notna().sum(),
    "nulls": df.isna().sum(),
})
df_info


Unnamed: 0,dtype,non_null,nulls
product_id,int64,35311,0
title,object,35139,172
merchant_id,int64,35311,0
category,object,35267,44
product_code,object,35216,95
views,float64,35297,14
merchant_rating,float64,35141,170
listing_date,object,35252,59
