# Machine learning project
## Predicting category of a product
**Author**: Sanjin Jareb

### Step 1:
Loading dataset that we will work on

In [34]:
import pandas as pd

df = pd.read_csv('products.csv')

# Display the first 5 rows of the DataFrame
print(df.head())

   product ID                                      Product Title  Merchant ID  \
0           1                    apple iphone 8 plus 64gb silver            1   
1           2                apple iphone 8 plus 64 gb spacegrau            2   
2           3  apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...            3   
3           4                apple iphone 8 plus 64gb space grey            4   
4           5  apple iphone 8 plus gold 5.5 64gb 4g unlocked ...            5   

   Category Label _Product Code  Number_of_Views  Merchant Rating  \
0   Mobile Phones    QA-2276-XC            860.0              2.5   
1   Mobile Phones    KA-2501-QO           3772.0              4.8   
2   Mobile Phones    FP-8086-IE           3092.0              3.9   
3   Mobile Phones    YI-0086-US            466.0              3.4   
4   Mobile Phones    NZ-3586-WP           4426.0              1.6   

   Listing Date    
0       5/10/2024  
1      12/31/2024  
2      11/10/2024  
3        5/2/2022 

## Step 2:
Check values in dataframe and standardize them

In [35]:
# Display the data types of each column
print(df.dtypes)
# Check for missing values in each column
print(df.isnull().sum())

product ID           int64
Product Title       object
Merchant ID          int64
 Category Label     object
_Product Code       object
Number_of_Views    float64
Merchant Rating    float64
 Listing Date       object
dtype: object
product ID           0
Product Title      172
Merchant ID          0
 Category Label     44
_Product Code       95
Number_of_Views     14
Merchant Rating    170
 Listing Date       59
dtype: int64


As we can see, there are a lot of missing values in some important columns for our research. Because of that we will delete all rows with missing values in any of following columns: **"Product Title", "Category Label"**

In [36]:
#Standardizing column names by removing leading/trailing spaces and underscore on the beginning
df.columns = df.columns.str.strip().str.lstrip('_')

# # Deleting unnecessary columns
columns_to_delete = ['Number_of_Views', 'Merchant Rating', 'Listing Date', 'Product Code']
df.drop(columns=columns_to_delete, inplace=True)

# Deleting rows with missing values in important columns
important_columns = ['Product Title', 'Category Label']
df.dropna(subset=important_columns, inplace=True)

# Resetting the index after row deletions
df.reset_index(drop=True, inplace=True)
# Making sure there are no missing values
print(df.isnull().sum())

product ID        0
Product Title     0
Merchant ID       0
Category Label    0
dtype: int64


Now we have dataframe with only important columns for the research. We will display it now, so that we have clear view on the new dataframe

In [37]:
print(df.head(10))

   product ID                                      Product Title  Merchant ID  \
0           1                    apple iphone 8 plus 64gb silver            1   
1           2                apple iphone 8 plus 64 gb spacegrau            2   
2           3  apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...            3   
3           4                apple iphone 8 plus 64gb space grey            4   
4           5  apple iphone 8 plus gold 5.5 64gb 4g unlocked ...            5   
5           6  apple iphone 8 plus gold 5.5 64gb 4g unlocked ...            6   
6           7               apple iphone 8 plus 64 gb space grey            7   
7           8                apple iphone 8 plus 64gb space grey            8   
8           9                apple iphone 8 plus 64gb space grey            9   
9          10                apple iphone 8 plus 64gb space grey           10   

  Category Label  
0  Mobile Phones  
1  Mobile Phones  
2  Mobile Phones  
3  Mobile Phones  
4  Mobile Pho

In [38]:
# Standardize names in category column
print(df['Category Label'].value_counts())
df['Category Label'] = df['Category Label'].replace({'Fridges':'Fridge Freezers', 
                        'Freezers':'Fridge Freezers', 
                        'fridge':'Fridge Freezers',
                        'Mobile Phone': 'Mobile Phones',
                        'CPUs': 'CPU',})
print('-'*40)
print(df['Category Label'].value_counts())

Category Label
Fridge Freezers     5470
Washing Machines    4015
Mobile Phones       4002
CPUs                3747
TVs                 3541
Fridges             3436
Dishwashers         3405
Digital Cameras     2689
Microwaves          2328
Freezers            2201
fridge               123
CPU                   84
Mobile Phone          55
Name: count, dtype: int64
----------------------------------------
Category Label
Fridge Freezers     11230
Mobile Phones        4057
Washing Machines     4015
CPU                  3831
TVs                  3541
Dishwashers          3405
Digital Cameras      2689
Microwaves           2328
Name: count, dtype: int64


## Step 3: 
Now is time for us to train and test some models. After that, based on their performances, we should choose one to work with.

In [41]:
# Splitting the dataset into features and target variable
X = df[['Product Title']]
y = df['Category Label']

# Train-test split with stratification
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Preprocessing and model pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

preprocessor = ColumnTransformer(
    transformers=[
        ("title", TfidfVectorizer(), "Product Title")
    ]
)

#Importing models to test
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
 
 
# List of classifiers
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Naive Bayes": MultinomialNB(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Support Vector Machine": LinearSVC()
}

# Training and evaluating each model
for name, model in models.items():
    pipeline = Pipeline([
        ("preprocessor", preprocessor),
        ("classifier", model)
    ])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    print(classification_report(y_test, y_pred, zero_division=0)) # Display classification report for each model so we can compare their performance

                  precision    recall  f1-score   support

             CPU       1.00      0.99      1.00       766
 Digital Cameras       1.00      0.99      0.99       538
     Dishwashers       0.96      0.91      0.94       681
 Fridge Freezers       0.94      0.99      0.97      2246
      Microwaves       1.00      0.94      0.97       466
   Mobile Phones       0.99      0.99      0.99       812
             TVs       0.98      0.98      0.98       708
Washing Machines       0.99      0.92      0.95       803

        accuracy                           0.97      7020
       macro avg       0.98      0.97      0.97      7020
    weighted avg       0.97      0.97      0.97      7020

                  precision    recall  f1-score   support

             CPU       1.00      1.00      1.00       766
 Digital Cameras       1.00      0.99      0.99       538
     Dishwashers       0.98      0.89      0.93       681
 Fridge Freezers       0.94      1.00      0.97      2246
      Micr

## Conclusion
The last model (***LinearSVC***) is the best choice here. Here is why:  
1. Accuracy is the highest (0.98)  
2. Macro F1 score is highest with 0.98
3. It has the best balance across all classes