# Introduction

In this file, we build a TFIDF-RandomForest model. This is a baseline model that only considers text.

How well can we predict the labels using only the title? 

We hope that the combined models can beat the accuracy performance of this title-only model.

## Mounting directory

In [None]:
# mount the google drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# change the working directory
%cd /content/gdrive/MyDrive/CPSC-4830 Group Project/Final/data

/content/gdrive/.shortcut-targets-by-id/1M6IvhLF8X3hPFkYeB6xMlQmOoHJZt56Z/CPSC-4830/Final/data


## Import Libraries

In [None]:
#import library
import tensorflow as tf
from tensorflow.keras import layers, Model
from tensorflow.keras.layers import Input, Dense, SimpleRNN
tf.config.run_functions_eagerly(True)

from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
%matplotlib inline

## Loading Data

In [None]:
train = pd.read_csv('final_train.csv', 
                    usecols = ["posting_id", "image", "image_phash", "title", "label_group", "title_translate"])
train.head(2)

Unnamed: 0,posting_id,image,image_phash,title,label_group,title_translate
0,train_129225211,0000a68812bc7e98c42888dfb1c07da0.jpg,94974f937d4c2433,Paper Bag Victoria Secret,249114794,Victoria's Secret Paper Bag
1,train_3386243561,00039780dfc94d01db8676fe789ecd05.jpg,af3f9460c2838f0f,"Double Tape 3M VHB 12 mm x 4,5 m ORIGINAL / DO...",2937985045,Double Tape 3M VHB 12 mm x 4.5 m ORIGINAL / DO...


In [None]:
validate = pd.read_csv('final_validation.csv', 
                       usecols = ["posting_id", "image", "image_phash", "title", "label_group", "title_translate"])
validate.head(2)

Unnamed: 0,posting_id,image,image_phash,title,label_group,title_translate
0,train_1003554842,560a5c3577fb22be2ac82c0e97558158.jpg,f3c78fce8c3050f0,Mustika Ratu Minyak Cem-Ceman 175 ml,3044373336,Mustika Ratu Oil Cem-Ceman 175 ml
1,train_523363809,dd1f14c7a734ff28b67062ae4f8529c6.jpg,af919a66c49d688b,Snobby Kelambu Box Bayi Snobby 1 Tiang KBX 1201,873493898,Snobby Baby Mosquito Net Snobby 1 Pole KBX 1201


In [None]:
print(f"Observation in Train: {len(train)}")
print(f"Observation in Validation: {len(validate)}")

Observation in Train: 29603
Observation in Validation: 4647


# Title pre-processing

Clean the text of reviews, e.g. emoji, punctuations, stopwords special characters and change to lowercases.

In [None]:
train['title_translate']

0                              Victoria's Secret Paper Bag
1        Double Tape 3M VHB 12 mm x 4.5 m ORIGINAL / DO...
2              Maling TTS Canned Pork Luncheon Meat 397 gr
3        Short sleeve Batik negligee - Random / Mixed P...
4                        Nescafe \xc3\x89clair Latte 220ml
                               ...                        
29598    Battery Battery Xiaomi Redmi Note 3 BM46 BM-46...
29599    Washable 75 gsm Non-Woven Spunbond Fabric Mask...
29600    KHANZAACC Robot RE101S 1.2mm Subwoofer Bass Me...
29601    Broth NON MSG HALAL Mama Kamu Free-range Chick...
29602    LEAK COATING FLEX TAPE / MAGIC ISOLATION / LEA...
Name: title_translate, Length: 29603, dtype: object

In [None]:
validate['title_translate']

0                       Mustika Ratu Oil Cem-Ceman 175 ml
1         Snobby Baby Mosquito Net Snobby 1 Pole KBX 1201
2                                           mini stoppers
3                      F916/304 Jelly Slides Wedges Shoes
4       \xe3\x80\x90CELEB\xe3\x80\x91100 Pcs Korean St...
                              ...                        
4642    Sweety Silver Pants BOYS GIRLS M30 L28 XL26 XX...
4643                                           Mayonnaise
4644        Fair N Pink Body Serum natural body whitening
4645                               Wholesale mask brushes
4646         Light Blue Skinny Pencil Men's Jeans 27 - 38
Name: title_translate, Length: 4647, dtype: object

In [None]:
import re, string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In TF-IDF (Term Frequency-Inverse Document Frequency), upper case / lower case does matter. 'APPLE' is not counted the same as 'apple'.

We also don't want punctuations, stopwords, and short words with one or two characters. 

In [None]:
# function to clean the translated titles before embedding
def clean_title(title):
    # Remove all non-alphanumeric characters and convert to lowercase
    clean1 = re.sub(r'[^a-zA-Z0-9\s]', '', title).lower()
    # Split the cleaned string into words
    clean2 = re.split('\W+', clean1)
    # Remove stopwords and short words
    title_cleaned = [word for word in clean2 if (word not in stopwords.words('english')) & (len(word) > 2)]
    # Join the cleaned words using a space separator
    title_cleaned = ' '.join(title_cleaned)
    return title_cleaned

In [None]:
# clean the title in the train dataset
train['title_cleaned'] = train['title_translate'].apply(clean_title)
validate['title_cleaned'] = validate['title_translate'].apply(clean_title)

In [None]:
train.head(2)

Unnamed: 0,posting_id,image,image_phash,title,label_group,title_translate,title_cleaned
0,train_129225211,0000a68812bc7e98c42888dfb1c07da0.jpg,94974f937d4c2433,Paper Bag Victoria Secret,249114794,Victoria's Secret Paper Bag,victorias secret paper bag
1,train_3386243561,00039780dfc94d01db8676fe789ecd05.jpg,af3f9460c2838f0f,"Double Tape 3M VHB 12 mm x 4,5 m ORIGINAL / DO...",2937985045,Double Tape 3M VHB 12 mm x 4.5 m ORIGINAL / DO...,double tape vhb original double foam tape


In [None]:
validate.head(2)

Unnamed: 0,posting_id,image,image_phash,title,label_group,title_translate,title_cleaned
0,train_1003554842,560a5c3577fb22be2ac82c0e97558158.jpg,f3c78fce8c3050f0,Mustika Ratu Minyak Cem-Ceman 175 ml,3044373336,Mustika Ratu Oil Cem-Ceman 175 ml,mustika ratu oil cemceman 175
1,train_523363809,dd1f14c7a734ff28b67062ae4f8529c6.jpg,af919a66c49d688b,Snobby Kelambu Box Bayi Snobby 1 Tiang KBX 1201,873493898,Snobby Baby Mosquito Net Snobby 1 Pole KBX 1201,snobby baby mosquito net snobby pole kbx 1201


# Label Encoding

In [None]:
# label pre-processing: convert to group then perform onehot encoding
from sklearn.preprocessing import LabelEncoder 
le = LabelEncoder() 

# Convert labels to integers using LabelEncoder
y_train = le.fit_transform(train['label_group'])
y_train.shape

(29603,)

In [None]:
# Create a mapping table between integer labels and their original string values
label_mapping = {i: label for i, label in enumerate(le.classes_)}

labels_map = {v: k for k, v in label_mapping.items()}

# Convert the mapping table to a DataFrame
label_mapping_df = pd.DataFrame(list(label_mapping.items()), columns=['Encoded Label', 'Original Label'])

# Print the DataFrame
print(label_mapping_df)

       Encoded Label  Original Label
0                  0          258047
1                  1          297977
2                  2          645628
3                  3          801176
4                  4          887886
...              ...             ...
11009          11009      4292154092
11010          11010      4292520070
11011          11011      4292939171
11012          11012      4293276364
11013          11013      4294197112

[11014 rows x 2 columns]


In [None]:
# Convert labels to integers using LabelEncoder
y_val = le.fit_transform(validate['label_group'])

# One-hot encode the labels using to_categorical
y_val.shape

(4647,)

## Pre-Processing

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
X_text_train = train['title_cleaned']
X_text_val = validate['title_cleaned']

# define the TfidfVectorizer to transform the text input
tfidf = TfidfVectorizer()

text_train = tfidf.fit_transform(X_text_train)
df_train = pd.DataFrame(text_train.toarray())

text_val = tfidf.transform(X_text_val)
df_val = pd.DataFrame(text_val.toarray())

In [None]:
text_train.shape

In [None]:
text_val.shape

## Apply PCA

Takes 46 minutes and 45G RAM Needed.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df_train)
train_scale = scaler.transform(df_train)
val_scale = scaler.transform(df_val)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(.95)
pca.fit(train_scale)
X_train = pca.transform(train_scale)
X_val = pca.transform(val_scale) 

In [None]:
X_train.shape

(29603, 13572)

In [None]:
X_val.shape

(4647, 13572)

In [None]:
train_df_out = pd.DataFrame(X_train)
train_df_out.to_csv("X_train.csv")

In [None]:
val_df_out = pd.DataFrame(X_val)
val_df_out.to_csv("X_val.csv")

# The Model

In [None]:
X_train = pd.read_csv("X_train.csv")
X_train.drop(columns = "Unnamed: 0", inplace = True)
X_train.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13562,13563,13564,13565,13566,13567,13568,13569,13570,13571
0,-0.253422,-0.031177,-0.02864,-0.026968,-0.037289,-0.209858,-0.03223,-0.009957,-0.070927,-0.017181,...,0.424732,0.029016,0.03845,-0.343325,-0.242661,0.195032,-0.385963,-0.2481,-0.363485,0.160394
1,-0.277367,-0.035601,-0.036659,-0.019529,-0.047698,-0.2293,-0.034749,0.00743,-0.066559,-0.023683,...,-0.68338,-0.861281,0.452239,-0.773648,1.034412,0.102345,-0.392756,0.017074,-0.506227,0.060359


In [None]:
X_val = pd.read_csv("X_val.csv")
X_val.drop(columns = "Unnamed: 0", inplace = True)
X_val.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13562,13563,13564,13565,13566,13567,13568,13569,13570,13571
0,-0.285498,-0.037147,-0.033968,-0.030544,-0.042355,-0.245498,-0.039771,-0.01056,-0.081412,-0.021569,...,-0.389414,-0.519038,-0.127282,0.952183,-0.22143,1.010981,0.859436,-0.553723,0.22291,1.128213
1,-0.271178,-0.03759,-0.034202,-0.032166,-0.040076,-0.231526,-0.040037,-0.011772,-0.067312,-0.022284,...,0.796135,-2.014444,-0.720362,0.885356,1.175679,0.482628,-0.073147,-0.881133,-1.327702,0.005276


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

## Attempt 1: Crushed after 8 hours and 8/10 trees built

In [None]:
rf = RandomForestClassifier(n_estimators = 10, max_depth = 1000, verbose = 5, n_jobs = -1)
rf_model = rf.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.


building tree 1 of 10
building tree 2 of 10
building tree 3 of 10
building tree 4 of 10
building tree 5 of 10
building tree 6 of 10
building tree 7 of 10
building tree 8 of 10


In [None]:
y_pred = rf_model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)

## Attempt 2: Reduced dataset to 5000 columns

In [None]:
X_train2 = X_train.iloc[:, :5000]
X_val2 = X_val.iloc[:, :5000]

In [None]:
X_train2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
0,-0.253422,-0.031177,-0.02864,-0.026968,-0.037289,-0.209858,-0.03223,-0.009957,-0.070927,-0.017181,...,1.522785,-0.268626,0.996037,-0.33824,0.328293,-0.241916,-0.49556,-0.287368,0.007088,-0.287823
1,-0.277367,-0.035601,-0.036659,-0.019529,-0.047698,-0.2293,-0.034749,0.00743,-0.066559,-0.023683,...,0.372919,0.144309,0.105368,-0.000711,0.107727,0.049159,-0.113521,-0.022424,-0.19273,0.789605
2,-0.322093,-0.041445,-0.037522,-0.034789,-0.047873,-0.286515,-0.045221,-0.013196,-0.097433,-0.025667,...,-1.04258,-3.832689,0.586747,-0.264264,0.06435,0.188845,2.706882,2.412411,-0.345094,0.256879
3,-0.378518,-0.054549,-0.052149,-0.048437,-0.068742,-0.342469,-0.064422,-0.018635,-0.112161,-0.030143,...,0.487114,0.988579,-0.141452,1.207346,0.177311,-0.056699,0.371752,0.152829,-0.765678,1.201429
4,-0.255591,-0.03071,-0.028802,-0.026325,-0.035623,-0.215221,-0.03075,-0.010814,-0.068941,-0.017107,...,-0.587278,-2.96356,1.443044,-4.730629,-6.399069,0.152627,0.107068,7.361832,-3.695039,6.611881


In [None]:
rf = RandomForestClassifier(n_estimators = 4, max_depth = 800, verbose = 5, n_jobs = -1)
rf_model = rf.fit(X_train2, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.


building tree 1 of 4
building tree 2 of 4
building tree 3 of 4
building tree 4 of 4


[Parallel(n_jobs=-1)]: Done   2 out of   4 | elapsed: 111.9min remaining: 111.9min
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed: 127.9min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed: 127.9min finished


In [None]:
y_pred = rf_model.predict(X_val2)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    1.1s remaining:    1.1s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    1.3s remaining:    0.0s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    1.3s finished


In [None]:
accuracy = accuracy_score(y_val, y_pred)
accuracy

0.00021519259737465033

# Best Parameters

This is simply not practical due to how long it takes to build a single tree.

In [None]:
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV

In [None]:
param = {
    'criterion': ['gini', 'entropy'], # log_loss and entropy are both for the Shannon information gain
    'n_estimators': [50, 100, 150, 200],
    'max_depth': ['none', 300]
}

rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=5, n_repeats=2, random_state=123) 
search_rfc = GridSearchCV(rfc, param, scoring='accuracy', n_jobs=-1, cv=cv) 

result_rfc = search_rfc.fit(text_train, y_train, verbose =5)

# summarize result
print('Best Score: %s' % result_rfc.best_score_)
print('Best Hyperparameters: %s' % result_rfc.best_params_)