<a href="https://colab.research.google.com/github/Mohamed-Fedi-Belaid/Product_Category_Prediction_using_ML/blob/main/Copy_of_E_commerce_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Full Machine Learning Pipeline

 **We are explected to build an api that can predict possible categories for any product. We will be using product name and it's description to find relevant categories**

 E-commerce Text Classification

With TF-IDF and Word2Vec

**Overview**

The objective of the project is to classify e-commerce products into four categories, based on its description available in the e-commerce platforms. The categories are: Electronics, Household, Books, and Clothing & Accessories. We carried out the following steps in this notebook:

    Performed basic exploratory data analysis, comparing the distributions of the number of characters, number of words, and average word-length of descriptions of products from different categories.
    Employed several text normalization techniques on product descriptions.
    Used TF-IDF vectorizer on the normalized product descriptions for text vectorization, compared the baseline performance of several classifiers, and performed hyperparameter tuning on the support vector machine classifier with linear kernel.
    In a separate direction, employed a few selected text normalization processes, namely convertion to lowercase and substitution of contractions on the raw data on product descriptions; used Google's pre-trained Word2Vec model on the tokens, obtained from the partially normalized descriptions, to get the embeddings, which are then converted to compressed sparse row (CSR) format; compared the baseline performance of several classifiers, and performed hyperparameter tuning on the XGBoost classifier.
    Employed the model with the highest validation accuracy to predict the labels of the test observations and obtained a test accuracy of 0.948939

.

**Contents**


    Introduction
        E-commerce Product Categorization
        Text Classification
        Data
        Project Objective
    Exploratory Data Analysis
        Class Frequencies
        Number of Characters
        Number of Words
        Average Word-length
    Train-Validation-Test Split
    Text Normalization
        Convertion to Lowercase
        Removal of Whitespaces
        Removal of Punctuations
        Removal of Unicode Characters
        Substitution of Acronyms
        Substitution of Contractions
        Removal of Stop Words
        Spelling Correction
        Stemming and Lemmatization
        Discardment of Non-alphabetic Words
        Retainment of Relevant Parts of Speech
        Removal of Additional Stop Words
        Integration of the Processes
        Implementation on Product Description
    TF-IDF Model
        Text Vectorization
        TF-IDF Baseline Modeling
        TF-IDF Hyperparameter Tuning
    Word2Vec Model
        Partial Text Normalization
        Word Embedding
        Word2Vec Baseline Modeling
        Word2Vec Hyperparameter Tuning
    Final Prediction and Evaluation
    Acknowledgements
    References



Importing libraries

In [None]:
# File system manangement
import time, psutil, os

# Data manipulation
import numpy as np
import pandas as pd

# Plotting and visualization
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
sns.set_theme()
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

# NLP
import string, re, nltk
from string import punctuation
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
!pip install num2words
from num2words import num2words
!pip install pyspellchecker
from spellchecker import SpellChecker
from nltk.stem.porter import PorterStemmer
import spacy
from nltk.stem import WordNetLemmatizer

# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

# Scipy
import scipy
from scipy import sparse
from scipy.sparse import csr_matrix

# Train-test split and cross validation
from sklearn.model_selection import train_test_split, ParameterGrid

# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import RidgeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier

# Model evaluation
from sklearn import metrics
from sklearn.metrics import accuracy_score

# Others
import json
import gensim
from sklearn.decomposition import TruncatedSVD