# Project 6: Develop an automatic classification engine for consumer goods.
*Pierre-eloi Ragetly*

This project is part of the Data Scientist path proposed by OpenClassrooms.



In [1]:
# Import usual libraries
import numpy as np
import pandas as pd
import os
import time

# to make this notebook's output stable across runs
np.random.seed(89)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams.update({'axes.titleweight': 'bold',
                     'axes.titlesize': 16,
                     'axes.labelsize': 14,
                     'xtick.labelsize': 12,
                     'ytick.labelsize': 12})

# Where to save the figures
def save_fig(fig_id, tight_layout=True):
    folder_path = os.path.join("charts")
    if not os.path.isdir(folder_path):
        os.makedirs(folder_path)
    path = os.path.join("charts", fig_id + ".png")
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Text-processing" data-toc-modified-id="Text-processing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Text processing</a></span><ul class="toc-item"><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Tokenization</a></span></li><li><span><a href="#Remove-numbers" data-toc-modified-id="Remove-numbers-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Remove numbers</a></span></li><li><span><a href="#Lower-casing" data-toc-modified-id="Lower-casing-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Lower casing</a></span></li><li><span><a href="#Stop-words-removal" data-toc-modified-id="Stop-words-removal-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Stop words removal</a></span></li><li><span><a href="#Lemmatization" data-toc-modified-id="Lemmatization-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Lemmatization</a></span></li></ul></li></ul></div>

# Get data

In [2]:
data = (pd.read_csv("data/Flipkart/flipkart_com-ecommerce_sample_1050.csv")
          .set_index('uniq_id'))

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1050 entries, 55b85ea15a1536d46b7190ad6fff8ce7 to f2f027ad6a6df617c9f125173da71e44
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   crawl_timestamp          1050 non-null   object 
 1   product_url              1050 non-null   object 
 2   product_name             1050 non-null   object 
 3   product_category_tree    1050 non-null   object 
 4   pid                      1050 non-null   object 
 5   retail_price             1049 non-null   float64
 6   discounted_price         1049 non-null   float64
 7   image                    1050 non-null   object 
 8   is_FK_Advantage_product  1050 non-null   bool   
 9   description              1050 non-null   object 
 10  product_rating           1050 non-null   object 
 11  overall_rating           1050 non-null   object 
 12  brand                    712 non-null    object 
 13  product_specifications  

The most promising attribute to automate the goods classification is the *description* feature. However, gathering text data, we cannot use it directly. Let's see how can we handle it.

## Text processing

Before using any Machine Learning on text data, the latter must be transformed into something an algorithm can digest. This process is called text preprocessing and includes various steps:
1. Tokenization &ndash; convert sentences to words;
2. Remove unnecessary punctuation, numbers;
3. Lower casing &ndash; convert a word to lower case
4. Remove stop words &ndash; frequent words such as "the", "a", "is";
5. Use *Stemming* or *Lemmatization* to convert a word to its base form.

### Tokenization

Tokenization is defined as a process to split the text into smaller unit, i.e. tokens. The easiest way is white space tokenization, meaning split the text based on whitespace between two words.  
The most used function is `word_tokenize()` from the *NLTK* (Natural Language ToolKit) python library. This function splits tokens based on white space and some punctuation marks like `.` and `'` but **not all** of them. Moreover, the methodology used to split contractions like "isn't" depends on the contraction itself and may make the stop words removal process (see section 1.4) really painful. For these reasons, it is much more prefered to use regular expressions (regex) and split the text by keeping alphanumeric characters only.

In [4]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokens = (pd.Series(data['description'])
            .apply(tokenizer.tokenize))

### Remove numbers

Numbers are not of any intererest and consequently must be removed. This can be done by iterating over all tokens and only keeping those that are alphabetic with the python function `isalpha()`.

In [5]:
words = pd.Series(np.zeros(len(tokens)), index=data.index)
for i in range(len(words)):
    words.iloc[i] = [w for w in tokens[i]
                     if w.isalpha()]

### Lower casing

Two words like Text and text, meaning exactly the same, will be considered as two different words. Consequently, it is highly adviced to convert all word to lower case.

In [6]:
for i in range(len(words)):
    words.iloc[i] = [w.lower() for w in words[i]]

### Stop words removal

*Stop words* are words that do not contribute to the deeper meaning of the sentence and so, do not really help to distinguish two different documents. Worse, they bring noise and may drop significantly the performance of your model. For this reason, they must be removed.   
Stop words usually refer to the **most common** words such as "and", "the" or "a". But there is no *single universal list* of stopwords. The stop words list may change depending on your application.  
As for tokenization, NLTK provides a list of common stop words for a variety of languages, such a English. This list can be found in the `stopwords` package.

In [7]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
print(stop_words)
# filter out stop words
for i in range(len(words)):
    words.iloc[i] = [w for w in words.iloc[i]
                     if not w in stop_words]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Lemmatization

The purpose of stemming or Lemmatization is to reduce words like "studies" to a root word ("studi") or a common base form ("study") respectively. Though it is much more easier to develop a stemmer than lemmatizer (which requires deep linguistics knowledge to build the lemma of each word), the latter is prefered. The noise will be more reduced and so, the results provided more accurate.

In [8]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
for i in range(len(words)):
    words.iloc[i] = [lemmatizer.lemmatize(w)
                     for w in words[i]]