# **Predict Future Sales**
## *Translation/Text Processing*

This challenge is marked as final project for the ["How to win a data science competition"](#https://www.coursera.org/learn/competitive-data-science/home/welcome) Coursera course.

In [this competition](#https://www.kaggle.com/c/competitive-data-science-predict-future-sales) we will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms, [1C Company](#http://1c.ru/eng/title.htm). They are asking to **predict total sales for every product and store in the next month**. They provide various files with supplemental information, but texts are in Russian.

Names may contain a lot of information about items and shops, and they will surely provide major insight towards creating a model for predicting sales. In order to get the maximum value, in this notebook we will **translate Russian names for items, categories and shops**; we'll remove punctuation and stopwords, find named entities, create new columns with information of interest and generate new files.

### Tasks covered
- [x] Process Russian texts and create new variables
- [ ] Perform some EDA and data mining for deeper understanding, and prepare data
- [ ] Create model and evaluate results

### Content
* [Libraries](#Libraries).
* [Data loading](#Data-loading).
* [Translation](#Translation).
* [Preprocessing](#Preprocessing).
    + [Categories](#Categories).
    + [Shops](#Shops).
    + [Items](#Items).
* [Saving files](#Saving-files).

## Libraries

In [None]:
import os
import pandas as pd
import numpy as np
from textblob import TextBlob
import re
import string
import warnings
from sklearn.preprocessing import LabelEncoder
import nltk
import spacy
#load the spacy english core 
spacy_nlp = spacy.load('en_core_web_sm')

## Data loading

In [None]:
folder = 'input/competitive-data-science-predict-future-sales'
path = f'../{folder}'
files = os.listdir(path)
print(files)

In [None]:
items = pd.read_csv(path + '/' + files[0]) # beware of using your own indexes if you edit this notebook
categories = pd.read_csv(path + '/' + files[2])
shops = pd.read_csv(path + '/' + files[-2])

## Translation

We will be using `TextBlob` for this task, which is a straightforward Text Processing library that I've already used in the past and works really well.

In [None]:
def translateText(text, from_lang, to_lang):
    """
    Function for translating text.
    It first translates and removes non-ascii characters. 
    If the first try is not succesful, it removes punctuation before translating, 
    as it may cause bad request http errors (probably due to an unresolved bug). If this is again
    not successful, it returns NA.
    
    Parameters
    ----------
    text: str
        Text for translating.
    from_lang: str
        Original language.
    to_lang: str
        Desired language.
    """
    
    try:
        # Translating.
        text = str(TextBlob(text).translate(from_lang=from_lang, to=to_lang))
        # Removing non-ascii characters.
        text = re.sub('[^\x00-\x7F]+', '', text) 
        
    except Exception: # some punctuation symbols currently cause a bad request error, we'll remove them and try again
        text = text.translate(str.maketrans('', '', string.punctuation))
        
        try:
            # Translating.
            text = str(TextBlob(text).translate(from_lang=from_lang, to=to_lang))
            # Removing non-ascii characters.
            text = re.sub('[^\x00-\x7F]+', '', text)
            
        except Exception: 
            text = np.nan # we'll return NA in case of any other error so we can detect it later
    
    return text

In [None]:
# applying translation for shops and categories
categories['category_name_en'] = categories.item_category_name.apply(lambda x: translateText(x, 'ru', 'en'))
shops['shop_name_en'] = shops.shop_name.apply(lambda x: translateText(x, 'ru', 'en'))

<div class="info alert-block alert-info">
    💻 I have created this notebook in my local machine and already executed it. For speeding up the running of this notebook in Kaggle, I will be using my locally translated files.
</div>

In [None]:
# pretranslated files in local
shops = pd.read_csv('../input/new-shops/new_shops.csv')
categories = pd.read_csv('../input/new-categories/new_categories.csv')

Translation for shops and categories is correct; there are no missing values. Great!

In [None]:
display(shops.head(3))
print(shops.shape)
print('\nUntranslated values: \n', shops.shop_name_en.isna().sum())
display(categories.head(3))
print(categories.shape)
print('\nUntranslated values: \n', categories.category_name_en.isna().sum())

<div class="alert alert-block alert-warning">
    <b>WARNING:</b> Translation for item names requires ~2 hours to complete running.
    <br><em>Executed with MSI PS42 Modern 8RC model.</em>
</div>

In [None]:
%%time
# translating items
# Uncomment the following line for a try, but beware of executing this cell with enough time to let it run, as it is a highly cpu and memory consuming task

#items['item_name_en'] = items.item_name.apply(lambda x: translateText(x, 'ru', 'en'))

There are 321 unstranslated items, which is frankly well given the total number of observations, but let's dig deeper into them to see why they weren't translated.

In [None]:
# pretranslated files in local
items = pd.read_csv('../input/new-items/new_items.csv').drop(['Unnamed: 0'], axis=1)

In [None]:
display(items.head())
print(items.shape)
print('\nUntranslated values: \n', items.item_name_en.isna().sum())

In [None]:
def detectLanguage(text):
    """
    Detects language in a text.
    
    Parameters
    ----------
    text: str
    """
    try:
        text = TextBlob(text).detect_language()
    except Exception:
        pass
    
    return text

In [None]:
items['language'] = items[items.item_name_en.isna()].item_name.apply(lambda x: detectLanguage(x))

In the table below we can see that all **untranslated items are due to them being already in English** or with a **numeric name**. This is also fantastic news! We'll use their original name. (Due to the small length of some of the strings, we can also see that the algorithm does what it can at detecting the language. It's important to note that these **language detection algorithms work best with bigger strings of text**, as they also tend to use stopwords as a primary source of knowledge.)

In [None]:
display(items[(items.item_name_en.isna()) & (items.language != 'en')][['item_name', 'language']])

items['item_name_en'] = items.item_name_en.fillna(items.item_name)
del items['language']

We have now translated all our datasets! Let's get rid of the names in Russian.

In [None]:
del shops['shop_name'], categories['item_category_name'], items['item_name']

## Preprocessing

### Categories

From the category values below, we can infer that every category name is comprised of a general category plus a subcategory, or just a single category. Let's extract them to create 2 new variables.

In [None]:
# wholesome view of categories
pd.DataFrame(categories.category_name_en.values.reshape(-1, 12))

In [None]:
categories['category_name'] = categories.category_name_en.apply(lambda x: x.split(' - ')[0])
categories['subcategory_name'] = categories.category_name_en.apply(lambda x: x.split(' - ')[-1])

categories.head()

We'll also normalise names by putting them in lowercase format, as they differ in some categories; we'll leave only alphanumeric characters and spaces; and we'll also manually correct two categories. Lastly, we'll create temporal variable IDs for these new groups with Label Encoding.

In [None]:
def onlyAlphaNumeric(text):
    """
    Puts to lowercase format and removes non-alphanumeric characters except spaces.
    
    Parameters
    ----------
    text: str
    """
    text = text.lower()
    text = re.sub(r'</i>|<b>|<i>|</b>', '', text) # also deleting some html notation
    text = re.sub(r'[^A-Za-z0-9 ]+', '', text)
    
    return text

In [None]:
cat_columns = [col for col in categories.columns if 'name' in col]
categories[cat_columns] = categories[cat_columns].applymap(lambda x: onlyAlphaNumeric(x))

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    categories[32:35]['category_name'] = 'payment cards'
    categories.iloc[32]['subcategory_name'] = 'cinema music games'
    categories.iloc[34]['subcategory_name'] = 'number'
    
categories['category_id'] = LabelEncoder().fit_transform(categories.category_name.values)
categories['subcategory_id'] = LabelEncoder().fit_transform(categories.subcategory_name.values)

display(categories.head())
print('Unique values: \n', categories.nunique())

In [None]:
del categories['category_name_en']

### Shops

For the shops dataset, we'll perform the same text processing operations. In this case, though, **we're creating city** (as we can see that it tends to be the first token of the name) and **shop type variables** (as it is sometimes specified within). With a **little bit of research**, one may **comprehend the different types of shopping centers included**. For example, a TC (from the Russian ТЦ) is a common shopping center, whereas a SEC (or TRK, depending on the translation, from the Russian ТРЦ or ТРК) is a shopping and entertainment complex, including cinemas and other leisure activities.

In [None]:
shops_original_en = shops.copy()

shops.head()

In [None]:
# only alphanumeric strings
# we could also eliminate 'quot' words, as they come from quotations before the translation,
# but for our matter there is no necessity in doing so
cat_columns = [col for col in shops.columns if 'name' in col]
shops[cat_columns] = shops[cat_columns].applymap(lambda x: onlyAlphaNumeric(x))

# also eliminating spelled quotations that may be left
shops[cat_columns] = shops[cat_columns].applymap(lambda x: re.sub('quot', '', x))

# cities
def takeCityName(text):
    # we iterate over the name to find the city with a restriction of word-length
    for i in range(0, len(text.split())):
        if ((text.split()[i] != '') and (len(text.split()[i]) > 2)):
            city = text.split()[i]
            break
                   
    return city
        
shops['shop_city_name'] = shops.shop_name_en.apply(lambda x: takeCityName(x))
shops['shop_city_name'] = np.where(shops['shop_city_name'] == 'digital', 'online', shops['shop_city_name'])

# shop types
#
shops['shop_type_name'] = shops.shop_name_en.apply(lambda x: 
                                                   'megacenter' if re.findall(r'trk|xl|sec', x)
                                                   else 'online' if 'online' in x
                                                   else 'special' if re.findall(r'outbound|sale', x)
                                                   else 'center' if re.findall(r'tc |shopping center', x)
                                                   else 'unspecified')

# ids
shops['shop_city_id'] = LabelEncoder().fit_transform(shops.shop_city_name.values)
shops['shop_type_id'] = LabelEncoder().fit_transform(shops.shop_type_name.values)   

display(shops.head())
print('Unique values: \n', shops.nunique())

In [None]:
del shops['shop_name_en']

Alternatively, for detecting cities, one could make use of **Named Entity Recognition** algorithms like the ones incorporated in `spaCy` library, but, as it's shown below, they're mostly unreliable for our task, even though it's one of the most powerful NLP libraries, with industrial strength, excelling at large-scale information processing.

In [None]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    for shop in shops_original_en.shop_name_en[15:23]:
        doc = spacy_nlp(shop)
        #the whole text with fancy entities location and type of entity
        spacy.displacy.render(doc, style='ent', jupyter=True)

### Items

Now, let's process items dataframe. In the previous cases, we've mostly had only nouns within the name, but for item names, we may also find adverbs, determinants and other so called **stopwords**, i.e., the most common words in a language. We'll be removing them in order to solely keep **nouns**-- which, in essence, **contain the whole meaning** of the name--, and other useful parts-of-speech or content words. We won't create a new ID for this new column, as even though we manage to reduce some cardinality with this process, it is still utterly high and there is no use for a meaningless ID. Also, we'll add 3 new columns containing a count of total words in the name, number of nouns and number of stopwords.

We'll utilize the prepared name for yielding more insights during the EDA phase. I'll soon upload a notebook for it.

In [None]:
# printing stopwords examples with nltk
stopwords_en = nltk.corpus.stopwords.words("english")
print('10 examples of stopwords: ', stopwords_en[:10])

In [None]:
items.tail()

In [None]:
def addTextMetrics(df, column):
    """Function for adding text metrics like length of name, number of nouns, etc. to a Dataframe.
    
    Parameters
    ----------
    df: pandas dataframe
    column: str
        Column of df for deriving text metrics from.
    
    Notes
    -----
    It uses a preloaded spacy language core, spacy_nlp.
    """      
    
    # adding basic name metrics
    df[f'{column}_number_words'] = df[column].apply(lambda x: len(spacy_nlp(x)))
    df[f'{column}_number_stopwords'] = df[column].apply(lambda x: len([token for token in spacy_nlp(x) 
                                                                       if token.is_stop]))
    df[f'{column}_number_nouns'] = df[column].apply(lambda x: len([token for token in spacy_nlp(x) 
                                                                   if token.pos_ == 'NOUN']))
    
    return df

In [None]:
def removeStopWords(text, content=False, outliers=False):
    """ Removing stopwords and more non-useful words.
    
    Parameters
    ----------
    text: str
    content: bool
        If True, leaves only content words like nouns, adjectives, verbs and adverbs (also numbers).
    outliers: bool
         If True, removes what I call outliers in word length, for percentiles 5 and 95.
         
    Notes
    -----
    It uses a preloaded spacy language core, spacy_nlp.
    """      
        
    # Transforming to spacy.doc.Doc object for easier processing.
    doc = spacy_nlp(text)
    
    if not content:
        # removing stop_words with Spacy
        tokens = [token.text for token in doc if not token.is_stop]

        # if we've stripped all tokens by removing stopwords or we're left with just one word, we take the original name
        if (len(tokens) < 2) or (not tokens):
            tokens = [token.text for token in doc]
    else:
        # leaving only content words
        tokens = [token.text for token in doc if token.pos_ in ['NOUN', 'PROPN', 'ADJ', 'VERB', 'NUM']]
        if (len(tokens) < 2) or (not tokens): # if this is met, we take any remaining alpha characters
            tokens = [token.text for token in doc if token.is_alpha]
    
    if outliers:
        lengths = [len(token) for token in tokens]
        words = []
        for word in tokens:          
            # Removing outliers in word-length:
            if ((len(word.text) >= numpy.percentile(lengths,5)) and
               (len(word.text) <= numpy.percentile(lengths,95))):
                   words.append(word.text)  
    
        text = ' '.join(words) # return in string format
    else:
        text = ' '.join(tokens)
 
    return text

<div class="alert alert-block alert-warning">
    <b>WARNING:</b> This next cell requires ~10 minutes to complete running.
    <br><em>Executed with MSI PS42 Modern 8RC model.</em>
</div>

In [None]:
%%time
# only alphanumeric strings
cat_columns = [col for col in items.columns if 'name' in col]
items[cat_columns] = items[cat_columns].applymap(lambda x: onlyAlphaNumeric(x))

# also eliminating spelled quotations that may be left
items[cat_columns] = items[cat_columns].applymap(lambda x: re.sub('quot', '', x))

# adding 3 new columns
items = addTextMetrics(items, 'item_name_en')

# only removing stopwords and other non content words. 
# Outliers in word length are not necessary to be removed for this task.
items['item_name'] = items.item_name_en.apply(lambda x: removeStopWords(x, content=True))

# we'll fill the remaining blank names with their original English name
items['item_name'] = np.where(items.item_name == '', items.item_name_en, items.item_name)

In [None]:
display(items.tail())
print('Unique values: \n', items.nunique())

In [None]:
del items['item_name_en']

## Saving files

Awesome! ✨ Let's save the results.

In [None]:
#items.to_csv(path + '/items_english.csv', index=False) # default comma-separated
#categories.to_csv(path + '/categories_english.csv', index=False)
#shops.to_csv(path + '/shops_english.csv', index=False)

---

## 💡 **Stay tuned for the next part: [In-depth EDA] Predict Future Sales**

Also, if you have any question or comment to add, please, feel welcome to do so!