# Outline

[1. Loading Data](#1.-Loading-Data)

[2. Exploratory Data Analysis](#2.-Exploratory-Data-Analysis)

# 1. Loading Data

- Get data into notebook in the best form possible for analysis.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import cyrtranslit
from sklearn import preprocessing, model_selection, metrics, feature_selection, ensemble, linear_model, cross_decomposition, feature_extraction, decomposition, compose
import lightgbm as lgb
from scipy import stats
import time
from sklearn.externals import joblib
import os
color = sns.color_palette()
%matplotlib inline

In [3]:
train = pd.read_csv('../train.csv.zip',compression='zip')
test = pd.read_csv('../test.csv.zip',compression='zip')

# 2. Exploratory Data Analysis

- What’s happening with data.
    - It's big.
    - It's in Russian with Cyrillic alphabet.

- Why it’s interesting
    - Outcome variable is very non-normal. There's three distinct groups. (Zero Range, Lower Range, Upper Range)
    
- What features you intend to take advantage of for your modeling.
    - Titles and descriptions offer endless NLP opportunities.
    - Plenty of categorical data to binarize.

In [None]:
# EDA

# 3. Pipeline

- Linked stages.
- Efficient and easy to implement.

In [39]:
# Load and describe engineered features
train_features = pd.DataFrame(index=train.index)
test_features = pd.DataFrame(index=test.index)

## 3.1. Cross-Decomposition of TF-IDF Vectors with BiGrams

This process consists of extracting term frequency vectors using the text in each ad as a document. Tokens for unigrams and bigrams will be included in this stage. Lastly, the resulting matrix will be reduced to the smallest number of components that retain all potential predictive power. Perform onto both titles and descriptions and retain separate components for each.

In [40]:
path = 'feature_engineering/1.tfidf_ngrams/feature_dumps/'
print('Loading and joining feature sets:...')
for file in os.listdir(path):
    print(file)
    if file[:4] == 'test':
        test_features = test_features.join(joblib.load(path+file))
    else:
        train_features = train_features.join(joblib.load(path+file))

Loading and joining feature sets:...
train_descr_idfngram.sav
train_title_idfngram.sav
test_descr_idfngram.sav
test_title_idfngram.sav


## 3.2. Discretized Vector Cross-Decomposition

This consists of splitting the dependent variable into discrete ranges and creating a vocabulary for each range. Then vectorize and cross-decompose each vocabulary independently. Resulting components for each vocabulary will reflect the presence of terms common in a certain discrete range of target.

In [41]:
path = 'feature_engineering/2.discrete-decomp/feature_dumps/'
print('Loading and joining feature sets:...')
for file in os.listdir(path):
    print(file)
    if file[:4] == 'test':
        test_features = test_features.join(joblib.load(path+file))
    else:
        train_features = train_features.join(joblib.load(path+file))

Loading and joining feature sets:...
test_title_zeroidf.sav
test_title_lowcnt.sav
train_title_zeroidf.sav
train_title_lowcnt.sav
test_title_lowidf.sav
test_title_zerocnt.sav
train_title_zerocnt.sav
train_title_upidf.sav
train_title_upcnt.sav
train_title_lowidf.sav
test_title_upcnt.sav
test_title_upidf.sav


## 3.3. Discretized Vector Sums

Similar to previous procedure, vocabularies are created for discrete ranges of target. However instead of decomposing the vectors of those vocabularies, you simply sum their frequencies along the row axis of the term frequency matrix. This results in a single variable for each vocabulary, which represents the aggregate frequency of a vocabulary's terms per ad.

In [42]:
path = 'feature_engineering/3.vector-sums/feature_dumps/'
print('Loading and joining feature sets:...')
for file in os.listdir(path):
    print(file)
    if file[:4] == 'test':
        test_features = test_features.join(pd.read_pickle(path+file,compression='zip'))
    else:
        train_features = train_features.join(pd.read_pickle(path+file,compression='zip'))

Loading and joining feature sets:...
test_sums.pkl
train_sums.pkl


## 3.4. Sentiment Analysis

An NLP library called `polyglot` offers multi-language tools, such as Sentiment-Analysis and Named-Entity-Recognition in Russian.

In [43]:
path = 'feature_engineering/4.sentiment/feature_dumps/'
print('Loading and joining feature sets:...')
for file in os.listdir(path):
    print(file)
    if file[:4] == 'test':
        test_features = test_features.join(joblib.load(path+file))
    else:
        train_features = train_features.join(joblib.load(path+file))

Loading and joining feature sets:...
test_title_polarity.sav
train_title_polarity.sav


## 3.5. Categorical Features

### 3.5.1 Binary CountVectorizer

Several categorical variables in this data have thousands of unique values which would increase the dimensional space unreasonably if binarizing in dense format. A binary `CountVectorizer` does the heavy lifting of populating dummy counts in sparse format, and `PLSR` reduces the numerous columns to a few core components.


### 3.5.2. Target-Sorted Label Encodings
Additionally, a label encoder of each feature is made with particular considerations. Normally, label encoding isn't recommended for machine learning because the algorithm will interpret the code numbers as meaningful information. However, encodings can convey useful information if categorical values are sorted by their mean outcome value. This way, each label's code will represent an approximation of the target outcome.

In [44]:
path = 'feature_engineering/5.categorical/feature_dumps/'
print('Loading and joining feature sets:...')
for file in os.listdir(path):
    print(file)
    if file[:4] == 'test':
        test_features = test_features.join(joblib.load(path+file))
    else:
        train_features = train_features.join(joblib.load(path+file))

Loading and joining feature sets:...
train_codes.sav
test_categ_catnamplsr.sav
train_regio_regplsr.sav
test_city_cityplsr.sav
train_param_p1plsr.sav
test_param_p3plsr.sav
test_param_p2plsr.sav
train_categ_catnamplsr.sav
test_regio_regplsr.sav
train_param_p3plsr.sav
test_codes.sav
train_paren_ptcatplsr.sav
train_param_p2plsr.sav
test_param_p1plsr.sav
train_city_cityplsr.sav
test_paren_ptcatplsr.sav


## 3.6 Other Features

- Imputations
- Missing Indicators
- Day-of-Week dummies.

In [45]:
path = 'feature_engineering/6.other/feature_dumps/'
print('Loading and joining feature sets:...')
for file in os.listdir(path):
    print(file)
    if file[:4] == 'test':
        test_features = test_features.join(joblib.load(path+file))
    else:
        train_features = train_features.join(joblib.load(path+file))

Loading and joining feature sets:...
train_othfeat.sav
test_othfeat.sav


# 4. Evaluation

- Evaluation and comparison of multiple models via robust analysis of residuals and error.

# 5. Product

- Why chose it.
- Why works.
- What problem it solves.
- How will it run in a production environment.
- What to do to maintain it going forward.