# Intro
I already did a first iteration of this `TFIDF PLSR` procedure in a previous notebook. Now Im simply gonna compare results with different approaches, namely:
- Using Stopwords in russian.
- Using NGrams.
- Not transforming terms to Lowercase.
- Decomposing more CSR columns.
- Training $PLSR$ with larger samples of the CSR matrix.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import nltk
import cyrtranslit
from sklearn import preprocessing, model_selection, metrics, feature_selection, ensemble, linear_model, cross_decomposition, feature_extraction, decomposition
from sklearn.pipeline import Pipeline
import lightgbm as lgb
import time
color = sns.color_palette()

%matplotlib inline

In [2]:
train = pd.read_pickle('../../train.pkl',compression='zip')

test = pd.read_pickle('../../test.pkl',compression='zip')

---
# Vectorize Titles with Russian StopWords
**Summary**

- $R^2$ score is lower but pretty close to the one without russian stopwords.
- $RMSE$ is also slightly worse. It's still worth getting the score on the full data.
- Score on all train data was also slightly lower. However, a random iteration gave a higher one. This score remained the most constant. It is inconclusive so far whether stopwords made any difference in titles.

In [3]:
# Sample of russian stopwords from NLTK
nltk.corpus.stopwords.words('russian')[:10]

['и', 'в', 'во', 'не', 'что', 'он', 'на', 'я', 'с', 'со']

In [4]:
ru_stop = nltk.corpus.stopwords.words('russian')

In [5]:
vec = feature_extraction.text.TfidfVectorizer(
    stop_words=ru_stop,
    lowercase=True,
    #min_df=0.00003,
    max_features=8600
)
# Fitting on train and test as merged lists
vec.fit(train['title'].values.tolist() + test['title'].values.tolist())
print(len(vec.get_feature_names()))

8600


In [6]:
# Word counts for train titles. CSR Matrix, tokens ordered alphabetically
counts = vec.transform(train['title'].values.tolist())

In [7]:
index = np.random.choice(len(train),size=int(1e5))

# (FIRST) Select those indices from the Sparse matrix and then turn into an array.
sample = counts[index].toarray()

In [8]:
# More components didn't offer a significant gain 
reduce = cross_decomposition.PLSRegression(n_components=10)

# Fit undersampled array against the same indices for the target.
reduce.fit(sample,train.iloc[index].deal_probability)

# PLSR score reducing random sample into n components.
reduce.score(sample,train.iloc[index].deal_probability)

0.26151920926493344

In [9]:
y_pred = reduce.predict(sample)

metrics.mean_squared_error(train.iloc[index].deal_probability,y_pred)

0.050281080394525314

In [54]:
# Reduce all CSR values in batches
reduced = pd.DataFrame()
lower = 0
for idx in np.arange(0,int(len(train)*1.1),int(1.1e5)):
    if idx > len(train):
        idx = len(train)
    upper = idx
    if upper > lower:
        #print(lower,upper)
        sample = counts[lower:upper].toarray()
        sample = reduce.transform(sample)
        reduced = reduced.append(pd.DataFrame(sample))
        lower = idx

In [39]:
linear = linear_model.LinearRegression()
linear.fit(reduced,train.deal_probability)
linear.score(reduced,train.deal_probability)

0.14190194473492246

---

## Stopwords for Descriptions
**Summary**

- In the grand scheme of things, using stopwords makes descriptions almost as useful as titles predictively.
- For some reason, overfitting is less worse with descriptions than with titles. (0.19 vs 0.13 descriptions) (0.26 vs 0.14 titles)
- Perhaps this is because there are more stopwords in descriptions, so they benefit more from removing them.

In [5]:
vec = feature_extraction.text.TfidfVectorizer(
    stop_words=ru_stop,
    lowercase=True,
    #min_df=0.00003,
    max_features=4500
)
# Fitting on train and test as merged lists
vec.fit(train['description'].astype(str).tolist() + test['description'].astype(str).tolist())
print(len(vec.get_feature_names()))

4500


In [6]:
# Word counts for train. CSR Matrix, tokens ordered alphabetically
counts = vec.transform(train['description'].astype(str).tolist())

In [7]:
index = np.random.choice(len(train),size=int(1e5))

# (FIRST) Select those indices from the Sparse matrix and then turn into an array.
sample = counts[index].toarray()

In [8]:
# More components didn't offer a significant gain 
reduce = cross_decomposition.PLSRegression(n_components=10)

# Fit undersampled array against the same indices for the target.
reduce.fit(sample,train.iloc[index].deal_probability)

# PLSR score reducing random sample to n components.
reduce.score(sample,train.iloc[index].deal_probability)

0.19728699139064343

- That's a higher score than our descriptions without stopwords. `0.1954`

In [10]:
# Reduce all CSR values in batches
reduced = pd.DataFrame()
lower = 0
for i,idx in enumerate(np.arange(0,int(len(train)*1.1),int(1.1e5))):
    if idx > len(train):
        idx = len(train)
    upper = idx
    if upper > lower:
        #print(lower,upper)
        sample = counts[lower:upper].toarray()
        sample = reduce.transform(sample)
        reduced = reduced.append(pd.DataFrame(sample))
        lower = idx
    else:
        lower = idx

In [11]:
reduced.shape

(1503424, 10)

In [11]:
linear = linear_model.LinearRegression()
linear.fit(reduced,train.deal_probability)
linear.score(reduced,train.deal_probability)

0.13796255999177576

---
# NGram Ranges on Titles
**Summary**

- (2,2) and (3,3) performed incredibly poorly. Those examples are deleted.
- (1,2) perform significantly better than (2,2). Brings scores almost at par with unigrams, but still slightly below.
- That's because (1,2) allows unigrams and bigrams to compete in frequency. Only bigrams which appear more frequently than unigrams would make the cut into the max_features used.

In [116]:
vec = feature_extraction.text.TfidfVectorizer(
    #stop_words=ru_stop,
    lowercase=True,
    max_features=8600,
    ngram_range=(1,2)
)
# Fitting on train and test as merged lists
vec.fit(train['title'].values.tolist() + test['title'].values.tolist())
print(len(vec.get_feature_names()))

8600


In [110]:
# Word counts for train. CSR Matrix, tokens ordered alphabetically
counts = vec.transform(train['title'].values.tolist())

In [111]:
index = np.random.choice(len(train),size=int(1e5))

# (FIRST) Select those indices from the Sparse matrix and then turn into an array.
sample = counts[index].toarray()

In [112]:
# More components didn't offer a significant gain 
reduce = cross_decomposition.PLSRegression(n_components=10)

# Fit undersampled array against the same indices for the target.
reduce.fit(sample,train.iloc[index].deal_probability)

# PLSR score reducing random sample to n components.
reduce.score(sample,train.iloc[index].deal_probability)

0.25612322051624004

In [113]:
# Reduce all CSR values in batches
reduced = pd.DataFrame()
lower = 0
for i,idx in enumerate(np.arange(0,int(len(train)*1.1),int(1.1e5))):
    if idx > len(train):
        idx = len(train)
    upper = idx
    if upper > lower:
        #print(lower,upper)
        sample = counts[lower:upper].toarray()
        sample = reduce.transform(sample)
        reduced = reduced.append(pd.DataFrame(sample))
        lower = idx
    else:
        lower = idx

In [114]:
linear = linear_model.LinearRegression()
linear.fit(reduced,train.deal_probability)
linear.score(reduced,train.deal_probability)

0.14216173687810452

---
## NGram Ranges on Descriptions

**Summary**

- Without stopwords, got `0.1945` on PLSR sample and `0.1333` on full train.
- With stopwords, got `0.1922` on PLSR sample and `0.1360` on full train. Again, stopwords improved full score on descriptions only.
- However, the highest score for descriptions was in the previous section without NGram Ranges. (`0.1379`)

In [11]:
vec = feature_extraction.text.TfidfVectorizer(
    stop_words=ru_stop,
    lowercase=True,
    max_features=4500,
    ngram_range=(1,2)
)
# Fitting on train and test as merged lists
vec.fit(train['description'].astype(str).tolist() + test['description'].astype(str).tolist())
print(len(vec.get_feature_names()))

4500


In [12]:
# Word counts for train. CSR Matrix, tokens ordered alphabetically
counts = vec.transform(train['description'].astype(str).tolist())

In [13]:
index = np.random.choice(len(train),size=int(1e5))

# (FIRST) Select those indices from the Sparse matrix and then turn into an array.
sample = counts[index].toarray()

In [14]:
# More components didn't offer a significant gain 
reduce = cross_decomposition.PLSRegression(n_components=10)

# Fit undersampled array against the same indices for the target.
reduce.fit(sample,train.iloc[index].deal_probability)

# PLSR score reducing random sample to n components.
reduce.score(sample,train.iloc[index].deal_probability)

0.19229064958193207

In [15]:
# Reduce all CSR values in batches
reduced = pd.DataFrame()
lower = 0
for i,idx in enumerate(np.arange(0,int(len(train)*1.1),int(1.1e5))):
    if idx > len(train):
        idx = len(train)
    upper = idx
    if upper > lower:
        #print(lower,upper)
        sample = counts[lower:upper].toarray()
        sample = reduce.transform(sample)
        reduced = reduced.append(pd.DataFrame(sample))
        lower = idx
    else:
        lower = idx

In [16]:
linear = linear_model.LinearRegression()
linear.fit(reduced,train.deal_probability)
linear.score(reduced,train.deal_probability)

0.13605336534332912

---
# Lowercase on Titles

**Summary**

- Setting lowercase to False for the first time. Maybe ads with Upper Case have more deal probability.
- Score on PLSR sample was `0.2582`, slightly below the `0.266x` benchmark.
- On the full train, 0.1356, slightly below the 0.1379 benchmark.

In [17]:
vec = feature_extraction.text.TfidfVectorizer(
    #stop_words=ru_stop,
    lowercase=False,
    max_features=8600,
    #ngram_range=(1,2)
)
# Fitting on train and test as merged lists
vec.fit(train['title'].values.tolist() + test['title'].values.tolist())
print(len(vec.get_feature_names()))

8600


In [18]:
# Word counts for train. CSR Matrix, tokens ordered alphabetically
counts = vec.transform(train['title'].values.tolist())

In [19]:
index = np.random.choice(len(train),size=int(1e5))

# (FIRST) Select those indices from the Sparse matrix and then turn into an array.
sample = counts[index].toarray()

In [20]:
# More components didn't offer a significant gain 
reduce = cross_decomposition.PLSRegression(n_components=10)

# Fit undersampled array against the same indices for the target.
reduce.fit(sample,train.iloc[index].deal_probability)

# PLSR score reducing random sample to n components.
reduce.score(sample,train.iloc[index].deal_probability)

0.2582151950790056

In [21]:
# Reduce all CSR values in batches
reduced = pd.DataFrame()
lower = 0
for i,idx in enumerate(np.arange(0,int(len(train)*1.1),int(1.1e5))):
    if idx > len(train):
        idx = len(train)
    upper = idx
    if upper > lower:
        #print(lower,upper)
        sample = counts[lower:upper].toarray()
        sample = reduce.transform(sample)
        reduced = reduced.append(pd.DataFrame(sample))
        lower = idx
    else:
        lower = idx

In [22]:
linear = linear_model.LinearRegression()
linear.fit(reduced,train.deal_probability)
linear.score(reduced,train.deal_probability)

0.13565958748292106

In [23]:
model_selection.cross_val_score(cv=4,
                                estimator=linear,
                                X=reduced,
                                y=train.deal_probability)

array([0.13729185, 0.13308195, 0.13636996, 0.13581638])

---
# Observations
- All parameters that increase the number of total features tend to produce lower scores. Examples of this are: `lowercase=False`,`ngram_range=(1,2)`. On the other hand, `stopwords` increased the score on descriptions, and it happens to produce less features naturally.
- Producing more features to represent the same data has the effect of requiring more of those features retained, in order for PLSR to have more informational value at the start. Changes that produce more features from the same data would increase specificity, but that specificity requires more processing, and my RAM memory forces me to cap the number of features retained to around 8000 for titles and 4500 for descriptions.
- Another factor at play is I've been defining the feature filtering with `max_features` and not `min_df` anymore. max_features only keeps those terms which have the most frequency. `min_df` filters terms that don't appear frequently enough. In a sense, `min_df` removes some meaningless terms, while `max_df` removes noisy (and meaningless) terms as well, because they appear too often they might mean nothing as well. I should combine the use of these three parameters as well as all the previous.
---

---
# Special Combos
>If I want to reap the rewards of some specificity (by using lowercase or ngrams) I should first **make memory space.**

- With 100k datapoints, I can only load 8000~ features into PLSR at once for titles, and 4500~ for descriptions. Therefore I should filter some terms out with `min/max_df` and see if then adding specificity improves scores.

In [5]:
vec = feature_extraction.text.TfidfVectorizer(
    stop_words=ru_stop,
    lowercase=False,
    max_features=8600,
    ngram_range=(1,2),
    min_df=0.000045
)
# Fitting on train and test as merged lists
vec.fit(train['title'].values.tolist() + test['title'].values.tolist())
print(len(vec.get_feature_names()))

8600


**Filtering terms**

- Started with only `stopwords` and `min_df` and got 6237 features. That leaves good room before the 8600 cap.
- Then enabling `lowercase=False` gave 6616 total features. That means around 400 features are the same terms in uppercase. (More Specificity)
- Surely, enabling the `ngram_range=(1,2)` at this point filled the rest of our cap at 8600 terms.

In [6]:
# Word counts for train titles. CSR Matrix, tokens ordered alphabetically
counts = vec.transform(train['title'].values.tolist())

In [7]:
index = np.random.choice(len(train),size=int(1e5))

# (FIRST) Select those indices from the Sparse matrix and then turn into an array.
sample = counts[index].toarray()

In [8]:
# More components didn't offer a significant gain 
reduce = cross_decomposition.PLSRegression(n_components=10)

# Fit undersampled array against the same indices for the target.
reduce.fit(sample,train.iloc[index].deal_probability)

# PLSR score reducing random sample to N components.
reduce.score(sample,train.iloc[index].deal_probability)

0.2453587440424907

In [9]:
# Reduce all CSR values in batches
reduced = pd.DataFrame()
lower = 0
for i,idx in enumerate(np.arange(0,int(len(train)*1.1),int(1.1e5))):
    if idx > len(train):
        idx = len(train)
    upper = idx
    if upper > lower:
        #print(lower,upper)
        sample = counts[lower:upper].toarray()
        sample = reduce.transform(sample)
        reduced = reduced.append(pd.DataFrame(sample))
        lower = idx
    else:
        lower = idx

In [11]:
linear = linear_model.LinearRegression()
linear.fit(reduced,train.deal_probability)
linear.score(reduced,train.deal_probability)

0.13463212179427675

In [12]:
model_selection.cross_val_score(cv=4,
                                estimator=linear,
                                X=reduced,
                                y=train.deal_probability)

array([0.13529843, 0.13176643, 0.1363894 , 0.1349529 ])

- That didn't produce a higher train result. We should find out where is the primary information value in this data.

---
# Greatest Information Value

>To know if more specificity adds predictive power, I must be able to process more features in order to assess that gain. To process more features, I here do several rounds of PLSR to handle significantly larger CSR matrices. 

**Summary**

- This approach **increased scores** without the use of `lowercase=False` or `NGrams`, which suggests **there is an information value in processing more unigram counts from the CSR.**
- In a later step, I will allow the extraction of `NGrams` and `lowercase=False` features in addition to the ones extracted in this step, and evaluate for any score increases.
- Meanwhile to **combat overfitting**, PLSR is being trained with larger samples of the CSR(400k) which is a four-fold increase from the 100k size used previously. Now the training size is about a quarter of the whole train data(1.5Million).


In [5]:
vec = feature_extraction.text.TfidfVectorizer(
    stop_words=ru_stop,
    #lowercase=False,
    #max_features=8600,
    #ngram_range=(1,2),
    min_df=0.000005,
    #max_df=0.0005
)
# Fitting on train and test as merged lists
vec.fit(train['title'].values.tolist() + test['title'].values.tolist())
print(len(vec.get_feature_names()))

29333


**Filtering Features**

- I'll leave around 30k features. Then create separate PLSR decompositions for separate columnar ranges.
- I'll also increase the sample size to train PLSR and handle less features at once to ensure there's enough RAM memory.

In [6]:
# Word counts for train titles. CSR Matrix, tokens ordered alphabetically
counts = vec.transform(train['title'].values.tolist())

In [7]:
counts.shape

(1503424, 29333)

**Notes**

- The counts CSR matrix is now larger than I've ever done PLSR with.
- Sample size will increase to 400k. And will process 2k, maybe 3k features at once.
- The random index for each PLSR training sample will vary for each batch of features, because it is only used in the `fit` step on PLSR. But the indexes for transforming columns and rows in batches will follow a strict order.

In [8]:
# Reduce all CSR values in batches
t = time.time()
reduced = pd.DataFrame(index=train.index)
low_col = 0
# Start iteration with columns
for col in np.arange(0,int(counts.shape[1]*1.05),2000):
    # Limiting the edge case of the last values
    if col > counts.shape[1]:
        col = counts.shape[1]
    up_col = col
    
    if up_col > low_col:
        # Train PLSR on a large sample of those columns from CSR
        index = np.random.choice(len(train),size=int(4e5))
        sample = counts[index,low_col:up_col].toarray()
        reduce = cross_decomposition.PLSRegression(n_components=5)
        reduce.fit(sample,train.iloc[index].deal_probability)
        print('Score for feature range:',reduce.score(sample,train.iloc[index].deal_probability))
        
        # Nested indexes iteration
        components = pd.DataFrame()
        low_idx = 0
        for idx in np.arange(0,int(len(train)*1.1),int(3.1e5)):
            # Limiting the edge case of the last values
            if idx > len(train):
                idx = len(train)
            up_idx = idx

            if up_idx > low_idx:
                print('Indexes:',low_idx,up_idx,'Columns:',low_col,up_col)
                sample = counts[low_idx:up_idx,low_col:up_col].toarray()
                print('Sample shape:',sample.shape)
                sample = reduce.transform(sample)
                components = components.append(pd.DataFrame(sample))
                low_idx = idx
        components.reset_index(drop=True,inplace=True)
        components.columns = ['col_{}-{}_{}'.format(low_col,up_col,i) for i in range(0,5)]
        reduced = reduced.join(components)
        print(reduced.shape,'\n')
        low_col = col
print(time.time()-t)

Score for feature range: 0.026233317577446624
Indexes: 0 310000 Columns: 0 2000
Sample shape: (310000, 2000)
Indexes: 310000 620000 Columns: 0 2000
Sample shape: (310000, 2000)
Indexes: 620000 930000 Columns: 0 2000
Sample shape: (310000, 2000)
Indexes: 930000 1240000 Columns: 0 2000
Sample shape: (310000, 2000)
Indexes: 1240000 1503424 Columns: 0 2000
Sample shape: (263424, 2000)
(1503424, 5) 

Score for feature range: 0.013189517053110555
Indexes: 0 310000 Columns: 2000 4000
Sample shape: (310000, 2000)
Indexes: 310000 620000 Columns: 2000 4000
Sample shape: (310000, 2000)
Indexes: 620000 930000 Columns: 2000 4000
Sample shape: (310000, 2000)
Indexes: 930000 1240000 Columns: 2000 4000
Sample shape: (310000, 2000)
Indexes: 1240000 1503424 Columns: 2000 4000
Sample shape: (263424, 2000)
(1503424, 10) 

Score for feature range: 0.01896583078454539
Indexes: 0 310000 Columns: 4000 6000
Sample shape: (310000, 2000)
Indexes: 310000 620000 Columns: 4000 6000
Sample shape: (310000, 2000)
Inde

In [9]:
reduced.shape

(1503424, 75)

reduced.to_pickle('train_nlp_features.pkl',compression='zip')

**Funny remark**

- I really sweated for this 3% increase, lol.

In [16]:
linear = linear_model.LinearRegression()
linear.fit(reduced,train.deal_probability)
linear.score(reduced,train.deal_probability)

0.16947041244877115

In [17]:
model_selection.cross_val_score(cv=4,
                                estimator=linear,
                                X=reduced,
                                y=train.deal_probability)

array([0.17082166, 0.16651347, 0.16958629, 0.17033808])

In [20]:
rmse = metrics.make_scorer(metrics.mean_squared_error)
model_selection.cross_val_score(cv=4,
                                estimator=linear,
                                X=reduced,
                                y=train.deal_probability,
                                scoring=rmse
                               )

array([0.05602679, 0.0562293 , 0.05612842, 0.05636757])

In [4]:
plt.figure(figsize=(20,14))
correlation=reduced.corr()
sns.heatmap(correlation)
plt.show()

NameError: name 'plt' is not defined