# Meta Models - Stagewise

In this notebook we prototype a set of Meta Models. In particular, we use the s-values trained using the chainRec algorithm with stagewise sampling. We then use these as input features to our model. These are combined with book level features in the hopes to enhance the predictive accuracy of the model.

In [1]:
import json
import os
import random
import numpy as np
import pandas as pd
from gensim.models import Word2Vec
import scipy.sparse as sp

random.seed(42)
np.random.seed(42)

### Loading S-Values

We load the S-Values from the ChainRec model.

In [2]:
MAPPING_DIR = '../mappings/'

s_values_df = pd.read_csv(MAPPING_DIR+"goodreads_s_values_stage.csv")

In [3]:
cols_to_keep = ['user_number', 'item_number', 's1', 's2', 's3', 's4']

s_values_df = s_values_df[cols_to_keep]

In [4]:
s_values_df['user_number'] = s_values_df['user_number'].apply(lambda x: str(x))
s_values_df['item_number'] = s_values_df['item_number'].apply(lambda x: str(x))
s_values_df['user_item_id'] = s_values_df['user_number'] + "-" + s_values_df['item_number']

### Load Data

In [5]:
OUTPUT_DATA_DIR = "../output_data/"

train_df_processed = pd.read_csv(OUTPUT_DATA_DIR+"text_processed_training.csv")
val_df_processed = pd.read_csv(OUTPUT_DATA_DIR+"text_processed_validation.csv")
test_df_processed = pd.read_csv(OUTPUT_DATA_DIR+"text_processed_testing.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


### Combining Features

We now combine the features engineered on the raw data with the trained s values.

Both datasets have interactions for combinations of user ids and book ids. However, the chainRec model expects the ids to be numberic. So we apply this same mapping to the raw data and then join the two datasets together to get our augmented set of features.

In [9]:
def load_mapping(mapping_file):
    """Loads the mapping from `mapping_file`.
    
    Parameters
    ----------
    mapping_file: str
        The name of the mapping file to import.
    
    Returns
    -------
    pd.DataFrame
        The DataFrame created from the mapping.
    
    """
    return pd.read_csv(os.path.join("../mappings", "{}.csv".format(mapping_file)))

In [10]:
user_map = load_mapping("user_map")
book_map = load_mapping("book_map")

In [11]:
book_map['book_id'] = book_map['book_id'].apply(lambda x: str(x))

In [12]:
def create_user_item_id(data_df, u_map, i_map):
    data_df['book_id'] = data_df['book_id'].apply(lambda x: str(x))
    data_df = pd.merge(data_df, u_map, how="left", on=["user_id"])
    data_df = pd.merge(data_df, i_map, how="left", on=["book_id"])
    data_df['user_number'] = data_df['user_number'].apply(lambda x: str(x))
    data_df['book_number'] = data_df['book_number'].apply(lambda x: str(x))
    data_df['user_item_id'] = data_df['user_number'] + "-" + data_df['book_number']
    return data_df.drop(columns=['user_number', 'book_number'])

In [13]:
train_df = create_user_item_id(train_df_processed, user_map, book_map)
val_df = create_user_item_id(val_df_processed, user_map, book_map)
test_df = create_user_item_id(test_df_processed, user_map, book_map)

In [14]:
s_values_df.drop(columns=['user_number', 'item_number'], inplace=True)

In [15]:
train_df_s = pd.merge(train_df, s_values_df, how='left', on=['user_item_id'])
val_df_s = pd.merge(val_df, s_values_df, how='left', on=['user_item_id'])
test_df_s = pd.merge(test_df, s_values_df, how='left', on=['user_item_id'])

In [16]:
train_df_s.drop(columns=['user_item_id'], inplace=True)
val_df_s.drop(columns=['user_item_id'], inplace=True)
test_df_s.drop(columns=['user_item_id'], inplace=True)

### Saving Augmented Data

We save the augmented train, test, and validation datasets for easier access on future models

In [None]:
train_df_s.to_csv(OUTPUT_DATA_DIR+"training_s_vals_stage.csv", index=False)
val_df_s.to_csv(OUTPUT_DATA_DIR+"validation_s_vals_stage.csv", index=False)
test_df_s.to_csv(OUTPUT_DATA_DIR+"testing_s_vals_stage.csv", index=False)

In [17]:
pd.set_option('display.max_columns', None)

### Non Language Features

We select the set of non language features which will be combined with the vectorized language features to get a full set of model features.

We also scale the count features. As counts tend to grow exponentially we use a log transform to make the counts linear.

In [18]:
columns_to_keep = ['text_reviews_count', 'is_ebook', 'average_rating', 'num_pages',
                   'ratings_count', 'is_translated', 'is_in_series', 'series_length', 
                   'is_paperback', 'is_hardcover', 'is_audio', 'from_penguin', 
                   'from_harpercollins', 'from_university_press', 'from_vintage',
                   'from_createspace', 'author_a', 'author_b', 'author_c',
                   'author_d', 'author_e', 'author_f', 's1', 's2', 's3', 's4',
                   'shelved_count', 'read_count', 'rated_count', 'recommended_count']
X_train_reg = train_df_s[columns_to_keep]
X_val_reg = val_df_s[columns_to_keep]
X_test_reg = test_df_s[columns_to_keep]

In [19]:
def log_transform_columns(data_df, cols):
    """Applies a log transform to `cols` in `data_df`.

    Parameters
    ----------
    data_df: pd.DataFrame
        The DataFrame in which the columns will be transformed.
    cols: collection
        The columns in `data_df` to be log scaled.

    Returns
    -------
    pd.DataFrame
        The DataFrame obtained from `data_df` after log scaling
        the columns `cols`.

    """
    for col in cols:
        data_df[col] = data_df[col].apply(lambda x: np.log(x) if x > 0 else 0)
    return data_df

In [20]:
log_transform_cols = ['text_reviews_count', 'ratings_count', 'shelved_count', 'read_count', 'rated_count', 'recommended_count']
X_train_reg = log_transform_columns(X_train_reg, log_transform_cols)
X_val_reg = log_transform_columns(X_val_reg, log_transform_cols)
X_test_reg = log_transform_columns(X_test_reg, log_transform_cols)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df[col] = data_df[col].apply(lambda x: np.log(x) if x > 0 else 0)


### Non-Language Models

We try some models that do not use any text based features first. Just by augmenting with the s values.

##### Unscaled S Values

First we use the raw s values

In [21]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()

X_train_reg1 = min_max_scaler.fit_transform(X_train_reg)
X_val_reg1 = min_max_scaler.transform(X_val_reg)
X_test_reg1 = min_max_scaler.transform(X_test_reg)

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

reg_model = LogisticRegression(max_iter=200)
reg_model.fit(X_train_reg1, train_df_s['recommended'])

train_AUC = roc_auc_score(train_df_s['recommended'], reg_model.predict(X_train_reg1))
val_AUC = roc_auc_score(val_df_s['recommended'], reg_model.predict(X_val_reg1))

print("Training AUC: {}".format(train_AUC))
print("Validation AUC: {}".format(val_AUC))

Training AUC: 0.7555134796990775
Validation AUC: 0.6239107606851506


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [28]:
reg_df = pd.DataFrame({'feature': columns_to_keep,
                       'regression_coefficient': reg_model.coef_[0]})

reg_df.head(30)

Unnamed: 0,feature,regression_coefficient
0,text_reviews_count,0.012054
1,is_ebook,0.176147
2,average_rating,1.981782
3,num_pages,-0.722584
4,ratings_count,3.416591
5,is_translated,-0.233676
6,is_in_series,-0.198069
7,series_length,0.245437
8,is_paperback,-0.036842
9,is_hardcover,-0.0103


In [29]:
from xgboost import XGBClassifier

xg_cls = XGBClassifier(
    objective='binary:logistic', learning_rate=0.1,
    max_depth=1, n_estimators=1000)

xg_cls.fit(X_train_reg1, train_df_processed['recommended'])
train_AUC = roc_auc_score(
    train_df_processed['recommended'], xg_cls.predict(X_train_reg1))
val_AUC = roc_auc_score(
    val_df_processed['recommended'], xg_cls.predict(X_val_reg1))

print("Training AUC: {}".format(train_AUC))
print("Validation AUC: {}".format(val_AUC))

Training AUC: 0.7646718514149099
Validation AUC: 0.6272306679271902


In [30]:
from sklearn.ensemble import RandomForestClassifier

ranfor_model = RandomForestClassifier(n_estimators=1000, max_depth=20)
ranfor_model.fit(X_train_reg1, train_df_processed['recommended'])

train_AUC = roc_auc_score(
    train_df_processed['recommended'], ranfor_model.predict(X_train_reg1))
val_AUC = roc_auc_score(
    val_df_processed['recommended'], ranfor_model.predict(X_val_reg1))

print("Training AUC: {}".format(train_AUC))
print("Validation AUC: {}".format(val_AUC))

Training AUC: 0.9094766807386248
Validation AUC: 0.6251176803100694


##### Scaled S Values

Next we try the models after scaling the s values using a sigmoid function

In [31]:
def sigmoid(val):
    """Applies the sigmoid function to `val`.
    
    The sigmoid function has the form
    f(x) = 1 / (1 + exp(-x))
    
    Parameters
    ----------
    val: float
        The operand to the sigmoid function.
    
    Returns
    -------
    float
        The result of applying the sigmoid
        function to `val`.
    
    """
    return 1 / (1 + np.exp(-val))

In [32]:
def scale_s_values(data_df):
    """Applies the sigmoid function to the s values in `data_df`.
    
    Parameters
    ---------
    data_df: pd.DataFrame
        The DataFrame for which the operation is performed.
    
    Returns
    -------
    pd.DataFrame
        The DataFrame that results from scaling the s values
        in `data_df`.
    
    """
    for s_col in ["s1", "s2", "s3", "s4"]:
        data_df[s_col] = data_df[s_col].apply(lambda x: sigmoid(x))
    return data_df

In [33]:
min_max_scaler = MinMaxScaler()

X_train_reg2 = scale_s_values(X_train_reg)
X_val_reg2 = scale_s_values(X_val_reg)
X_test_reg2 = scale_s_values(X_test_reg)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df[s_col] = data_df[s_col].apply(lambda x: sigmoid(x))


In [34]:
reg_model = LogisticRegression(max_iter=1000)
reg_model.fit(X_train_reg2, train_df_s['recommended'])

train_AUC = roc_auc_score(train_df_s['recommended'], reg_model.predict(X_train_reg2))
val_AUC = roc_auc_score(val_df_s['recommended'], reg_model.predict(X_val_reg2))

print("Training AUC: {}".format(train_AUC))
print("Validation AUC: {}".format(val_AUC))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training AUC: 0.7589434221982165
Validation AUC: 0.6253907107734591


In [35]:
reg_df = pd.DataFrame({'feature': columns_to_keep,
                       'regression_coefficient': reg_model.coef_[0]})

reg_df.head(30)

Unnamed: 0,feature,regression_coefficient
0,text_reviews_count,0.007644
1,is_ebook,-0.165985
2,average_rating,0.972058
3,num_pages,-0.000301
4,ratings_count,0.401848
5,is_translated,-0.411388
6,is_in_series,-0.210058
7,series_length,0.231697
8,is_paperback,-0.14542
9,is_hardcover,-0.092811


In [36]:
from xgboost import XGBClassifier

xg_cls = XGBClassifier(
    objective='binary:logistic', learning_rate=0.1,
    max_depth=1, n_estimators=1000)

xg_cls.fit(X_train_reg2, train_df_processed['recommended'])
train_AUC = roc_auc_score(
    train_df_processed['recommended'], xg_cls.predict(X_train_reg2))
val_AUC = roc_auc_score(
    val_df_processed['recommended'], xg_cls.predict(X_val_reg2))

print("Training AUC: {}".format(train_AUC))
print("Validation AUC: {}".format(val_AUC))

Training AUC: 0.7646718514149099
Validation AUC: 0.6272306679271902


In [37]:
from sklearn.ensemble import RandomForestClassifier

ranfor_model = RandomForestClassifier(n_estimators=1000, max_depth=15)
ranfor_model.fit(X_train_reg2, train_df_processed['recommended'])

train_AUC = roc_auc_score(
    train_df_processed['recommended'], ranfor_model.predict(X_train_reg2))
val_AUC = roc_auc_score(
    val_df_processed['recommended'], ranfor_model.predict(X_val_reg2))

print("Training AUC: {}".format(train_AUC))
print("Validation AUC: {}".format(val_AUC))

Training AUC: 0.8330892712883677
Validation AUC: 0.6291754874352941


### Word2Vec

We create vector embeddings of the words in the book descriptions. Word2Vec captures the most important words and then the vectors for each important word in the book description are averaged to get a vector for the book

In [38]:
book_df = train_df_s[['book_id', 'cleaned_text']]
book_df = book_df.drop_duplicates(subset=['book_id'])

book_df['cleaned_text'] = book_df['cleaned_text'].apply(lambda x: "" if pd.isnull(x) else x)

w2v = Word2Vec(list(book_df['cleaned_text']), size=200, window=10, min_count=1)

In [39]:
def create_book_vector(book_text, vec_length):
    """Creates a vector for the book given by `book_text`.

    The word vectors for each word in `book_text` are
    averaged to build a vector for the book.

    Parameters
    ----------
    book_text: str
        The book text for which the vector is generated.

    Returns
    -------
    vector
        A vector for the book.

    """
    text_vecs = [word for word in str(book_text) if word in w2v.wv.vocab]
    if len(text_vecs) > 0:
        return np.mean(w2v[text_vecs], axis=0)
    return np.zeros(vec_length)

In [40]:
train_df_s['book_vector'] = train_df_s['cleaned_text'].apply(lambda x: create_book_vector(x, 200))
val_df_s['book_vector'] = val_df_s['cleaned_text'].apply(lambda x: create_book_vector(x, 200))
test_df_s['book_vector'] = test_df_s['cleaned_text'].apply(lambda x: create_book_vector(x, 200))

  return np.mean(w2v[text_vecs], axis=0)


In [41]:
def create_book_vec_df(book_vecs, indices):
    """Creates a dataframe from `book_vecs`.

    Each numpy array in `book_vecs` is converted to a
    row in the resulting dataframe.

    Parameters
    ----------
    book_vecs: list
        A list of numpy arrays where each array corresponds
        to the book vector for a book.
    indicies: np.array
        A numpy array of indices for the DataFrame

    Returns
    -------
    pd.DataFrame
        The DataFrame obtained from converting `review_vecs`
        to a dataframe.

    """
    book_vec_df = pd.DataFrame(np.vstack(book_vecs))
    book_vec_df.columns = ["word" + str(col) for col in book_vec_df.columns]
    book_vec_df.index = indices
    return book_vec_df

In [42]:
train_wv = create_book_vec_df(train_df_s['book_vector'], train_df_s.index)
val_wv = create_book_vec_df(val_df_s['book_vector'], val_df_s.index)
test_wv = create_book_vec_df(test_df_s['book_vector'], test_df_s.index)

In [43]:
X_train_wv_reg = pd.concat([train_wv, X_train_reg2], axis=1)
X_val_wv_reg = pd.concat([val_wv, X_val_reg2], axis=1)
X_test_wv_reg = pd.concat([test_wv, X_test_reg2], axis=1)

In [44]:
reg_model = LogisticRegression(max_iter=1000)
reg_model.fit(X_train_wv_reg, train_df_s['recommended'])

train_AUC = roc_auc_score(train_df_s['recommended'], reg_model.predict(X_train_wv_reg))
val_AUC = roc_auc_score(val_df_s['recommended'], reg_model.predict(X_val_wv_reg))

print("Training AUC: {}".format(train_AUC))
print("Validation AUC: {}".format(val_AUC))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training AUC: 0.7585883831799561
Validation AUC: 0.6259733105473987


In [45]:
xg_cls = XGBClassifier(
    objective='binary:logistic', learning_rate=0.1,
    max_depth=2, n_estimators=2000)

xg_cls.fit(X_train_wv_reg, train_df_processed['recommended'])
train_AUC = roc_auc_score(
    train_df_processed['recommended'], xg_cls.predict(X_train_wv_reg))
val_AUC = roc_auc_score(
    val_df_processed['recommended'], xg_cls.predict(X_val_wv_reg))

print("Training AUC: {}".format(train_AUC))
print("Validation AUC: {}".format(val_AUC))

Training AUC: 0.7931655735517011
Validation AUC: 0.6251816633278595


In [46]:
from sklearn.ensemble import RandomForestClassifier

ranfor_model = RandomForestClassifier(n_estimators=1000, max_depth=15)
ranfor_model.fit(X_train_wv_reg, train_df_processed['recommended'])

train_AUC = roc_auc_score(
    train_df_processed['recommended'], ranfor_model.predict(X_train_wv_reg))
val_AUC = roc_auc_score(
    val_df_processed['recommended'], ranfor_model.predict(X_val_wv_reg))

print("Training AUC: {}".format(train_AUC))
print("Validation AUC: {}".format(val_AUC))

Training AUC: 0.8292951634251359
Validation AUC: 0.6291268112318548
