<a href="https://colab.research.google.com/github/Jlokkerbol/masterclass/blob/main/spacy_pipeline_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -U pendulum spacy

Collecting pendulum
  Downloading pendulum-2.1.2-cp37-cp37m-manylinux1_x86_64.whl (155 kB)
[?25l[K     |██▏                             | 10 kB 15.3 MB/s eta 0:00:01[K     |████▎                           | 20 kB 19.0 MB/s eta 0:00:01[K     |██████▍                         | 30 kB 18.2 MB/s eta 0:00:01[K     |████████▌                       | 40 kB 11.3 MB/s eta 0:00:01[K     |██████████▋                     | 51 kB 9.0 MB/s eta 0:00:01[K     |████████████▊                   | 61 kB 9.0 MB/s eta 0:00:01[K     |██████████████▉                 | 71 kB 8.3 MB/s eta 0:00:01[K     |█████████████████               | 81 kB 9.2 MB/s eta 0:00:01[K     |███████████████████             | 92 kB 8.7 MB/s eta 0:00:01[K     |█████████████████████▏          | 102 kB 8.0 MB/s eta 0:00:01[K     |███████████████████████▎        | 112 kB 8.0 MB/s eta 0:00:01[K     |█████████████████████████▍      | 122 kB 8.0 MB/s eta 0:00:01[K     |███████████████████████████▌    | 133 kB 8.0 M

In [None]:
!python -m spacy download nl_core_news_lg

Collecting nl-core-news-lg==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/nl_core_news_lg-3.2.0/nl_core_news_lg-3.2.0-py3-none-any.whl (572.6 MB)
[K     |████████████████████████████████| 572.6 MB 7.8 kB/s 
Installing collected packages: nl-core-news-lg
Successfully installed nl-core-news-lg-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('nl_core_news_lg')


In [None]:
from imblearn.ensemble import BalancedRandomForestClassifier, BalancedBaggingClassifier    
import pandas as pd
import pendulum
from sklearn.base import TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import balanced_accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
import spacy


# only use 2019 data as example
df = pd.read_parquet("https://github.com/jads-nl/public-lectures/blob/main/nlp/data/dutch-restaurant-reviews-per-year/reviewYear%3D2019/058d741d776d45f18e0ccc51f71173dc.parquet?raw=true")

# initatie spacy model
nlp = spacy.load("nl_core_news_lg")

# Dutch Restaurant reviews

## Objective
Predict a 'detractor' such that restaurant owner can look-up interesting (negative) feedback and act upon that.

## Data preparation

### Select main columns

In [None]:
reviews = df.loc[:, ['restoId', 'reviewerId', 'reviewerFame', 'reviewerNumReviews', 'reviewText']].copy()

### Format date columns

In [None]:
def parse_date(date):
    return pendulum.from_format(date, fmt="D MMM YYYY", locale="nl")

reviews["reviewDate"] = df.reviewDate.apply(parse_date).dt.date

### Format numerical columns

In [None]:
def clean_price(string):
    "Remove euro sign and whitespace in price"
    if string:
        return float(string.split(" ")[-1])
    else:
        return 0


reviews["avgPrice"] = df["avgPrice"].fillna(0).apply(clean_price)


# numerical columns have comma as decimal seperator --> cast to floats
numerical_cols = [
    "scoreFood",
    "scoreService",
    "scoreDecor",
    "reviewScoreOverall",
    "scoreTotal",
]
for col in numerical_cols:
    reviews[col] = pd.to_numeric(df[col])

### Format ordinal columns

In [None]:
map_scores = {
    "waitingTimeScore": {
        None: 0,
        "Hoog tempo": 1,
        "Kort": 2,
        "Redelijk": 3,
        "Kan beter": 4,
        "Lang": 5,
    },
    "valueForPriceScore": {
        None: 0,
        "Erg gunstig": 1,
        "Gunstig": 2,
        "Redelijk": 3,
        "Precies goed": 4,
        "Kan beter": 5,
    },
    "noiseLevelScore": {
        None: 0,
        "Erg rustig": 1,
        "Rustig": 2,
        "Precies goed": 3,
        "Rumoerig": 4,
    },
    "reviewerFame": {
        None: 0,
        "Proever": 1,
        "Fijnproever": 2,
        "Expertproever": 3,
        "Meesterproever": 4
    }
}

for col in map_scores.keys():
    reviews[col] = (
        df[col].apply(lambda x: map_scores[col].get(x, None)).astype("Int64")
    )

## Text pre-processing

### Filter reviews that are short or in process

In [None]:
def validate_review(review):
    if review == '- Recensie is momenteel in behandeling -' or len(review) < 4:
        return False
    else:
        return True
    

reviews['is_valid'] = reviews.reviewText.apply(validate_review)

### Add simple features

In [None]:
reviews['review_char_length_'] = df.reviewText.apply(lambda x: len(x))

### Tokenize and create Document-Term Matrix

We will use [pandas sparse data structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html) to save memory. Note cell below takes about 9 minutes to complete.

In [None]:
%%time
def tokenize_simple(text):
    """Tokenizer returning lowercase tokens with no stop words, no punctuation and no words with encoding errors"""
    doc = nlp(text)
    return [token.lower_ for token in doc if not (token.is_stop or token.is_punct or ("\\" in token.lower_))]


# some abbreviations aren't in spaCy's default Dutch stopwords list, so we add them
stop_words = nlp.Defaults.stop_words.update(['n', 't'])

count_vectorizer = CountVectorizer(tokenizer=tokenize_simple, stop_words=stop_words, ngram_range=(1,1))
dtm = pd.DataFrame.sparse.from_spmatrix(count_vectorizer.fit_transform(reviews.reviewText), columns=count_vectorizer.get_feature_names())



CPU times: user 9min 14s, sys: 2.04 s, total: 9min 16s
Wall time: 9min 16s


We will only keep words in the DTM that occur twice or more over all the reviews. This reduces the width of the DTM.

In [None]:
token_filter = (dtm.sum() > 2)
token_filter[token_filter == True]
print(f"Full DTM: {dtm.shape}")
print(f"Filtered DTM: {dtm.loc[:, token_filter].shape}")

Full DTM: (47048, 32823)
Filtered DTM: (47048, 11050)


## Binary classification: `is_detractor`

### Define Y

In [None]:
reviews["is_detractor"] = reviews.reviewScoreOverall.apply(lambda x: True if x <= 6 else False)

### Train-test split

(Cell below takes about two minute).

In [None]:
%%time
X = reviews[reviews.is_valid].drop(columns=["reviewDate", "reviewText", "scoreFood", "scoreService", "scoreDecor", "reviewScoreOverall", "scoreTotal", "is_detractor"])
y = reviews[reviews.is_valid].is_detractor

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in sss.split(X,y):
    X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    dtm_train, dtm_test = dtm.loc[reviews.is_valid, token_filter].iloc[train_index, :], dtm.loc[reviews.is_valid, token_filter].iloc[test_index, :]

CPU times: user 1min 41s, sys: 2.68 s, total: 1min 44s
Wall time: 1min 44s


### BaggingClassifier - without DTM


In [None]:
from sklearn.ensemble import BaggingClassifier

In [None]:
#%%time
clf1 = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                bootstrap=False,
                                random_state=0)
clf1.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), bootstrap=False,
                  random_state=0)

We will skip fine-tuning the model, our purpose is to compare it with a model that adds text. Using the balanced accuracy to compare, which is defined as the average of recall obtained on each class.

In [None]:
balanced_accuracy_score(y_test, clf1.predict(X_test)).round(2)

0.66

### BaggingClassifier - with DTM

(Cell below takes about 5 minutes)

In [None]:
%%time
clf2 = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                bootstrap=False,
                                random_state=0)
clf2.fit(dtm_train.join(X_train), y_train) 

  "pandas.DataFrame with sparse columns found."


CPU times: user 25min 31s, sys: 11.3 s, total: 25min 42s
Wall time: 25min 37s


In [None]:
balanced_accuracy_score(y_test, clf2.predict(dtm_test.join(X_test))).round(3)

  "pandas.DataFrame with sparse columns found."


0.704

## Closing remarks

We have illustrated how a simple bag-of-words model can add to the performance of a classifier that uses structured data. We haven't optimized the modeling at all, but done a simple like-to-like comparison with the same parameters.

Note that working with text requires more engineering: you need to make decisions about how to store and process the data because it can quickly expand beyond the memory of your (virtual) machine. Even with this simple model, we have used over 700 features from a truncated document-term matrix.