The data contains the actual review texts, some additional information, and ratings on a scale from 1 to 5. The texts are stored in JSON files within the array responses.

In [None]:
import json
import bz2
import regex
from tqdm import tqdm
from scipy import sparse

In [None]:
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [None]:
responses = []
with bz2.BZ2File('banki_responses.json.bz2', 'r') as thefile:
    for row in tqdm(thefile):
        resp = json.loads(row)
        if not resp['rating_not_checked'] and (len(resp['text'].split()) > 0):
            responses.append(resp)

201030it [02:24, 1386.86it/s]


 I will classify texts into two classes, distinguishing between highly negative reviews (with a rating of 1) and positive reviews (with a rating of 5).

Let's select N1 reviews with a rating of 1 and N2 reviews with a rating of 5 from the entire dataset (values for N1 and N2 are at your discretion). I will use sklearn.model_selection.train_test_split to split the selected documents into training and test sets.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

In [None]:
df = pd.json_normalize(responses)[['city', 'bank_name', 'author', 'datetime', 'rating_grade', 'title', 'text', 'bank_license', 'num_comments', 'rating_not_checked']]

In [None]:
df.head()

Unnamed: 0,city,bank_name,author,datetime,rating_grade,title,text,bank_license,num_comments,rating_not_checked
0,г. Москва,Бинбанк,uhnov1,2015-06-08 12:50:54,,Жалоба,Добрый день! Я не являюсь клиентом банка и пор...,лицензия № 2562,0,False
1,г. Новосибирск,Сбербанк России,Foryou,2015-06-08 11:09:57,,Не могу пользоваться услугой Сбербанк он-лайн,Доброго дня! Являюсь держателем зарплатной кар...,лицензия № 1481,0,False
2,г. Москва,Бинбанк,Vladimir84,2015-06-05 20:14:28,,Двойное списание за один товар.,Здравствуйте! Дублирую свое заявление от 03.0...,лицензия № 2562,1,False
3,г. Ставрополь,Сбербанк России,643609,2015-06-05 13:51:01,,Меняют проценты комиссии не предупредив и не ...,Добрый день!! Я открыл расчетный счет в СберБа...,лицензия № 1481,2,False
4,г. Челябинск,ОТП Банк,anfisa-2003,2015-06-05 10:58:12,,Верните денежные средства за страховку,"04.03.2015 г. взяла кредит в вашем банке, заяв...",лицензия № 2766,1,False


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153499 entries, 0 to 153498
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   city                138325 non-null  object 
 1   bank_name           153499 non-null  object 
 2   author              153479 non-null  object 
 3   datetime            153499 non-null  object 
 4   rating_grade        88658 non-null   float64
 5   title               153499 non-null  object 
 6   text                153499 non-null  object 
 7   bank_license        153498 non-null  object 
 8   num_comments        153499 non-null  int64  
 9   rating_not_checked  153499 non-null  bool   
dtypes: bool(1), float64(1), int64(1), object(7)
memory usage: 10.7+ MB


In [None]:
df['rating_grade'].value_counts()

rating_grade
1.0    47387
5.0    14713
2.0    13509
3.0     9261
4.0     3788
Name: count, dtype: int64

In [None]:
N1 = df[df['rating_grade'] == 1][:15000]

In [None]:
N2 = df[df['rating_grade'] == 5]

In [None]:
n1n2 = pd.concat([N1, N2])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(n1n2['text'], n1n2['rating_grade'], test_size=0.2, random_state=42)

Let's use a text classification algorithm to address the task and establish a baseline. I'll compare different text vectorization approaches: using only unigrams, bigrams, trigrams, or character n-grams.

In [None]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1))

In [None]:
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

In [None]:
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [None]:
rfc = RandomForestClassifier()

In [None]:
rfc.fit(X_train_tfidf, y_train)

In [None]:
rfc_test = rfc.predict(X_test_tfidf)

In [None]:
print(classification_report(rfc_test, y_test))

              precision    recall  f1-score   support

         1.0       0.96      0.93      0.94      3085
         5.0       0.92      0.96      0.94      2858

    accuracy                           0.94      5943
   macro avg       0.94      0.94      0.94      5943
weighted avg       0.94      0.94      0.94      5943



The result with unigrams is good.

In [None]:
tfidf_vectorizer_bi = TfidfVectorizer(ngram_range=(2, 2))

In [None]:
X_train_tfidf_bi = tfidf_vectorizer.fit_transform(X_train)

In [None]:
X_test_tfidf_bi = tfidf_vectorizer.transform(X_test)

In [None]:
rfc_bi = RandomForestClassifier()

In [None]:
rfc_bi.fit(X_train_tfidf_bi, y_train)

In [None]:
rfc_test_bi = rfc_bi.predict(X_test_tfidf_bi)

In [None]:
print(classification_report(rfc_test_bi, y_test))

              precision    recall  f1-score   support

         1.0       0.95      0.89      0.92      3188
         5.0       0.88      0.95      0.92      2755

    accuracy                           0.92      5943
   macro avg       0.92      0.92      0.92      5943
weighted avg       0.92      0.92      0.92      5943




The result with bigrams is slightly worse.

In [None]:
tfidf_vectorizer_three = TfidfVectorizer(ngram_range=(3, 3))

In [None]:
X_train_tfidf_three = tfidf_vectorizer.fit_transform(X_train)

In [None]:
X_test_tfidf_three = tfidf_vectorizer.transform(X_test)

In [None]:
rfc_three = RandomForestClassifier()

In [None]:
rfc_three.fit(X_train_tfidf_three, y_train)

In [None]:
rfc_test_three = rfc_three.predict(X_test_tfidf_three)

In [None]:
print(classification_report(rfc_test_three, y_test))

              precision    recall  f1-score   support

         1.0       0.95      0.89      0.92      3189
         5.0       0.88      0.95      0.91      2754

    accuracy                           0.92      5943
   macro avg       0.92      0.92      0.92      5943
weighted avg       0.92      0.92      0.92      5943




The quality deteriorated even further with trigrams.

In [None]:
tfidf_vectorizer_ng = TfidfVectorizer(analyzer='char', ngram_range=(2, 5))

In [None]:
X_train_tfidf_ng = tfidf_vectorizer.fit_transform(X_train)

In [None]:
X_test_tfidf_ng = tfidf_vectorizer.transform(X_test)

In [None]:
rfc_ng = RandomForestClassifier()

In [None]:
rfc_ng.fit(X_train_tfidf_ng, y_train)

In [None]:
rfc_test_ng = rfc_ng.predict(X_test_tfidf_ng)

In [None]:
print(classification_report(rfc_test_ng, y_test))

              precision    recall  f1-score   support

         1.0       0.96      0.89      0.92      3209
         5.0       0.88      0.95      0.91      2734

    accuracy                           0.92      5943
   macro avg       0.92      0.92      0.92      5943
weighted avg       0.92      0.92      0.92      5943



The quality is better with n-grams compared to bigrams and trigrams.

Let's compare how the quality of solving a task changes when using latent topics as features:

The first approach involves using transformation (sklearn.feature_extraction.text.TfidfTransformer) and singular value decomposition (also known as latent semantic analysis) (sklearn.decomposition.TruncatedSVD).

The second approach uses topic modeling with LDA (sklearn.decomposition.LatentDirichletAllocation).

I will use accuracy and F-measure to evaluate the performance of classification.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD, LatentDirichletAllocation
from sklearn.metrics import accuracy_score, f1_score
from sklearn.svm import SVC

In [None]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1))
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [None]:
pipeline_tfidf_svd = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('svd', TruncatedSVD()),
    ('svm', SVC(kernel='linear'))
])

In [None]:
pipeline_tfidf_svd.fit(X_train, y_train)

In [None]:
y_pred_tfidf_svd = pipeline_tfidf_svd.predict(X_test)

In [None]:
accuracy_tfidf_svd = accuracy_score(y_test, y_pred_tfidf_svd)
accuracy_tfidf_svd

0.8732963149924281

In [None]:
f1_tfidf_svd = f1_score(y_test, y_pred_tfidf_svd)
f1_tfidf_svd

0.8743114672008013

In [None]:
lda = LatentDirichletAllocation(n_components=10, random_state=42)
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2)
tf = tf_vectorizer.fit_transform(X_train)
lda.fit(tf)

In [None]:
tf_test = tf_vectorizer.transform(X_test)

In [None]:
lda_features = lda.transform(tf_test)

In [None]:
rfc_lda = RandomForestClassifier()

In [None]:
rfc_lda.fit(lda.transform(tf), y_train)

In [None]:
y_pred_lda = rfc_lda.predict(lda_features)

In [None]:
accuracy_lda = accuracy_score(y_test, y_pred_lda)
accuracy_lda

0.922598014470806

In [None]:
f1_lda = f1_score(y_test, y_pred_lda)
f1_lda

0.9233077692564189

The quality of the second approach is better.