# Gamer's negative chat recognition

> "Use text preprocessing in detecting the gamers negative chat recognition"

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [text, chinese, classification, binary, count, vectorizer, kaggle, colab, negative chat, gamer, translation]
- hide: false

In [None]:
# Installing the modules

!pip3 install catboost
!pip3 install googletrans==3.1.0a0
!pip3 install tensorflow_text==2.4.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.0.6-cp37-none-manylinux1_x86_64.whl (76.6 MB)
[K     |████████████████████████████████| 76.6 MB 1.3 MB/s 
Installing collected packages: catboost
Successfully installed catboost-1.0.6
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting googletrans==3.1.0a0
  Downloading googletrans-3.1.0a0.tar.gz (19 kB)
Collecting httpx==0.13.3
  Downloading httpx-0.13.3-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 1.8 MB/s 
Collecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting httpcore==0.9.*
  Downloading httpcore-0.9.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 738 kB/s 
[?25hCollecting hstspreload
  Downloading hstspreload-2021.12.1-py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 10

In [None]:
# Required modules

import tqdm
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text

from xgboost import XGBClassifier
from catboost import CatBoostClassifier

from zipfile import ZipFile
from googletrans import Translator
from matplotlib import pyplot as plt

from sklearn.cluster import KMeans
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
# Config

tqdm.tqdm.pandas()
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 12)

In [None]:
# Create kaggle folder

!mkdir ~/.kaggle
!cp ./kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Test the api

!kaggle competitions download -c gamers-negative-chat-recognition

Downloading gamers-negative-chat-recognition.zip to /content
  0% 0.00/1.18M [00:00<?, ?B/s]
100% 1.18M/1.18M [00:00<00:00, 116MB/s]


In [None]:
# Extracting the data

with ZipFile('/content/gamers-negative-chat-recognition.zip', 'r') as zf:
    zf.extractall('./')

In [None]:
# Load the train data

train = pd.read_csv('./train.csv')
train.head()

Unnamed: 0,qid,text,label
0,100001,我去送了个人头，结果啥也没那到。,1
1,100002,我送人头给你们发育发育,1
2,100003,我送你爷爷们多好,1
3,100004,我送你一个黄金分割率。,1
4,100005,我现在非常想送人头。,1


In [None]:
# Inspecting the train

train.info()
train.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   qid     60000 non-null  int64 
 1   text    60000 non-null  object
 2   label   60000 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.4+ MB


Unnamed: 0,qid,label
count,60000.0,60000.0
mean,130000.5,0.377967
std,17320.652413,0.484883
min,100001.0,0.0
25%,115000.75,0.0
50%,130000.5,0.0
75%,145000.25,1.0
max,160000.0,1.0


In [None]:
# Load the test data

test = pd.read_csv('./test.csv')
test.head()

Unnamed: 0,qid,text
0,160001,我这局送的人头有没有上局多
1,160002,没事，我送的和你差不多
2,160003,我送看你还咋赢
3,160004,送成狗，我野区都是人家的，玩你马
4,160005,我他喵的不挂机就不错。


In [None]:
# Distribution of train

print(train['label'].value_counts(normalize=True))
train['label'].value_counts()

0    0.622033
1    0.377967
Name: label, dtype: float64


0    37322
1    22678
Name: label, dtype: int64

In [None]:
# Translate the chinese into english

translator = Translator(service_urls=['translate.googleapis.com'])

train['en_translated'] = train['text'].progress_apply(lambda x: translator.translate(x, src='zh-cn', dest='en').text)
test['en_translated'] = test['text'].progress_apply(lambda x: translator.translate(x, src='zh-cn', dest='en').text)

In [None]:
# Seperate out features and labels

X = train['en_translated']
y = train['label']

In [None]:
# Train Test Split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=88)

### Approach -1

* Use count or Count Vectorizer to convert all the textual data into numbers and then apply an Machine Learning Algorithm.

In [None]:
# Adding vectorizers and model to the pipeline

pipe = Pipeline([
    ('count_vec', CountVectorizer()),
    ('pac', CatBoostClassifier(verbose=2))
])

In [None]:
# Fitting the model

pipe.fit(X_train, y_train)

Learning rate set to 0.053805
0:	learn: 0.6852330	total: 94.1ms	remaining: 1m 34s
2:	learn: 0.6729092	total: 179ms	remaining: 59.6s
4:	learn: 0.6632516	total: 264ms	remaining: 52.6s
6:	learn: 0.6551210	total: 345ms	remaining: 49s
8:	learn: 0.6484885	total: 428ms	remaining: 47.1s
10:	learn: 0.6437421	total: 503ms	remaining: 45.2s
12:	learn: 0.6397520	total: 587ms	remaining: 44.5s
14:	learn: 0.6355606	total: 662ms	remaining: 43.5s
16:	learn: 0.6324941	total: 742ms	remaining: 42.9s
18:	learn: 0.6296412	total: 825ms	remaining: 42.6s
20:	learn: 0.6266962	total: 902ms	remaining: 42s
22:	learn: 0.6246321	total: 980ms	remaining: 41.6s
24:	learn: 0.6230272	total: 1.06s	remaining: 41.6s
26:	learn: 0.6211850	total: 1.25s	remaining: 44.9s
28:	learn: 0.6197541	total: 1.44s	remaining: 48.3s
30:	learn: 0.6183261	total: 1.63s	remaining: 51.1s
32:	learn: 0.6167174	total: 1.89s	remaining: 55.4s
34:	learn: 0.6156047	total: 2.13s	remaining: 58.6s
36:	learn: 0.6146077	total: 2.37s	remaining: 1m 1s
38:	lear

Pipeline(steps=[('count_vec', CountVectorizer()),
                ('pac',
                 <catboost.core.CatBoostClassifier object at 0x7f547d1e7290>)])

### Approach -2

* Use count or TF-IDF to convert all the textual data into numbers and then apply an Machine Learning Algorithm.

In [None]:
# Adding vectorizers and model to the pipeline

pipe = Pipeline([
    ('tfidf_vec', TfidfVectorizer()),
    ('pac', CatBoostClassifier(verbose=2))
])

In [None]:
# Fitting the model

pipe.fit(X_train, y_train)

Learning rate set to 0.053805
0:	learn: 0.6851285	total: 154ms	remaining: 2m 34s
2:	learn: 0.6728493	total: 401ms	remaining: 2m 13s
4:	learn: 0.6636953	total: 645ms	remaining: 2m 8s
6:	learn: 0.6563560	total: 892ms	remaining: 2m 6s
8:	learn: 0.6512494	total: 1.15s	remaining: 2m 6s
10:	learn: 0.6447860	total: 1.4s	remaining: 2m 5s
12:	learn: 0.6408191	total: 1.65s	remaining: 2m 5s
14:	learn: 0.6375778	total: 1.9s	remaining: 2m 4s
16:	learn: 0.6348335	total: 2.15s	remaining: 2m 4s
18:	learn: 0.6321941	total: 2.4s	remaining: 2m 4s
20:	learn: 0.6294812	total: 2.65s	remaining: 2m 3s
22:	learn: 0.6273261	total: 2.89s	remaining: 2m 2s
24:	learn: 0.6256978	total: 3.15s	remaining: 2m 2s
26:	learn: 0.6238937	total: 3.4s	remaining: 2m 2s
28:	learn: 0.6220156	total: 3.64s	remaining: 2m 1s
30:	learn: 0.6208215	total: 3.88s	remaining: 2m 1s
32:	learn: 0.6195032	total: 4.14s	remaining: 2m 1s
34:	learn: 0.6183077	total: 4.39s	remaining: 2m 1s
36:	learn: 0.6172713	total: 4.64s	remaining: 2m
38:	learn: 

Pipeline(steps=[('tfidf_vec', TfidfVectorizer(stop_words='english')),
                ('pac',
                 <catboost.core.CatBoostClassifier object at 0x7f547c06c990>)])

In [None]:
# Calculating the score(when pipeline is used)

print(f"F1 Score of Train: {f1_score(y_train, pipe.predict(X_train))}")
print(f"F1 Score of Valid: {f1_score(y_valid, pipe.predict(X_valid))}")

F1 Score of Train: 0.4901241957204848
F1 Score of Valid: 0.4431593364784659


In [None]:
# Test Predictions(when pipeline is used)

test_pred = pipe.predict(test['en_translated'])

### Approach -3

* Use tensorflow embeddings and do a classification approach

In [None]:
# Chinese text train and valid split

X_chinese = train['text']
X_train_ch, X_valid_ch, y_train, y_valid = train_test_split(X_chinese, y, test_size=0.2, random_state=88)

In [None]:
# Get the embeddings

embed = hub.load("https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2")

train_embeddings = embed(X_train)
valid_embeddings = embed(X_valid)
test_embeddings = embed(test['en_translated'])













In [None]:
# Apply the classification part

model = CatBoostClassifier(verbose=2)
model.fit(train_embeddings.numpy(), y_train)

Learning rate set to 0.053805
0:	learn: 0.6883766	total: 97.1ms	remaining: 1m 36s
2:	learn: 0.6797808	total: 238ms	remaining: 1m 18s
4:	learn: 0.6728903	total: 405ms	remaining: 1m 20s
6:	learn: 0.6671913	total: 560ms	remaining: 1m 19s
8:	learn: 0.6623465	total: 707ms	remaining: 1m 17s
10:	learn: 0.6580477	total: 853ms	remaining: 1m 16s
12:	learn: 0.6545860	total: 981ms	remaining: 1m 14s
14:	learn: 0.6513731	total: 1.13s	remaining: 1m 13s
16:	learn: 0.6487816	total: 1.26s	remaining: 1m 12s
18:	learn: 0.6465716	total: 1.4s	remaining: 1m 12s
20:	learn: 0.6444539	total: 1.54s	remaining: 1m 11s
22:	learn: 0.6426027	total: 1.68s	remaining: 1m 11s
24:	learn: 0.6410458	total: 1.82s	remaining: 1m 11s
26:	learn: 0.6396086	total: 1.96s	remaining: 1m 10s
28:	learn: 0.6381567	total: 2.1s	remaining: 1m 10s
30:	learn: 0.6366694	total: 2.24s	remaining: 1m 10s
32:	learn: 0.6354118	total: 2.37s	remaining: 1m 9s
34:	learn: 0.6342199	total: 2.51s	remaining: 1m 9s
36:	learn: 0.6330833	total: 2.66s	remainin

<catboost.core.CatBoostClassifier at 0x7fe585ff80d0>

In [None]:
# Calculating the score

print(f"F1 Score of Train: {f1_score(y_train, model.predict(train_embeddings.numpy()))}")
print(f"F1 Score of Valid: {f1_score(y_valid, model.predict(valid_embeddings.numpy()))}")

F1 Score of Train: 0.6707428301185691
F1 Score of Valid: 0.4661957618567104


In [None]:
# Test Predictions

test_pred = model.predict(test_embeddings.numpy())

In [None]:
# Load sample submission

submission = pd.read_csv('sample_submission.csv')
submission['label'] = test_pred
submission.to_csv('output.csv', index=False)

In [None]:
# Submitting to kaggle

!kaggle competitions submit -c gamers-negative-chat-recognition -f output.csv -m "TFHUB-nnlm-eng128 norm with catboost"

100% 155k/155k [00:00<00:00, 763kB/s]
Successfully submitted to gamer's negative chat recognition(消极游戏聊天内容检测)

The approach which gave, this (F1-Score: 0.14079) was to use the embeddings from the Universal Sentence Encoder and then use a CatBoostEncoder to produce the results.