# Gamer's negative chat recognition

> "Use text preprocessing in detecting the gamers negative chat recognition"

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [text, chinese, classification, binary, count, vectorizer, kaggle, colab, negative chat, gamer, translation]
- hide: false

In [None]:
# Installing the modules

!pip3 install catboost
!pip3 install googletrans==3.1.0a0
!pip3 install tensorflow_text==2.4.1

In [None]:
# Required modules

import tqdm
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text

from xgboost import XGBClassifier
from catboost import CatBoostClassifier

from zipfile import ZipFile
from googletrans import Translator
from matplotlib import pyplot as plt

from sklearn.cluster import KMeans
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
# Config

tqdm.tqdm.pandas()
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 12)

In [None]:
# Create kaggle folder

!mkdir ~/.kaggle
!cp ./kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Test the api

!kaggle competitions download -c gamers-negative-chat-recognition

Downloading gamers-negative-chat-recognition.zip to /content
  0% 0.00/1.18M [00:00<?, ?B/s]
100% 1.18M/1.18M [00:00<00:00, 116MB/s]


In [None]:
# Extracting the data

with ZipFile('/content/gamers-negative-chat-recognition.zip', 'r') as zf:
    zf.extractall('./')

In [None]:
# Load the train data

train = pd.read_csv('./train.csv')
train.head()

Unnamed: 0,qid,text,label
0,100001,我去送了个人头，结果啥也没那到。,1
1,100002,我送人头给你们发育发育,1
2,100003,我送你爷爷们多好,1
3,100004,我送你一个黄金分割率。,1
4,100005,我现在非常想送人头。,1


In [None]:
# Inspecting the train

train.info()
train.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   qid     60000 non-null  int64 
 1   text    60000 non-null  object
 2   label   60000 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.4+ MB


Unnamed: 0,qid,label
count,60000.0,60000.0
mean,130000.5,0.377967
std,17320.652413,0.484883
min,100001.0,0.0
25%,115000.75,0.0
50%,130000.5,0.0
75%,145000.25,1.0
max,160000.0,1.0


In [None]:
# Load the test data

test = pd.read_csv('./test.csv')
test.head()

Unnamed: 0,qid,text
0,160001,我这局送的人头有没有上局多
1,160002,没事，我送的和你差不多
2,160003,我送看你还咋赢
3,160004,送成狗，我野区都是人家的，玩你马
4,160005,我他喵的不挂机就不错。


In [None]:
# Distribution of train

print(train['label'].value_counts(normalize=True))
train['label'].value_counts()

0    0.622033
1    0.377967
Name: label, dtype: float64


0    37322
1    22678
Name: label, dtype: int64

## Translation of the chinese negative chat

I have used the google translate api to translate the chinese negative chat into english.

In [None]:
# Translate the chinese into english

translator = Translator(service_urls=['translate.googleapis.com'])

train['en_translated'] = train['text'].progress_apply(lambda x: translator.translate(x, src='zh-cn', dest='en').text)
test['en_translated'] = test['text'].progress_apply(lambda x: translator.translate(x, src='zh-cn', dest='en').text)

In [None]:
# Seperate out features and labels

X = train['en_translated']
y = train['label']

In [None]:
# Train Test Split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=88)

## Model Building

### Approach -1

* Use count or Count Vectorizer to convert all the textual data into numbers and then apply an Machine Learning Algorithm.

In [None]:
# Adding vectorizers and model to the pipeline

pipe = Pipeline([
    ('count_vec', CountVectorizer()),
    ('pac', CatBoostClassifier(verbose=2))
])

In [None]:
# Fitting the model

pipe.fit(X_train, y_train)

### Approach -2

* Use count or TF-IDF to convert all the textual data into numbers and then apply an Machine Learning Algorithm.

In [None]:
# Adding vectorizers and model to the pipeline

pipe = Pipeline([
    ('tfidf_vec', TfidfVectorizer()),
    ('pac', CatBoostClassifier(verbose=2))
])

In [None]:
# Fitting the model

pipe.fit(X_train, y_train)

In [None]:
# Calculating the score(when pipeline is used)

print(f"F1 Score of Train: {f1_score(y_train, pipe.predict(X_train))}")
print(f"F1 Score of Valid: {f1_score(y_valid, pipe.predict(X_valid))}")

F1 Score of Train: 0.4901241957204848
F1 Score of Valid: 0.4431593364784659


In [None]:
# Test Predictions(when pipeline is used)

test_pred = pipe.predict(test['en_translated'])

### Approach -3

* Use tensorflow embeddings and do a classification approach

In [None]:
# Chinese text train and valid split

X_chinese = train['text']
X_train_ch, X_valid_ch, y_train, y_valid = train_test_split(X_chinese, y, test_size=0.2, random_state=88)

In [None]:
# Get the embeddings

embed = hub.load("https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2")

train_embeddings = embed(X_train)
valid_embeddings = embed(X_valid)
test_embeddings = embed(test['en_translated'])

In [None]:
# Apply the classification part

model = CatBoostClassifier(verbose=2)
model.fit(train_embeddings.numpy(), y_train)

In [None]:
# Calculating the score

print(f"F1 Score of Train: {f1_score(y_train, model.predict(train_embeddings.numpy()))}")
print(f"F1 Score of Valid: {f1_score(y_valid, model.predict(valid_embeddings.numpy()))}")

F1 Score of Train: 0.6707428301185691
F1 Score of Valid: 0.4661957618567104


In [None]:
# Test Predictions

test_pred = model.predict(test_embeddings.numpy())

In [None]:
# Load sample submission

submission = pd.read_csv('sample_submission.csv')
submission['label'] = test_pred
submission.to_csv('output.csv', index=False)

In [None]:
# Submitting to kaggle

!kaggle competitions submit -c gamers-negative-chat-recognition -f output.csv -m "TFHUB-nnlm-eng128 norm with catboost"

100% 155k/155k [00:00<00:00, 763kB/s]
Successfully submitted to gamer's negative chat recognition(消极游戏聊天内容检测)

The F1-Scores which we get on the train and validation partitions are of no/little use, because the text in `train.csv` and `test.csv` is very much different.

The approach which gave, this (F1-Score: 0.14079) was to use the embeddings from the Universal Sentence Encoder and then use a CatBoostEncoder to produce the results.