### 语种检测

&emsp;&emsp;此部分主要介绍使用多项式朴素贝叶斯进行简单的语种检测的分类器

数据说明：

- 数据在当前目录下的data.csv中，包含English, French, German, Spanish, Italian 和 Dutch 6种语言

数据如下：

    1 december wereld aids dag voorlichting in zuidafrika over bieten taboes en optimisme,nl

**主要包括：**

- 数据准备
- 数据分割
- 数据去噪
- 特征提取（CountVectorizer）
- 模型构建，训练
- 模型评估


### 一、数据准备

In [1]:
in_f = open('data.csv')
lines = in_f.readlines()
in_f.close()
dataset = [(line.strip()[:-3], line.strip()[-2:]) for line in lines]

In [2]:
dataset[0:5]

[('1 december wereld aids dag voorlichting in zuidafrika over bieten taboes en optimisme',
  'nl'),
 ('1 mill贸n de afectados ante las inundaciones en sri lanka unicef est谩 distribuyendo ayuda de emergencia srilanka',
  'es'),
 ('1 mill贸n de fans en facebook antes del 14 de febrero y paty miki dani y berta se tiran en paraca铆das qu茅 har铆as t煤 porunmillondefans',
  'es'),
 ('1 satellite galileo sottoposto ai test presso lesaestec nl galileo navigation space in inglese',
  'it'),
 ('10 der welt sind bei', 'de')]

### 二、数据分割

In [3]:
from sklearn.model_selection import train_test_split
x, y = list(zip(*dataset))
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)

In [4]:
len(x_train)

6799

### 三、去噪

In [5]:
import re

def remove_noise(document):
    noise_pattern = re.compile("|".join(["http\S+", "\@\w+", "\#\w+"]))
    clean_text = re.sub(noise_pattern, "", document)
    return clean_text.strip()

remove_noise("Trump images are now more popular than cat gifs. @trump #trends http://www.trumptrends.html")

'Trump images are now more popular than cat gifs.'

### 四、特征提取

在降噪数据上抽取出来有用的特征啦，我们抽取1-gram和2-gram的统计特征

In [6]:
from sklearn.feature_extraction.text import CountVectorizer  # 文本特征提取
# CountVectorizer模型构建，参数内容可参考sklearn的API
vec = CountVectorizer(
    lowercase = True,
    analyzer = 'char_wb', 
    ngram_range = (1,2),
    max_features = 1000,  
    preprocessor = remove_noise
)
# CountVectorizer模型训练
vec.fit(x_train)

CountVectorizer(analyzer='char_wb', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 2),
        preprocessor=<function remove_noise at 0x0000013583E78D90>,
        stop_words=None, strip_accents=None,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)

### 五、模型构建及训练

In [7]:
from sklearn.naive_bayes import MultinomialNB  # 多项式朴素贝叶斯模型

classifier = MultinomialNB()
classifier.fit(vec.transform(x_train), y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### 六、模型评价

In [9]:
classifier.score(vec.transform(x_test), y_test)

0.9770621967357741

在少量数据下能够得到97.7%的正确率，还是挺高的，要是增加数据，模型的效果应该更好

### 代码整理

In [10]:
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB


class LanguageDetector():

    def __init__(self, classifier=MultinomialNB()):
        self.classifier = classifier
        self.vectorizer = CountVectorizer(ngram_range=(1,2), max_features=1000, preprocessor=self._remove_noise)

    def _remove_noise(self, document):
        noise_pattern = re.compile("|".join(["http\S+", "\@\w+", "\#\w+"]))
        clean_text = re.sub(noise_pattern, "", document)
        return clean_text

    def features(self, X):
        return self.vectorizer.transform(X)

    def fit(self, X, y):
        self.vectorizer.fit(X)
        self.classifier.fit(self.features(X), y)

    def predict(self, x):
        return self.classifier.predict(self.features([x]))

    def score(self, X, y):
        return self.classifier.score(self.features(X), y)

In [11]:
in_f = open('data.csv')
lines = in_f.readlines()
in_f.close()
dataset = [(line.strip()[:-3], line.strip()[-2:]) for line in lines]
x, y = zip(*dataset)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)

language_detector = LanguageDetector()
language_detector.fit(x_train, y_train)
print(language_detector.predict('This is an English sentence'))
print(language_detector.score(x_test, y_test))

['en']
0.9770621967357741
