# **Classifying writers by their text.**
### *Solving the problem of classifying a literary text by author. Working with the Natural Language Processing Library (NLTK), and applying vectorized data pipelines.*
![](https://www.gov.spb.ru/static/writable/cache/b4/3b/b43b747007089117d13a5518d40255eb.jpg)

## 1. Download dataframe and processing data

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('train_data.csv')
test_df = pd.read_csv("test_no_answers.csv")

#### The language of the dataset is Russian, so don't be frightened)

In [3]:
df.head()

Unnamed: 0,id,text,author
0,0,-Бабушка!- вскричала малютка.- Возьми меня с с...,Dostoevsky
1,1,"Знал ли Скрудж об этом? Разумеется, знал. Да и...",Dostoevsky
2,2,"-С праздником, дядя, с радостью! Дай вам Бог в...",Dostoevsky
3,3,Мы высказали только главную передовую мысль на...,Dostoevsky
4,4,"I. Отдел литературный. Повести, романы, расска...",Dostoevsky


#### Clean up the data by first removing punctuation marks and then stopwords (But it's important to try all methods here, as this has the most impact on your accuracy).

In [4]:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from string import punctuation
russian_stopwords = stopwords.words("russian")

def remove_punct(text):
    table = {33: ' ', 34: ' ', 35: ' ', 36: ' ', 37: ' ', 38: ' ', 39: ' ', 40: ' ', 41: ' ', 42: ' ', 43: ' ', 44: ' ', 45: ' ', 
             46: ' ', 47: ' ', 58: ' ', 59: ' ', 60: ' ', 61: ' ', 62: ' ', 63: ' ', 64: ' ', 91: ' ', 92: ' ', 93: ' ', 94: ' ', 
             95: ' ', 96: ' ', 123: ' ', 124: ' ', 125: ' ', 126: ' '}
    return text.translate(table)

def cleaning(dataframe):
    dataframe['Post_clean'] = dataframe['text'].map(lambda x: x.lower())
    dataframe['Post_clean'] = dataframe['Post_clean'].map(lambda x: remove_punct(x))
    dataframe['Post_clean'] = dataframe['Post_clean'].map(lambda x: x.split(' '))
    dataframe['Post_clean'] = dataframe['Post_clean'].map(lambda x: [token for token in x if token not in russian_stopwords\
                                                                  and token != " " \
                                                                  and token.strip() not in punctuation])
    dataframe['Post_clean'] = dataframe['Post_clean'].map(lambda x: ' '.join(x))

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Let's see what we got after cleaning

In [5]:
cleaning(df)
cleaning(test_df)
df.head()

Unnamed: 0,id,text,author,Post_clean
0,0,-Бабушка!- вскричала малютка.- Возьми меня с с...,Dostoevsky,бабушка вскричала малютка возьми собой знаю уй...
1,1,"Знал ли Скрудж об этом? Разумеется, знал. Да и...",Dostoevsky,знал скрудж разумеется знал могло иначе скрудж...
2,2,"-С праздником, дядя, с радостью! Дай вам Бог в...",Dostoevsky,праздником дядя радостью дай бог благ земных р...
3,3,Мы высказали только главную передовую мысль на...,Dostoevsky,высказали главную передовую мысль нашего журна...
4,4,"I. Отдел литературный. Повести, романы, расска...",Dostoevsky,i отдел литературный повести романы рассказы м...


## 2. Learning models and building a data pipeline

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(df['Post_clean'], df['author'], test_size=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [7]:
from sklearn.pipeline import Pipeline
# pipeline allows to combine transformer and model into one block, which simplifies code writing and improves its readability
from sklearn.feature_extraction.text import TfidfVectorizer
# TfidfVectorizer converts texts into numeric vectors that reflect the importance of using each word from a set of words 
# (the number of words in the set determines the dimensionality of the vector) in each text
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

#### We build the conveyor, first we vectorize the data, then we apply our model, try several different models to make a little analysis at the end

In [8]:
sgd_ppl_clf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('sgd_clf', SGDClassifier(random_state=42))])
knb_ppl_clf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('knb_clf', KNeighborsClassifier(n_neighbors=10))])
sgd_ppl_clf.fit(X_train, y_train)
knb_ppl_clf.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()),
                ('knb_clf', KNeighborsClassifier(n_neighbors=10))])

#### We make a metric for both the overall result, and for each author

In [9]:
predicted_sgd = sgd_ppl_clf.predict(X_test)
print(metrics.classification_report(predicted_sgd, y_test))

              precision    recall  f1-score   support

      Akunin       0.98      0.86      0.91        92
    Bulychev       0.98      0.96      0.97        55
      Chehov       0.79      0.86      0.83        36
  Dostoevsky       0.27      0.30      0.29        10
       Gogol       0.40      0.60      0.48        10
        King       0.96      0.95      0.95        55
   Pratchett       0.94      0.98      0.96        48
      Remark       0.75      1.00      0.86         6

    accuracy                           0.89       312
   macro avg       0.76      0.81      0.78       312
weighted avg       0.90      0.89      0.89       312



In [10]:
predicted_sgd = knb_ppl_clf.predict(X_test)
print(metrics.classification_report(predicted_sgd, y_test))

              precision    recall  f1-score   support

      Akunin       0.91      0.84      0.88        88
    Bulychev       0.89      0.94      0.91        51
      Chehov       0.67      0.81      0.73        32
  Dostoevsky       0.45      0.42      0.43        12
       Gogol       0.60      0.69      0.64        13
        King       0.83      0.82      0.83        55
   Pratchett       0.92      0.87      0.89        53
      Remark       0.88      0.88      0.88         8

    accuracy                           0.83       312
   macro avg       0.77      0.78      0.77       312
weighted avg       0.84      0.83      0.83       312



In [11]:
sgd_ppl_clf = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
    ('sgd_clf', SGDClassifier(penalty='elasticnet', class_weight='balanced', random_state=42))])
sgd_ppl_clf.fit(X_train, y_train)
predicted_sgd = sgd_ppl_clf.predict(X_test)
print(metrics.classification_report(predicted_sgd, y_test))

              precision    recall  f1-score   support

      Akunin       0.98      0.92      0.95        86
    Bulychev       0.93      0.98      0.95        51
      Chehov       0.79      0.89      0.84        35
  Dostoevsky       0.27      0.30      0.29        10
       Gogol       0.53      0.57      0.55        14
        King       0.98      0.85      0.91        62
   Pratchett       0.90      0.96      0.93        47
      Remark       0.88      1.00      0.93         7

    accuracy                           0.88       312
   macro avg       0.78      0.81      0.79       312
weighted avg       0.89      0.88      0.89       312



In [12]:
predicted_sgd_val = sgd_ppl_clf.predict(X_valid)
print(metrics.classification_report(predicted_sgd_val, y_valid))

              precision    recall  f1-score   support

      Akunin       0.98      0.85      0.91        61
    Bulychev       0.89      1.00      0.94        25
      Chehov       0.62      0.77      0.69        13
  Dostoevsky       0.56      0.56      0.56         9
       Gogol       0.46      0.67      0.55         9
        King       1.00      0.88      0.94        34
   Pratchett       0.91      1.00      0.95        20
      Remark       1.00      1.00      1.00         3

    accuracy                           0.87       174
   macro avg       0.80      0.84      0.82       174
weighted avg       0.89      0.87      0.87       174



## 3. (Optional) Create a prediction for our test data

In [13]:
final = sgd_ppl_clf.predict(test_df.text)

In [14]:
answer = pd.DataFrame(data=final, columns=['author'])
answer.insert(0, 'id', test_df['id'])

In [15]:
answer.to_csv('submission.csv', index=None)

In [16]:
answer

Unnamed: 0,id,author
0,1734,Dostoevsky
1,1735,Pratchett
2,1736,Akunin
3,1737,Chehov
4,1738,King
...,...,...
325,2059,King
326,2060,Chehov
327,2061,Pratchett
328,2062,Pratchett


## **4. The results of our work**
#### As you can see, the model specifically designed for classification, I also tried a laptop to play with a neural network, in theory, it is certainly much smarter. Try your variants, it is interesting to see! Thanks for reading!
![](http://ic.pics.livejournal.com/choodo7/60378694/98257/98257_900.jpg)