# Классифікація тексту

---
<a name="0"/>

### Зміст:
* 1. [Імпорт данних](#1)
* 2. [Первинний аналіз](#2)
* 3. [Підготовка та навчання моделей](#3)
* 4. [Результат](#4)

In [5]:
!pip install datasets

Collecting datasets
  Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/66/f8/38298237d18d4b6a8ee5dfe390e97bed5adb8e01ec6f9680c0ddf3066728/datasets-2.14.4-py3-none-any.whl.metadata
  Downloading datasets-2.14.4-py3-none-any.whl.metadata (19 kB)
Collecting xxhash (from datasets)
  Obtaining dependency information for xxhash from https://files.pythonhosted.org/packages/46/14/0302669d5d983ce23dc3870f4f2b16ab1d757a1d7e54a5cfe7a5df37f8e2/xxhash-3.3.0-cp311-cp311-win_amd64.whl.metadata
  Downloading xxhash-3.3.0-cp311-cp311-win_amd64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Obtaining dependency information for multiprocess from https://files.pythonhosted.org/packages/e7/41/96ac938770ba6e7d5ae1d8c9cafebac54b413549042c6260f0d0a6ec6622/multiprocess-0.70.15-py311-none-any.whl.metadata
  Downloading multiprocess-0.70.15-py311-none-any.whl.metadata (7.2 kB)
Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets)
  Obtaining dependenc

Імпорт бібліотек

In [6]:
import pandas as pd
import numpy as np
import warnings
from datasets import load_dataset

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

Налаштування

In [7]:
warnings.filterwarnings("ignore")

---
<a name="1"/>

### 1. Імпорт данних
[зміст](#0)

Завантажимо датасет з HuggingFace(https://huggingface.co/datasets/imdb)

In [9]:
data_files = {"train": "train.csv", "test": "test.csv"}
dataset = load_dataset("imdb", data_files=data_files)

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating unsupervised split: 0 examples [00:00, ? examples/s]

In [19]:
columns = ['review', 'sentiment']

In [16]:
# Функція спліту датасету

In [20]:
def split_text(text):
    df = pd.DataFrame(columns=columns)
    for line in text:
        review = line['text']
        sentiment = line['label']
        df.loc[len(df.index)] = [review, sentiment]
    return df

In [22]:
train_data = split_text(dataset['train'])

In [23]:
test_data = split_text(dataset['test'])

In [24]:
data = pd.concat([train_data, test_data], axis=0)

In [25]:
print(f"Розмірність датасету: {data.shape}")
data.head()

Розмірність датасету: (50000, 2)


Unnamed: 0,review,sentiment
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


---
<a name="2" a/>
    
### 2. Первинний аналіз
[зміст](#0)

In [26]:
print(f"Кількість дублікатів в датасеті: {data.duplicated().sum()} \n")

data = data.drop_duplicates(ignore_index=True)
print(f'''Розмірність датасету: {data.shape} \n
Баланс класів: \n {data['sentiment'].value_counts()} \n
Загальна інформація:''')

data.info()

Кількість дублікатів в датасеті: 418 

Розмірність датасету: (49582, 2) 

Баланс класів: 
 1    24884
0    24698
Name: sentiment, dtype: int64 

Загальна інформація:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49582 entries, 0 to 49581
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     49582 non-null  object
 1   sentiment  49582 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 774.8+ KB


In [27]:
le = LabelEncoder()
data['sentiment'] = le.fit_transform(data['sentiment'])

data['sentiment'] = data['sentiment'].astype('int16')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49582 entries, 0 to 49581
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     49582 non-null  object
 1   sentiment  49582 non-null  int16 
dtypes: int16(1), object(1)
memory usage: 484.3+ KB


Розділ данних на фічі та таргет

In [28]:
X = data.review.values
y = data.sentiment.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

---
<a name="3" a/>

### 3. Підготовка та навчання моделей
[зміст](#0)

### Очистка та стандартизація тексту

Приберемо теги, відступи та зробимо нижній регістр.

In [29]:
from sklearn.base import TransformerMixin

class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

def clean_text(text):
    text = text.replace("<br />", " ")
    return text.strip().lower()

### Preprocesing NLTK

In [30]:
nltk.download("punkt")
nltk.download("stopwords")

def nltk_preprocess_text(text):
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words("english"))
    filtered_tokens = [token.lower() for token in tokens if token.isalpha() and token.lower() not in stop_words]
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
    return " ".join(stemmed_tokens)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\texno\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\texno\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


### Векторізація

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [32]:
vectorizer = TfidfVectorizer(preprocessor=nltk_preprocess_text)

### Класифікація

In [33]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

### Pipline

In [34]:
pipe = Pipeline([
    ("cleaner" , predictors()),
    ("vectorizer" , vectorizer),
    ("predictor" , clf)])
pipe.fit(X_train,y_train)

---
<a name="4" a/>

### 4. Результат
[зміст](#0)

In [35]:
from sklearn import metrics

predicted = pipe.predict(X_test)

print(metrics.classification_report(predicted, y_test))

              precision    recall  f1-score   support

           0       0.88      0.90      0.89      7155
           1       0.90      0.88      0.89      7720

    accuracy                           0.89     14875
   macro avg       0.89      0.89      0.89     14875
weighted avg       0.89      0.89      0.89     14875



Векторайзери, NLTK та логістична регрессія добре впоралися з завданням. `Score=0.89` гарний результат.

### Дякую за увагу =)