<a href="https://colab.research.google.com/github/SavageGinny/MLP-Jupiters/blob/main/Lab5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.model_selection import train_test_split

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/levlazutin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Загрузка датасета

In [5]:
df = pd.read_csv("text.csv")
print(df.head())
print(df['category'].value_counts())

        category                                               text
0           tech  tv future in the hands of viewers with home th...
1       business  worldcom boss  left books alone  former worldc...
2          sport  tigers wary of farrell  gamble  leicester say ...
3          sport  yeading face newcastle in fa cup premiership s...
4  entertainment  ocean s twelve raids box office ocean s twelve...
category
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64


In [6]:
df

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...
...,...,...
2220,business,cars pull down us retail figures us retail sal...
2221,politics,kilroy unveils immigration policy ex-chatshow ...
2222,entertainment,rem announce new glasgow concert us band rem h...
2223,politics,how political squabbles snowball it s become c...


Предобработка текстов

In [7]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = text.split()
    tokens = [stemmer.stem(t) for t in tokens if t not in stop_words]
    return ' '.join(tokens)

df['clean_text'] = df['text'].apply(preprocess)

In [8]:
df

Unnamed: 0,category,text,clean_text
0,tech,tv future in the hands of viewers with home th...,tv futur hand viewer home theatr system plasma...
1,business,worldcom boss left books alone former worldc...,worldcom boss left book alon former worldcom b...
2,sport,tigers wary of farrell gamble leicester say ...,tiger wari farrel gambl leicest say rush make ...
3,sport,yeading face newcastle in fa cup premiership s...,yead face newcastl fa cup premiership side new...
4,entertainment,ocean s twelve raids box office ocean s twelve...,ocean twelv raid box offic ocean twelv crime c...
...,...,...,...
2220,business,cars pull down us retail figures us retail sal...,car pull us retail figur us retail sale fell j...
2221,politics,kilroy unveils immigration policy ex-chatshow ...,kilroy unveil immigr polici exchatshow host ro...
2222,entertainment,rem announce new glasgow concert us band rem h...,rem announc new glasgow concert us band rem an...
2223,politics,how political squabbles snowball it s become c...,polit squabbl snowbal becom commonplac argu bl...


Векторизация

In [9]:
vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(df['clean_text'])

In [10]:
feature_names = vectorizer.get_feature_names_out()
df_vectorized = pd.DataFrame(X.toarray(), columns=feature_names)
df_vectorized.head()

Unnamed: 0,abil,abl,accept,access,accord,account,accus,achiev,across,act,...,worth,would,write,wrong,year,yearold,yet,york,young,yuko
0,0.0,0.0,0.0,0.0,0.02983,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.031284,0.0,0.033588,0.0,0.0,0.0
1,0.078688,0.0,0.0,0.0,0.0,0.334367,0.069332,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08748,0.0,...,0.0,0.08496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.079211,0.0,0.0,0.0,0.094033,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.106087,0.0,0.0


Кластеризация и сравнение с метками

In [11]:
true_labels = df['category'].astype('category').cat.codes
kmeans = KMeans(n_clusters=true_labels.nunique(), random_state=42)
clusters = kmeans.fit_predict(X)

ari = adjusted_rand_score(true_labels, clusters)
f"Adjusted Rand Index: {ari:.4f}"

'Adjusted Rand Index: 0.8517'

Разделение на выборки

In [12]:
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, true_labels, test_size=0.2, random_state=42, stratify=true_labels
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.25, random_state=42, stratify=y_train_val
)

f"Train: {X_train.shape[0]}, Val: {X_val.shape[0]}, Test: {X_test.shape[0]}"

'Train: 1335, Val: 445, Test: 445'

In [13]:
train_idx = y_train.index
val_idx = y_val.index
test_idx = y_test.index

df_train = df.loc[train_idx][['category', 'text', 'clean_text']].reset_index(drop=True)
df_val = df.loc[val_idx][['category', 'text', 'clean_text']].reset_index(drop=True)
df_test = df.loc[test_idx][['category', 'text', 'clean_text']].reset_index(drop=True)


In [14]:
df_train

Unnamed: 0,category,text,clean_text
0,sport,ireland call up uncapped campbell ulster scrum...,ireland call uncap campbel ulster scrumhalf ki...
1,tech,call for action on internet scam phone compani...,call action internet scam phone compani enough...
2,business,chinese wine tempts italy s illva italy s illv...,chines wine tempt itali illva itali illva saro...
3,entertainment,us to raise tv indecency fines us politician...,us rais tv indec fine us politician propos tou...
4,business,news corp makes $5.4bn fox offer news corporat...,news corp make bn fox offer news corpor seek b...
...,...,...,...
1330,business,reliance unit loses anil ambani anil ambani t...,relianc unit lose anil ambani anil ambani youn...
1331,tech,online games play with politics after bubbling...,onlin game play polit bubbl time onlin game br...
1332,sport,sydney return for henin-hardenne olympic champ...,sydney return heninhardenn olymp champion just...
1333,business,yukos bankruptcy not us matter russian autho...,yuko bankruptci us matter russian author abid ...


In [15]:
df_val

Unnamed: 0,category,text,clean_text
0,business,mixed reaction to man utd offer shares in manc...,mix reaction man utd offer share manchest unit...
1,politics,choose hope over fear - kennedy voters will ha...,choos hope fear kennedi voter clear choic poli...
2,tech,local net tv takes off in austria an austrian ...,local net tv take austria austrian villag test...
3,politics,labour s eu propaganda a taxpayer subsidise...,labour eu propaganda taxpay subsidis propagand...
4,business,renault boss hails great year strong sales o...,renault boss hail great year strong sale outsi...
...,...,...,...
440,entertainment,itunes now selling band aid song ipod owners c...,itun sell band aid song ipod owner download ba...
441,sport,keegan hails comeback king fowler manchester c...,keegan hail comeback king fowler manchest citi...
442,politics,anti-terror plan faces first test plans to all...,antiterror plan face first test plan allow hom...
443,politics,candidate resigns over bnp link a prospective ...,candid resign bnp link prospect candid uk inde...


In [16]:
df_test

Unnamed: 0,category,text,clean_text
0,business,profits jump at china s top bank industrial an...,profit jump china top bank industri commerci b...
1,business,china had role in yukos split-up china lent ru...,china role yuko splitup china lent russia bn b...
2,entertainment,shark tale dvd is us best-seller oscar-nominat...,shark tale dvd us bestsel oscarnomin anim shar...
3,business,firms pump billions into pensions employers ha...,firm pump billion pension employ spent billion...
4,business,saudi ministry to employ women women will be e...,saudi ministri employ women women employ saudi...
...,...,...,...
440,sport,safin cool on wimbledon newly-crowned australi...,safin cool wimbledon newlycrown australian ope...
441,politics,csa could close says minister ministers wou...,csa could close say minist minist would rule s...
442,politics,david blunkett in quotes david blunkett - who ...,david blunkett quot david blunkett resign home...
443,entertainment,muslim group attacks tv drama 24 a british mus...,muslim group attack tv drama british muslim gr...
