For this section we will look at simple ways of processing text to do classification. The news 20 dataset is usually how most courses will get into this, but we will look into a kaggle dataset in financial sentiment analysis instead. Please download the dataset from [here](https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news). Place the dataset somewhere and change the first line in cell 2 accordingly.

However, I do suggest that you have a browse through the analysis done on news20 dataset as shown in [sklearn docs](https://scikit-learn.org/0.19/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py).

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier

Note that this `cp437` encoding is rare, and do not worry about it. If you ever do require some encoding to read in data it will most likely be "utf-8" or similar.

Also note how I haven't done the label encoder transformation _after_ the train test split. This is probably one of the few functions where it doesn't matter, as we are only converting labels to numbers, and in this case at least, won't cause any data leakage.

In [None]:
import pandas as pd
import google.colab as cl
cl.drive.mount('/content/drive')

splits = {'train': 'sent_train.csv', 'validation': 'sent_valid.csv'}
df = pd.read_csv('/content/drive/MyDrive/all-data.csv',encoding='cp437',header = None, names = ['sentiment','text'])
le = LabelEncoder()
df["y"] = le.fit_transform(df["sentiment"])

df

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,sentiment,text,y
0,neutral,"According to Gran , the company has no plans t...",1
1,neutral,Technopolis plans to develop in stages an area...,1
2,negative,The international electronic industry company ...,0
3,positive,With the new production plant the company woul...,2
4,positive,According to the company 's updated strategy f...,2
...,...,...,...
4841,negative,LONDON MarketWatch -- Share prices ended lower...,0
4842,neutral,Rinkuskiai 's beer sales fell by 6.5 per cent ...,1
4843,negative,Operating profit fell to EUR 35.4 mn from EUR ...,0
4844,negative,Net sales of the Paper segment decreased to EU...,0


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp(df["text"].values[0])
print(doc)

for entity in doc.ents:
    print(entity.text, entity.label_)

According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Gran PERSON
Russia GPE


In [None]:
doc.ents

(Gran, Russia)

In [None]:
from tqdm.auto import tqdm
tqdm.pandas() # делает прогресс бар

df["ents"] = df["text"].progress_map(lambda text: [(entity.text, entity.label_)
                                          for entity in nlp(text).ents])

  0%|          | 0/4846 [00:00<?, ?it/s]

In [None]:
df.sample(5)

Unnamed: 0,sentiment,text,y,ents
3870,positive,The newly created position has been establishe...,2,"[(Amer Sports ', PERSON)]"
3485,neutral,`` This agreement is a direct result of LCC 's...,1,"[(LCC, ORG), (earlier this year, DATE), (Dean ..."
4219,negative,"More than 14,000 customers were left powerless .",0,"[(More than 14,000, CARDINAL)]"
3561,neutral,Curators have divided their material into eigh...,1,"[(eight, CARDINAL)]"
1273,neutral,"Jun. 14 , 2009 ( AOL Weblogs delivered by News...",1,"[(Jun., PERSON), (14 , 2009, DATE), (AOL, ORG)..."


In [None]:
df["ent_types"] = df["ents"].progress_map(lambda x: set(ent[1] for ent in x))

  0%|          | 0/4846 [00:00<?, ?it/s]

The types of entities and their definitions can be seen [here](https://spacy.io/api/annotation#named-entities).

In [None]:
df.sample(5)

Unnamed: 0,sentiment,text,y,ents,ent_types
2630,neutral,"The Finnish company is building a 800,000 mt-y...",1,"[(Finnish, NORP), (800,000, CARDINAL), (mt-yea...","{DATE, GPE, CARDINAL, NORP}"
3075,neutral,Other details were not provided .,1,[],{}
1370,neutral,The cooperation will involve Arena Partners bu...,1,"[(Arena Partners, ORG), (35 %, PERCENT), (Alma...","{PERCENT, ORG}"
912,positive,This combined with foreign investments creates...,2,"[(Solteq, PERSON)]",{PERSON}
4431,negative,The recent troubles simply make NETeller cheap...,0,"[(NETeller, ORG)]",{ORG}


In [None]:
def replace_text(text, entities):
    for ent_name, ent_type in entities:
        text = text.replace(ent_name, ent_type)

    return text

df["format_text"] = df.progress_apply(lambda x: replace_text(x["text"], x["ents"]), axis=1)

  0%|          | 0/4846 [00:00<?, ?it/s]

In [None]:
df.sample(5)

Unnamed: 0,sentiment,text,y,ents,ent_types,format_text
2576,neutral,Talvivaara is listed on the London Stock Excha...,1,"[(Talvivaara, PERSON), (the London Stock Excha...","{PERSON, CARDINAL, ORG}",PERSON is listed on ORG and NASDAQ PERSON and ...
1738,positive,`` I am very pleased and proud of our performa...,2,"[(last year, DATE), (Juha Rantanen, PERSON)]","{DATE, PERSON}",`` I am very pleased and proud of our performa...
478,neutral,Aspocomp intends to set up a plant to manufact...,1,[],{},Aspocomp intends to set up a plant to manufact...
959,positive,Operating profit improved by 16.7 % to EUR 7.7...,2,"[(16.7 %, PERCENT), (EUR, ORG), (7.7, CARDINAL)]","{PERCENT, CARDINAL, ORG}",Operating profit improved by PERCENT to ORG CA...
3263,neutral,The Group 's business sectors are Building Con...,1,"[(Group, ORG), (Building Construction , Infras...",{ORG},"The ORG 's business sectors are ORG , and ORG ."


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

In [None]:
train_df, test_df = train_test_split(df, stratify=df["y"], test_size=0.1)

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.99,
                                min_df=5,
                                lowercase=True,
                                stop_words='english')
train_tfidf = tfidf_vectorizer.fit_transform(train_df["format_text"].values)

In [None]:
train_tfidf

<4361x1431 sparse matrix of type '<class 'numpy.float64'>'
	with 36117 stored elements in Compressed Sparse Row format>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import keras
from tensorflow.keras.layers import Dense

model = LogisticRegression(multi_class="multinomial")

# model.fit(train_tfidf, train_df["y"])
mod = keras.Sequential()
mod.add(keras.Input((1431,)))  # Входной слой, размерность зависит от ваших данных
mod.add(Dense(units=100, activation='relu'))
mod.add(Dense(units=250, activation='relu'))
mod.add(Dense(units=300, activation='relu'))
mod.add(Dense(units=250, activation='relu'))
mod.add(Dense(units=100, activation='relu'))
mod.add(Dense(units=3, activation='softmax'))  # Выходной слой с 3 нейронами для 3 классов

mod.compile(keras.optimizers.Adam(0.01,),loss =  keras.losses.CategoricalCrossentropy(),metrics = ['acc'])

mod.fit(train_tfidf, train_df["y"],epochs=10)
# mod.predict(test_df[])

# train_df['y']
mod.predict(test_tfidf)
# mod.
train_tfidf

Epoch 1/10


ValueError: Arguments `target` and `output` must have the same shape. Received: target.shape=(None, 1), output.shape=(None, 3)

Unnamed: 0,y
4794,1
2920,2
4774,2
2586,1
4561,0
...,...
4233,0
1735,2
3996,0
122,2


Unnamed: 0,y
4794,1
2920,2
4774,2
2586,1
4561,0
...,...
4233,0
1735,2
3996,0
122,2


In [None]:
test_tfidf = tfidf_vectorizer.transform(test_df["format_text"])
# test_preds = model.predict(test_tfidf)
# accuracy_score(test_df["y"], test_preds)

prediction =mod.predict(test_tfidf)
# accuracy_score(test_df['y'],prediction)
prediction.shape

[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step


(485, 1)

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk
import numpy as np
nltk.download('punkt_tab')
ps = PorterStemmer()
def stem_sentence(text):
    return " ".join([ps.stem(word) for word in word_tokenize(text) ])

train_df["processed_text"] = train_df["format_text"].progress_map(stem_sentence)
test_df["processed_text"] = test_df["format_text"].map(stem_sentence)
test_df

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.99,
                                min_df=5,
                                lowercase=True,
                                stop_words='english')
train_tfidf = tfidf_vectorizer.fit_transform(train_df["processed_text"].values)

In [None]:
model = LogisticRegression(multi_class="multinomial")
model.fit(train_tfidf, train_df["y"])

test_tfidf = tfidf_vectorizer.transform(test_df["processed_text"])
test_preds = model.predict(test_tfidf)
accuracy_score(test_df["y"], test_preds)

In [None]:
train_tfidf

## Ngrams

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.99,
                                   min_df=5,
                                   lowercase=True,
                                   stop_words='english',
                                   ngram_range=(1, 2)
                                  )
train_tfidf = tfidf_vectorizer.fit_transform(train_df["processed_text"].values)

train_tfidf.shape

In [None]:
model = LogisticRegression(multi_class="multinomial")
model.fit(train_tfidf, train_df["y"])

test_tfidf = tfidf_vectorizer.transform(test_df["processed_text"])
test_preds = model.predict(test_tfidf)
accuracy_score(test_df["y"], test_preds)

In [None]:
idxs = (-model.coef_).argsort(axis=-1)[:,:10]
words = tfidf_vectorizer.get_feature_names()
for i, idx in enumerate(idxs):
    print(le.inverse_transform([i]))
    print([words[i] for i in idx])
    print("="*10)
