For this section we will look at simple ways of processing text to do classification. The news 20 dataset is usually how most courses will get into this, but we will look into a kaggle dataset in financial sentiment analysis instead. Please download the dataset from [here](https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news). Place the dataset somewhere and change the first line in cell 2 accordingly.

However, I do suggest that you have a browse through the analysis done on news20 dataset as shown in [sklearn docs](https://scikit-learn.org/0.19/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py).

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

Note that this `cp437` encoding is rare, and do not worry about it. If you ever do require some encoding to read in data it will most likely be "utf-8" or similar.

Also note how I haven't done the label encoder transformation _after_ the train test split. This is probably one of the few functions where it doesn't matter, as we are only converting labels to numbers, and in this case at least, won't cause any data leakage.

In [2]:
df = pd.read_csv("/tmp/all-data.csv", 
                 encoding='cp437', 
                 header=None, 
                 names=["sentiment", "text"])
le = LabelEncoder()
df["y"] = le.fit_transform(df["sentiment"])
df

Unnamed: 0,sentiment,text,y
0,neutral,"According to Gran , the company has no plans t...",1
1,neutral,Technopolis plans to develop in stages an area...,1
2,negative,The international electronic industry company ...,0
3,positive,With the new production plant the company woul...,2
4,positive,According to the company 's updated strategy f...,2
...,...,...,...
4841,negative,LONDON MarketWatch -- Share prices ended lower...,0
4842,neutral,Rinkuskiai 's beer sales fell by 6.5 per cent ...,1
4843,negative,Operating profit fell to EUR 35.4 mn from EUR ...,0
4844,negative,Net sales of the Paper segment decreased to EU...,0


In [3]:
import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp(df["text"].values[0])
print(doc)

for entity in doc.ents:
    print(entity.text, entity.label_)



According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Gran PERSON
Russia GPE


In [4]:
from tqdm.auto import tqdm
tqdm.pandas()

df["ents"] = df["text"].progress_map(lambda text: [(entity.text, entity.label_) 
                                          for entity in nlp(text).ents])

  from pandas import Panel


HBox(children=(FloatProgress(value=0.0, max=4846.0), HTML(value='')))




In [5]:
df["ent_types"] = df["ents"].progress_map(lambda x: set(ent[1] for ent in x))

HBox(children=(FloatProgress(value=0.0, max=4846.0), HTML(value='')))




In [6]:
def replace_text(text, entities):
    for ent, ent_type in entities:
        text = text.replace(ent, ent_type)
        
    return text

df["format_text"] = df.progress_apply(lambda x: replace_text(x["text"], x["ents"]), axis=1)

HBox(children=(FloatProgress(value=0.0, max=4846.0), HTML(value='')))




In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split

In [8]:
train_df, test_df = train_test_split(df, stratify=df["y"], test_size=0.1)

In [9]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
trainX, trainY = ros.fit_resample(train_df.drop("y", axis=1), train_df["y"])

In [11]:
trainX

Unnamed: 0,sentiment,text,ents,ent_types,format_text
0,neutral,"Rapala VMC Corporation Rapala , a leading fish...","[(Rapala VMC Corporation Rapala, ORG), (Pelton...","{PERCENT, ORG, GPE}","ORG , a leading fishing tackle and sporting go..."
1,positive,Shareholders of Rakvere Lihakombinaat decided ...,"[(mid-July, DATE)]",{DATE},Shareholders of Rakvere Lihakombinaat decided ...
2,neutral,Approximately SEK 166 million in repayments ha...,"[(166 million, CARDINAL), (Stockholm, GPE), (8...","{CARDINAL, LOC, GPE}",Approximately SEK CARDINAL in repayments has b...
3,neutral,The serial bond is part of the plan to refinan...,[],{},The serial bond is part of the plan to refinan...
4,neutral,Results are expected late in 2006 .,"[(2006, DATE)]",{DATE},Results are expected late in DATE .
...,...,...,...,...,...
7768,positive,The company 's net profit rose 11.4 % on the y...,"[(11.4 %, PERCENT), (the year, DATE), (82.2 mi...","{PERCENT, MONEY, DATE, CARDINAL}",The company 's net profit rose PERCENT on DATE...
7769,positive,Industry Investment is very interested in Glas...,"[(Glaston, GPE)]",{GPE},Industry Investment is very interested in GPE ...
7770,positive,"Operating profit was EUR 9.8 mn , compared to ...","[(2009, DATE)]",{DATE},"Operating profit was EUR 9.8 mn , compared to ..."
7771,positive,"Vaisala Oyj Press Release September 30 , 2010 ...","[(Vaisala Oyj, PERSON), (September 30 , 2010, ...","{PERSON, DATE, GPE}",PERSON Press Release DATE GPE has signed a con...


In [13]:
trainY.value_counts()

2    2591
1    2591
0    2591
Name: y, dtype: int64

In [23]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.99, 
                                min_df=5,
                                lowercase=True,
                                stop_words='english')
train_tfidf = tfidf_vectorizer.fit_transform(trainX["format_text"].values)

In [24]:
train_tfidf

<7773x2263 sparse matrix of type '<class 'numpy.float64'>'
	with 72577 stored elements in Compressed Sparse Row format>

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression(multi_class="multinomial")
model.fit(train_tfidf, trainY)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(multi_class='multinomial')

In [33]:
test_tfidf = tfidf_vectorizer.transform(test_df["format_text"])
test_preds = model.predict(test_tfidf)
accuracy_score(test_df["y"], test_preds)

0.7443298969072165