For this section we will look at simple ways of processing text to do classification. The news 20 dataset is usually how most courses will get into this, but we will look into a kaggle dataset in financial sentiment analysis instead. Please download the dataset from [here](https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news). Place the dataset somewhere and change the first line in cell 2 accordingly.

However, I do suggest that you have a browse through the analysis done on news20 dataset as shown in [sklearn docs](https://scikit-learn.org/0.19/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py).

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

Note that this `cp437` encoding is rare, and do not worry about it. If you ever do require some encoding to read in data it will most likely be "utf-8" or similar.

Also note how I haven't done the label encoder transformation _after_ the train test split. This is probably one of the few functions where it doesn't matter, as we are only converting labels to numbers, and in this case at least, won't cause any data leakage.

In [4]:
df = pd.read_csv("/tmp/all-data.csv", 
                 encoding='cp437', 
                 header=None, 
                 names=["sentiment", "text"])
le = LabelEncoder()
df["y"] = le.fit_transform(df["sentiment"])
df

Unnamed: 0,sentiment,text,y
0,neutral,"According to Gran , the company has no plans t...",1
1,neutral,Technopolis plans to develop in stages an area...,1
2,negative,The international electronic industry company ...,0
3,positive,With the new production plant the company woul...,2
4,positive,According to the company 's updated strategy f...,2
...,...,...,...
4841,negative,LONDON MarketWatch -- Share prices ended lower...,0
4842,neutral,Rinkuskiai 's beer sales fell by 6.5 per cent ...,1
4843,negative,Operating profit fell to EUR 35.4 mn from EUR ...,0
4844,negative,Net sales of the Paper segment decreased to EU...,0


In [9]:
import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp(df["text"].values[0])
print(doc)

for entity in doc.ents:
    print(entity.text, entity.label_)

According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Gran PERSON
Russia GPE


In [11]:
doc.ents

[Gran, Russia]

In [15]:
from tqdm.auto import tqdm
tqdm.pandas()

df["ents"] = df["text"].progress_map(lambda text: [(entity.text, entity.label_) 
                                          for entity in nlp(text).ents])

  from pandas import Panel


HBox(children=(FloatProgress(value=0.0, max=4846.0), HTML(value='')))




In [17]:
df.sample(5)

Unnamed: 0,sentiment,text,y,ents
2961,neutral,"In the sinter plant , limestone and coke breez...",1,[]
376,positive,The disposal of Autotank will also strengthen ...,2,"[(Autotank, ORG), (Aspo, ORG), (Gustav Nyberg,..."
2800,neutral,An acquisition of TeliaSonera would be France ...,1,"[(TeliaSonera, ORG), (France Telecom 's, ORG),..."
1498,neutral,And when it has lifted the veil on the various...,1,[]
522,neutral,"The order consists of capacity expansion , mai...",1,[]


In [18]:
df["ent_types"] = df["ents"].progress_map(lambda x: set(ent[1] for ent in x))

HBox(children=(FloatProgress(value=0.0, max=4846.0), HTML(value='')))




The types of entities and their definitions can be seen [here](https://spacy.io/api/annotation#named-entities).

In [19]:
df.sample(5)

Unnamed: 0,sentiment,text,y,ents,ent_types
201,positive,Finnish software developer Done Solutions Oyj ...,2,"[(Finnish, NORP), (Done Solutions Oyj, ORG), (...","{NORP, ORG, MONEY, DATE}"
1025,neutral,"An additional amount , capped at EUR12m , is p...",1,"[(2007, DATE)]",{DATE}
1593,neutral,s already good position in the technical build...,1,"[(s, ORG), (Ostrobothnia, GPE)]","{ORG, GPE}"
4274,neutral,The resignation will be in effect immediately .,1,[],{}
2688,neutral,There are currently some ten shops selling Tik...,1,"[(ten, CARDINAL), (Tikkurila, ORG), (Kazakhsta...","{CARDINAL, ORG, GPE}"


In [22]:
df.loc[1593, "text"].replace("market", "M")

's already good position in the technical building services M in Ostrobothnia .'

In [23]:
def replace_text(text, entities):
    for ent, ent_type in entities:
        text = text.replace(ent, ent_type)
        
    return text

df["format_text"] = df.progress_apply(lambda x: replace_text(x["text"], x["ents"]), axis=1)

HBox(children=(FloatProgress(value=0.0, max=4846.0), HTML(value='')))




In [24]:
df.sample(5)

Unnamed: 0,sentiment,text,y,ents,ent_types,format_text
2107,positive,Finnish-owned contract manufacturer of electro...,2,"[(Finnish, NORP), (Elcoteq Hungary Kft, PERSON...","{PERSON, NORP, CARDINAL}",NORP-owned contract manufacturer of electronic...
2292,positive,Operating profit totaled EUR 17.7 mn compared ...,2,"[(EUR, ORG), (17.7 mn, QUANTITY), (EUR, ORG), ...","{MONEY, ORG, QUANTITY, DATE}",Operating profit totaled ORG QUANTITY compared...
4171,neutral,Another problem is cola-flavoured long drinks .,1,[],{},Another problem is cola-flavoured long drinks .
1163,neutral,Aldata to Share Space Optimization Vision at A...,1,"[(Apollo User Group, ORG), (2009, DATE)]","{ORG, DATE}",Aldata to Share Space Optimization Vision at O...
1968,neutral,It started with software that was capable of r...,1,[],{},It started with software that was capable of r...


In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split

In [26]:
train_df, test_df = train_test_split(df, stratify=df["y"], test_size=0.1)

In [27]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.99, 
                                min_df=5,
                                lowercase=True,
                                stop_words='english')
train_tfidf = tfidf_vectorizer.fit_transform(train_df["format_text"].values)

In [28]:
train_tfidf

<4361x1451 sparse matrix of type '<class 'numpy.float64'>'
	with 36070 stored elements in Compressed Sparse Row format>

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression(multi_class="multinomial")
model.fit(train_tfidf, train_df["y"])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [30]:
test_tfidf = tfidf_vectorizer.transform(test_df["format_text"])
test_preds = model.predict(test_tfidf)
accuracy_score(test_df["y"], test_preds)

0.7463917525773196

In [32]:
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize

ps = PorterStemmer() 
def stem_sentence(text):
    return " ".join([ps.stem(word) for word in word_tokenize(text)])

train_df["processed_text"] = train_df["format_text"].progress_map(stem_sentence)
test_df["processed_text"] = test_df["format_text"].map(stem_sentence)

HBox(children=(FloatProgress(value=0.0, max=4361.0), HTML(value='')))




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [33]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.99, 
                                min_df=5,
                                lowercase=True,
                                stop_words='english')
train_tfidf = tfidf_vectorizer.fit_transform(train_df["processed_text"].values)

In [35]:
model = LogisticRegression(multi_class="multinomial")
model.fit(train_tfidf, train_df["y"])

test_tfidf = tfidf_vectorizer.transform(test_df["processed_text"])
test_preds = model.predict(test_tfidf)
accuracy_score(test_df["y"], test_preds)

0.7649484536082474

In [36]:
train_tfidf

<4361x1223 sparse matrix of type '<class 'numpy.float64'>'
	with 39328 stored elements in Compressed Sparse Row format>

## Ngrams

In [38]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.99, 
                                   min_df=5,
                                   lowercase=True,
                                   stop_words='english',
                                   ngram_range=(1, 2) 
                                  )
train_tfidf = tfidf_vectorizer.fit_transform(train_df["processed_text"].values)

train_tfidf.shape

(4361, 2283)

In [39]:
model = LogisticRegression(multi_class="multinomial")
model.fit(train_tfidf, train_df["y"])

test_tfidf = tfidf_vectorizer.transform(test_df["processed_text"])
test_preds = model.predict(test_tfidf)
accuracy_score(test_df["y"], test_preds)

0.7546391752577319

In [41]:
idxs = (-model.coef_).argsort(axis=-1)[:,:10]
words = tfidf_vectorizer.get_feature_names()
for i, idx in enumerate(idxs):
    print(le.inverse_transform([i]))
    print([words[i] for i in idx])
    print("="*10)


['negative']
['decreas', 'fell', 'drop', 'lower', 'declin', 'loss', 'lay', 'staff', 'mn', 'cut']
['neutral']
['includ', 'disclos', 'stake', 'valu', 'rang', 'cardin oper', 'ha cardin', 'busi', 'approxim', 'publish']
['positive']
['increas', 'rose', 'improv', 'sign', 'grew', 'expand', 'effici', 'posit', 'doubl', 'award']
