<a href="https://colab.research.google.com/github/Satyamaadi/python/blob/master/OnePipelineManyClassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook aims to give you a brief overview of performing text classification using Naive Bayes, Logistic Regression and Support Vector Machines. We will be using a dataset called "Economic news article tone and relevance" from Figure-Eight which consists of approximately 8000 news articles, which were tagged as relevant or not relevant to the US Economy. Our goal in this notebook is to explore the process of training and testing text classifiers for this problem, using this data set and two text classification algorithms: Multinomial Naive Bayes and Logistic Regression, implemented in sklearn.

In [1]:
!wget https://raw.githubusercontent.com/practical-nlp/practical-nlp-code/master/Ch4/Data/Full-Economic-News-DFE-839861.csv

--2022-06-20 16:43:44--  https://raw.githubusercontent.com/practical-nlp/practical-nlp-code/master/Ch4/Data/Full-Economic-News-DFE-839861.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12383529 (12M) [text/plain]
Saving to: ‘Full-Economic-News-DFE-839861.csv’


2022-06-20 16:43:45 (99.5 MB/s) - ‘Full-Economic-News-DFE-839861.csv’ saved [12383529/12383529]



In [2]:
import pandas as pd

In [7]:
df = pd.read_csv('Full-Economic-News-DFE-839861.csv',encoding='ISO-8859-1')

In [8]:
df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,positivity,positivity:confidence,relevance,relevance:confidence,articleid,date,headline,positivity_gold,relevance_gold,text
0,842613455,False,finalized,3,12/5/15 17:48,3.0,0.64,yes,0.64,wsj_398217788,8/14/91,Yields on CDs Fell in the Latest Week,,,NEW YORK -- Yields on most certificates of dep...
1,842613456,False,finalized,3,12/5/15 16:54,,,no,1.0,wsj_399019502,8/21/07,The Morning Brief: White House Seeks to Limit ...,,,The Wall Street Journal Online</br></br>The Mo...
2,842613457,False,finalized,3,12/5/15 1:59,,,no,1.0,wsj_398284048,11/14/91,Banking Bill Negotiators Set Compromise --- Pl...,,,WASHINGTON -- In an effort to achieve banking ...
3,842613458,False,finalized,3,12/5/15 2:19,,0.0,no,0.675,wsj_397959018,6/16/86,Manager's Journal: Sniffing Out Drug Abusers I...,,,The statistics on the enormous costs of employ...
4,842613459,False,finalized,3,12/5/15 17:48,3.0,0.3257,yes,0.64,wsj_398838054,10/4/02,Currency Trading: Dollar Remains in Tight Rang...,,,NEW YORK -- Indecision marked the dollar's ton...


In [13]:
df.shape
df['relevance'].value_counts()/df.shape[0]

no          0.821375
yes         0.177500
not sure    0.001125
Name: relevance, dtype: float64

In [14]:
df = df[df['relevance']!='not sure']

In [15]:
df['relevance'] = df['relevance'].map({'yes':1,'no':0})

In [16]:
df.head(6)

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,positivity,positivity:confidence,relevance,relevance:confidence,articleid,date,headline,positivity_gold,relevance_gold,text
0,842613455,False,finalized,3,12/5/15 17:48,3.0,0.64,1,0.64,wsj_398217788,8/14/91,Yields on CDs Fell in the Latest Week,,,NEW YORK -- Yields on most certificates of dep...
1,842613456,False,finalized,3,12/5/15 16:54,,,0,1.0,wsj_399019502,8/21/07,The Morning Brief: White House Seeks to Limit ...,,,The Wall Street Journal Online</br></br>The Mo...
2,842613457,False,finalized,3,12/5/15 1:59,,,0,1.0,wsj_398284048,11/14/91,Banking Bill Negotiators Set Compromise --- Pl...,,,WASHINGTON -- In an effort to achieve banking ...
3,842613458,False,finalized,3,12/5/15 2:19,,0.0,0,0.675,wsj_397959018,6/16/86,Manager's Journal: Sniffing Out Drug Abusers I...,,,The statistics on the enormous costs of employ...
4,842613459,False,finalized,3,12/5/15 17:48,3.0,0.3257,1,0.64,wsj_398838054,10/4/02,Currency Trading: Dollar Remains in Tight Rang...,,,NEW YORK -- Indecision marked the dollar's ton...
5,842613460,False,finalized,3,12/4/15 23:15,3.0,0.6783,1,1.0,wsj_905654974,11/23/11,"Stocks Fall Again; BofA, Alcoa Slide",,,"Stocks declined, as investors weighed slower-t..."


In [17]:
df = df[['text','relevance']]

In [18]:
df.head()

Unnamed: 0,text,relevance
0,NEW YORK -- Yields on most certificates of dep...,1
1,The Wall Street Journal Online</br></br>The Mo...,0
2,WASHINGTON -- In an effort to achieve banking ...,0
3,The statistics on the enormous costs of employ...,0
4,NEW YORK -- Indecision marked the dollar's ton...,1


In [24]:
from sklearn.feature_extraction import _stop_words

In [26]:
stopwords = _stop_words.ENGLISH_STOP_WORDS
def clean(doc): # doc is a string of text
    doc = doc.replace("</br>", " ") # This text contains a lot of <br/> tags.
    doc = "".join([char for char in doc if char not in string.punctuation and not char.isdigit()])
    doc = " ".join([token for token in doc.split() if token not in stopwords])
    # remove punctuation and numbers
    return doc

In [27]:
import sklearn
from sklearn.model_selection import train_test_split

In [28]:
X = df.text
y = df.relevance

In [29]:
X_train,X_test ,y_train,y_test = train_test_split(X,y,random_state=1)

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

In [32]:
import string
vec = CountVectorizer(preprocessor=clean)
X_train_dtm = vec.fit_transform(X_train)

In [33]:
X_test_dtm = vec.fit_transform(X_test)

In [38]:
X_test_dtm.shape

(1998, 27292)

In [40]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB() # instantiate a Multinomial Naive Bayes model
%time nb.fit(X_train_dtm, y_train) # train the model(timing it with an IPython "magic command")
y_pred_class = nb.predict(X_train_dtm) # make class predictions for X_test_dtm

CPU times: user 18.4 ms, sys: 83 µs, total: 18.5 ms
Wall time: 45.9 ms


In [44]:
df.text[:10],df.relevance[:10],y_pred_class[:10]

(0    NEW YORK -- Yields on most certificates of dep...
 1    The Wall Street Journal Online</br></br>The Mo...
 2    WASHINGTON -- In an effort to achieve banking ...
 3    The statistics on the enormous costs of employ...
 4    NEW YORK -- Indecision marked the dollar's ton...
 5    Stocks declined, as investors weighed slower-t...
 6    TORONTO -- Royal Bank of Canada and Bank of Mo...
 7    Many people think that the monster of health-c...
 8    Sequenom Inc., a genomics-based biotechnology ...
 9    The U.S. dollar declined against most major fo...
 Name: text, dtype: object, 0    1
 1    0
 2    0
 3    0
 4    1
 5    1
 6    0
 7    0
 8    0
 9    1
 Name: relevance, dtype: int64, array([0, 1, 0, 0, 1, 1, 0, 0, 1, 0]))

In [55]:
print(nb.score(X_train_dtm,y_train))

0.8471550141832137
