*Predicting the **news article category** of the specified news groups, from the given 20 articles*

*Extracting feature vectors from a list of the raw texts that are fed to text feature extractors such as **CountVectorizer** or **TfidfVectorizer** with custom parameters*

In [1]:
import pandas as pd #DataFrame manipulation library

#Loading the dataset having details of 18000 newsgroups posts on 20 topics from the 20 newsgroups dataset
from sklearn.datasets import fetch_20newsgroups 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.linear_model import LogisticRegression

##### Transforming a processed Text to Document Term Matrix/ Term Document Matrix/ Term Frequency Document Matrix

In [2]:
ds_ng =  fetch_20newsgroups(subset = 'train', shuffle = True, random_state=42) #load the dataset in fetch_ng
#subset = 'train' as dataset is split in two subsets: training (or development) and testing (or for performance evaluation)
#The split between the train and test set is based upon a messages posted before and after a specific date
ds_ng.keys() #Dataset loaded is a dictionary 

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

##### Considering a small sample from dataset

In [3]:
ds_ng.data[15:16]

['From: mathew <mathew@mantis.co.uk>\nSubject: Re: <Political Atheists?\nOrganization: Mantis Consultants, Cambridge. UK.\nX-Newsreader: rusnews v1.01\nLines: 22\n\nkmr4@po.CWRU.edu (Keith M. Ryan) writes:\n> ( I am almost sure that Zyklon-B is immediate and painless method of \n> death. If not, insert soem other form. )\n> \n>         And, ethnic and minority groups have been killed, mutilated and \n> exterminated through out history, so I guess it was not unusual.\n> \n>         So, you would agree that the holocost would be allowed under the US \n> Constitution?  [ in so far, the punishment. I doubt they recieved what would \n> be considered a "fair" trial by US standards.\n\nDon\'t be so sure.  Look what happened to Japanese citizens in the US during\nWorld War II.  If you\'re prepared to say "Let\'s round these people up and\nstick them in a concentration camp without trial", it\'s only a short step to\ngassing them without trial.  After all, it seems that the Nazis originally\non

In [4]:
corpus = ['Look what happened to Japanese citizens in the US during\nWorld War II.',
          'If you\'re prepared to say "Let\'s round these people up and\nstick them in a concentration camp without trial"',
          'it\'s only a short step to\ngassing them without trial.',
          'After all, it seems that the Nazis originally\nonly intended to imprison the Jews;',
          'the Final Solution was dreamt up partly\nbecause they couldn\'t afford to run the camps because of the devastation\ncaused by Goering\'s Total War.']

In [5]:
count_vect = CountVectorizer(stop_words='english') #Initializing Vectorizer
                          #   predefined list of stop words
                          #  ,strip_accents='ascii' for enabling Stemming of words to root words
                          #  ,token_pattern for tokenizing the sample data i.e. separating/splitting each word where there is tab or space into tokens
                          #  , analyzer = 'word'  or character tokenization is prefered as taking 3 or n number of characters as a token reduces matrix size
dtm = count_vect.fit_transform(corpus) # fit and transform
print(dtm.shape)
dtm

(5, 36)


<5x36 sparse matrix of type '<class 'numpy.int64'>'
	with 38 stored elements in Compressed Sparse Row format>

##### Sparse Matrix : 
* Any matrix having more than 90% values as 0      
* Text Related Transformation matrix is sparse marix as not all words in the corpus occur in all the documents      
* Sparse matrix (can't be viewed) as it's function is to capture the indices of the values or position of cell values in the matrix which are 0 and hence reduce the space    
* It creates another datastructure that stores the indices of cell values which are 0  

##### todense() :
* Uses the index created by sparse matrix and adds value of 0 in that given cell
* Sparse Matrix : zipping of files      
* Dense Matrix : Unzipping of files

In [6]:
#5x36 sparse matrix was transformed above
dtm.todense()

matrix([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
        [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
         0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1,
         1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0]])

*The above generated matrix **Term Frequency Matrix** is called **Bag of Words** as the positioning of the token/word is irrelevant*    
*Position of word doesn't affect the meaning of the sentence*

In [7]:
count_vect.vocabulary_
#Dictionary having word and the Index of the token

{'look': 19,
 'happened': 12,
 'japanese': 16,
 'citizens': 4,
 'world': 35,
 'war': 34,
 'ii': 13,
 'prepared': 24,
 'say': 27,
 'let': 18,
 'round': 25,
 'people': 23,
 'stick': 31,
 'concentration': 5,
 'camp': 1,
 'trial': 33,
 'short': 28,
 'step': 30,
 'gassing': 10,
 'nazis': 20,
 'originally': 21,
 'intended': 15,
 'imprison': 14,
 'jews': 17,
 'final': 9,
 'solution': 29,
 'dreamt': 8,
 'partly': 22,
 'couldn': 6,
 'afford': 0,
 'run': 26,
 'camps': 2,
 'devastation': 7,
 'caused': 3,
 'goering': 11,
 'total': 32}

In [8]:
ds_ng['target_names'] #News topics

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [9]:
ds_ng.target[:10] #Numerical Labels of the Target

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

##### Considering the entire data from dataset

##### Convering to a Bag of Words 
* Assign a fixed integer *ID to each word* occurring in any document of the training set(building from dictioanry of words to integer indices)     
* For each document *#i*, count the number of occurrences of each word *w* and store it in *X[i,j]* as the value of feature *#j* where *j* is the index of the word *w* in the dicitionary

##### Using **CountVectorizer**

In [10]:
count_vect = CountVectorizer(stop_words='english') #Initializing Vectorizer
x_train_counts = count_vect.fit_transform(ds_ng.data) # fit and transforms corpus of raw text into tf matrix
print(x_train_counts.shape)
x_train_counts

(11314, 129796)


<11314x129796 sparse matrix of type '<class 'numpy.int64'>'
	with 1300729 stored elements in Compressed Sparse Row format>

*11314 news documents have 129796 unique words*

##### Training the Classifier

In [11]:
clf = LogisticRegression().fit(x_train_counts, ds_ng.target) #initializing the model and fitting it on input and target

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


*Logistic Regression model is trained with input as TF matrix and output as label encoders or numerical labels for categories*

##### Predicting the news article category, each of the specified news groups belongs to, from the given 20 articles/target_names

In [12]:
docs_new = ['I am very interested in your catalog, but E-mail to you bounces.',
            'I was wondering if any of you out there want to play outdoor games',
            'Too much\nhypocricy exists behind the walls of "God\'s house"',
            'Everything seems to work fine except for the\n> Windows Drivers']

In [13]:
x_new_counts = count_vect.transform(docs_new)
predicted = clf.predict(x_new_counts)
for doc,category in zip(docs_new, predicted):
    print('%r : %s' % (doc,ds_ng.target_names[category]))

'I am very interested in your catalog, but E-mail to you bounces.' : misc.forsale
'I was wondering if any of you out there want to play outdoor games' : rec.sport.baseball
'Too much\nhypocricy exists behind the walls of "God\'s house"' : soc.religion.christian
'Everything seems to work fine except for the\n> Windows Drivers' : comp.os.ms-windows.misc


##### Using **TfidfVectorizer**

In [14]:
count_vect_Tfidf = TfidfVectorizer(stop_words='english') #Initializing Vectorizer
x_train_counts_Tfidf = count_vect_Tfidf.fit_transform(ds_ng.data) # fit and transforms corpus of raw text into tf matrix
print(x_train_counts_Tfidf.shape)
x_train_counts_Tfidf

(11314, 129796)


<11314x129796 sparse matrix of type '<class 'numpy.float64'>'
	with 1300729 stored elements in Compressed Sparse Row format>

In [15]:
clf_Tfidf = LogisticRegression().fit(x_train_counts_Tfidf, ds_ng.target) #initializing the model and fitting it on input and target

In [16]:
docs_news = ['I am very interested in your catalog, but E-mail to you bounces.',
             'I was wondering if any of you out there want to play outdoor games',
             'I was wondering if any of you out there want to sell or rent games',
             'Too much\nhypocricy exists behind the walls of "God\'s house"',
             'Everything seems to work fine except for the\n> Windows Drivers']

In [17]:
x_new_counts_Tfidf = count_vect_Tfidf.transform(docs_news)
predicted_Tfidf = clf_Tfidf.predict(x_new_counts_Tfidf)
for doc,category in zip(docs_news, predicted_Tfidf):
    print('%r : %s' % (doc,ds_ng.target_names[category]))

'I am very interested in your catalog, but E-mail to you bounces.' : misc.forsale
'I was wondering if any of you out there want to play outdoor games' : rec.sport.hockey
'I was wondering if any of you out there want to sell or rent games' : misc.forsale
'Too much\nhypocricy exists behind the walls of "God\'s house"' : soc.religion.christian
'Everything seems to work fine except for the\n> Windows Drivers' : comp.os.ms-windows.misc


*TfidfVectorizer provides more accurate/specific prediction of newsgroups to news articles*