## Subject 2: Text NewsGroup Classification 
1. Download the newsgroups data set using the code below. 
2. Construct a text classifier that predicts the target variable (newsgroups.target) from the input data (newsgroups.data).
3. We will evaluate your classifier against a hold-out data set, so be sure to construct a classification function that can receive a single string.

In [1]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='train')

In [2]:
#newsgroups.data[:4]

In [3]:
newsgroups.target[:2]

array([7, 4])

In [4]:
list(newsgroups.target_names)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## Extracting features from text files 
Text files are actually series of words (ordered). In order to run machine learning algorithms we need to convert the text files into numerical feature vectors.
Scikit-learn has a high level component which will create feature vectors for us ‘CountVectorizer’.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts_vect = count_vect.fit_transform(newsgroups.data)
X_train_counts_vect.shape

(11314, 130107)

By doing 'count_vect.fit_transform(newsgroups.data)', we are learning the vocabulary dictionary and it returns a Document-Term matrix [n_samples, n_features].

TF->Term Frequencies

TF-IDF-> Term Frequencies times inverse document frequency

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer
Tfidf_transformer = TfidfTransformer()
X_train_tfidf = Tfidf_transformer.fit_transform(X_train_counts_vect)
X_train_tfidf.shape
#dimension of the Document-Term matrix

(11314, 130107)

## Running ML algorithms
There are various algorithms which can be used for text classification. We will start with the most simplest one Naive Bayes (NB)

In [7]:
from sklearn.naive_bayes import MultinomialNB
NBML = MultinomialNB().fit(X_train_tfidf, newsgroups.target)

In [8]:
from sklearn.pipeline import Pipeline
>>> text_Pipeline = Pipeline([('vect', CountVectorizer()),
...                      ('tfidf', TfidfTransformer()),
...                      ('clf', MultinomialNB()),
... ])
text_Pipeline= text_Pipeline.fit(newsgroups.data, newsgroups.target)

Performance of NB Classifier: Now we will test the performance of the NB classifier on test set.

In [9]:
import numpy as np
newsgroups_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_Pipeline.predict(newsgroups_test.data)
np.mean(predicted == newsgroups_test.target)

0.7738980350504514

We have newsgroups which is the trainig data set and newsgroups_test which is the testing data set.


## Support Vector Machines (SVM) 
Let’s try using a different algorithm SVM, and see if we can get any better performance.

In [10]:
from sklearn.linear_model import SGDClassifier
>>> text_clf_svm = Pipeline([('vect', CountVectorizer()),
...                      ('tfidf', TfidfTransformer()),
...                      ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
...                                            alpha=1e-3, random_state=42)),
... ])
>>> _ = text_clf_svm.fit(newsgroups.data, newsgroups.target)
>>> predicted_svm = text_clf_svm.predict(newsgroups_test.data)
>>> np.mean(predicted_svm == newsgroups_test.target)

0.8240839086563994

## Grid Search
Almost all the classifiers will have various parameters which can be tuned to obtain optimal performance. Scikit gives an extremely useful tool ‘GridSearchCV’.

In [11]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
               'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),}
              
gs_clf = GridSearchCV(text_Pipeline, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(newsgroups.data, newsgroups.target)



In [12]:
gs_clf.best_score_
gs_clf.best_params_

{'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

## Prepare the field 

In [13]:
import numpy as np
import re
import nltk
nltk.download('stopwords')
import pickle
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rober\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
X,y=newsgroups.data,newsgroups.target

## Text Preprocessing 

In [15]:
documents = []
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

stemmer = WordNetLemmatizer()

for sen in range(0, len(X)):
    # Remove all the special characters
    document = re.sub(r'\W', ' ', str(X[sen]))
    
    # remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
    
    # Remove single characters from the start
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 
    
    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)
    
    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)
    
    # Converting to Lowercase
    document = document.lower()
    
    # Lemmatization
    document = document.split()

    document = [stemmer.lemmatize(word) for word in document]
    document = ' '.join(document)
    
    documents.append(document)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rober\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Converting Text to Numbers 
Machines, unlike humans, cannot understand the raw text. Machines can only see numbers. Particularly, statistical techniques such as machine learning can only deal with numbers. Therefore, we need to convert our text into numbers.

Different approaches exist to convert text into the corresponding numerical form. The Bag of Words Model and the Word Embedding Model are two of the most commonly used approaches. In this article, we will use the bag of words model to convert our text to numbers.

### Bag of Words 

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = vectorizer.fit_transform(documents).toarray()

## Finding TFIDF

Term frequency = (Number of Occurrences of a word)/(Total words in the document)

IDF(word) = Log((Total number of documents)/(Number of documents containing the word))

In [17]:
# from sklearn.feature_extraction.text import TfidfTransformer
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()

## Training and Testing Sets 
Like any other machine learning problem, we need to split our data into training and testing sets

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Training Text Classification Model and Predicting Sentiment 
 We will use the Random Forest Algorithm to train our model.
 To train our machine learning model using the random forest algorithm we will use RandomForestClassifier class from the sklearn.ensemble library.

In [19]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)
classifier.fit(X_train, y_train) 

y_pred = classifier.predict(X_test)

## Evaluating the Model 
To evaluate the performance of a classification model such as the one that we just trained, we can use metrics such as the confusion matrix, F1 measure, and the accuracy.

In [20]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import seaborn as sn
# plt.title('Confusion Matrix')
# sn.heatmap(y_test, annot=True, vmin=0.0, vmax=100.0, fmt='.2f', cmap=cmap)
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

[[ 66   1   0   0   1   0   0   0   0   1   0   0   0   2   0  10   0   2
    4   1]
 [  1  91   9   4   3   5   0   2   0   3   0   0   6   3   1   0   0   0
    0   0]
 [  0   2  89   4   0   4   0   0   0   0   0   5   4   0   5   0   0   0
    0   0]
 [  0   4   5  85   2   2   7   2   1   3   1   0  10   2   1   2   1   0
    0   0]
 [  0   3   3  15  81   3   3   0   1   1   0   0   5   2   2   0   0   0
    1   0]
 [  1   7  16   1   0  85   0   1   0   0   0   0   4   0   4   1   0   0
    0   0]
 [  0   1   2   9   0   0  83   2   0   2   0   1   2   1   0   0   0   1
    1   0]
 [  1   1   0   2   0   2   6  90   4   0   2   2   2   2   2   0   0   0
    1   0]
 [  0   1   0   1   2   0   3   5 111   2   0   0   0   0   0   0   1   0
    2   0]
 [  0   1   1   0   1   1   0   0   0 104   7   1   1   1   1   0   0   0
    0   0]
 [  0   0   0   0   0   1   0   1   0  10 103   0   0   0   0   0   0   0
    0   0]
 [  0   1   0   1   2   2   0   0   0   0   0 118   1   2   1   0

## Saving and Loading the Model 

In [119]:
#SAVE THE MODEL
with open('text_classifier_model', 'wb') as picklefile:
    pickle.dump(classifier,picklefile)
    
# lOAD THE MODEL
with open('text_classifier_model', 'rb') as training_model:
    model_loaded = pickle.load(training_model)

In [120]:
# import pandas as pd
# import seaborn as sn
# confusion_matrix = pd.crosstab(X, y_pred, rownames=['Actual'], colnames=['Predicted'], margins = True)
# sn.heatmap(confusion_matrix, annot=True)

##  Classification Function

In [22]:
import numpy as np
import re
import nltk
from sklearn.datasets import load_files
nltk.download('stopwords')
import pickle
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from nltk.stem import WordNetLemmatizer

def text_classification_fct(information):
    X, y = information.data, information.target
    #Text Preprocessing
    documents = []

    stemmer = WordNetLemmatizer()

    for sen in range(0, len(X)):
   
        document = re.sub(r'\W', ' ', str(X[sen]))
    
        document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
    
        document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 
       
        document = re.sub(r'\s+', ' ', document, flags=re.I)
    
        document = re.sub(r'^b\s+', '', document)
    
        document = document.lower()
    
        document = document.split()

        document = [stemmer.lemmatize(word) for word in document]
        document = ' '.join(document)
    
        documents.append(document)
        
    #Converting Text to Numbers
    
    vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
    X = vectorizer.fit_transform(documents).toarray()
    
    #Finding TFIDF
    tfidfconverter = TfidfTransformer()
    X = tfidfconverter.fit_transform(X).toarray()
    
    #Training and Testing Sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
    
    #Training Text Classification Model and Predicting Sentiment
    classifier = RandomForestClassifier(n_estimators=1000, random_state=0)
    classifier.fit(X_train.to_frame(), y_train.to_frame()) 
    y_pred = classifier.predict(X_test)
    
    #Evaluating the Model
    
    print(confusion_matrix(y_test,y_pred))
    print(classification_report(y_test,y_pred))
    print(accuracy_score(y_test, y_pred))
    

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rober\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
