### Activity 3.01: Developing End-to-End Text Classifiers

For this activity, you will build an end-to-end classifier that figures out whether a news article is political or not.

In [1]:
from sklearn.datasets import fetch_20newsgroups

import matplotlib as mpl

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

import re

import string

from nltk import word_tokenize

from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

from collections import Counter

from pylab import *

import nltk

nltk.download('stopwords')

nltk.download('punkt')

nltk.download('wordnet')

import warnings

warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/LNonyane/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/LNonyane/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/LNonyane/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### 2 Read the dataset and clean it.

In [2]:
review_data = pd.read_csv('data/news_political_dataset.csv')
review_data.head()

Unnamed: 0,headline,short_description,is_political
0,Will Smith Joins Diplo And Nicky Jam For The 2...,Of course it has a song.,0
1,Hugh Grant Marries For The First Time At Age 57,The actor and his longtime girlfriend Anna Ebe...,0
2,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,The actor gives Dems an ass-kicking for not fi...,0
3,Julianna Margulies Uses Donald Trump Poop Bags...,"The ""Dietland"" actress said using the bags is ...",0
4,Morgan Freeman 'Devastated' That Sexual Harass...,"""It is not right to equate horrific incidents ...",0


Use a lambda function to extract tokens from each 'reviewText' of this DataFrame, lemmatize them, and concatenate them side by side. Use the join function to concatenate a list of words into a single sentence. Use the regular expression method (re) to replace anything other than alphabetical characters, digits, and whitespaces with blank space.

In [3]:
lemmatizer = WordNetLemmatizer()
review_data['cleaned_headline'] = review_data['headline']\
.apply(lambda x : ' '.join\
 ([lemmatizer.lemmatize\
  (word.lower()) \
  for word in word_tokenize\
  (re.sub(r'([^\s\w]|_)+', ' ',\
   str(x)))]))

#### 3. Create a TFIDF matrix out of it.

In [4]:
review_data['cleaned_headline']

0        will smith join diplo and nicky jam for the 20...
1          hugh grant marries for the first time at age 57
2        jim carrey blast castrato adam schiff and demo...
3        julianna margulies us donald trump poop bag to...
4        morgan freeman devastated that sexual harassme...
                               ...                        
69500    girl with the dragon tattoo india release canc...
69501    maria sharapova stunned by victoria azarenka i...
69502    giant over patriot jet over colt among most im...
69503    aldon smith arrested 49ers linebacker busted f...
69504    dwight howard rip teammate after magic loss to...
Name: cleaned_headline, Length: 69505, dtype: object

In [5]:
tfidf_model = TfidfVectorizer(max_features=500)
tfidf_df = pd.DataFrame(tfidf_model.fit_transform(review_data['cleaned_headline']).todense()) # todense() creates matrix
tfidf_df.columns = sorted(tfidf_model.vocabulary_)
tfidf_df.head()

Unnamed: 0,000,10,11,2014,2015,2016,24,abortion,about,actually,...,work,worker,world,worst,would,wrong,year,york,you,your
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.395334,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
review_data['is_political'].value_counts()

0    36766
1    32739
Name: is_political, dtype: int64

#### 4. Divide the data into training and validation sets.

In [7]:
# Use sklearn's train_test_split function to create training and validation sets.
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split\
                                     (tfidf_df, review_data['is_political'],\
                                      test_size=0.2, \
                                      random_state=42, \
                                      stratify=review_data['is_political'])

##### Logistic Regression Classifier

In [8]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression() # log regression class instance
logreg.fit(X_train,y_train) # training model
predicted_labels = logreg.predict(X_valid)
logreg.predict_proba(tfidf_df)[:,1]

array([0.07850945, 0.17472651, 0.95455872, ..., 0.06039255, 0.25981654,
       0.22160182])

Use the crosstab function of pandas to compare the results of our classification model with the actual classes ('target', in this case) of the reviews.

In [9]:
#review_data['predicted_labels'] = predicted_labels # create feature 'predicted_labels' in df review_data and assign predicted_labels values
pd.crosstab(y_valid, predicted_labels)

col_0,0,1
is_political,Unnamed: 1_level_1,Unnamed: 2_level_1
0,6652,701
1,1267,5281


##### Naive Bayes Classifier

In [10]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB() # instance of class GaussianNB  
nb.fit(X_train,y_train) # trainning the model
predicted_labels = nb.predict(X_valid)
nb.predict_proba(tfidf_df)[:,1]

array([0.00000000e+000, 7.03653397e-033, 1.00000000e+000, ...,
       1.92238545e-171, 2.08783703e-009, 4.26497627e-018])

In [11]:
#review_data['predicted_labels'] = predicted_labels
pd.crosstab(y_valid, predicted_labels)

col_0,0,1
is_political,Unnamed: 1_level_1,Unnamed: 2_level_1
0,6169,1184
1,1284,5264


##### KNN Classifier

In [12]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3) # instance of KNeighborsClassifier class

knn.fit(X_train,y_train) # training the data

#review_data['predicted_labels_knn'] = knn.predict(tfidf_df) # predicting on independent variables

pd.crosstab(y_valid, knn.predict(X_valid))

col_0,0,1
is_political,Unnamed: 1_level_1,Unnamed: 2_level_1
0,6282,1071
1,2064,4484


#### Decision Tree Classifier

In [13]:
from sklearn import tree
dtc = tree.DecisionTreeClassifier() # base model 
dtc = dtc.fit(X_train,y_train) # train the data
predicted_labels = dtc.predict(X_valid) # calculate predicted values of y

In [14]:
# compare results of the classification model with actual classes
pd.crosstab(y_valid, predicted_labels)

col_0,0,1
is_political,Unnamed: 1_level_1,Unnamed: 2_level_1
0,6086,1267
1,1370,5178


#### Random Forest Classifier

In [15]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=20,max_depth=4, max_features='sqrt',random_state=1)
#data_patio_lawn_garden['predicted_labels_rfc'] = clf_model(rfc, tfidf_df, data_patio_lawn_garden['target'])
rfc.fit(X_train, y_train) # train model
pd.crosstab(y_valid, rfc.predict(X_valid))

col_0,0,1
is_political,Unnamed: 1_level_1,Unnamed: 2_level_1
0,7103,250
1,3576,2972


#### XGB Classifier

In [16]:
from xgboost import XGBClassifier
xgb_clf=XGBClassifier(n_estimators=20,learning_rate=0.03,\
                      max_depth=5,subsample=0.6,\
                      colsample_bytree= 0.6,reg_alpha= 10,\
                      seed=42)
#data_patio_lawn_garden['predicted_labels_xgbc'] = clf_model(xgb_clf, tfidf_df, data_patio_lawn_garden['target'])
xgb_clf.fit(X_train, y_train)
pd.crosstab(y_valid, xgb_clf.predict(X_valid))

col_0,0,1
is_political,Unnamed: 1_level_1,Unnamed: 2_level_1
0,7116,237
1,3471,3077


We have seen how to build end-to-end classifiers. Developing an end-to-end classifier was done in phases. Firstly, the text corpus was cleaned and tokenized, features were extracted using TFIDF, and then the dataset was divided into training and validation sets. The XGBoost algorithm was used to develop a classification model. Finally, the performance was measured using parameters such as the confusion matrix, accuracy, precision, recall, F1 plot curve, and ROC curve.

In [19]:
word_importances = pd.DataFrame({'word':X_train.columns,'importance':xgb_clf.feature_importances_})
word_importances.sort_values('importance', ascending = False).head(4)

Unnamed: 0,word,importance
443,trump,0.336932
158,gop,0.079485
77,clinton,0.067346
346,republican,0.062968
