# Text classification 

Text classification is a type of supervised machine learning. The workflow is as follows. 

- First, you prepapre hand-labeled outputs. 
- Then, preprocess the data and build and train a model based on it. 
- After done with the cross-validation, apply the model to the entire dataset.

Note that this is a much simpilfied version of what I did for my disseratation chapter. Smaller data, and less explorations. 

## Load libraries

You don't need to load these many libraries.

In [1]:
%matplotlib inline
import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import numpy as np # for large data 
import pandas as pd # for data manipulation and analysis
from subprocess import check_output
import matplotlib.pyplot as plt
import seaborn as sns
import gensim, nltk, re, string, xgboost, textblob, spacy

# nltk

from nltk import word_tokenize, sent_tokenize # for tokenization
from nltk.corpus import stopwords # for stop words

stop = stopwords.words('english')

from nltk.stem import WordNetLemmatizer # lemmatizer
from nltk.stem import LancasterStemmer, WordNetLemmatizer
from nltk import tokenize
from collections import Counter

import contractions
import inflect
from bs4 import BeautifulSoup

import unicodedata

from pattern.en import tag
from nltk.corpus import wordnet as wn

# models 

from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # for bag of words, tfidf 
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import decomposition, ensemble 

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
sns.set()

from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec # for word2vec

from keras.preprocessing import text, sequence 
from keras import layers, models, optimizers

Using TensorFlow backend.


### Load data


In [2]:

aa_newspapers = pd.read_csv("/home/jae/PS239T/14_supervised-machine-learning/data/african_american_newspapers.csv")


In [14]:
aa_newspapers

Unnamed: 0.1,Unnamed: 0,author,date,source,text,year,linked_progress,linked_hurt
0,1,,1968-11-23,Sun Reporter (1968-1979),friday nov. at p.m. rev. l. s. rubin pastor ...,1968,0,0
1,2,,1968-11-13,Oakland Post (1968-1981),we have a large building an ante bellum buildi...,1968,1,1
2,3,,1968-11-13,Oakland Post (1968-1981),ktvu's 'televoters' were back to being pretty ...,1968,0,1
3,4,,1968-12-18,Oakland Post (1968-1981),washington d.c. washington's appointed mayor ...,1968,0,0
4,5,,1968-12-28,Sun Reporter (1968-1979),spokesmen for the congress of racial equality ...,1968,1,0
5,6,,1968-12-18,Oakland Post (1968-1981),the lowell junior high school fire that caused...,1968,0,1
6,7,,1968-10-09,Oakland Post (1968-1981),sacramento one of the first negro mayors in t...,1968,1,0
7,8,,1968-10-31,Sacramento Observer (1968-1975),robert weaver head of the department of housin...,1968,1,0
8,9,,1968-09-11,Oakland Post (1968-1981),the new zellerbach auditorium at the universit...,1968,0,0
9,10,"Fleming, Tom",1968-11-23,Sun Reporter (1968-1979),three san francisco police officers were shot ...,1968,0,1


## Preprocessing

### Remove special characters, punctuations, numbers, and whitespace

In [3]:
# sample

aa_newspapers['text'] = aa_newspapers['text'].str.replace('\n','', regex = True).str.replace(',','', regex = True).str.replace(':','', regex = True).str.replace('/','', regex = True)
aa_newspapers['text'] = aa_newspapers['text'].str.replace('\\','', regex = True).str.replace('-','', regex = True).str.replace('"','', regex = True)
aa_newspapers['text'] = aa_newspapers['text'].str.replace('\d+', '', regex = True) # remove numbers
aa_newspapers['text'] = aa_newspapers['text'].str.strip() # remove whitespace

In [4]:
aa_newspapers['text'][2]

"KTVU's 'televoters' were back to being pretty uptight this week.San Francisco's Mayor Alioto should not apologize to the Black Panthers for his remarks about the Sunday night bombings in the opinion of  percent.Teachers should not strike if violence is not curbed in high schools   percent to .Fiftythree percent did not agree with President Johnson's announcement on Vietnam; and  percent do not think the bombing halt will result in peace in Vietnam ... As a matter of fact you might say Don't give up the ship was the byword of viewers polled by Channel  in its weekly question and answer period.However the San FranciscoOakland TV station stressed the continuing nonscientific not necessarily representative of community opinion nature of its poll."

### Lower case

In [5]:
aa_newspapers['text'] = aa_newspapers['text'].str.lower()

In [6]:
aa_newspapers['text'][2]

"ktvu's 'televoters' were back to being pretty uptight this week.san francisco's mayor alioto should not apologize to the black panthers for his remarks about the sunday night bombings in the opinion of  percent.teachers should not strike if violence is not curbed in high schools   percent to .fiftythree percent did not agree with president johnson's announcement on vietnam; and  percent do not think the bombing halt will result in peace in vietnam ... as a matter of fact you might say don't give up the ship was the byword of viewers polled by channel  in its weekly question and answer period.however the san franciscooakland tv station stressed the continuing nonscientific not necessarily representative of community opinion nature of its poll."

## Classification

### Building and training a model

I adapted some code from [here](https://link.springer.com/chapter/10.1007/978-1-4842-2388-8_4#Sec2).

In [7]:
## Extract features 

In [8]:

vectorizer = CountVectorizer(
    min_df = 1,
    ngram_range = (1,2),
    max_features = 5000,
    binary = True)


In [9]:

def dtm_train(data, text, column, date):
    # Bag of words model
    
    features = vectorizer.fit_transform(data[text])
    
    response = data[column].values 

    # split into train/test datasets 

    X_train, X_test, y_train, y_test = train_test_split(features, response, 
                                                        test_size = 0.3,
                                                        random_state = 1234,
                                                        stratify = data[date])

    return(X_train, y_train, X_test, y_test)

In [10]:


# linked progress
aa_dtm_lp = dtm_train(aa_newspapers, 'text', 'linked_progress', 'year')
aa_X_train_lp = aa_dtm_lp[0]
aa_y_train_lp = aa_dtm_lp[1]
aa_X_test_lp = aa_dtm_lp[2]
aa_y_test_lp = aa_dtm_lp[3]

# linked hurt 
aa_dtm_lh = dtm_train(aa_newspapers, 'text', 'linked_hurt', 'year')
aa_X_train_lh = aa_dtm_lh[0]
aa_y_train_lh = aa_dtm_lh[1]
aa_X_test_lh = aa_dtm_lh[2]
aa_y_test_lh = aa_dtm_lh[3]

### Fit a model

In [11]:
# logistic regression
def fit_logistic_regression(X_train, y_train):
    model = LogisticRegression(max_iter = 4000)
    model.fit(X_train, y_train)
    return model

# random forest
def fit_random_forest(X_train, y_train):
    model = RandomForestClassifier(n_estimators = 200, max_depth = 3, random_state = 42)
    model.fit(X_train, y_train)
    return model 

# naive bayes
def fit_bayes(X_train, y_train):
    model = MultinomialNB()
    model.fit(X_train, y_train)
    return model

def test_model(model, X_train, y_train, X_test, y_test):
    y_pred = model.predict(X_test)
    print(confusion_matrix(y_test, y_pred))
    print('Accuracy: ', model.score(X_test, y_test))

In [12]:

## linked progress
aa_lp = fit_logistic_regression(aa_X_train_lp, aa_y_train_lp)
aa_lp_rf = fit_random_forest(aa_X_train_lp, aa_y_train_lp)
aa_lp_bayes = fit_bayes(aa_X_train_lp, aa_y_train_lp)

## linked hurt 
aa_lh = fit_logistic_regression(aa_X_train_lh, aa_y_train_lh)
aa_lh_rf = fit_random_forest(aa_X_train_lh, aa_y_train_lh)
aa_lh_bayes = fit_bayes(aa_X_train_lh, aa_y_train_lh)

### Cross-validation check

In [13]:
# asian sample test 

## linked progress
print("linked progress")
test_model(aa_lp, aa_X_train_lp, aa_y_train_lp, aa_X_test_lp, aa_y_test_lp)
test_model(aa_lp_rf, aa_X_train_lp, aa_y_train_lp, aa_X_test_lp, aa_y_test_lp)
test_model(aa_lp_bayes, aa_X_train_lp, aa_y_train_lp, aa_X_test_lp, aa_y_test_lp)

## linked hurt
print("linked hurt")
test_model(aa_lh, aa_X_train_lh, aa_y_train_lh, aa_X_test_lh, aa_y_test_lh)
test_model(aa_lh_rf, aa_X_train_lh, aa_y_train_lh, aa_X_test_lh, aa_y_test_lh)
test_model(aa_lh_bayes, aa_X_train_lh, aa_y_train_lh, aa_X_test_lh, aa_y_test_lh)

linked progress
[[206  18]
 [ 53  26]]
Accuracy:  0.7656765676567657
[[224   0]
 [ 79   0]]
Accuracy:  0.7392739273927392
[[166  58]
 [ 22  57]]
Accuracy:  0.735973597359736
linked hurt
[[241   5]
 [ 36  21]]
Accuracy:  0.8646864686468647
[[246   0]
 [ 57   0]]
Accuracy:  0.8118811881188119
[[213  33]
 [ 10  47]]
Accuracy:  0.858085808580858
