# Homework 10, Applying Machine Learning To Sentiment Analysis
# Matt Briskey

### 1. Explain the idea of bag-of-words model.

The bag-of-words model allows us to represent text as numerical feature vectors.  The model involves:

1. Tokenization: The first step is to break down the text into individual words or tokens. Punctuation marks and other non-alphanumeric characters are typically removed, and the text is usually converted to lowercase to avoid different representations of the same word due to capitalization.

2. Vocabulary construction: After tokenization, a unique vocabulary is constructed by collecting all the distinct words from the entire collection of documents. Each word is treated as a feature.

3. Vectorization: To create the numerical representation of a document, a vector is formed, with each element of the vector representing the occurrence or frequency of a word from the vocabulary in the document. 

This approach ignores the order and strucutre of sentences and instead considers each document as a "bag" of words. Bag-of-words modeling is said to be sparase because the unique words in each document represent only a small subset of all the
words in the bag-of-words vocabulary, so the feature vectors will mostly consist of zeros.

### 2. What are the two methods to treat the meaningless frequently occurring words?

1. Term frequency inverse document frequency (TF-IDF) - downweights meaningless frequently occurring words.  TF-IDF is a numerical representation used to reflect the importance of a word in a document relative to a collection of documents (corpus). It takes into account both the frequency of the word in the current document (term frequency) and the rarity of the word in the entire corpus (inverse document frequency).  Words that appear frequently in a document but infrequently across the corpus are often considered more important for determining the document's content or sentiment. Words that occur in many documents across the corpus are usually less informative as they are common across various topics and sentiments.

2. Stop word removal - Stop-words are simply those words that are extremely common in all sorts of texts and probably bear no (or only little) useful information that can be used to distinguish between different classes of documents. Examples of stop- words are is, and, has, and like.

### 3. Classify the documents in fetch_20newsgroups.

In [1]:
# Import the dataset

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
 
X, y = fetch_20newsgroups(categories=categories, shuffle=True, random_state=1, return_X_y=True)

In [2]:
# Create a dataframe with the data 

import pandas as pd

df = pd.DataFrame({'text': X, 'category': y})

df

Unnamed: 0,text,category
0,From: jaeger@buphy.bu.edu (Gregg Jaeger)\nSubj...,0
1,From: young@serum.kodak.com (Rich Young)\nSubj...,2
2,From: mdw33310@uxa.cso.uiuc.edu (Michael D. Wa...,3
3,Subject: Re: Looking for Tseng VESA drivers\nF...,1
4,From: cfaks@ux1.cts.eiu.edu (Alice Sanders)\nS...,2
...,...,...
2252,From: kmr4@po.CWRU.edu (Keith M. Ryan)\nSubjec...,0
2253,From: luis.nobrega@filebank.cts.com (Luis Nobr...,1
2254,From: e_p@unl.edu (edgar pearlstein)\nSubject:...,3
2255,From: atterlep@vela.acs.oakland.edu (Cardinal ...,3


In [3]:
# Check the types of the data

df.dtypes

text        object
category     int64
dtype: object

In [4]:
# Get a count of each category

df.category.value_counts()

3    599
2    594
1    584
0    480
Name: category, dtype: int64

In [5]:
# Show the first raw text before preprocessing 

df.text[0][:500]

'From: jaeger@buphy.bu.edu (Gregg Jaeger)\nSubject: Re: The Inimitable Rushdie (Re: An Anecdote about Islam\nOrganization: Boston University Physics Department\nLines: 63\n\nIn article <1993Apr14.121134.12187@monu6.cc.monash.edu.au> darice@yoyo.cc.monash.edu.au (Fred Rice) writes:\n\n>>In article <C5C7Cn.5GB@ra.nrl.navy.mil> khan@itd.itd.nrl.navy.mil (Umar Khan) writes:\n\n>I just borrowed a book from the library on Khomeini\'s fatwa etc.\n\n>I found this useful passage regarding the legitimacy of the "fatwa'

In [6]:
import numpy as np

# Set print precision
np.set_printoptions(precision=2)

### Cleaning the data

In [7]:
# Regex to clean HTML, punctuation, etc from the raw text

import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

In [8]:
# Show the first text after preprocessing

preprocessor(df.text[0][:500])

'from jaeger buphy bu edu gregg jaeger subject re the inimitable rushdie re an anecdote about islam organization boston university physics department lines 63 in article darice yoyo cc monash edu au fred rice writes in article khan itd itd nrl navy mil umar khan writes i just borrowed a book from the library on khomeini s fatwa etc i found this useful passage regarding the legitimacy of the fatwa'

In [9]:
# Use the Regex preprocessor to clean all of the texts

df['text'] = df['text'].apply(preprocessor)

In [10]:
# Show a sample of the cleaned text

df.head()

Unnamed: 0,text,category
0,from jaeger buphy bu edu gregg jaeger subject ...,0
1,from young serum kodak com rich young subject ...,2
2,from mdw33310 uxa cso uiuc edu michael d walke...,3
3,subject re looking for tseng vesa drivers from...,1
4,from cfaks ux1 cts eiu edu alice sanders subje...,2


### Processing documents into tokens

In [11]:
# Download stopwords

import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\16145\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [12]:
# Tokenizer

def tokenizer(text):
    return text.split()

In [13]:
# stopwords, PorterStemmer

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

stop = stopwords.words('english')

# Define the tokenizer_porter function
def tokenizer_porter(data):
    return [porter.stem(word) for word in data.split()]

# Apply the tokenizer_porter function to each text in the 'text' column
tokenized_texts = df['text'].apply(tokenizer_porter)

# Remove stopwords from each tokenized text
filtered_texts = tokenized_texts.apply(lambda words: [w for w in words if w not in stop])


### Training a logistic regression model for document classification

In [14]:
#  Split the data into train and test

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['category'], test_size = 0.5, random_state=1)

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

# Term frequency inverse document frequency (TF-IDF)

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None,
                        stop_words='english'
                        )

param_grid = [{
  'vect__tokenizer': [tokenizer, tokenizer_porter],
  'clf__penalty': ['l1', 'l2']},
]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0, solver='liblinear'))])

gs_lr_tfidf = GridSearchCV(lr_tfidf,
                           param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [16]:
%%time

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Wall time: 22.5 s


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vect',
                                        TfidfVectorizer(lowercase=False,
                                                        stop_words='english')),
                                       ('clf',
                                        LogisticRegression(random_state=0,
                                                           solver='liblinear'))]),
             n_jobs=-1,
             param_grid=[{'clf__penalty': ['l1', 'l2'],
                          'vect__tokenizer': [<function tokenizer at 0x00000205451B05E0>,
                                              <function tokenizer_porter at 0x00000205451B0790>]}],
             scoring='accuracy', verbose=1)

In [17]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__penalty': 'l2', 'vect__tokenizer': <function tokenizer_porter at 0x00000205451B0790>} 
CV Accuracy: 0.942


In [18]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.945


### Model inspection

In [19]:
log_reg = clf.steps[1][1]

In [20]:
log_reg.intercept_, log_reg.coef_

(array([-1.94, -0.69, -1.04, -1.4 ]),
 array([[-0.05, -0.12, -0.07, ..., -0.03, -0.01,  0.03],
        [ 0.69,  0.27,  0.08, ..., -0.07, -0.04, -0.01],
        [-0.35,  0.01,  0.04, ...,  0.15,  0.07, -0.01],
        [-0.31, -0.18, -0.07, ..., -0.03, -0.02, -0.01]]))

In [21]:
vocab = clf.steps[0][1].vocabulary_

In [22]:
coefs_series = pd.Series(log_reg.coef_.reshape(-1), name='Coefficients')

In [23]:
vocab_series = pd.Series(vocab, name='Vocabulary')

In [24]:
coefs_series

0       -0.045698
1       -0.120636
2       -0.071563
3       -0.000987
4       -0.000237
           ...   
74583   -0.015923
74584   -0.009579
74585   -0.030197
74586   -0.019083
74587   -0.014871
Name: Coefficients, Length: 74588, dtype: float64

In [25]:
word_df = pd.merge(
    vocab_series,
    coefs_series,
    left_on='Vocabulary',
    right_index=True
).drop('Vocabulary', axis=1).reset_index()

In [26]:
word_df.sort_values('Coefficients').head(10)

Unnamed: 0,index,Coefficients
418,graphic,-0.956557
36,christ,-0.847417
360,ca,-0.809215
1254,imag,-0.794275
78,church,-0.714078
3755,msg,-0.700208
355,pitt,-0.697575
2188,sin,-0.697156
88,thank,-0.656602
356,gordon,-0.650394


In [27]:
word_df.sort_values('Coefficients').tail(10)

Unnamed: 0,index,Coefficients
3019,caltech,1.244766
5131,jaeger,1.254913
3170,livesey,1.334555
154,t,1.350427
3228,okcforum,1.390929
467,atheist,1.978709
1043,atheism,2.087479
1625,moral,2.315052
5130,islam,2.329774
3017,keith,3.017815
