# Natural Language Processing

The focus of this week was using the Naive Bayes classifier to label text.  In this assignment, we will first build our own Naive Bayes classifier on the text samples from the lectures.  Then, we will explore the algorithm with `sklearn`.  Recall our sentences dealing with **TV** and **radio** below.  


In [96]:
sents = ['TV programs are not interesting -- TV is annoying.',
'Kids like TV'
,'We receive TV by radio waves'
,'It is interesting to listen to the radio'
,'On the waves, kids programs are rare.'
,'The kids listen to the radio; it is rare.']

### Q1: Building a DataFrame

To begin, let's organize our sentences in a familiar `DataFrame`.  

In [2]:
import pandas as pd

In [3]:
data = pd.DataFrame(sents, columns =['sents'])

In [4]:
data.head()

Unnamed: 0,sents
0,TV programs are not interesting -- TV is annoy...
1,Kids like TV
2,We receive TV by radio waves
3,It is interesting to listen to the radio
4,"On the waves, kids programs are rare."


In [5]:
### GRADED
### PROBLEM I: Create a DataFrame with one column
## named "sents", containing the six sentences above.  Save your dataframe to 
## ans_1 below.

sents_df = pd.DataFrame(sents, columns =['sents'])
ans_1 = sents_df

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Q2: Vectorization

An important piece of the work in building a classifier will be using word counts in each class.  We will utilize the built in `CountVectorizer` from sklearn to accomplish this task.  To begin, we create a **document term matrix** that wil contain word counts for each word in the sentences.

In [26]:
### GRADED
### Problem 2
## Create a document term matrix
## using the .fit_transform() method
## of the CountVectorizer on your DataFrame 
## From above.  Save your results to 
## ans_2 below
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer()

term_matrix = cvect.fit_transform(ans_1['sents'])
ans_2= term_matrix

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [27]:
type(ans_2)

scipy.sparse.csr.csr_matrix

In [28]:
ans_2

<6x20 sparse matrix of type '<class 'numpy.int64'>'
	with 38 stored elements in Compressed Sparse Row format>

### Q3: Back to array

The results of our `CountVectorizer` transformation is a sparse matrix.  This is convenient for storing massive arrays with many zero entries as we would find with a larger corpus.  We want to examine the words and counts, so let's convert the sparse matrix back to an array using the `.asarray()` method.

In [29]:
### GRADED
### PROBLEM III
## After fitting your CountVectorizer
## convert the object back to an array 
## with the .toarray() method.  Save your 
## results to ans_3
cvect = CountVectorizer()
fit_wrds = ans_2
dtm = fit_wrds.toarray()
ans_3 = dtm

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [30]:
ans_3

array([[1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1],
       [0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 2, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0],
       [0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 2, 1, 0, 0, 0]],
      dtype=int64)

### Q4: As a `DataFrame`

Now, we can use our `CountVectorizer`'s `.get_feature_names()` method to apply titles to the columns of a `DataFrame` and the values will be those from our array above.  This time we will write a function that takes in our original text and returns a `DataFrame` with the counts of each individual word in each document, and the words as the column heading.  See the image below for the a plotted version of your final DataFrame using seaborn's `heatmap`.

![](table.png)

In [31]:
### GRADED
### PROBLEM IV
## Complete the function 
## make_dtmdf below
def make_dtmdf(sents):
    '''
    This function will take in a list
    of sentences and return a document
    term matrix as a DataFrame.
    '''
    data = pd.DataFrame(sents, columns =['sents'])
    cvect = CountVectorizer()
    fit_wrds = cvect.fit_transform(data['sents'])
    dtm = fit_wrds.toarray()
    
    return pd.DataFrame(dtm, columns = cvect.get_feature_names())

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [32]:
make_dtmdf(sents)

Unnamed: 0,annoying,are,by,interesting,is,it,kids,like,listen,not,on,programs,radio,rare,receive,the,to,tv,waves,we
0,1,1,0,1,1,0,0,0,0,1,0,1,0,0,0,0,0,2,0,0
1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0
2,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,1,1
3,0,0,0,1,1,1,0,0,1,0,0,0,1,0,0,1,2,0,0,0
4,0,1,0,0,0,0,1,0,0,0,1,1,0,1,0,1,0,0,1,0
5,0,0,0,0,1,1,1,0,1,0,0,0,1,1,0,2,1,0,0,0


### Q5: Trimming the Vocabulary

As we can see, our text has some unneccessary words.  For example, it won't be very important to a sentence if it contains the word "to" as being classified as TV or radio.  These are called *stop_words* and the `CountVectorizer` has an argument `stop_words` that we can utilize to eliminate this uninformative vocabulary.  Let's add this element to our function.

In [35]:
### GRADED
### PROBLEM V
def make_dtmdf(sents, stopwords = True):
    '''
    This function will take in a list
    of sentences and return a document
    term matrix as a DataFrame. By default,
    it will remove the stopwords from the texts.
    '''
    if stopwords:
        data = pd.DataFrame(sents, columns =['sents'])
        cvect = CountVectorizer(stop_words = 'english')
        fit_wrds = cvect.fit_transform(data['sents'])
        dtm = fit_wrds.toarray()
        final_df = pd.DataFrame(dtm, columns = cvect.get_feature_names())
    else:
        data = pd.DataFrame(sents, columns =['sents'])
        cvect = CountVectorizer()
        fit_wrds = cvect.fit_transform(data['sents'])
        dtm = fit_wrds.toarray()
        final_df = pd.DataFrame(dtm, columns = cvect.get_feature_names())
        
    return final_df

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [38]:
make_dtmdf(sents, True)

Unnamed: 0,annoying,interesting,kids,like,listen,programs,radio,rare,receive,tv,waves
0,1,1,0,0,0,1,0,0,0,2,0
1,0,0,1,1,0,0,0,0,0,1,0
2,0,0,0,0,0,0,1,0,1,1,1
3,0,1,0,0,1,0,1,0,0,0,0
4,0,0,1,0,0,1,0,1,0,0,1
5,0,0,1,0,1,0,1,1,0,0,0


### Q6: Adding a Label Column

Now, we add to our function to accept a list of labels for the sentences 
and return a DataFrame containing the labels for each sentence in a column titled "label".

In [154]:
### GRADED
def make_dtmdf(sents, labels, stopwords = True):
    '''
    This function will take in a list
    of sentences and return a document
    term matrix as a DataFrame. By default,
    it will remove the stopwords from the texts.
    '''
    if stopwords:
        data = pd.DataFrame(sents, columns =['sents'])
        cvect = CountVectorizer(stop_words = 'english')
        fit_wrds = cvect.fit_transform(data['sents'])
        dtm = fit_wrds.toarray()
        final_df = pd.DataFrame(dtm, columns = cvect.get_feature_names())
    else:
        data = pd.DataFrame(sents, columns =['sents'])
        cvect = CountVectorizer()
        fit_wrds = cvect.fit_transform(data['sents'])
        dtm = fit_wrds.toarray()
        final_df = pd.DataFrame(dtm, columns = cvect.get_feature_names())
        
    final_df['labels'] = labels
    return final_df

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [97]:
labels = ['TV', 'TV', 'radio', 'radio', 'TV', 'radio']

In [46]:
make_dtmdf(sents, labels, stopwords = True)

Unnamed: 0,annoying,interesting,kids,like,listen,programs,radio,rare,receive,tv,waves,label
0,1,1,0,0,0,1,0,0,0,2,0,TV
1,0,0,1,1,0,0,0,0,0,1,0,TV
2,0,0,0,0,0,0,1,0,1,1,1,radio
3,0,1,0,0,1,0,1,0,0,0,0,radio
4,0,0,1,0,0,1,0,1,0,0,1,TV
5,0,0,1,0,1,0,1,1,0,0,0,radio


In [48]:
test = make_dtmdf(sents, labels, stopwords = True)

In [49]:
test[test.label == 'TV']

Unnamed: 0,annoying,interesting,kids,like,listen,programs,radio,rare,receive,tv,waves,label
0,1,1,0,0,0,1,0,0,0,2,0,TV
1,0,0,1,1,0,0,0,0,0,1,0,TV
4,0,0,1,0,0,1,0,1,0,0,1,TV


In [58]:
ans = (test.groupby('label').sum() + 1).T

In [62]:
ans['radio'].sum()

22

In [53]:
hm.head()

Unnamed: 0,annoying,interesting,kids,like,listen,programs,radio,rare,receive,tv,waves,label
0,1,1,0,0,0,1,0,0,0,2,0,TV
1,0,0,1,1,0,0,0,0,0,1,0,TV
2,0,0,0,0,0,0,1,0,1,1,1,radio
3,0,1,0,0,1,0,1,0,0,0,0,radio
4,0,0,1,0,0,1,0,1,0,0,1,TV
5,0,0,1,0,1,0,1,1,0,0,0,radio


### Q7: Computing our Priors

The priors in this case deal with the probability that a text was in either label.  Assing the correct values for `P(TV)` and `P(radio)` to p_tv and p_radio respectively.

In [47]:
### GRADED
p_tv = 1/2
p_radio = 1/2

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Q8: Conditional Probabilities

Recall from our lectures that the following formula yeilds the conditional probability that a given word occurs given a label.

$$\hat{P}(w | c) = \frac{count(w, c) + 1}{count(c) + |vocabulary|}$$

First, we write a function to compute the `count(w, c)` piece.  This should return a `DataFrame` as shown below.

![](ans_8.png)

In [155]:
### GRADED
def count_w_c(dtm):
    '''
    This function takes in a 
    document term matrix including
    class labels column named labels.
    We return the count for each word given by 
    each word given a class.
    '''
    nominator = (dtm.groupby('labels').sum() + 1).T
       
    return nominator

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Q9: Counting labels

Now, we turn our attention to the denominator.  Here, we want to get the `count(c)` term first.  This function will return the count of words in each class.

In [78]:
### GRADED
def count_c(sents, labels, stopwords = True):
    '''
    This function takes in a list
    of sentences, and a list of class labels.  
    
    It returns counts of total words in the class labels
    in the order TV, radio.
    '''
    if stopwords:
        cvect = CountVectorizer(stop_words = 'english')
    else:
        cvect = CountVectorizer()
        
    data = pd.DataFrame(sents, columns = ['sents'])
    fit_wrds = cvect.fit_transform(data['sents'])
    dtm = fit_wrds.toarray()
    final_df = pd.DataFrame(dtm, columns = cvect.get_feature_names())
        
    final_df['labels'] = labels
    
    nominator = (final_df.groupby('labels').sum() + 1).T
    tv = nominator.sum()['TV']
    radio = nominator.sum()['radio']
    
    return tv, radio

In [79]:
count_c(sents, labels, stopwords = True)

(23, 22)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Q10: Conditional Probabilities

Now, we want to compute the conditional probability of class inclusion given a new sentence.  This is the formula given by:

$$\hat{P}(w | c) = \frac{count(w, c) + 1}{count(c) + |vocabulary|}$$

We define a function named `conditional_probabilities` that will return a dataframe with the conditional probabilities for class inclusion of a new sentence.


In [161]:
### GRADED
### Complete the conditional_probability
## function below that takes in a list of sentences
## a list of labels, and a list with a new sentence
## to be classified.  The function will return the conditional
## probabilities for each class.
def conditional_probabilities(sents, labels, new, stopwords = True):
    '''
    Use the formula above and your earlier
    functions to compute the conditional probability
    for each word being in each class.
    
    Example:
    sents = ['Chinese Beijing Chinese',
    'Chinese Chinese Shanghai', 
    'Chinese Macao', 
    'Tokyo Japan Chinese']
    labels = ['c', 'c', 'c', 'j']
    new = ['Chinese Chinese Chinese Tokyo Japan']
    
    ans = ans_make_dtmdf(a, ['c', 'c', 'c', 'j'], ['Chinese Chinese Chinese Tokyo Japan'])
    
    print(ans) --> 	a	b	p_w_a	p_w_b
            chinese	6	2	0.428571	0.222222
            japan	1	2	0.071429	0.222222
            tokyo	1	2	0.071429	0.222222
    '''
    if stopwords:
        cvect = CountVectorizer(stop_words = 'english')
    else:
        cvect = CountVectorizer()
        
    fit_wrds = cvect.fit_transform(sents)
    dtm = fit_wrds.toarray()
    final_df = pd.DataFrame(dtm, columns = cvect.get_feature_names())   
    final_df['labels'] = labels
    
    ans = (final_df.groupby('labels').sum() + 1).T
    c1 = ans.sum()[0]
    c2 = ans.sum()[1]
    
    new_wrd = cvect.transform(new).toarray()
    new_df = pd.DataFrame(new_wrd, columns = cvect.get_feature_names())
    
    a1 = ans/c1
    a2 = ans/c2
    
    vect2 = CountVectorizer(stop_words = 'english')
    index = cvect.fit_transform(new)
    index = cvect.get_feature_names()
    full_df = pd.concat([ans, a1['c'], a2['j']], axis = 1)
    full_df.columns = ['a', 'b', 'p_w_a', 'p_w_b']
    
    return full_df.loc[index]

In [None]:
### GRADED
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [162]:
conditional_probabilities(['Chinese Beijing Chinese',
    'Chinese Chinese Shanghai', 
    'Chinese Macao', 
    'Tokyo Japan Chinese'], ['c', 'c', 'c', 'j'], ['Chinese Chinese Chinese Tokyo Japan'], stopwords = True)

Unnamed: 0,a,b,p_w_a,p_w_b
chinese,6,2,0.428571,0.222222
japan,1,2,0.071429,0.222222
tokyo,1,2,0.071429,0.222222


### Q11: Completing the Task

Once we have these conditional probabilities, we use them to compute the probabilities for each class by combining the priors with the conditional probabilities for each words occurrence.  For example, if we have a new sentence:

`I saw the radio on the TV`

we compute two probabilities.  The first is the probability this sentence comes from `TV` the second for `radio`.  These would be calculated with:

$$P(c_{tv}) \times P(radio | c_{tv}) \times P( TV | c_{tv})$$

$$P(c_{radio}) \times P(radio | c_{radio}) \times P( TV | c_{radio})$$



In [101]:
### GRADED
### PROBLEM 11
## Use your conditional probabilities
## and priors to compute the class of the 
## given sentence.  Which class should we
## label it, 'TV' or 'radio'?  Save your
## answer to ans_11 below
new = ['I saw the radio on the TV']
ans_11 = 'TV'

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Part II: `sklearn` and Spam

Shifting to an existing implementation from `sklearn`, we will work through a text classification project using a dataset from the [UCI Machine Learning Repository]( https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) dealing with email spam. Our goal is to classify messages as **spam** or **ham**. 

```
The collection is composed by just one text file, where each line has the correct class followed by the raw message. We offer some examples bellow: 

ham What you doing?how are you? 
ham Ok lar... Joking wif u oni... 
ham dun say so early hor... U c already then say... 
ham MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H* 
ham Siva is in hostel aha:-. 
ham Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor. 
spam FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop 
spam Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B 
spam URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU 

Note: the messages are not chronologically sorted.

```

In [102]:
import pandas as pd

In [103]:
sms = pd.read_table('../resource/asnlib/public/docs/SMSSpamCollection.txt', names = ['spam', 'message'])

In [104]:
sms.head()

Unnamed: 0,spam,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Q12: train/test split

We begin by splitting our data into a training and a testing set.

In [107]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

In [132]:
### GRADED
### Problem 12
## Train/Test split. Recall our target is spam!
## Use sklearns train_test_split() with a random seed = 24
## to split the data into training and testing set.  Save your
## answers as X_train, X_test, y_train, y_test
X = sms.message
Y = sms.spam
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 24)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Q13: Vectorizer Fit


In [133]:
### GRADED
### PROBLEM 13
## Fit a CountVectorizer to the 
## train dataset saved in Q12
cvect = CountVectorizer()
cvect.fit(X_train)
ans = cvect.get_feature_names()

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Q14: Transform Train

In [134]:
### GRADED
### PROBLEM 14
## Using your fit CountVectorizer
## Transform you training data and save the 
## results to X_train_cvect
X_train_cvect = cvect.transform(X_train)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Q15: Transform Test

In [135]:
### GRADED
### PROBLEM 15
## Using the same CountVectorizer instance
## transform your testing data and save the results
## to X_test_cvect
X_test_cvect = cvect.transform(X_test)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Q16: Building Model

Now, we implement the Multinomial Naive Bayes estimator from `sklearn`.  This is the recommended option for textual count features.

```
The multinomial Naive Bayes classifier is suitable for classification with
discrete features (e.g., word counts for text classification). The
multinomial distribution normally requires integer feature counts. However,
in practice, fractional counts such as tf-idf may also work.
```

In [118]:
from sklearn.naive_bayes import MultinomialNB

In [136]:
### GRADED
### PROBLEM 16
## Use the Instantiated Naive Bayes Classifier 
## and fit your training data
nbayes = MultinomialNB()
nbayes.fit(X_train_cvect, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Q17: Assessing the fit

In [137]:
### GRADED
### PROBLEM 17
## Use the .score() method
## to assess the fit on the testing
## data.  Save your score to ans_17 below
ans_17 = nbayes.score(X_test_cvect, y_test)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [138]:
ans_17

0.9842067480258435

In [139]:
sms.spam.value_counts(normalize = True)

ham     0.865937
spam    0.134063
Name: spam, dtype: float64

### Q18: Excluding `stopwords`

We didn't remove stopwords as we had in our NaiveBayes from part I.  Let's now write a function that takes in an X and y predictor and target array, and returns the score after fitting 
a `MutlinomialNB` to the vocabulary with the stopwords removed.  

In [140]:
### GRADED
### PROBLEM 18
## Complete the function below
def nbayes_fit(X, y, stopwords = True):
    '''
    This function takes in a pandas series of text
    and one of a target variable.  
    The function fits a MultinomialNB with CountVectorized
    text having the stopwords removed.
    '''
    X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 24)
    
    if stopwords:
        cvect = CountVectorizer(stop_words = 'english')
    else:
        cvect = CountVectorizer()
    
    X_train_cvect = cvect.fit_transform(X_train)
    X_test_cvect = cvect.transform(X_test)
    
    nbayes = MultinomialNB()
    nbayes.fit(X_train_cvect, y_train)
    
    return nbayes.score(X_test_cvect, y_test)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [141]:
nbayes_fit(sms.message, sms.spam, stopwords = True)

0.9849246231155779

### Q19: Incorporating `ngrams`

The `CountVectorizer` also contains an `n_gram_range` parameter that allows us to incoporporate bi-grams, tri-grams, and any range of n-grams.  We call this by using a range of values as a tuple.  Now, let's incorporate **bigrams** into our model, and see if the score improves.

In [143]:
### GRADED
## Complete the function below
def nbayes_fit(X, y, stopwords = True, ngrams = 2):
    '''
    This function takes in a pandas series of text
    and one of a target variable.  
    The function fits a MultinomialNB with CountVectorized
    text having the stopwords removed.  We also use bigrams
    by incorporating the ngram_range of the CountVectorizer
    up through the ngrams argument value.  This is default to 2.(bigrams)
    '''
    X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 24)
    
    if stopwords:
        cvect = CountVectorizer(stop_words = 'english', ngram_range=(1, ngrams))
    else:
        cvect = CountVectorizer(ngram_range=(1, ngrams))
    
    X_train_cvect = cvect.fit_transform(X_train)
    X_test_cvect = cvect.transform(X_test)
    
    nbayes = MultinomialNB()
    nbayes.fit(X_train_cvect, y_train)
    
    return nbayes.score(X_test_cvect, y_test)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [145]:
nbayes_fit(X, Y, stopwords = True, ngrams = 3)

0.9849246231155779

### `tfidf` Vectorizer

In addition to the `CountVectorizer` we have an option to use the **term frequency inverse document frequency** vectorization approach.  Here, rather than just pure word counts, we attempt to measure the rarity of words in the text.  From the [user guide](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency: 

$$\text{tf-idf}(t, d) = tf(t, d) \times idf(t)$$

Using the TfidfTransformer’s default settings, `TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)` the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as

$$\text{idf}(t) = \log \frac{1 + n}{1 + df(t)} + 1$$



In [146]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [147]:
tfidif = TfidfVectorizer()

### Q20: Revising Function with `tfidif`



In [148]:
### GRADED
###PROBLEM 20
def tfidif_bayes(X, y, stopwords = True, ngrams = 2):
    '''
    This function takes in a pandas series of text
    and one of a target variable.  
    The function fits a MultinomialNB with TfidifVectorizer
    text having the stopwords removed.  We also use bigrams
    by incorporating the ngram_range of the TfidfVectorizer
    up through the ngrams argument value.  This is default to 2.(bigrams)
    '''
    X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 24)
    
    if stopwords:
        cvect = TfidfVectorizer(stop_words = 'english', ngram_range=(1, ngrams))
    else:
        cvect = TfidfVectorizer(ngram_range=(1, ngrams))
    
    X_train_cvect = cvect.fit_transform(X_train)
    X_test_cvect = cvect.transform(X_test)
    
    nbayes = MultinomialNB()
    nbayes.fit(X_train_cvect, y_train)
    
    return nbayes.score(X_test_cvect, y_test)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [153]:
tfidif_bayes(X, Y, stopwords = True, ngrams = 2)

0.9612347451543432

Not bad!  We can apply many of our earlier ideas around searching for the ideal hyperparameters in order to attempt improvement.  