# Week 12 (4/25-5/1)

## Text classification with Naive Bayes

* Text processing
* Naive Bayes classification
* Laplace smoothing
* Word clouds

## Project 

* [Text with naive Bayes](../Projects/text_with_naive_bayes/text_with_naive_bayes.ipynb)

## Resources

### 1. Word counts

### 2. Stop words

## Exercises

### Exercise 1

In [42]:
#  "nbsphinx": "hidden"
from collections import Counter
import re
from sklearn.model_selection import train_test_split
import pandas as pd


movies = pd.read_csv('movie_reviews.zip')
train_df, test_df = train_test_split(movies, test_size=0.5, random_state=1)
grouped = train_df.groupby(by='sentiment')['review'].sum()

words = {}
for k in grouped.index:
    words[k] = Counter(re.findall(r"[a-z]+", grouped[k].replace("<br />", "").lower()))
word_counts = pd.DataFrame(words).fillna(0).astype(int)
word_counts.to_csv("word_counts.csv")

Unnamed: 0,negative,positive
this,20294,17756
show,1390,1791
proved,49,76
to,34325,33630
be,7241,6262
...,...,...
virginny,0,2
seeping,0,1
lode,0,2
spooner,0,1


In [43]:
word_counts.head(10) 

Unnamed: 0,negative,positive
this,20294,17756
show,1390,1791
proved,49,76
to,34325,33630
be,7241,6262
a,39146,42334
waste,674,50
of,34273,38891
minutes,1099,426
precious,27,24


### Exercise 2

Write a function `rev_probs()` which takes as its argument text of a review and returns logarithm of probabilities that the review is positive and negative based on the training data. The function should use naive Bayes with Laplace smoothing to compute the probabilities.

**Example:**

In [22]:
# sample review

review = """I saw this recent Woody Allen film because I\'m a fan of 
his work and I make it a point to try to see everything he does, though 
the reviews of this film led me to expect a disappointing effort. They were right. 
This is a confused movie that can\'t decide whether it wants to be a comedy, 
a romantic fantasy, or a drama about female mid-life crisis. It fails at all three.
<br /><br />Alice (Mia Farrow) is a restless middle aged woman who has married into 
great wealth and leads a life of aimless luxury with her rather boring husband and 
their two small children. This rather mundane plot concept is livened up with such 
implausibilities as an old Chinese folk healer who makes her invisible with some magic 
herbs, and the ghost of a former lover (with whom she flies over Manhattan). If these 
additions sound too fantastic for you, how about something more prosaic, like an affair 
with a saxophone player?<br /><br />I was never quite sure of what this mixed up muddle 
was trying to say. There are only a handful of truly funny moments in the film, 
and the endingis a really preposterous touch of Pollyanna.<br /><br />Rent \'Crimes and 
Misdemeanors\' instead, a superbly well-done film that suceeds in combining comedy with 
a serious consideration of ethics and morals. Or go back to "Annie Hall" or "Manhattan"."""

In [38]:
#  "nbsphinx": "hidden"
from sklearn.model_selection import train_test_split
import pandas as pd
import re
import numpy as np

movies = pd.read_csv('movie_reviews.zip')
train_df, test_df = train_test_split(movies, test_size=0.5, random_state=1)

wc = word_counts + 1

with open("stopwords.txt") as f:
    stops = f.read().split(",")
wc = wc.drop(stops, errors="ignore")

class_probs = train_df["sentiment"].value_counts()/len(train_df)

log_wc = np.log10(wc/wc.sum())

def rev_probs(rev):
    words = [w for w in re.findall(r"[a-z']+", rev.replace("<br />", "").lower()) if w in log_wc.index]
    return sum([log_wc.loc[w] for w in words]) + np.log10(class_probs)

In [39]:
rev_probs(review)

negative   -427.788529
positive   -429.078611
dtype: float64

### Exercise 3

Create a dataframe in which every row corresponds to one post. The columns should list the name of the newsgroup, the post author, the post subject, and the body of the post. Here is a sample:

In [7]:
#  "nbsphinx": "hidden"

from zipfile import ZipFile, ZIP_DEFLATED
import pandas as pd
import io

with ZipFile("newsgroups.csv.zip", 'r') as zipped:
    txt = zipped.read("newsgroups.csv").decode(encoding='utf8', errors='ignore')
    
df = pd.read_csv(io.StringIO(txt))

In [8]:
df.head(5)

Unnamed: 0,newsgroup,from,subject,body
0,rec.autos,gwm@spl1.spl.loral.com (Gary W. Mahan),Re: Are BMW's worth the price?,>sure sounds like they got a ringer. the 325i...
1,sci.med,davec@ecst.csuchico.edu (Dave Childs),Dental Fillings question,I have been hearing bad thing about amalgam de...
2,alt.atheism,"""Robert Knowles"" <p00261@psilink.com>",Re: Islamic marriage?,">DATE: Tue, 6 Apr 1993 00:11:49 GMT\n>FROM: ..."
3,rec.sport.baseball,sepinwal@mail.sas.upenn.edu (Alan Sepinwall),Re: WFAN,In article <1993Apr16.174843.28111@cabell.vcu....
4,talk.religion.misc,rwd4f@poe.acc.Virginia.EDU (Rob Dobson),Re: A Message for you Mr. President: How do yo...,In article <visser.735284180@convex.convex.com...


After creating the DataFrame, save it to a csv file. 