# Your name: Bin Xie

## Names of people you worked with

# Section 1: Naive Bayes

Thanks to Lucas Champollion and Frans Adriaans.

In  this  exercise,  you  will  be  using  a  Naïve  Bayes  classifier  to  analyze  movie  reviews.  This  will  require  a  couple  of  assumptions  on  our  part:

-- First,  we're  assuming  that  every  review  can  be  classified  as  either  positive  or  negative  (no  neutral  reviews)

-- Second,  documents  can  be  represented  as  a  bag  of  words,  with  sequential  order  being  unimportant

-- Third,  the  probabilities  of  two  words  x  and  y  appearing  in  a  document  are  independent  (this  is  the  naïve  part  of  Naïve  Bayes).  This  is  saying  that  the  probability  of  encountering  the  word  "butter"  is  unchanged  by  seeing  the  word  "peanut",  even  though  we  generally  have  the  intuition  that  p(butter|peanut)  is  higher  than,  for  example  p(butter|supernova)

We  know  that  assumptions  2  and  3  are  probably  wrong,  but  they  make  things  much  simpler.

The  first  thing  needed  for  a  classifier  is  a  set  of  labeled  reviews  to  use  to  train  the  model.  NLTK  has  a  corpus  of  2000  labeled  movie  reviews  that  should  work  well  for  this.  Run  the  cell  below  to  import  and  transform  the  reviews  into  a  convenient  form.

In [1]:
import  nltk
nltk.download('movie_reviews')
nltk.download('punkt')


[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/binxie/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /Users/binxie/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
import random
#  This  part  is  a  bit  complicated  but  basically  we're  taking  the  corpus  of  movie  reviews  and  transforming  them  into  a  useful  form.
from nltk.corpus import  movie_reviews
movies  =  [(list(movie_reviews.words(fileid)),
             category) for  category  in movie_reviews.categories() for  fileid  in movie_reviews.fileids(category)]
random.seed(8823)
random.shuffle(movies)


What  this  gives  us  is  a  list  of  pairs  in  the  form  (review,label).  These  pairs  are  a  Python  data  structure  called  a  tuple.  The  review  is  a  list  of  words,  and  the  label  is  a  string  that  either  says  'pos'  or  'neg'. Let's take a look at the first review.

In [4]:
#  Each  item  in  the  list  has  the  structure  (review,label)  where  review  is  a  list  and  label  is  a  string  
#(the  u  before  each  string  indicates  that  it's  unicode)
print(movies[0])

(['delicatessen', '(', 'directors', ':', 'marc', 'caro', '/', 'jean', '-', 'pierre', 'jeunet', ';', 'screenwriters', ':', 'gilles', 'adrien', '/', 'marc', 'caro', ';', 'cinematographer', ':', 'darius', 'khondji', ';', 'editor', ':', 'herve', 'schneid', ';', 'cast', ':', 'dominique', 'pinon', '(', 'louison', ')', ',', 'marie', '-', 'laure', 'dougnac', '(', 'julie', 'clapet', ')', ',', 'jean', '-', 'claude', 'dreyfus', '(', 'clapet', '-', 'the', 'butcher', ')', ',', 'karin', 'viard', '(', 'mademoiselle', 'plusse', ')', ',', 'ticky', 'holgado', '(', 'marcel', 'tapioca', ')', ',', 'anne', '-', 'marie', 'pisani', '(', 'madame', 'tapioca', ')', ',', 'jacques', 'mathou', '(', 'roger', ')', ',', 'rufus', '(', 'robert', 'kube', ')', ',', 'howard', 'vernon', '(', 'frog', 'man', ')', ',', 'edith', 'ker', '(', 'granny', ')', ',', 'boban', 'janevski', '(', 'young', 'rascal', ')', ',', 'mikael', 'todde', '(', 'young', 'rascal', ')', ',', 'chick', 'ortega', '(', 'postman', ')', ',', 'silvie', 'laguna

### Question 1

This looks like gibberish, but: what movie is this a review of?  Is it positive or negative?  How can you tell?  (What words stand out to you?)

### Your answer to Question 1

The movie name is "delicatessen".

This review is negative.

From the review, we can found some words or phrases like "tasteless", "failed to reach my funny bone", "left me mostly annoyed" and so on, which express the nagative attitude.

## Question 2

Please write some Python code that prints out the movie review in a more visually appleaing way, and then run it.

In [7]:
def format_review(movie):
    content = ' '.join(movie[0])
    attitude = movie[1]
    return content, attitude

content, attitude = format_review(movies[0])
print(content)
print(attitude)

delicatessen ( directors : marc caro / jean - pierre jeunet ; screenwriters : gilles adrien / marc caro ; cinematographer : darius khondji ; editor : herve schneid ; cast : dominique pinon ( louison ) , marie - laure dougnac ( julie clapet ) , jean - claude dreyfus ( clapet - the butcher ) , karin viard ( mademoiselle plusse ) , ticky holgado ( marcel tapioca ) , anne - marie pisani ( madame tapioca ) , jacques mathou ( roger ) , rufus ( robert kube ) , howard vernon ( frog man ) , edith ker ( granny ) , boban janevski ( young rascal ) , mikael todde ( young rascal ) , chick ortega ( postman ) , silvie laguna ( aurore interligator ) , howard vernon ( frog man ) ; runtime : 96 ; miramax / constellation / ugc / hatchette premiere ; 1991 - france ) reviewed by dennis schwartz a black comedy set in the near future in a boarding house run by a depraved butcher . the comedy is played more in comic strip style for entertaining value than for deeper satire , as it features mostly zany sophomor

Once  we've  got  the  training  data,  we  need  to  decide  what  our  feature  set  should  be.  First,  we  can  find  out  some  of  the  most  common  words  in  the  reviews.

In [8]:
frequencies  = nltk.FreqDist(w.lower() for  w  in movie_reviews.words())
most_common  = frequencies.most_common(100)  
#note:  the  syntax  of "most_common" varies  between  versions
for word in  most_common:
    print(word)

(',', 77717)
('the', 76529)
('.', 65876)
('a', 38106)
('and', 35576)
('of', 34123)
('to', 31937)
("'", 30585)
('is', 25195)
('in', 21822)
('s', 18513)
('"', 17612)
('it', 16107)
('that', 15924)
('-', 15595)
(')', 11781)
('(', 11664)
('as', 11378)
('with', 10792)
('for', 9961)
('his', 9587)
('this', 9578)
('film', 9517)
('i', 8889)
('he', 8864)
('but', 8634)
('on', 7385)
('are', 6949)
('t', 6410)
('by', 6261)
('be', 6174)
('one', 5852)
('movie', 5771)
('an', 5744)
('who', 5692)
('not', 5577)
('you', 5316)
('from', 4999)
('at', 4986)
('was', 4940)
('have', 4901)
('they', 4825)
('has', 4719)
('her', 4522)
('all', 4373)
('?', 3771)
('there', 3770)
('like', 3690)
('so', 3683)
('out', 3637)
('about', 3523)
('up', 3405)
('more', 3347)
('what', 3322)
('when', 3258)
('which', 3161)
('or', 3148)
('she', 3141)
('their', 3122)
(':', 3042)
('some', 2985)
('just', 2905)
('can', 2882)
('if', 2799)
('we', 2775)
('him', 2633)
('into', 2623)
('even', 2565)
('only', 2495)
('than', 2474)
('no', 2472)
('go

### Question 3


Which of these words do you think are especially common in movie reviews (vs. English texts in general)?  (Please discuss 5-10 words).

Which of these words do you think might indicate that the movie review is positive/negative?  (Please discuss 5-10 words.)

### Your answer to Question 3

For the common words in movie reviews, words like "not", "like", "than", "good", "most", "much", "well", "very" and "first" are used frequently. When people write movie reviews, they prefer to compare the movie with other movies. Thus, words like "than", "most", "first" are really common. Also, people tend to use words like "good", "well", "very", "not" to give a direct view on the movie.

Words like "good", "well" usually indicate the movie review is positive. Words like "not" usually indicate the moview review is negative since we learned from the paper that people tend to use "not" before some positive words.

At  this  point,  we  can  now  enter  the  feature  engineering  stage.  Normally  we  would  select  words  that  we  think  will  be  informative.  For  the  sake  of  the  example,  we  have  preselected  some  words  more  or  less  at  random.  For  now  it  is  OK  to  leave  these  as  is,  but  later on we  will  ask  you  to  improve  on  this  choice.

In [11]:
#  Picking  some  random  set  of  features  to  start  with  here:
features  =  ["the","bad","only","old","almost","good"]

Now  we  can  define  a  function  that  takes  a  review  and  extracts  our  features  from  it  (that  is,  given  a  review,  for  each  of  the  words  we  have  just  picked,  it  checks  whether  the  review  contains  it).  Run  the  cell  below  to  define  this  function.

In [12]:
def  extract_features(review,features):
        doc_features  =  {}
        for  word  in  features:       
            doc_features[word]  =  (word  in  review)
        return  doc_features

We  don't  actually  care  about  the  movie  reviews  themselves  but  only  about  the  features  they  contain,  together  with  the  information  about  whether  the  reviews  are  positive  or  negative.  With  the  feature  extractor  function  just  defined,  we  can  turn  all  of  the  movie  reviews  into  feature  vectors.  The  feature  vector  of  a  given  review  consists  of  our  word  list  together  with  information  about  whether  the  review  contains  each  of  these  words.  Run  the  cell  below  to  look  at  the  feature  vector  of  the  first  review.  It  consists  of  a  list  of  features  together  with  their  values  (True  means  that  the  word  occurs  in  the  review,  False  means  that  it  doesn't)  together  with  a  label  (pos  means  that  the  review  is  positive,  neg  means  it  is  negative)

In [15]:
#  Extracting  features:
movie_features  =  [(extract_features(review,features),category) 
                    for(review,category)  in  movies]

print(len(movie_features))
print(movie_features[0])

2000
2000
({'the': True, 'bad': False, 'only': True, 'old': False, 'almost': True, 'good': True}, 'neg')


### Question 4

What just happened here?  What does "2000" mean?  What does "good: True" mean?  What does "neg" mean?

### Your answer to Question 4

This program extracts the feature vectors from each moview review.

2000 means that there are totally 2000 movie reviews and we get 2000 feature vectors.

"good: True" means that the corresponding movie review has word "good". "neg" means the label of the movie review is negative.

### Training the model...

The  NLTK  NaiveBayesClassifier()  function  can  now  be  trained  on  the  feature  vectors  we  have  just  extracted.  To  do  this,  we  split  our  list  of  movie  review  feature  vectors  into  training  data  (reviews  100  and  up)  and  test  data  (reviews  0-99).  We  hand  the  training  data  to  the  classifier  function.  What  goes  on  inside  this  function  is  not  shown  in  the  code  below,  but  it  is  essentially  what  we  talked  about  in  the  lecture,  except  that  instead  of  learning  the  difference  between  ham  and  spam,  the  classifier  learns  the  difference  between  positive  and  negative  reviews.  Run  the  cell  below  to  do  this.  If  the  training  step  works  well,  you  not  see  any  result  except  that  the  counter to  the  left  of  the  cell  will  be  filled  with  a  number.  But  our  code  stores  the  resulting  classifier  as  "our_first_classifier"

In [16]:
 # split  into  training  and  testing  sets:

movies_training,  movies_test  =  movie_features[100:],  movie_features[:100]

our_first_classifier  =  nltk.NaiveBayesClassifier.train(movies_training)

Now  we  can  see  how  well  our  classifier  has  learned  to  distinguish  between  positive  and  negative  reviews.  We  will  first  try  out  how  well  it  fares  on  the  reviews  that  it  has  just  used  to  learn  that  difference,  that  is,  the  training  set  (movies_training).  Run  the  cell  below  to  tell  our  classifier  to  classify  the  training  set  and  check  how  well  it  did.  The  result  is  a  fraction  between  0  and  1,  so  the  code  multiplies  it  with  100  to  get  a  percentage.

In [17]:
print("Accuracy  of  the  first  classifier (percent) on training set:")
print(100*nltk.classify.accuracy(our_first_classifier,movies_training))


Accuracy  of  the  first  classifier (percent) on training set:
62.36842105263158


### Question 5

What just happened here?  What does "63%" mean?

### Your answer to Question 5

We split our 2000 feature vectors into training set and test set. We feed the training set to the NaiveBayesClassifier model. And then we use this trained model to predict the labels of training set. We get the accuracy based on predict labels and ground truth labels.

63% means that among the training data, we predict the labels of 63% of them correctly and predict the wrong labels for other 37% of them.

Next,  run  the  cell  below  try  out  our  classifier  on  a  test  set  which  contains  entirely  new  data  (the  reviews  0-99  that  we  have  previously  set  aside  for  this  purpose).  We  will  use  this  as  a  baseline   below  when  we  will  ask  you  to  select  your  own  features  for  a  new  classifier.

In [18]:
baseline  =  nltk.classify.accuracy(our_first_classifier,movies_test)
print("Accuracy  of  the  first  classifier (percent) on test set:")
print(100*  baseline)

Accuracy  of  the  first  classifier (percent) on test set:
71.0


### Question 6

What just happened here?  What does "71%" mean?

### Your answer to Question 6

We used our trained model to predict the labels of test data.

71% means that among the test data, we predict the labels of 71% of them correctly and predict the wrong labels for other 29% of them.

## Keep reading....

Next,  you  will  see  how  well  the  trained  classifier  does  with  a  set  of  entirely  new  movie  reviews.  You  will  run  the  classifier  on  the  reviews  below,  and  then  calculate  its  precision  and  recall.  First,  run  the  cell  below  to  enter  a  new  set  of  movie  reviews  into  Python  (they  will  be  stored  as  "new_test_set").

In [19]:
#  New  movie  reviews  (from  Rotten  Tomatoes,  various  movies)

test1  =  "After  the  pleasant  surprise  that  was  the  first  film  of  the  new  Planet  of  the  Apes  series,  the   expectations  for  the  sequel,  or  middle  part  of  the  trilogy,  were  somewhat  bigger.  Thankfully,  everyone  involved  was  fully  aware  of  that  and  delivered  another  smart  blockbuster  with  a  lot  of  vital  commentary  on  the  futility  of  war  and  violent  conflicts.  The  film  doesn't  want  you  to  pick  a  side  too  easily  as  hostility  between  the  last  remaining  humans  on  Earth  and  the  intelligent  apes  arise.  There  are  decent  and  bad  characters  on  both  sides.  This  makes  for  an  interesting  ride,  as  the  conflict  spins  more  and  more  into  chaos  and  there  is  little  anyone  can  do  against  it,  after  a  point  of  no  return.  Once  again,  the  CGI  is  incredible,  thanks  to  great  motion  capture  acting  and  the  accompanying  special  effects.  Thankfully,  the  human  actors  are  en  par,  especially  Gary  Oldman  only  takes  two  short  scenes  to  make  a  strong  point  for  being  one  of  the  best  of  his  generation.  The  gloomy  atmosphere,  the  great  cinematography,  it  all  adds  up  to  an  intelligent  and  pretty  damn  entertaining  continuation  of  the  story.  If  there  is  one  complaint  it  would  have  to  be  that  the  ending  is  merely  a  cliffhanger  for  what's  next  in  part  three.  But  at  least  we  all  have  something  to  look  forward  to."#  positive

test2  =  "I  will  start  off  this  review  with  a  caveat  that  I  am  not  the  biggest  fan  of  Michael  Bay  films,  its  not  that  I  don't  like  any  of  his  films  but  i  am  just  not  the  biggest  fan.  This  is  a  Michael  Bay  film  from  beginning  to  end.  For  people  that  like  Michael  Bay's  style  and  the  other  films  in  this  franchise  will  certainly  love  this  film.  I  actually  did  enjoy  the  first  film  in  this  franchise  but  every  installment  after  has  been  worse.  The  first  criticism  is  that  Michael  Bay's  directing  style  and  cliches  were  so  heavy  handed  in  this  film  that  it  became  its  own  character  and  became  a  spoof  of  itself,  it  took  me  out  of  the  film;  from  slow  motion  sequences,  to  low  camera  angles,  and  one  liners  that  did  not  quite  hit  hard  enough.  The   product  placement  in  this  film  was  also  very  in  your  face  and  often  took  me  out  of  the  film.  The  dialogue  was  cringe  worthy  and  I  often  felt  like  Peter  Cullen  (voice  of  Optimus  Prime)  did  not  want  to  say  half  the  lines.  The  plot  was  extremely  convoluted  partly  due  to  this  movie  being  just  way  too  long  (nearly  3  hours).  There  is  one  positive  for  this  film  though  and  it  barely  counts.  Even  if  your  not  a  fan  of  Michael  Bay,  you  can  never  argue  with  the  amazing  visuals  and  intense  action  sequences  that  he  brings  to  the  screen  though  after  a  while  of  things  just  blowing  up  I  began  to  get  bored.  Overall  this  film  is  is  a  steaming  pile  of  crap  and  for  people  who  are  not  me  you  need  to  be  big  fan  of  Michael  Bay  and  other  films  in  this  franchise  to  truly  enjoy  it.  Though  even  if  you  are  a  fan  of  Michael  Bay  it  is  going  to  be  hard  to  enjoy  this  film  as  it  is  one  of  the  worst  movies  I  have  ever  seen."#  negative

test3  =  "Both  leads  are  playing  their  stereotypical  roles,  but  they  feel  very  comfortable  in  it.  Really  the  best  part  of  this  film  is  watching  these  two  actors  go  head  to  head  in  some  really  good  scenes.  Aside  from  those  few  scenes  though,  most  of  the  film  is  so  schmaltzy  and  predictable  that  it  doesn't  make  sense  for  the  film  to  be  as  long  as  it  is.  The  direction  is  so  flawed  in  its  portrayal  of  several  characters  and  too  misguided  in  others  that  its  hard  to  take  many  performances  seriously.  There's  an  entire  sub  plot  with  Farmiga  that  could  have  been  completely  removed,  and  there  is  a  brother  with  a  disability  character  who  borders  on  offensive  for  much  of  the  film.  Billy  Bob  Thornton  is  so  underutilized  in  this  film  that  I  don't  know  why  he  signed  on,  and  the  same  is  true  for  Vincent  D'Onofrio.  I  don't  know  why  the  film  is  as  long  as  it  is,  and  it  feels  so  self-indulgent  for  the  director  most  of  the  time.  If  it  weren't  for  Robert  Downey  Jr.  and  Duvall's  performances,  this  film  wouldn't  have   almost  anything  going  for  it.  I  was  surprised  that  the  courtroom  scenes  were  as  lackluster  as  they  were,  I  was  really  expecting  to  enjoy  those.  They  just  fell  flat  most  of  the  time.  It's  overall  inoffensive,  but  it  is  nothing  spectacular."#  negative

test4  =  "David  Ayer,  fresh  off  of  a  weird  mixture  of  directing  \"Sabotage\"  and  \"End  of  Watch\"  (which  he  also  wrote  and  produced),  \"Fury\"  could  have  gone  either  way,  but  I  must  say,  this  film  is  extremely  impressive.  Every  crew  member  aboard  that  tank  gives  it  their  all  in  their  performances  and  that  is  definitely  the  dividing  line  between  whether  or  not  this  film  would  be  good  or  bad.  Brad  Pitt,  Logan  Lerman,  Shia  Labeouf,  John  Bernthal,  and  Michael  Pena  are  all  believable  in  their  roles.  Normally  I  wouldn't  waste  my  time  listing  every  cast  member,  but  there  is  not  one  bad  performance  here  and  everyone  deserves  recognition  for  their  work.  Yes,  a  few  of  them  are  a  little  underdeveloped,  but  you  understand  theirt  motivations  and  pride  for  their  country  the  whole  way  through.  I  wa  immersed  in  these  immaculatly  shot  war  sequences  that  will  have  your  heart  pumping.  It  has  been  a  while  that  I  was  so  immersed  in  a  film  like  I  was  with  \"Fury\"  and  that  is  saying  something.  The  brutally  honest  emotions  given  by  all  the  characters  throughout  this  film  are  terrific  and  you  will  not  even  think  this  film  is  135  Minutes  long,  because  the  experience  is  immersive.  \"Fury\"  is  the  best  war  film  I  have  seen  in  a  very  long  time.  It  has  a  few  nitpicking  scenes,  but  other  than  that  it  blew  me  away.  \"Fury\"  hit's  it's  target."#  positive

test5  =  "This  new  film  fuses  together  everything  good  about  the  original  films,  as  well  as  the  recent  Marvel  films,  and  does  so  with  gusto.  There's  just  so  much  to  love  about  this  film,  from  the  reassembled  cast,  to  the  asides  for  fans  of  the  comics,  to  the  awe  inspiring   action  and  it  all  works  well  together.  This  film  comes  on  the  heels  of  the  rights  transferring  from  Fox  to  Marvel,  and  it  shows  in  the  production  value,  which  obviously  has  help  from  Marvel  Studios,  to  set  up  for  their  newly  announced  2016  film  for  the  X-Men  canon.  It's  just  brilliantly  constructed,  bringing  all  your  favorite  characters  together,  while  also  showing  new  information  and  new  characters  for  us  to  love.  Most  of  what  we  see  comes  directly  from  the  comics,  and  that's  something  to  rejoice  over,  but  it's  also  pure,  perfect,  psychological  action  thriller.  This  is  the  new  breed  of  X-Men,  and  they\'re  far  more  intelligent  and  calculated  than  ever  before."#  positive

test6  =  "Derivative,  needlessly  shaky,  poorly  acted  and  devoid  of  excitement,  Earth  to  Echo  is  an  adventure  film  which  lacks  a  relatively  vital  component;  any  sense  of  adventure.  Apart  from  Astro's  performance  as  Tuck,  and  a  viscerally  compelling  sequence  involving  Reese  C.  Hartwig's  character  Munch,  Earth  to  Echo  is  a  film  which  shall  displease  fans  of  the  alien-discovery  category  of  film,  as  well  as  kids  desiring  a  film  full  of  varied  and  interesting  action.  Relatively  impressive  visual  effects  save  the  film  from  an  entirely  poor  rating,  though  this  is  still  surely  one  to  miss."#  negative

test7=  "Before  viewing  this  film,  I  lowered  my  expectations,  knowing  that  the  film  was  probably  going  to  be  all  dick  and  fart  jokes.  Not  only  was  that  exactly  what  this  film  is,  but  it  is  also  savagely  racist,  and  Seth  McFarlane's  presence  is  very  off-putting,  because  he  is  a  much  better  voice  actor.  He  thought  he  could  put  together  a  sloppy  old-fashion  western  comedy  with  modern-day  lingo  thrown  in,  and  normally  a  movie  that  does  that  is  hit  or  miss,  but  this  just  misses  almost  every  single  time.  \"A  Million  Ways  To  Die  In  The  West\"  is  easily  the  worst  comedy   that  has  come  from  2014  so  far.  While  not  being  a  fan  of  Family  Guy  should  not  affect  my  viewings  on  this  film,  it  feels  like  the  same  stupid  humour  that  is  present  there,  just  a  lot  more  gross-out  stuff.  Don't  get  me  wrong,  I  laughed  at  \"Ted\"  as  much  as  the  next  guy,  but  this  just  feels  like  the  decline  of  McFarlane's  career.  With  poor  writing  and  sloppy  directing,  there  is  not  much  to  like  here  and  it  will  hardly  gain  a  single  laugh."#  negative

test8  =  "This  movie  is  definitely  for  a  more  mature  audience,  but  I  give  this  movie  a  round  of  applause.  It  provides  a  comedic  effect  to  serious  situations  of  life  and  it  also  shares  its  awkward  moments  that  seem  to  be  very  natural  in  life  and  that  is  why  I  give  this  film  a  high  rating  because  it  mirrors  life  as  we  know  it.  This  indie  movie  was  great  for  Jenny  Slate  to  star  in...this  is  good  for  her  resume  and  is  just  good  for  her  in  general.  I  hope  she  gets  many  more  films  to  come  later  on  in  her  future  career.  She  seems  as  if  she  can  further  develop  into  a  multi-dimensional  actress."#  positive

test9  =  "Romantic,  inspiring,  and  strongly  performed,  Belle  is  a  period  piece  that  transcends  its  trappings,  and  becomes  a  film  that  has  a  lot  to  say  about  life  and  the  way  we  see  ourselves.  Powerfully  led  by  actress  Gugu  Mbatha-Raw,  the  entire  cast  of  Belle  finds  the  humanity  in  their  characters,  and  every  character  feels  like  a  real  person.  Watching  Dido  struggle  with  her  self-worth  and  the  problem  of  racism  in  the  world  is  so  captivating  and  enthralling  that  you  don't  want  to  look  away.  I'm  not  even  a  big  fan  of  period-piece  romances,  but  this  film  had  my  heart  crying  out  for  Dido  to  find  true  love.  It's  an  incredibly  sweet  and  earnest  film,  and  it  deserves  every  sweet  moment  it  has."#  positive

test10  =  "Let's  not  waste  too  much  time  assessing  the  insipidness  that  contains  TASM2.  With  very  few  redeeming  qualities,  the  follow  up  to  the  Andrew  Garfield  starring  reboot  is  even  worse  than  its  predecessor.  What  were  studio  heads  thinking  (were  they  even  thinking).  When  movies  are  made  to  treat  the  audience  as  lab  rats,  testing  to  see  when  enough  is  enough,  it  can  only  spell  eventual  doom.  This  series  not  only  condescend  to  its  intended  audience  but  down  right  insults  the  average  viewer  with  continuously  pretentious  tongue  and  cheek  self-aggrandizing  winking.  Its  saving  grace  is  its  top-notch  production  values,  sadly  used  to  promote  a  frivolous  film."#  negative

new_test_set = [test1,test2,test3,test4,test5,test6,test7,test8,test9,test10]

review_num = 1
for t in new_test_set:
    print("############# REVIEW NUMBER " + str(review_num) + "#############")
    print(t)
    review_num += 1

############# REVIEW NUMBER 1#############
After  the  pleasant  surprise  that  was  the  first  film  of  the  new  Planet  of  the  Apes  series,  the   expectations  for  the  sequel,  or  middle  part  of  the  trilogy,  were  somewhat  bigger.  Thankfully,  everyone  involved  was  fully  aware  of  that  and  delivered  another  smart  blockbuster  with  a  lot  of  vital  commentary  on  the  futility  of  war  and  violent  conflicts.  The  film  doesn't  want  you  to  pick  a  side  too  easily  as  hostility  between  the  last  remaining  humans  on  Earth  and  the  intelligent  apes  arise.  There  are  decent  and  bad  characters  on  both  sides.  This  makes  for  an  interesting  ride,  as  the  conflict  spins  more  and  more  into  chaos  and  there  is  little  anyone  can  do  against  it,  after  a  point  of  no  return.  Once  again,  the  CGI  is  incredible,  thanks  to  great  motion  capture  acting  and  the  accompanying  special  effects.  Thankfully,

According  to  the  original  source,  the  ten  reviews  above  can  be  categorized  as  follows  (you  may  convince  yourself  that  this  is  accurate  by  reading  them;  this  will  also  give  you  ideas  for  new  features  you  could  use):

1  :  pos  

2  :  neg  

3  :  neg  

4  :  pos  

5  :  pos  

6  :  neg  

7  :  neg  

8  :  pos  

9  :  pos  

10  :  neg

We  are  not  going  to  tell  the  classifier  about  these  labels,  we  are  just  going  to  store  them  (as  "true_labels")  so  you  can  use  them  to  assess  the  classifier's  performance  later  on.  Run  the  cell  below  to  store  these  values  and  ask  our  classifier  to  classify  the  reviews  on  its  own:

In [20]:
true_labels=['pos','neg','neg','pos','pos','neg','neg','pos','pos','neg']

In [21]:
i  =  1
for  review  in  new_test_set:
    print(str(i) +':' + our_first_classifier.classify(extract_features(nltk.word_tokenize(review),features)))
    i+=1
    


1:neg
2:pos
3:pos
4:neg
5:pos
6:pos
7:pos
8:neg
9:pos
10:pos


If  you  look  at  what  the  classifier  has  returned  and  compare  it  with  the  labels  we  have  given  you  for  the  reviews,  you  will  notice  that  it  has  done  a  lousy  job.  In  this  section  we  will  ask  you  to  come  up  with  a  way  to  assess  the  classifier's  performance  by  computing  its  accuracy

In [22]:
i=0
correctly_classified_reviews=0
incorrectly_classified_reviews=0
#  compute  numbers  of  correctly  and  incorrectly  classified  reviews
for  review  in  new_test_set:        
    what_classifier_thinks  =  our_first_classifier.classify(extract_features(nltk.word_tokenize(review),features))        
    the_truth  =  true_labels[i]
    if  what_classifier_thinks  ==  the_truth:              
        correctly_classified_reviews  +=  1
    else:              
        incorrectly_classified_reviews  +=1        
    i+=1#  increase  counter  by  1

    
    
#  Convert  these  numbers  from  integer  to  float  to  make  division  easier
correctly_classified_reviews  =  float(correctly_classified_reviews)
incorrectly_classified_reviews  =  float(incorrectly_classified_reviews)
accuracy  =  correctly_classified_reviews  /(correctly_classified_reviews  +incorrectly_classified_reviews)
print("Accuracy  of  the  first  classifier  on  the  ten  reviews  given  above (percent):")
print(100*  accuracy)
        

Accuracy  of  the  first  classifier  on  the  ten  reviews  given  above (percent):
20.0


### Question 7

What just happened here?  What does "20%" mean?

Suggest  some  possible  sources  of  differences  between  performance  on  the  test  set  from  NLTK  and  the  Rotten  Tomatoes  test  set  in  the  cell  below  (we're  just  asking  you  to  think  about  this  a  bit  -  there's  not  a  specific  answer  we're  looking  for  here.

### Your answer to Question 7

We used our trained model to predict the labels of these new 10 movie reviews.

20% means that among these 10 movie reviews, we predict the labels of 2 movie reviews correctly and predict the wrong labels for other 8 movie reviews.

For the feature vector, we randomly chose the top N words from the most common words in dataset set NLTK. Therefore, this feature vector is not good for the Rotten Tomatoes test set. 

# Section 2: Refining the features

The  classifier  using  the  features  given  above  had  results  that  were  not  particularly  great.  In  the  section  below,  you  will  be  revising  your  feature  set  and  training  a  new  classifier  to  improve  the  results.

Which  features  where  most  informative?  NLTK  provides  an  easy  way  of  assessing  this.  Run  the  cell  below  to  find  out.

In [23]:
our_first_classifier.show_most_informative_features(12)

Most Informative Features
                     bad = True              neg : pos    =      1.9 : 1.0
                     bad = False             pos : neg    =      1.5 : 1.0
                    only = False             pos : neg    =      1.2 : 1.0
                    only = True              neg : pos    =      1.1 : 1.0
                  almost = True              pos : neg    =      1.1 : 1.0
                     old = True              pos : neg    =      1.1 : 1.0
                  almost = False             neg : pos    =      1.0 : 1.0
                    good = False             neg : pos    =      1.0 : 1.0
                     old = False             neg : pos    =      1.0 : 1.0
                    good = True              pos : neg    =      1.0 : 1.0
                     the = True              pos : neg    =      1.0 : 1.0


### Question 8

What does this output mean?  Explain in prose.

### Your answer to Question 8

From this output, we can find that movie reviews which have words like "bad" and "only" are more likely to be negative. Movie reviews which have words like "almost" and "old" are more likely to be positive. Words like "good" and "the" have no influence on the attitude of movie reviews.

### Question 9

How  might  you  improve  the  performance  of  the  model?  What  features  might  be  useful  to  add  or  remove?  Try  out  different  lists  of  features  by  editing  and  then  running  the  cell  immediately  below,  and  running  the  following  cells.  For  example,  if  you  want  to  try  the  features  "good",  "bad",  "fantastic",  and  "spoiler",  you  would  modify  the  first  line  of  cell  below  so  that  it  says:  student_defined_features  =  ["good",  "bad",  "fantastic",  "spoiler"].  You  may  enter  as  many  features  as  you  wish.  Try  to  come  up  with  predictive  ones,  that  is,  words  that  you  expect  to  find  either  almost  only  in  positive  reviews  or  almost  only  in  negative  reviews.You  should  expand  your  feature  set  until  you  get  some  performance  improvement  compared  with  the  initial  feature  set  when  run  against  movies  0-99  from  the  NLTK  collection.  (That  is,  your  performance  should  improve  compared  with  the  "baseline"  you  have  computed  in  Question  1.)  You  may  run  the  cells  below  as  many  times  as  you  like.  The  more  improvement  you  get,  the  better.  You  may  find  it  to  be  quite  hard  to  get  more  than  1%  improvement.  On  the  other  hand,  a  judicious  choice  of  words  may  lead  to  a  5%  or  even  10%  improvement. 

Have  a  look  at  the  movie  reviews  above  or  on rottentomatoes.com if  you  need  inspiration,  but  please  be  a  good  sport  and  don't  try  and  research  the  best  predictors  (it  would  spoil  the  fun)!

A student who did this assignment at NYU suggested the following features.  You can adapt this code to see how well your own features do.  How well do they do?  Can you improve in it?

In [72]:

student_defined_features  =  ['perfect', 'dark', 'disney', 'beautiful', 'throughout', 'sometimes', 'fiction', 'simple', 'strong', 'classic', 'heart', 'oscar', 'wonderful', 'extremely', 'novel', 'scream', 'future', 'effective', 'attention', 'emotional', 'hilarious', 'excellent', 'eventually', 'america', 'particularly', 'reality', 'leads', 'power', 'enjoy', 'happy', 'important', 'powerful', 'perfectly', 'brilliant', 'impressive', 'entertainment', 'success', 'intelligent', 'solid', 'worst', 'boring', 'stupid', 'worse', 'none', 'poor', 'thriller', 'talent', 'laugh', 'hell', 'apparently', 'interest', 'lack', 'mess', 'annoying', 'predictable', 'cool', 'waste', 'fails', 'ridiculous', 'serious', 'terrible', 'laughs', 'awful', 'dull', 'dumb']

#  Extracting  features:

movie_features2  =  [(extract_features(review,student_defined_features),category)  for  (review,category)  in  movies]

#  split  into  training  and  testing  sets:
movies_training2,  movies_test2  =  movie_features2[100:],  movie_features2[:100]

our_second_classifier  =  nltk.NaiveBayesClassifier.train(movies_training2)

#  See  how  well  our  second  classifier  fares  on  the  test  set:

accuracy_of_student_classifier  =  nltk.classify.accuracy(our_second_classifier,movies_test2)

improvement  =  accuracy_of_student_classifier  -  baseline

print("Accuracy  of  the  student-defined  classifier:",  100*accuracy_of_student_classifier,  "%")

print("Baseline  (accuracy  of  the  classifier  we  defined  above):",  100*baseline,  "%")

print("Improvement  compared  with  baseline:",  100*improvement, "%")

Accuracy  of  the  student-defined  classifier: 82.0 %
Baseline  (accuracy  of  the  classifier  we  defined  above): 71.0 %
Improvement  compared  with  baseline: 10.999999999999998 %


In [60]:
from nltk.corpus import stopwords
positive_movies = [movie for movie in movies if movie[1] == 'pos']
negative_movies = [movie for movie in movies if movie[1] == 'neg']

positive_reviews = [word for movie in positive_movies for word in movie[0] if word.isalpha()]
negative_reviews = [word for movie in negative_movies for word in movie[0] if word.isalpha()]

positive_reviews = [word for word in positive_reviews if word not in stopwords.words('english')]
negative_reviews = [word for word in negative_reviews if word not in stopwords.words('english')]

In [70]:
def get_top_words(reviews, num):
    frequencies = nltk.FreqDist(w for w in reviews)
    most_common  = frequencies.most_common(num)  
    return most_common
        
top_positive = [pair[0] for pair in get_top_words(positive_reviews, 500)]
top_negative = [pair[0] for pair in get_top_words(negative_reviews, 500)]

positive_features = [w for w in top_positive if w not in top_negative]
negative_features = [w for w in top_negative if w not in top_positive]

print(positive_features)
print(negative_features)

['war', 'perfect', 'jackie', 'dark', 'disney', 'beautiful', 'throughout', 'others', 'sometimes', 'fiction', 'simple', 'strong', 'classic', 'heart', 'voice', 'oscar', 'child', 'wonderful', 'tom', 'husband', 'extremely', 'genre', 'novel', 'scream', 'future', 'science', 'aliens', 'truman', 'history', 'tells', 'effective', 'attention', 'emotional', 'tale', 'hilarious', 'excellent', 'eventually', 'experience', 'sets', 'america', 'particularly', 'george', 'wars', 'reality', 'leads', 'co', 'power', 'enjoy', 'de', 'light', 'change', 'viewer', 'taking', 'form', 'cameron', 'among', 'elements', 'deal', 'whether', 'released', 'told', 'happy', 'important', 'parents', 'feature', 'powerful', 'living', 'perfectly', 'feeling', 'easily', 'easy', 'ryan', 'meet', 'personal', 'brings', 'battle', 'release', 'brilliant', 'impressive', 'score', 'usually', 'entertainment', 'success', 'ben', 'intelligent', 'art', 'taken', 'forced', 'age', 'leave', 'solid', 'political', 'similar', 'using', 'events', 'due', 'chri

### Your answer to Question 9

So actually, I seperated the movie reviews based on negative and positive attitudes. And I removed the stop words and punctuations in them. Then I got the top 500 frequent words respectively in positive movie reviews and negative movie reviews. I removed the duplicate words between them and chose proper features from them.

With my features, the accuracy of the model is improved by 11%, reaching 82.0%.

### Question 10

Please explain in your own words (drawing on the book/Wikipedia/etc for inspiration): what is "Naive Bayes" and how does it work?  What are "features"?  What are "classes"?  How does the computer "learn" from the data?  Why is it "Naive"?  How does it relate to Reverend Bayes?  Your answer should be 1 paragraph.

### Your answer to Question 10



### Question 11

Please spend a bit more time exploring ONE of these tools:

1)  http://liwc.wpengine.com/

2)  http://politeness.cornell.edu/ 

3)  http://genderpredict.co/ 

Write 1 paragraph about why and how the tool was built, how it works, and how successful you think it is (on what sort of data?)

### Your answer to Question 11

(PLEASE PUT YOUR ANSWER HERE)

### Question 12

Please read Paul Graham's classic essay "A plan for spam" (http://www.paulgraham.com/spam.html) and indicate 3 things you learned from it.

### Your answer to Question 12

(PLEASE PUT YOUR ANSWER HERE)

## Question 13

Please read these two articles (which engage with each other):
    
 (1) https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G
   

(2)  https://phys.org/news/2018-11-amazon-sexist-hiring-algorithm-human.html (watch the embedded video too!)
    
Please indicate 3 things you learned (total), and mention how these articles relate to ideas from our class.


## Your answer to Question 13

goes here

### Question 14

Please write your own question and answer it!  It can be a programming question or a conceptual/writing question.


### Your question and answer for Question 14

(PLEASE PUT YOUR QUESTION AND ANSWER HERE)