# Movie Reviews Corpus Analysis in NLTK<br>
## Natalia Wojarnik

In [35]:
#Loading important libraries and modules used for the analysis

import nltk
from nltk.corpus import movie_reviews
import pandas as pd

In [57]:
len(nltk.corpus.movie_reviews.fileids())

2000

In [58]:
movie_reviews.categories()

['neg', 'pos']

## Average word and sentence length and language diversity<br>

<p>There are 2000 different movie reviews in both categories: negative and positive. Using pandas I will conduct basic analysis and present them in the dataframe to take a look at the average word length, sentence length and language diversity.</p>

In [37]:
stats = []
for fileid in movie_reviews.fileids():
    num_chars = len(movie_reviews.raw(fileid))
    num_words = len(movie_reviews.words(fileid))
    num_sents = len(movie_reviews.sents(fileid))
    num_vocab = len(set(word.lower() for word in movie_reviews.words(fileid)))
    stats.append([fileid,round(num_chars/num_words),round(num_words/num_sents),round(num_words/num_vocab)])

In [38]:
df_stats = pd.DataFrame(stats, columns = ['Text ID', 'Average Word Length', 'Average Sentence Length', 'Linguistic Diversity'])
df_stats

Unnamed: 0,Text ID,Average Word Length,Average Sentence Length,Linguistic Diversity
0,neg/cv000_29416.txt,5,20,2
1,neg/cv001_19502.txt,5,22,2
2,neg/cv002_17424.txt,5,23,2
3,neg/cv003_12683.txt,5,27,2
4,neg/cv004_12641.txt,5,24,2
...,...,...,...,...
1995,pos/cv995_21821.txt,5,18,3
1996,pos/cv996_11592.txt,5,16,2
1997,pos/cv997_5046.txt,5,24,2
1998,pos/cv998_14111.txt,5,19,3


## Conditional frequency distribution<br>

<p>Using nltk module, I will calculate a conditional frequency over the corpus to analyze the use of the following words: <em>good, bad, amazing, awful, no, not</em>.</p>

In [60]:
cond_fd = nltk.ConditionalFreqDist(
    (sentiment, word)
    for sentiment in movie_reviews.categories()
    for word in movie_reviews.words(categories=sentiment))
sentiments = ['neg', 'pos']
target_words = ['good', 'bad', 'amazing', 'awful', 'no', 'not']
cond_fd.tabulate(samples = target_words, conditions = sentiments)

       good     bad amazing   awful      no     not 
neg    1163    1034      67     111    1411    2651 
pos    1248     361     117      21    1061    2926 


## Category balance<br>
<p>To check if the categories are balanced, I will count the total number of words in the positive category and the negative category.</p>

In [41]:
neg_words = movie_reviews.words(categories = 'neg')
pos_words = movie_reviews.words(categories = 'pos')
neg_words_count = len(neg_words)
pos_words_count = len(pos_words)
print(f"Total number of words in the positive review category: {pos_words_count}")
print(f"Total number of words in the negative review category: {neg_words_count}")

Total number of words in the positive review category: 832564
Total number of words in the negative review category: 751256


## Positive and negative indicators<br>

<p>I assume that the total  number of words in each category is close enough to analyze the raw frequency table pretty fairly.<br><br>
<p>In order to check which words would be the best indicators of a good review and a bad review, I will make two functions: pos_indicator and neg_indicator. Since the data is fairly distributed, I use the count for target word in one target category (pos or neg for pos_indicator and neg_indicator accordingly) divided by the total count of the target word in both categories and take the percentage.</p>

In [42]:
def pos_indicator(word='str'):
    word_pos_count = len([w for w in movie_reviews.words(categories = 'pos') if w == word])
    word_neg_count = len([w for w in movie_reviews.words(categories = 'neg') if w == word])
    count_total = word_neg_count + word_pos_count
    per_word = print(round(100*word_pos_count/count_total), "%")
    return per_word

In [43]:
pos_indicator('amazing')

64 %


In [44]:
pos_indicator('good')

52 %


In [45]:
def neg_indicator(word='str'):
    word_pos_count = len([w for w in movie_reviews.words(categories = 'pos') if w == word])
    word_neg_count = len([w for w in movie_reviews.words(categories = 'neg') if w == word])
    count_total = word_neg_count + word_pos_count
    per_word = print(round(100*word_neg_count/count_total), "%")
    return per_word

In [46]:
neg_indicator('bad')

74 %


In [47]:
neg_indicator('awful')

84 %


## Results<br>

<p>From the analysis, it can be concluded that the word 'amazing' is more frequently used in the positive reviews than the word 'good'. Therefore, 'amazing' is a better indicator for the positive category.<br>I used the word 'good' in the comparison to find the best indicator, however, the counts for this word in both categories are surprisingly close.<br>
<p>The word 'awful' is more frequently used in the negative reviews than the word 'bad'. Hence, 'awful' is a better indicator for the negative category.</p>

## Further analysis - Concordances<br>

<p>The frequency distributions were especially surprising for 4 target words: <em>not, no and bad</em> in positive category and <em>good</em> for negative category. To take a closer look at their usage, I check the concordances for those words in their less expected categories respectedly.</p>

In [48]:
text_pos = nltk.Text(pos_words)
text_pos.concordance('not')

Displaying 25 of 2926 matches:
dley ( robbie coltrane , the world is not enough ) calls in inspector frederick
y finished ( both color and music had not been finalized , so no comments about
have . _election , a good film , does not live up to its hype . what makes _ele
ts , and yet both films were probably not even aware of each other , made from 
real acting was involved and there is not an original or inventive bone in it '
imental and at times terribly mushy , not to mention very manipulative . but oh
h a huge special - effects budget but not enough money to hire any recognizable
d dubbing and supporting characters , not to mention the hideous title sequence
powerful tribes , proclaiming himself not tribal , not regional , but a nationa
es , proclaiming himself not tribal , not regional , but a national leader . hi
straightforward manner that teaches , not preaches . it concentrates on the goo
 tradition , the supporting cast does not outshine the star , complementing his
f the fig

In [49]:
text_pos.concordance('no')

Displaying 25 of 1061 matches:
and music had not been finalized , so no comments about marilyn manson ) , but
ming from mtv films , i should expect no less . . . but the film starts off li
er relationship ? even so , there ' s no logical reason why mr . m has an affa
 hours and then collect the profits . no real acting was involved and there is
re , director steven spielberg wastes no time , taking us into the water on a 
vies all rolled into one , and it ' s no wonder it took america by storm in th
ngs i like about jackie chan movies : no political correctness ) . he is joine
 oppression . iron monkey succeeds as no kung fu film since drunken master 2 .
 and minds of its audience . while in no way the equal of a masterpiece like d
 track of who is ahead . the race has no rules ? whichever contestant reaches 
, nick shaffer ( breckin meyer ) is a no - nonsense lawyer - in - training who
 rat race " is a riot , with terrific no - holds - barred performances from th
 framing device ) , i

<p>As it can be concluded, the word 'not', even if used in the positive category, does not always indicate a negative meaning and is very often used as a negative particle: 'when not laughing aloud', 'can not experience human', 'i do not advocate', 'i ' m not trying', 'are not stupid'. Ihe last example especially is the evidence for language ambiguity. Even though the particle 'not' has a logical negative connotation, it actually reverses another word's negative connotation ('stupid'). There are some occurences that are still suprising, e.g. 'he is not happy with the final product'. However, without the context, it is not possible to draw clear conclusions.</p>
<p>The word 'no' is often used in similar contexts as 'not': 'there is is no book of shadows', 'there is no goddamned blair witch', 'i like about jackie chan movies : no political correctness'. Even if something is negated in the review, it does not mean that the overall feeling about the movie is negative. The word 'no' indicates often the lack of a feature which is treated as a positive thing about a particular movie. The fixed phrase 'no matter... (if, how, who)' seems to a frequent occurence but it does not indicate either positivity or negativity of the whole review.<br>
<p>Concluding, 'no' and 'not', regardless of being negation particles, cannot be used as negative reviews indicators. The surprising frequency distribution counts are better understood and explained after the analysis of concordances.</p>

In [50]:
text_pos.concordance('bad')

Displaying 25 of 361 matches:
ccent , but it actually wasn ' t half bad . the film , however , is all good . 
kly , distract from the story . it is bad enough that mr . m doesn ' t like tra
 domination . the situation goes from bad to worse as the army mutinies , the r
d the violence that comes with it are bad things . what makes this all interest
 trying to say city of angels is that bad . it had a lot going for it , but som
urposeful , clear - cut good guys and bad guys , puts the fictional characters 
ld in his portrayal of a cop - gone - bad . gibson , of course , was just being
d karaoke jam . the cable guy has its bad spots , like most any movie . i didn 
n life ( he muses later that it ' s a bad thing to make decisions when you ' re
ngs , while depression , paranoia and bad times find their expression in more n
al . in the end , i can ' t call it a bad effort . it ' s less artistic , not s
and less original , but certainly not bad . it ' s simply different . in fact ,
age . stil

<p>The word 'bad' is often used with negation particles in positive reviews which reverses its meaning ('wasn + t as bad as', 'but certainly not bad', 'the film is by no means bad'). 'Bad' occurs in the context where it explains the nature of characters or their actions, not the feeling about the movie: 'the big bad wolf', 'a villain who does bad things', 'good guys and bad guys', 'fighting the bad guys', ' matrix skipping ' bad guy'.</p>

In [51]:
text_neg = nltk.Text(neg_words)
text_neg.concordance('good')

Displaying 25 of 1163 matches:
 highway & memento ) , but there are good and bad ways of making all types of 
e more sense . the actors are pretty good for the most part , although wes ben
me schnazzy cgi , and the occasional good gore shot , like picking into someon
gh too much of this mess to find the good . aside from the fact that children 
 are . if you ' re in the mood for a good suspense film , though , stake out s
hat still does not make for a really good science fiction experience . ghosts 
ce without breathing gear ( which is good for the film ' s budget ) . it is ne
n of joaquin phoenix ( who ' s quite good and by far the film ' s most interes
alo in a romantic comedy -- it was a good idea a couple years ago with the tru
mailed me too , saying clueless is a good movie and that i ' m the only one wh
 love the movie . the preview looked good and of course i ' m crazymadinlove w
rgence of late has been surprisingly good in terms of comedy . what makes movi
the big gorilla . " s

<p>The analysis of the word 'good' in negative reviews lead to similar conclusions. 'Good' is semantically reversed using negations: 'without a good script', 'nobody gives a good performance', 'does not make for a really good science fiction'. There are a lot of cases when an adverb, adjective or a phrase used in the whole sentence with the word 'good', changes the sense completely or is used to make the negative feeling about the movie milder: 'the video is fair to good for', 'occasional good gore shot', 'there are good and bad ways of', 'are pretty good for the most part'. It indicates that writing reviews we tend to be more careful with expressing negative opinions and it's reflected in the language.</p>

## Further analysis - Common contexts<br>

In [52]:
print("Negative category:\n")
print(text_neg.common_contexts(['not','no']))
print("\nPositive category:\n")
print(text_pos.common_contexts(['not','no']))

Negative category:

._only is_. is_only is_one ._one s_one s_so ,_one a_- -_- the_-
but_one is_, s_- are_interesting was_one is_lost and_one has_one are_"
None

Positive category:

._only or_, is_one a_- is_. -_- is_bad is_right *_* ._one s_easy ,_-
,_one is_" -_, s_" since_one for_just but_one are_"
None


In [53]:
print("Negative category:\n")
print(text_neg.common_contexts(['bad','awful']))
print("\nPositive category:\n")
print(text_pos.common_contexts(['bad','awful']))

Negative category:

as_as so_that so_, is_. t_, that_. pretty_movie is_, just_. be_.
how_it was_in ._acting was_. are_, plain_. pretty_, ._, so_i are_.
None

Positive category:

are_, are_.
None


In [54]:
print("Negative category:\n")
print(text_neg.common_contexts(['good','amazing']))
print("\nPositive category:\n")
print(text_pos.common_contexts(['good','amazing']))

Negative category:

is_, so_about was_, s_to the_thing
None

Positive category:

pretty_, is_, quite_, so_that are_, so_. that_. s_, is_. "_"
and_performances is_in truly_. both_and the_things were_, also_.
how_she is_as
None


<p>The analysis of common contexts of negatively connotated words in positive reviews, and the opposite case, shows that positive reviews includes some negative comments as well but in overall the reviews are positive. The same happens to be true for the opposite situation for negative reviews.<br>The case of 'not' and 'no' shows that only analyzing the particles in context can give reliable results and correct sentiment label.</p>

## Further analysis - Collocations<br>

In [55]:
text_neg.collocations()

special effects; new york; high school; van damme; hong kong; even
though; blair witch; box office; bruce willis; action sequences; looks
like; years ago; science fiction; las vegas; last year; keanu reeves;
starship troopers; running time; urban legend; jackie chan


In [56]:
text_pos.collocations()

special effects; star wars; new york; pulp fiction; science fiction;
star trek; phantom menace; even though; high school; united states;
boogie nights; hong kong; jackie chan; starship troopers; supporting
cast; jackie brown; private ryan; martial arts; motion picture; box
office


<p>There are few interesting observations.<br><br>
In the positive reviews the titles of movies occur more often than in the negative reviews. We don't know which movies were actually reviewed so it might mean that those titles (star wars [phantom menace], pulp fiction, star trek, boogie nights, private ryan, jackie brown) are used as a sort of benchmark to compare a good movie to. The titles don't necessarily occur in the negative category so they don't need to be the subject of the analysis but they are mentioned as a reference.<br><br>
The actors' names occur more often in the negative category. There is only one name occuring in the positive reviews ('jackie chan') but it appears also in the negative category. It might suggest that acting and main cast of a movie are the leading factors for giving a negative review. 