# Project Assignment #2

## Step 2: Create a section in your notebook that introduces you (or your group) and introduces your project

- If you (or your group) have made any changes to your research project idea or selected other/additional data, modify your project description accordingly.

This project was made by Crystal Huynh, Larry Qu, and Nelson Truong.

### Research Question: What common speech patterns can be found in positive and negative commentary?
Our group wanted to explore how people tend to speak when talking in an extremely positive and negative manner, especially in the context of movie reviews. Are there specific words that are frequently used when speaking in a positive sentiment versus a negative one? How often do those words show up in speech? How polarizing are these frequently used words? Do people tend to speak more when they are feeling positive or negative about something? This is the first step that we want to take when analyzing the speech patterns of movie reviews.

### Why it Matters
With this information, we can later see whether these patterns match typical human behavior/speech, or if they are more specific in an online, more anonymous environment. We can use our findings to potentially see how polarizing these spaces are and how they affect our society in the long run as this type of analysis can also be used on things like social media comments and news articles where commentary can be even more polarizing. As we explore speech patterns in settings outside of normal, in person conversation, we can get a better idea of how people truly think and feel in certain environments or how people will think and feel given what we already know.

### Data Source
[IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)
[More Dataset Information](http://ai.stanford.edu/~amaas/data/sentiment/)

We are using the IMDB Dataset of 50K Movie Reviews Large Movie Review Dataset from kaggle which contains 50K movie reviews. This dataset can be used for binary sentiment classification as it contains highly polar movie reviews for training and testing.

https://gist.github.com/mkulakowski2/4289441

https://gist.github.com/mkulakowski2/4289437

In order to indentify the positive and negative words, we can easily scan the web for premade datasets. These two shown above are datasets containing negative and positive words, respectively, that we can use to analyze our movie reviews with.

### Project Scope
The intended analysis is to see what are the common "positive" and "negative" words being used and how often they are used. The resulting visualizations for this would probably be something like a bar chart highlighting the ten most common "positive" and "negative" words and how often they appear in the dataset. We can also how long "positive" reviews are compared to "negative" reviews by checking the word count for each review and plotting the total word count frequency in a separate bar chart. Another analysis we could do is categorize common "positive" and "negative" words to see how polarizing they really are and plot this on another bar chart to visualize the scale of how positive "positive" reviews are and how negative "negative" reviews are.


### Expected Insights
We expect to see more polarizing "positive" and "negative" reviews to be lengthier in word count as people tend to speak more than they are feeling extremely "positive" or "negative" which is typical in normal, everyday, real life behavior. However, many people tend to not show their true thoughts and feelings when speaking in public, so perhaps in spaces like movie reviews, people's wording may be more polarizing than if they were to be giving the review in person. In other words, their online, anonymous reviews may be a lot more emotional and biased than what people would typically say aloud. We might see that although online commentary may reflect human behavior in that it represents what people actually are thinking and feeling, it may not be representative of what people would normally say to others. However, organizations can use these findings to gauge a more truthful reaction from the public for their respective initiatives and campaigns to see what they need to improve upon in order to generate a positive response.

## Step 3: Create a section in your notebook that introduces your data

- This can include summary information about the data, summary statistics where appropriate, and an identification of where there may be missing/incorrect/outlier data
- It should include at least 4 exploratory data visualizations
- This section does not need to be polished but it should be a good foundation

In [1]:
import pandas as pd

In [2]:
imdb_df = pd.read_csv('IMDB Dataset.csv')
imdb_df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [3]:
imdb_df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [4]:
imdb_df['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation
nltk.download('stopwords')
from nltk.probability import FreqDist


#most common words used for positive and negative sentiments
#positiveReviews = []
#negativeReviews = []
positiveText = ""
negativeText = ""


for i in range(len(imdb_df['review'])):
    if imdb_df['sentiment'][i] == "positive":
        #positiveReviews.append(imdb_df['review'][i])
        positiveText += imdb_df['review'][i]
    else:
        #negativeReviews.append(imdb_df['review'][i])
        negativeText += imdb_df['review'][i]

#print(positiveReviews[:5])

#tokenize the words
sent = sent_tokenize(positiveText)
words = []
for s in sent:
    for w in word_tokenize(s):
        words.append(w)
        
#remove stopwords
myStopWords = list(punctuation) + stopwords.words('english')
wordsNoStop = []
for i in words:
    if i.lower() not in myStopWords:
        wordsNoStop.append(i)
        
print("30 most common words in positive reviews")

freq = FreqDist(wordsNoStop)
for j in sorted(freq, key=freq.get, reverse=True)[:30]:
    print(j,freq[j])
    

#tokenize the words
sent = sent_tokenize(negativeText)
words = []
for s in sent:
    for w in word_tokenize(s):
        words.append(w)
        
#remove stopwords
myStopWords = list(punctuation) + stopwords.words('english')
wordsNoStop = []
for i in words:
    if i.lower() not in myStopWords:
        wordsNoStop.append(i)
        
print("30 most common words in negative reviews")

negfreq = FreqDist(wordsNoStop)
for j in sorted(negfreq, key=negfreq.get, reverse=True)[:30]:
    print(j,negfreq[j])
  
#onlyPositive = []
#for pword in freq:
#    if pword not in negfreq:
#        onlyPositive.append((pword, freq[pword]))
        
#onlyNegative = []
#for negword in negfreq:
#    if negword not in freq:
#        onlyNegative.append((negword, freq[negword]))
        
#onlyPositive.sort(key = lambda x: -1*x[1]) 
#onlyNegative.sort(key = lambda x: -1*x[1]) 

    
#print("30 most common words in positive reviews")
#print(onlyPositive[:30])

#print("30 most common words in negative reviews")
#print(onlyNegative[:30])

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


30 most common words in positive reviews
br 97952
's 63002
film 39754
movie 35757
'' 31962
`` 31591
n't 26471
one 23831
like 16414
good 13759
story 12042
great 11790
time 11453
see 11437
would 11098
really 10443
well 9931
also 9373
much 8747
even 8331
first 8278
people 8154
get 8023
... 7898
love 7747
best 7536
way 7413
films 7230
many 7084
could 7056


# Data Exploration

The following cells will be exploring a small subset of the dataset (first 1000 reviews, first 5000 reviews, etc) to try to gain an idea of what kind of dataset we are working with. Expanding upon the previous section where we looked at the 30 most frequent words in positive and negative reviews, we tried to create data visualizations to better represent these differences. Throughout our data exploration, we used a lot of tokenization and lemmatization to try to get the frequency of different words used in the movie reviews. This ended up in a lot of weird "words" such as br or '' which we all know aren't actually words. So, we tried to clean those words up where we could see them and then created the different bar charts of word frequency.

In [None]:
import nltk
from nltk.corpus import stopwords
from string import punctuation
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer

In [None]:
# Compile first 1000 reviews into one giant string
reviews = ""
for ind in range(1000):
    reviews = reviews + " " + imdb_df["review"][ind]
    
# Create a list of stop words
stop_words = stopwords.words("english") + list(punctuation)

# Tokenize and convert all the reviews to lowercase
words = word_tokenize(reviews.lower())

# Remove all unwanted words
completewords = [w for w in words if w not in stop_words]

# Lemmatize the list of words
completestemmed = [WordNetLemmatizer().lemmatize(w) for w in completewords]

# Find the frequency of all words
freq = FreqDist(completestemmed)

# Turn this list into a pandas dataframe
wordlist = []
wordfreq = []
for i in sorted(freq, key=freq.get, reverse=True):
    wordlist.append(i)
    wordfreq.append(freq[i])
    
df = pd.DataFrame({'wordlist':wordlist, 'wordfreq':wordfreq})

In [None]:
df

In [None]:
# Purge weird words
df = df.drop(labels=0)
df = df.drop(labels=1)
df = df.drop(labels=4)
df = df.drop(labels=5)
df = df.drop(labels=6)
df

In [None]:
# Mean
df['wordfreq'].mean()

In [None]:
# Median
df['wordfreq'].median()

In [None]:
# Mode
df['wordfreq'].mode()

In [None]:
# Histogram
df.plot.hist(bins=1000, xlim=(0, 150), ylim=(0,2000))

In [None]:
# Barchart of the 17 most frequent words
bar_df = df.loc[0:20, ["wordlist", "wordfreq"]]
bar_df.plot.barh(x="wordlist", y="wordfreq")
#ax = bar_df.plot.bar(x='wordlist', y='wordfreq', rot=0)

In [None]:
# Barchart of the 19 least frequent words
bar_df = df.loc[18490:, ["wordlist", "wordfreq"]]
bar_df.plot.barh(x="wordlist", y="wordfreq")

In [None]:
# Look only at the first 1000 positive reviews
reviews = ""
for ind in range(5000):
    if imdb_df["sentiment"][ind] == "positive":
        reviews = reviews + " " + imdb_df["review"][ind]
stop_words = stopwords.words("english") + list(punctuation)
words = word_tokenize(reviews.lower())
completewords = [w for w in words if w not in stop_words]
completestemmed = [WordNetLemmatizer().lemmatize(w) for w in completewords]
freq = FreqDist(completestemmed)
wordlist = []
wordfreq = []
for i in sorted(freq, key=freq.get, reverse=True):
    wordlist.append(i)
    wordfreq.append(freq[i])
df = pd.DataFrame({'wordlist':wordlist, 'wordfreq':wordfreq})

In [None]:
df

In [None]:
# Clean up the dataframe and remove weird words
df = df.drop(labels=0)
df = df.drop(labels=1)
df = df.drop(labels=4)
df = df.drop(labels=5)
df = df.drop(labels=7)
df

In [None]:
# Barchart
bar_df = df.loc[0:50, ["wordlist", "wordfreq"]]
bar_df.plot.barh(x="wordlist", y="wordfreq", figsize=(30,30), fontsize=20)

In [None]:
# Look only at the negative reviews
reviews = ""
for ind in range(5000):
    if imdb_df["sentiment"][ind] == "negative":
        reviews = reviews + " " + imdb_df["review"][ind]
stop_words = stopwords.words("english") + list(punctuation)
words = word_tokenize(reviews.lower())
completewords = [w for w in words if w not in stop_words]
completestemmed = [WordNetLemmatizer().lemmatize(w) for w in completewords]
freq = FreqDist(completestemmed)
wordlist = []
wordfreq = []
for i in sorted(freq, key=freq.get, reverse=True):
    wordlist.append(i)
    wordfreq.append(freq[i])
df = pd.DataFrame({'wordlist':wordlist, 'wordfreq':wordfreq})

In [None]:
df

In [None]:
# Clean up dataframe and drop weird words
df = df.drop(labels=0)
df = df.drop(labels=1)
df = df.drop(labels=4)
df = df.drop(labels=5)
df = df.drop(labels=6)
df

In [None]:
# Barchart
bar_df = df.loc[0:30, ["wordlist", "wordfreq"]]
bar_df.plot.barh(x="wordlist", y="wordfreq", figsize=(30,30), fontsize=20)

We can see that in the reviews with negative sentiment, they are using very succinct words such as good or bad. This is probably due to the fact that the reviewer didn't like the movie so they don't want to waste time writing an indepth review about the movie. However, you look at the most frequent words for reviews with positive sentiment and you can see similar succint words. It turns out succinct words are just positive in general but the positive sentiment reviews includes more synonomous with good such as great or love.

## Step 4: Create a section in your notebook that includes the beginnings of your analysis

- This should include code for data analysis
- Include an additional 4 data visualizations of information directly relevant to your study topic
    - Or describe 4 data visualizations that you want to make
- This section does not need to be polished but it should be a good foundation

In [None]:
import nltk
from nltk.sentiment import vader
nltk.download('vader_lexicon')

In [None]:
sia = vader.SentimentIntensityAnalyzer()

In [None]:
#testing syntax stuffs

#imdb_df.head(1)
#imdb_df["review"].iloc[0]
#review1 = imdb_df["review"].iloc[0]
#sia.polarity_scores(review1)
#sia.polarity_scores(review1)["compound"]

In [None]:
#for loop for the whole dataset

#k = 0
#for index, row in imdb_df.iterrows():
    #review = imdb_df["review"].iloc[k]
    #print('Review ' + str(k) + "'s compound polarity score is " + str(sia.polarity_scores(review)["compound"]))
    #k+=1

In [None]:
k = 0
data = []
for index, row in imdb_df.iterrows():
    while k < 50:
        review = imdb_df["review"].iloc[k]
        print('Review ' + str(k) + "'s compound polarity score is " + str(sia.polarity_scores(review)["compound"]))
        data.append(sia.polarity_scores(review)["compound"])
        k+=1

In [None]:
data_df = pd.DataFrame(data)
data_df

In [None]:
data_df.plot(kind='line')

In [None]:
grouped = imdb_df.groupby(imdb_df.sentiment)

positive = grouped.get_group("positive")
positive

In [None]:
k = 0
posdata = []
for index, row in positive.iterrows():
    while k < 50:
        review = positive["review"].iloc[k]
        print('Review ' + str(k) + "'s compound polarity score is " + str(sia.polarity_scores(review)["compound"]))
        posdata.append(sia.polarity_scores(review)["compound"])
        k+=1

In [None]:
posdata_df = pd.DataFrame(posdata)
posdata_df.plot(kind='line')

In [None]:
grouped = imdb_df.groupby(imdb_df.sentiment)

negative = grouped.get_group("negative")
negative

In [None]:
k = 0
negdata = []
for index, row in negative.iterrows():
    while k < 50:
        review = negative["review"].iloc[k]
        print('Review ' + str(k) + "'s compound polarity score is " + str(sia.polarity_scores(review)["compound"]))
        negdata.append(sia.polarity_scores(review)["compound"])
        k+=1

In [None]:
negdata_df = pd.DataFrame(negdata)
negdata_df.plot(kind='line')

In [None]:
posdata_df.mean()

In [None]:
negdata_df.mean()

In [None]:
import numpy as np
import matplotlib.pyplot as plt
   
n=1
r = np.arange(n)
width = 0.25

x = [posdata_df.mean()[0], negdata_df.mean()[0]]

ax1 = plt.subplot()
ax1.set_xticks([1,2])

# plt.bar(r, posdata_df.mean(), color = 'b',
#         width = width, edgecolor = 'black',
#         label='average positive sentiment')
# plt.bar(r + width, negdata_df.mean(), color = 'r',
#         width = width, edgecolor = 'black',
#         label='average negative sentiment')
plt.bar([1,2], x)

ax1.set_xticklabels(['average positive sentiment', 'average negative sentiment'])
plt.ylabel("Average Sentiment Value")
  
plt.title("Average Sentiment Values")  
plt.text(1, 0.6, str(round(x[0], 3)))
plt.text(2, 0.02, str(round(x[1], 3)))
  
plt.show()