### Problem 1 ~ 5

Please put IMDB sentiment (positive, negative) of movie review data (`IMDB Dataset.csv`) as a dependent variable. You want to evaluate model accuracy after establishing a model with the sentiment score for the review obtained using the sentiment dictionary(Lexion) as an independent variable. Among the entire data, only the first 500 data are used for convenience. Among these, the first 70% is used as train data to build a model, and the remaining 30% is used as test data to evaluate accuracy. For accuracy evaluation, accuracy (accuracy rate) is used.

First, preprocess the review sentence using the following preprocessing way

In [22]:
import numpy as np
import pandas as pd

review = pd.read_csv('../Data/IMDBDataset.csv')
review.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


Second, find the words that match the words in the NRC dictionary. (using `NRC-Emotion-Lexion-Wordlevel-v0.92.txt` in Sogang Cyber Campus)

In [23]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

NRC = pd.read_csv('../Data/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt', 
                  engine = "python", 
                  header = None, 
                  sep = "\t")
NRC = NRC[(NRC != 0).all(1)] 
NRC = NRC.reset_index(drop=True)

# data preprocessing
tokenizer = RegexpTokenizer('[\w]+')
stop_words = stopwords.words('english')
p_stemmer = PorterStemmer()

n = 500
score_list = []

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/junghunlee/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [24]:
match_review_lexion = []
for row in review['review'][0:n]:
    raw = row.lower()
    tokens = tokenizer.tokenize(raw)
    stopped_tokens = [i for i in tokens if not i in stop_words] # remove stopwords
    match_words = [x for x in stopped_tokens if x in list(NRC[0])] # match w/ lexicon
    match_review_lexion.append(match_words)

#### Problem 1

Regarding 10th review document (index : 9), how many words do match the words in the NRC dictionary?

In [26]:
match_review_lexion[9]

['gut', 'laughter', 'young', 'love', 'hell']

#### Problem 2

How many positive words are in the 10th review document?

In [27]:
positive_words = []
negative_words = []
n = 500

for row in review['review'][0:n]:
    raw = row.lower()
    tokens = tokenizer.tokenize(raw)
    stopped_tokens = [i for i in tokens if not i in stop_words] # remove stopwords
    match_words = [x for x in stopped_tokens if x in list(NRC[0])] # match w/ lexicon

    emotion=[]
    for i in match_words:
        temp = list(NRC.iloc[np.where(NRC[0] == i)[0],1])  # emotion list
        for j in temp:
            emotion.append(j)
    sentiment_result1 = pd.Series(emotion).value_counts()
    try: neg = sentiment_result1.negative  # negative word count
    except: neg = 0
    try: pos = sentiment_result1.positive # positive word count
    except: pos = 0
    score = pos - neg   
    positive_words.append(pos)
    negative_words.append(neg)
    score_list.append(score)

In [28]:
positive_words[9]

3

#### Problem 3

How many negative words are in the 10th (index : 9) review document?

In [29]:
negative_words[9]

1

#### Problem 4

Now, assign scores 1 and -1 to words that match positive and negative words in the NRC dictionary, respectiely. Then, generate the sentiment score by adding the positive and negative scores as an independent variable. Evaluate the (prediction) accuracy of a logistic regression model with the sentiment score obtained using the procedures above

In [33]:
len(score_list)

500

In [34]:
df_score = pd.DataFrame([score_list], index = ['lexion']).T
df_score = pd.concat([df_score, review.sentiment[:500]], axis = 1)
df_score

Unnamed: 0,lexion,sentiment
0,-11,positive
1,13,positive
2,4,positive
3,-3,negative
4,17,positive
...,...,...
495,-8,negative
496,2,negative
497,1,negative
498,9,negative


In [35]:
from sklearn.linear_model import LogisticRegression

X_train = df_score['lexion'][:350] # 70%
X_test = df_score['lexion'][350:] # 30%

y_train = df_score['sentiment'][:350]
y_test = df_score['sentiment'][350:]

model = LogisticRegression()
model.fit(np.array(X_train).reshape(-1,1), y_train)

In [36]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

y_pred = model.predict(np.array(X_test).reshape(-1,1))
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    negative       0.69      0.66      0.68        77
    positive       0.66      0.68      0.67        73

    accuracy                           0.67       150
   macro avg       0.67      0.67      0.67       150
weighted avg       0.67      0.67      0.67       150


In [38]:
print(f'The model test score is {model.score(np.array(X_test).reshape(-1,1), np.array(y_test)) : .6f}')

The model test score is  0.673333


#### Problem 5

Now, assign scores 2 and -1 to words that match positive and negative words in the NRC dictionary, respectiely. Then, generate the sentiment score by adding the positive and negative scores as an independent variable. Evaluate the (prediction) accuracy of a logistic regression model with the sentiment score obtained using the procedures above

In [39]:
n = 500
score_list = []

for row in review['review'][0:n]:
    raw = row.lower()
    tokens = tokenizer.tokenize(raw)
    stopped_tokens = [i for i in tokens if not i in stop_words] # remove stopwords
    match_words = [x for x in stopped_tokens if x in list(NRC[0])] # match w/ lexicon

    emotion=[]
    for i in match_words:
        temp = list(NRC.iloc[np.where(NRC[0] == i)[0],1])  # emotion list
        for j in temp:
            emotion.append(j)
    sentiment_result1 = pd.Series(emotion).value_counts()
    try: neg = sentiment_result1.negative  # negative word count
    except: neg = 0
    try: pos = sentiment_result1.positive * 2 # positive word count
    except: pos = 0
    score = pos - neg   
    score_list.append(score)

In [40]:
len(score_list)

500

In [41]:
df_score = pd.DataFrame([score_list], index = ['lexion']).T
df_score = pd.concat([df_score, review.sentiment[:500]], axis = 1)
df_score

Unnamed: 0,lexion,sentiment
0,-1,positive
1,27,positive
2,12,positive
3,-1,negative
4,36,positive
...,...,...
495,1,negative
496,7,negative
497,7,negative
498,24,negative


In [42]:
X_train = df_score['lexion'][:350] # 70%
X_test = df_score['lexion'][350:] # 30%

y_train = df_score['sentiment'][:350]
y_test = df_score['sentiment'][350:]

model = LogisticRegression()
model.fit(np.array(X_train).reshape(-1,1), y_train)

In [43]:
y_pred = model.predict(np.array(X_test).reshape(-1,1))
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    negative       0.62      0.73      0.67        77
    positive       0.65      0.53      0.59        73

    accuracy                           0.63       150
   macro avg       0.64      0.63      0.63       150
weighted avg       0.64      0.63      0.63       150


In [44]:
print(f'The model test score is {model.score(np.array(X_test).reshape(-1,1), np.array(y_test)) : .6f}')

The model test score is  0.633333


### Problem 6 ~ 7

Regarding the first 2000 overviews of movie review data (`movies_metadata.csv`), you want to do some analysis for the movie overview. Create a variable 'long', which is 1 if 'run_time' is longer than 100, otherwise, it is 0

In [58]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

metadata = pd.read_csv('../Data/movies_metadata.csv', engine = 'python')[:2000]
metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


#### Problem 6 

How many movies are classified as long = 1

In [59]:
metadata['long'] = 0
metadata.loc[metadata['runtime'] > 100, 'long'] = 1

In [60]:
metadata[['runtime','long']]

Unnamed: 0,runtime,long
0,81.0,0
1,104.0,1
2,101.0,1
3,127.0,1
4,106.0,1
...,...,...
1995,103.0,1
1996,96.0,0
1997,112.0,1
1998,86.0,0


In [61]:
len(metadata[metadata['long'] == 1])

1072

In [62]:
metadata['long'].value_counts()

long
1    1072
0     928
Name: count, dtype: int64

#### Problem 7

Construct a logistic regression model with the 'vote_average' in the movie review data as a predictor variable($X$) and the 'long' obtained above as a response variable($y$). Then evaluate the model accuracy. Note that, from the dataset, the first 70% is used as train data to build a model, and the remaining 30% is used as test data to evaluate accuracy. For accuracy evaluation, accuracy (accuracy rate) is used

In [63]:
X_train = metadata['vote_average'][:1400] # 70%
X_test = metadata['vote_average'][1400:] # 30%

y_train = metadata['long'][:1400]
y_test = metadata['long'][1400:]

model = LogisticRegression()
model.fit(np.array(X_train).reshape(-1,1), y_train)

In [64]:
y_pred = model.predict(np.array(X_test).reshape(-1,1))
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.33      0.46       309
           1       0.56      0.88      0.68       291

    accuracy                           0.60       600
   macro avg       0.65      0.61      0.57       600
weighted avg       0.66      0.60      0.57       600


In [65]:
print(f'The model test score is {model.score(np.array(X_test).reshape(-1,1), np.array(y_test)) : .6f}')

The model test score is  0.600000


### Problem 8 ~ 9

Regarding the first 2000 overviews of movie review data (`movies_metadata.csv`), you intend to analyze the data using a topic model based on LDA. Here, the max features of the TF-IDF vector are limited to 1000 and applied. Apply the topic model of the sklearn library to find where the first document falls under topics 1 to 7 (index : 0 to 6). Apply (designate 7 topics, random_state = 777, running_method = 'online', max_iter = 5)

In [66]:
metadata['overview'] = metadata['overview'].fillna('')  # fillna(""): delete null
metadata['overview'].isnull().sum() #  Null check again

0

data preprocessing

In [68]:
overview_list = []

for i in range(len(metadata['overview'])):
    overview1 = re.sub('[^a-zA-z]',' ',metadata['overview'][i])
    overview2 = overview1.lower()
    overview_list.append(overview2)

In [69]:
titles = metadata['original_title']

In [70]:
synopses = [re.sub(r'[^a-zA-Z]', ' ', overview) for overview in metadata['overview']]
synopses = [line.lower() for line in synopses]

vectorize using sklearn TF-IDF

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer
import urllib.request
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

vectorizer = TfidfVectorizer(
    stop_words = 'english', 
    max_features = 1000
)
X = vectorizer.fit_transform(synopses)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/junghunlee/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [75]:
from sklearn.decomposition import LatentDirichletAllocation

lda_model = LatentDirichletAllocation(
    n_components = 7, # 7 topics
    learning_method = 'online',
    random_state = 777,
    max_iter = 5
)

lda_top = lda_model.fit_transform(X)

CPU times: user 572 ms, sys: 4.74 ms, total: 576 ms
Wall time: 611 ms


In [78]:
lda_top[0]

array([0.03111175, 0.03111176, 0.11279337, 0.03111314, 0.73163452,
       0.03111188, 0.03112359])

#### Problem 8

What is the number of documents belonging to the topic 3 (index : 2) ?

In [82]:
lda_top

array([[0.03111175, 0.03111176, 0.11279337, ..., 0.73163452, 0.03111188,
        0.03112359],
       [0.03035427, 0.03036649, 0.23805431, ..., 0.61016171, 0.03035436,
        0.03035414],
       [0.03235602, 0.03235601, 0.03239337, ..., 0.62125018, 0.0323561 ,
        0.21693185],
       ...,
       [0.02778766, 0.02778755, 0.02791853, ..., 0.83312438, 0.02778772,
        0.02779088],
       [0.03038962, 0.03060413, 0.03047763, ..., 0.81735965, 0.03038947,
        0.03038978],
       [0.02956066, 0.0295605 , 0.1856207 , ..., 0.66655776, 0.02956071,
        0.02956312]])

In [81]:
dominant_topics = np.argmax(lda_top, axis = 1)
topic_count = np.sum(dominant_topics == 2)
print(f"Number of documents in Topic 3: {topic_count}")

Number of documents in Topic 3: 352


In [86]:
for i in range(7) :
    topic_count = np.sum(dominant_topics == i)
    print(f"Number of documents in Topic {i + 1}: {topic_count}")

Number of documents in Topic 1: 24
Number of documents in Topic 2: 7
Number of documents in Topic 3: 352
Number of documents in Topic 4: 8
Number of documents in Topic 5: 1601
Number of documents in Topic 6: 0
Number of documents in Topic 7: 8


#### Problem 9

Waht is the top word for topic 2 (index : 1) ?

In [79]:
terms = vectorizer.get_feature_names_out()

def get_topics(components, feature_names, n = 5):
    for idx, topic in enumerate(components):
        print("Topic %d:" % (idx+1), 
              [(feature_names[i], topic[i].round(2)) for i in topic.argsort()[:-n - 1:-1]])

get_topics(lda_model.components_, terms)

Topic 1: [('violent', 4.36), ('freddy', 4.33), ('president', 3.99), ('lisa', 2.5), ('process', 1.12)]
Topic 2: [('space', 5.24), ('power', 2.68), ('mind', 2.5), ('private', 2.15), ('prevent', 1.81)]
Topic 3: [('new', 11.12), ('gets', 9.86), ('killer', 9.58), ('man', 9.5), ('angeles', 9.39)]
Topic 4: [('government', 6.0), ('alien', 4.74), ('games', 4.05), ('planet', 3.44), ('amy', 3.08)]
Topic 5: [('young', 43.61), ('life', 41.83), ('family', 36.41), ('film', 34.72), ('new', 33.55)]
Topic 6: [('judge', 0.19), ('jimmy', 0.16), ('park', 0.16), ('crime', 0.16), ('famous', 0.16)]
Topic 7: [('max', 8.24), ('alex', 5.88), ('scheme', 4.01), ('steal', 3.38), ('sarah', 3.27)]
