Positive/Negative Liste bereitgestellt von:
;   Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." 
;       Proceedings of the ACM SIGKDD International Conference on Knowledge 
;       Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, 
;       Washington, USA, 
;   Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing 
;       and Comparing Opinions on the Web." Proceedings of the 14th 
;       International World Wide Web conference (WWW-2005), May 10-14, 
;       2005, Chiba, Japan.

In [1]:
#Load the libraries
import numpy as np
import pandas as pd
import re
# https://online.stat.psu.edu/stat504/lesson/1/1.7
from sklearn.metrics import classification_report,accuracy_score
from preprocessing import preprocesser_text

In [2]:
positive_words = pd.read_csv('data/positive-words.txt', skiprows=29, header=None, names=['words'])
positive_words

Unnamed: 0,words
0,a+
1,abound
2,abounds
3,abundance
4,abundant
...,...
2001,youthful
2002,zeal
2003,zenith
2004,zest


In [3]:
positive_words = preprocesser_text(positive_words, to_prepro='words')
positive_words.head(5)

Pandas Apply: 100%|██████████| 2006/2006 [00:00<00:00, 8392.99it/s]
Pandas Apply: 100%|██████████| 2006/2006 [00:00<00:00, 336093.87it/s]
Pandas Apply: 100%|██████████| 2006/2006 [00:00<00:00, 33414.11it/s]
Pandas Apply: 100%|██████████| 2006/2006 [00:00<00:00, 3525.57it/s]


Unnamed: 0,words
0,a+
1,abound
2,abound
3,abund
4,abund


In [4]:
negative_words = pd.read_csv('data/negative-words.txt', skiprows=29, header=None, names=['words'])
negative_words.head(5)

Unnamed: 0,words
0,2-faced
1,2-faces
2,abnormal
3,abolish
4,abominable


In [5]:
negative_words = preprocesser_text(negative_words, to_prepro='words')
negative_words.head(5)

Pandas Apply: 100%|██████████| 4783/4783 [00:00<00:00, 11609.35it/s]
Pandas Apply: 100%|██████████| 4783/4783 [00:00<00:00, 531440.73it/s]
Pandas Apply: 100%|██████████| 4783/4783 [00:00<00:00, 36231.06it/s]
Pandas Apply: 100%|██████████| 4783/4783 [00:01<00:00, 3817.15it/s]


Unnamed: 0,words
0,2-face
1,2-face
2,abnorm
3,abolish
4,abomin


In [6]:
negative_words.drop_duplicates(inplace=True)
positive_words.drop_duplicates(inplace=True)

In [7]:
#importing the training data
imdb_data=pd.read_csv('data/IMDB Dataset.csv')
print(imdb_data.shape)
imdb_data.head(10)

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [8]:
imdb_data = preprocesser_text(imdb_data)

Pandas Apply: 100%|██████████| 50000/50000 [00:07<00:00, 6325.88it/s]
Pandas Apply: 100%|██████████| 50000/50000 [00:00<00:00, 347226.12it/s]
Pandas Apply: 100%|██████████| 50000/50000 [03:07<00:00, 266.58it/s]
Pandas Apply: 100%|██████████| 50000/50000 [00:55<00:00, 905.62it/s]


In [9]:
norm_test_reviews=imdb_data.iloc[40000:]
norm_train_reviews=imdb_data.iloc[:40000]

In [11]:
def check_sentiment_by_counting(tokens, positive=True, negative=True, return_as_str=False, threshold=0):
    if positive:
        positive_n = len(np.intersect1d(tokens.split(), positive_words.values))
    if negative: 
        negative_n = len(np.intersect1d(tokens.split(), negative_words.values))
    if return_as_str:
        return 'positive' if positive_n - negative_n > threshold else 'negative'
    if positive:
        return positive_n
    if negative:
        return negative_n

def count_positive_negative_words(df):
    positive = df['review'].swifter.apply(check_sentiment_by_counting, positive=True, negative=False)
    negative = df['review'].swifter.apply(check_sentiment_by_counting, positive=False, negative=True)
    print("Positive and Negative Words: ", positive.sum(), negative.sum())
    return positive, negative


In [26]:
positive, negative = count_positive_negative_words(norm_test_reviews)

Pandas Apply: 100%|██████████| 10000/10000 [00:12<00:00, 825.29it/s]
Pandas Apply: 100%|██████████| 10000/10000 [00:23<00:00, 424.30it/s]

Positive and Negative Words:  113219 88027





In [19]:
norm_test_reviews['sentiment_pred'] = norm_test_reviews['review'].swifter.apply(check_sentiment_by_counting, return_as_str=True, threshold=113219/88027)

Pandas Apply: 100%|██████████| 10000/10000 [00:35<00:00, 283.99it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  norm_test_reviews['sentiment_pred'] = norm_test_reviews['review'].swifter.apply(check_sentiment_by_counting, return_as_str=True, threshold=0.5) #113219/88027)


In [20]:
accuracy_score(norm_test_reviews['sentiment_pred'], norm_test_reviews['sentiment'])

0.6787

In [17]:
norm_test_reviews['sentiment_pred'].value_counts()

positive    5662
negative    4338
Name: sentiment_pred, dtype: int64

In [18]:
#Classification report for tfidf features
lr_tfidf_report=classification_report(norm_test_reviews['sentiment_pred'], norm_test_reviews['sentiment'],target_names=['Positive','Negative'])
print(lr_tfidf_report)

              precision    recall  f1-score   support

    Positive       0.62      0.72      0.67      4338
    Negative       0.76      0.67      0.71      5662

    accuracy                           0.69     10000
   macro avg       0.69      0.69      0.69     10000
weighted avg       0.70      0.69      0.69     10000

