## Einlesen des IMDB Datensatzes

In [173]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore') # Blendet zur Übersicht Warnungen aus

# source: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
imdb_data = pd.read_csv('../misc/IMDB Dataset.csv')
imdb_data = imdb_data.head(1000)
imdb_data.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Analyse des IMDB Datensatzes

In [174]:
imdb_data['sentiment'].value_counts()

positive    501
negative    499
Name: sentiment, dtype: int64

In [175]:
print(f"Sentiment: {imdb_data['sentiment'][5]}\n")
print(imdb_data['review'][5])

Sentiment: positive

Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. It just never gets old, despite my having seen it some 15 or more times in the last 25 years. Paul Lukas' performance brings tears to my eyes, and Bette Davis, in one of her very few truly sympathetic roles, is a delight. The kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. And the mother's slow awakening to what's happening in the world and under her own roof is believable and startling. If I had a dozen thumbs, they'd all be "up" for this movie.


In [176]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer

tokenizer=ToktokTokenizer()
stopword_list=nltk.corpus.stopwords.words('english')

#set stopwords to english
stop=set(stopwords.words('english'))

#removing the stopwords
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text
    
#Apply function on review column
imdb_data['review']=imdb_data['review'].apply(remove_stopwords)

In [177]:
import spacy

nlp = spacy.load('en_core_web_sm')

lemma_text_list = []
for doc in nlp.pipe(imdb_data['review']):
    lemma_text_list.append(" ".join(token.lemma_ for token in doc))

imdb_data['review'] = lemma_text_list

## Übertragen in einen TFIDF Vektor

In [210]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
reviews = tfidf.fit_transform(imdb_data.review)
reviews.shape

(1000, 15051)

## Übertragen der Sentiment-Labels in binäre Darstellung

In [211]:
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
sentiments = lb.fit_transform(imdb_data.sentiment)

## Aufteilen in Trainings- und Testdatensätze

In [212]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(reviews, sentiments, test_size=0.2, random_state=42)

print(X_train.shape)

(800, 15051)


In [213]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm

#clf = MultinomialNB().fit(X_train, y_train)
clf = svm.SVC()
clf.fit(X_train, y_train)

In [214]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

sentiments_pred = clf.predict(X_test)
print(sentiments_pred)
target_names = ['positive', 'negative']
print(classification_report(y_test, sentiments_pred, target_names=target_names))

[1 1 0 0 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 0
 1 0 1 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 1
 1 1 0 0 0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 0 1
 1 1 1 1 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 1 1 1 0 1
 0 0 1 0 1 1 0 1 0 1 0 1 1 0 0]
              precision    recall  f1-score   support

    positive       0.84      0.73      0.78       104
    negative       0.75      0.85      0.80        96

    accuracy                           0.79       200
   macro avg       0.79      0.79      0.79       200
weighted avg       0.80      0.79      0.79       200

