## Einlesen des IMDB Datensatzes

In [127]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore') # Blendet zur Übersicht Warnungen aus

# Quelle: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
imdb_data = pd.read_csv('../misc/IMDB Dataset.csv')
n = 2000
imdb_data = imdb_data.head(n)

imdb_data.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Eckdaten des IMDB Datensatzes

In [128]:
import re

def count_words(string):
    words = re.findall(r'\w+', string)
    return len(words)

imdb_data['num_words'] = imdb_data['review'].apply(count_words)
avg_words = imdb_data['num_words'].mean()



print("Aufteilung der Daten:")
print(imdb_data['sentiment'].value_counts())
print("")
print(f"Durchschnittliche Anzahl von Wörtern pro Beitrag:\n{int(avg_words)}")

Aufteilung der Daten:
positive    1005
negative     995
Name: sentiment, dtype: int64

Durchschnittliche Anzahl von Wörtern pro Beitrag:
234


## Beispiel eines IMDB Reviews

In [129]:
print(f"Sentiment: {imdb_data['sentiment'][1]}\n")
print(imdb_data['review'][1])

Sentiment: positive

A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are 

## Entfernen von überflüssigen HTML tags

In [130]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer


# Quelle: https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews
def cleanhtml(raw_html):
   CLEANR = re.compile('<.*?>') 
   cleantext = re.sub(CLEANR, '', raw_html)
   return cleantext

imdb_data['review'] = imdb_data['review'].apply(cleanhtml)
print(imdb_data.review[1])


A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.


## Entfernen von Stoppwörtern

In [131]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer


tokenizer=ToktokTokenizer()
stopword_list=nltk.corpus.stopwords.words('english') # setze tokenizer Sprache auf Englisch

# Entfernen von Stoppwörtern
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

imdb_data['review'] = imdb_data['review'].apply(remove_stopwords)
print(imdb_data['review'][1])


wonderful little production. filming technique unassuming- old-time-BBC fashion gives comforting , sometimes discomforting , sense realism entire piece. actors extremely well chosen- Michael Sheen " got polari " voices pat ! truly see seamless editing guided references Williams ' diary entries , well worth watching terrificly written performed piece. masterful production one great master ' comedy life. realism really comes home little things : fantasy guard , rather use traditional ' dream ' techniques remains solid disappears. plays knowledge senses , particularly scenes concerning Orton Halliwell sets ( particularly flat Halliwell ' murals decorating every surface ) terribly well done .


## Durchführen einer Lemmatisierung

In [132]:
import spacy

nlp = spacy.load('en_core_web_sm')

lemma_text_list = []
for doc in nlp.pipe(imdb_data['review']):
    lemma_text_list.append(" ".join(token.lemma_ for token in doc))

imdb_data['review'] = lemma_text_list
print(imdb_data['review'][1])


wonderful little production . film technique unassuming- old - time - BBC fashion give comforting , sometimes discomforting , sense realism entire piece . actor extremely well chosen- Michael Sheen " get polari " voice pat ! truly see seamless edit guide reference Williams ' diary entry , well worth watch terrificly write perform piece . masterful production one great master ' comedy life . realism really come home little thing : fantasy guard , rather use traditional ' dream ' technique remain solid disappear . play knowledge sense , particularly scene concern Orton Halliwell set ( particularly flat Halliwell ' mural decorate every surface ) terribly well do .


## Aufteilen in Trainings- und Testdatensätze

In [133]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(imdb_data.review, imdb_data.sentiment, test_size=0.2, random_state=42)

print("Trainingsdaten:")
print(len(X_train))
print("----------------")
print("Testdaten:")
print(len(X_test))

Trainingsdaten:
1600
----------------
Testdaten:
400


## Übertragen in einen TFIDF Vektor

In [134]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print(X_train_tfidf.shape)
print(X_test_tfidf.shape)

(1600, 18894)
(400, 18894)


## Übertragen der Sentiment-Labels in binäre Darstellung

In [135]:
from sklearn.preprocessing import LabelBinarizer

# Sentiment Labels Darstellung
print(f"{y_train[:5]}")
print("")
lb = LabelBinarizer()
y_train = lb.fit_transform(y_train)
y_test = lb.fit_transform(y_test)
print("Binäre Darstellung:")
print(f"{y_train[:5]}")

968    positive
240    negative
819    negative
692    negative
420    negative
Name: sentiment, dtype: object

Binäre Darstellung:
[[1]
 [0]
 [0]
 [0]
 [0]]


## Training eines Classifiers(Support Vector Machine) auf Grundlage der Trainingsdaten

In [136]:
from sklearn.metrics import classification_report
from sklearn import svm

clf = svm.SVC()
clf.fit(X_train_tfidf, y_train)
sentiments_pred = clf.predict(X_test_tfidf)
target_names = ['positive', 'negative']
print(classification_report(y_test, sentiments_pred, target_names=target_names))

              precision    recall  f1-score   support

    positive       0.88      0.82      0.85       195
    negative       0.84      0.89      0.87       205

    accuracy                           0.86       400
   macro avg       0.86      0.86      0.86       400
weighted avg       0.86      0.86      0.86       400



## Austauschen des Classifiers

In [156]:
from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=20)
neigh.fit(X_train_tfidf, y_train)
sentiments_pred = neigh.predict(X_test_tfidf)
target_names = ['positive', 'negative']
print(classification_report(y_test, sentiments_pred, target_names=target_names))


              precision    recall  f1-score   support

    positive       0.66      0.92      0.77       195
    negative       0.88      0.55      0.67       205

    accuracy                           0.73       400
   macro avg       0.77      0.73      0.72       400
weighted avg       0.77      0.73      0.72       400



## Beispielsätze klassifizieren

In [158]:
def classify_example(example, clf):
    X_example = [f"{example}"]
    X_example = tfidf.transform(X_example)

    result = clf.predict(X_example)
    if result[0] == 1:
        print(f"The review \n\"{example}\"\ngot classified as: Positive")
    else:
        print(f"The review \n\"{example}\"\ngot classified as: Negative")
    print("")

example1 = "I hated the movie so much. I couldnt stand a second of it and it bored me"
example2 = "The movie was very boring"
example3 = "The movie was fun to watch"

classify_example(example1, clf)
classify_example(example2, clf)
classify_example(example3, clf)

The review 
"I hated the movie so much. I couldnt stand a second of it and it bored me"
got classified as: Negative

The review 
"The movie was very boring"
got classified as: Negative

The review 
"The movie was fun to watch"
got classified as: Positive

