## Einlesen des IMDB Datensatzes

In [1235]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore') # Blendet zur Übersicht Warnungen aus

# Quelle: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
imdb_data = pd.read_csv('../misc/IMDB Dataset.csv')
n = 500
imdb_data_unprocessed = imdb_data.head(n)
imdb_data = imdb_data.head(n)

imdb_data.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Analyse des IMDB Datensatzes

In [1236]:
print(imdb_data.shape)
print(imdb_data['sentiment'].value_counts())

(100, 2)
negative    58
positive    42
Name: sentiment, dtype: int64


In [1237]:
print(f"Sentiment: {imdb_data['sentiment'][1]}\n")
print(imdb_data['review'][1])

Sentiment: positive

A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are 

## Entfernen von überflüssigen HTML tags und Sonderzeichen

In [1238]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer


# Quelle: https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews
def cleanhtml(raw_html):
   CLEANR = re.compile('<.*?>') 
   cleantext = re.sub(CLEANR, '', raw_html)
   return cleantext

def remove_special_characters(text):
    cleanString = re.sub(r"[^a-zA-Z0-9]+", ' ', text)
    return cleanString

def denoise_text(text):
    text = cleanhtml(text)
    text = remove_special_characters(text)
    return text

imdb_data['review'] = imdb_data['review'].apply(denoise_text)
print(imdb_data.review[1])


A wonderful little production The filming technique is very unassuming very old time BBC fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece The actors are extremely well chosen Michael Sheen not only has got all the polari but he has all the voices down pat too You can truly see the seamless editing guided by the references to Williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece A masterful production about one of the great master s of comedy and his life The realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears It plays on our knowledge and our senses particularly with the scenes concerning Orton and Halliwell and the sets particularly of their flat with Halliwell s murals decorating every surface are terribly well done 


## Entfernen von Stoppwörtern

In [1239]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer


tokenizer=ToktokTokenizer()
stopword_list=nltk.corpus.stopwords.words('english') # setze tokenizer Sprache auf Englisch

# Entfernen von Stoppwörtern
def remove_stopwords(text, is_lower_case=True):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

imdb_data['review'] = imdb_data['review'].apply(remove_stopwords)
print(imdb_data['review'][1])


A wonderful little production The filming technique unassuming old time BBC fashion gives comforting sometimes discomforting sense realism entire piece The actors extremely well chosen Michael Sheen got polari voices pat You truly see seamless editing guided references Williams diary entries well worth watching terrificly written performed piece A masterful production one great master comedy life The realism really comes home little things fantasy guard rather use traditional dream techniques remains solid disappears It plays knowledge senses particularly scenes concerning Orton Halliwell sets particularly flat Halliwell murals decorating every surface terribly well done


## Durchführen einer Lemmatisierung

In [1240]:
import spacy

nlp = spacy.load('en_core_web_sm')

lemma_text_list = []
for doc in nlp.pipe(imdb_data['review']):
    lemma_text_list.append(" ".join(token.lemma_ for token in doc))

imdb_data['review'] = lemma_text_list
print(imdb_data['review'][1])


a wonderful little production the filming technique unassume old time BBC fashion give comfort sometimes discomforting sense realism entire piece the actor extremely well choose Michael Sheen get polari voice pat you truly see seamless edit guide reference Williams diary entry well worth watch terrificly write perform piece a masterful production one great master comedy life the realism really come home little thing fantasy guard rather use traditional dream technique remain solid disappear it play knowledge sense particularly scene concern Orton Halliwell set particularly flat Halliwell mural decorate every surface terribly well do


## Aufteilen in Trainings- und Testdatensätze

In [1241]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(imdb_data.review, imdb_data.sentiment, test_size=0.2, random_state=42)

print("Trainingsdaten:")
print(len(X_train))
print("----------------")
print("Testdaten:")
print(len(X_test))

Trainingsdaten:
80
----------------
Testdaten:
20


## Übertragen in einen TFIDF Vektor

In [1242]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print(X_train_tfidf.shape)
print(X_test_tfidf.shape)

(80, 3283)
(20, 3283)


## Übertragen der Sentiment-Labels in binäre Darstellung

In [1243]:
from sklearn.preprocessing import LabelBinarizer

# Sentiment Labels Darstellung
print(f"{y_train[:5]}")
print("")
lb = LabelBinarizer()
y_train = lb.fit_transform(y_train)
y_test = lb.fit_transform(y_test)
print("Binäre Darstellung:")
print(f"{y_train[:5]}")

55    negative
88    negative
26    positive
42    negative
69    negative
Name: sentiment, dtype: object

Binäre Darstellung:
[[0]
 [0]
 [1]
 [0]
 [0]]


## Training eines Classifiers(Support Vector Machine) auf Grundlage der Trainingsdaten

In [1244]:
from sklearn.metrics import classification_report
from sklearn import svm

clf = svm.SVC()
clf.fit(X_train_tfidf, y_train)
sentiments_pred = clf.predict(X_test_tfidf)
target_names = ['positive', 'negative']
print(classification_report(y_test, sentiments_pred, target_names=target_names))

              precision    recall  f1-score   support

    positive       0.30      1.00      0.46         6
    negative       0.00      0.00      0.00        14

    accuracy                           0.30        20
   macro avg       0.15      0.50      0.23        20
weighted avg       0.09      0.30      0.14        20



## Weitere Features hinzufügen (Feature Union)

In [1245]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

def get_sentiment_scores(analyzer, reviews):
    scores = []
    for sentence in reviews:
        vs = analyzer.polarity_scores(sentence)
        scores.append(vs['compound'])
    return scores

analyzer = SentimentIntensityAnalyzer()
X_train, X_test, y_train, y_test = train_test_split(imdb_data_unprocessed.review, imdb_data_unprocessed.sentiment, test_size=0.2, random_state=42)
print(X_train)
X_train_scores = get_sentiment_scores(analyzer, X_train)
X_test_scores = get_sentiment_scores(analyzer, X_test)

print(len(X_train_scores))
print(len(X_test_scores))

55    As someone has already mentioned on this board...
88    Nicholas Walker is Paul, the local town Revera...
26    "The Cell" is an exotic masterpiece, a dizzyin...
42    Of all the films I have seen, this one, The Ra...
69    This film laboured along with some of the most...
                            ...                        
60    What happened? What we have here is basically ...
71    Honestly - this short film sucks. the dummy us...
14    This a fantastic movie of three prisoners who ...
92    Deanna Durbin, Nan Grey and Barbara Read are "...
51    ***SPOILERS*** All too, in real life as well a...
Name: review, Length: 80, dtype: object
80
20


In [1246]:
from scipy import sparse
import numpy as np

X_train_scores = np.asarray(X_train_scores)
X_train_scores = np.expand_dims(X_train_scores, axis = 1)


In [1247]:
from scipy import sparse
import numpy as np

print(X_train_scores.shape)
print(X_train_tfidf.shape)
X_train_union = sparse.hstack([X_train_tfidf, X_train_scores])
print(X_train_union.shape)

(80, 1)
(80, 3283)
(80, 3284)


In [1248]:
X_test_scores = np.asarray(X_test_scores)
X_test_scores = np.expand_dims(X_test_scores, axis = 1)
X_test_union = sparse.hstack([X_test_tfidf, X_test_scores])

In [1249]:
from sklearn.metrics import classification_report
from sklearn import svm

clf = svm.SVC()
clf.fit(X_train_union, y_train)
sentiments_pred = clf.predict(X_test_union)
target_names = ['positive', 'negative']
print(classification_report(y_test, sentiments_pred, target_names=target_names))

              precision    recall  f1-score   support

    positive       0.33      0.33      0.33         6
    negative       0.71      0.71      0.71        14

    accuracy                           0.60        20
   macro avg       0.52      0.52      0.52        20
weighted avg       0.60      0.60      0.60        20

