## Einlesen des IMDB Datensatzes

In [658]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore') # Blendet zur Übersicht Warnungen aus

# Quelle: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
imdb_data = pd.read_csv('../misc/IMDB Dataset.csv')
imdb_data = imdb_data.head(5000)
imdb_data.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Analyse des IMDB Datensatzes

In [659]:
print(imdb_data.shape)
print(imdb_data['sentiment'].value_counts())

(800, 2)
negative    410
positive    390
Name: sentiment, dtype: int64


In [660]:
print(f"Sentiment: {imdb_data['sentiment'][1]}\n")
print(imdb_data['review'][1])

Sentiment: positive

A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are 

## Entfernen von überflüssigen HTML tags und Sonderzeichen

In [661]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer

# Quelle: https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews
def cleanhtml(raw_html):
   CLEANR = re.compile('<.*?>') 
   cleantext = re.sub(CLEANR, '', raw_html)
   return cleantext

def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

def remove_special_characters(text):
    cleanString = re.sub(r"[^a-zA-Z0-9]+", ' ', text)
    return cleanString

def denoise_text(text):
    text = cleanhtml(text)
    text = remove_special_characters(text)
    text = remove_between_square_brackets(text)
    return text

imdb_data['review'] = imdb_data['review'].apply(denoise_text)
print(imdb_data.review[1])

A wonderful little production The filming technique is very unassuming very old time BBC fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece The actors are extremely well chosen Michael Sheen not only has got all the polari but he has all the voices down pat too You can truly see the seamless editing guided by the references to Williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece A masterful production about one of the great master s of comedy and his life The realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears It plays on our knowledge and our senses particularly with the scenes concerning Orton and Halliwell and the sets particularly of their flat with Halliwell s murals decorating every surface are terribly well done 


## Entfernen von Stoppwörtern

In [662]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer

tokenizer=ToktokTokenizer()
stopword_list=nltk.corpus.stopwords.words('english') # setze tokenizer Sprache auf Englisch

# Entfernen von Stoppwörtern
def remove_stopwords(text, is_lower_case=True):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

imdb_data['review'] = imdb_data['review'].apply(remove_stopwords)
print(imdb_data['review'][1])

A wonderful little production The filming technique unassuming old time BBC fashion gives comforting sometimes discomforting sense realism entire piece The actors extremely well chosen Michael Sheen got polari voices pat You truly see seamless editing guided references Williams diary entries well worth watching terrificly written performed piece A masterful production one great master comedy life The realism really comes home little things fantasy guard rather use traditional dream techniques remains solid disappears It plays knowledge senses particularly scenes concerning Orton Halliwell sets particularly flat Halliwell murals decorating every surface terribly well done


## Durchführen einer Lemmatisierung

In [668]:
import spacy

nlp = spacy.load('en_core_web_sm')

lemma_text_list = []
for doc in nlp.pipe(imdb_data['review']):
    lemma_text_list.append(" ".join(token.lemma_ for token in doc))

imdb_data['review'] = lemma_text_list
print(imdb_data['review'][1])

a wonderful little production the filming technique unassume old time BBC fashion give comfort sometimes discomforte sense realism entire piece the actor extremely well choose Michael Sheen get polari voice pat you truly see seamless edit guide reference Williams diary entry well worth watch terrificly write perform piece a masterful production one great master comedy life the realism really come home little thing fantasy guard rather use traditional dream technique remain solid disappear it play knowledge sense particularly scene concern Orton Halliwell set particularly flat Halliwell mural decorate every surface terribly well do


## Übertragen in einen TFIDF Vektor

In [669]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
reviews = tfidf.fit_transform(imdb_data.review)
reviews.shape

(800, 13295)

## Übertragen der Sentiment-Labels in binäre Darstellung

In [670]:
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
sentiments = lb.fit_transform(imdb_data.sentiment)

# Sentiment Labels Darstellung
print(f"{imdb_data.sentiment.head(5)}")
print("")
print("Binäre Darstellung:")
print(f"{sentiments[:5]}")

0    positive
1    positive
2    positive
3    negative
4    positive
Name: sentiment, dtype: object

Binäre Darstellung:
[[1]
 [1]
 [1]
 [0]
 [1]]


## Aufteilen in Trainings- und Testdatensätze

In [671]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(reviews, sentiments, test_size=0.2, random_state=42)

print("Trainingsdaten:")
print(X_train.shape)
print(y_train.shape)
print("----------------")
print("Testdaten:")
print(X_test.shape)
print(y_test.shape)

Trainingsdaten:
(640, 13295)
(640, 1)
----------------
Testdaten:
(160, 13295)
(160, 1)


## Training eines Classifiers(Support Vector Machine) auf Grundlage der Trainingsdaten

In [672]:
from sklearn.metrics import classification_report
from sklearn import svm

clf = svm.SVC()
clf.fit(X_train, y_train)
sentiments_pred = clf.predict(X_test)
target_names = ['positive', 'negative']
print(classification_report(y_test, sentiments_pred, target_names=target_names))

              precision    recall  f1-score   support

    positive       0.90      0.73      0.81        88
    negative       0.73      0.90      0.81        72

    accuracy                           0.81       160
   macro avg       0.82      0.82      0.81       160
weighted avg       0.82      0.81      0.81       160

