## Einlesen des IMDB Datensatzes

In [50]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore') # Blendet zur Übersicht Warnungen aus

# source: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
imdb_data = pd.read_csv('../misc/IMDB Dataset.csv')
imdb_data.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


## Analyse des IMDB Datensatzes

In [51]:
imdb_data['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [52]:
print(f"Sentiment: {imdb_data['sentiment'][5]}\n")
print(imdb_data['review'][5])

Sentiment: positive

Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. It just never gets old, despite my having seen it some 15 or more times in the last 25 years. Paul Lukas' performance brings tears to my eyes, and Bette Davis, in one of her very few truly sympathetic roles, is a delight. The kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. And the mother's slow awakening to what's happening in the world and under her own roof is believable and startling. If I had a dozen thumbs, they'd all be "up" for this movie.


## Aufteilen in Trainings- und Testdatensätze

In [76]:
from sklearn.model_selection import train_test_split

reviews_train, reviews_test, sentiments_train, sentiments_test = train_test_split(imdb_data.review, imdb_data.sentiment, test_size=0.2, random_state=42)

print(f"Trainingsdaten-Split:")
print(sentiments_train.value_counts())
print("---------------------------")
print(f"Testdaten-Split:")
print(sentiments_test.value_counts())

Trainingsdaten-Split:
negative    20039
positive    19961
Name: sentiment, dtype: int64
---------------------------
Testdaten-Split:
positive    5039
negative    4961
Name: sentiment, dtype: int64


## Übertragen in einen TFIDF Vektor

In [77]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
reviews_train_vec = tfidf.fit_transform(reviews_train)
reviews_test_vec = tfidf.fit_transform(reviews_test)

print('Tfidf_train:',reviews_train_vec.shape)
print('Tfidf_test:',reviews_test_vec.shape)

Tfidf_train: (40000, 93003)
Tfidf_test: (10000, 52466)


## Übertragen der Sentiment-Labels in binäre Darstellung

In [87]:
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
sentiments_train_vec = lb.fit_transform(sentiments_train)
sentiments_test_vec = lb.fit_transform(sentiments_test)

print("Training Labels (Sentiment)")
print(sentiments_train_vec.shape)
print(sentiments_train_vec[:5])
print("---------------------------")
print("Test Labels (Sentiment)")
print(sentiments_test_vec.shape)
print(sentiments_test_vec[:5])

39087    negative
30893    negative
45278    positive
16398    negative
13653    negative
Name: sentiment, dtype: object
Training Labels (Sentiment)
(40000, 1)
[[0]
 [0]
 [1]
 [0]
 [0]]
---------------------------
Test Labels (Sentiment)
(10000, 1)
[[1]
 [1]
 [0]
 [1]
 [0]]
