## 🅰 Aufgabe a) Stopwort-Filterung nach Genre



In [1]:
from nltk.corpus import brown
from nltk.corpus import stopwords
from collections import Counter
import nltk
nltk.download('stopwords')

genres = ['news', 'romance', 'hobbies']
stop_words = set(stopwords.words('english'))

for genre in genres:
    words = brown.words(categories=genre)
    total_words = len(words)
    filtered = [w.lower() for w in words if w.lower() not in stop_words and w.isalpha()]
    print(f"\nGenre: {genre}")
    print(f"- Original: {total_words} Wörter")
    print(f"- Nach Stopword-Filter: {len(filtered)} Wörter")
    
    freq = Counter(filtered)
    print(f"- Häufigste genretypische Wörter: {freq.most_common(5)}")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jonas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



Genre: news
- Original: 100554 Wörter
- Nach Stopword-Filter: 46867 Wörter
- Häufigste genretypische Wörter: [('said', 406), ('would', 246), ('new', 241), ('one', 213), ('last', 177)]

Genre: romance
- Original: 70022 Wörter
- Nach Stopword-Filter: 27570 Wörter
- Häufigste genretypische Wörter: [('said', 331), ('would', 247), ('could', 195), ('like', 189), ('one', 182)]

Genre: hobbies
- Original: 82345 Wörter
- Nach Stopword-Filter: 37688 Wörter
- Häufigste genretypische Wörter: [('one', 286), ('new', 149), ('may', 143), ('time', 133), ('first', 126)]


### Häufigste genretypische Wörter pro Genre

| Genre   | Wort 1     | Wort 2       | Wort 3       | Wort 4     | Wort 5      |
|---------|------------|--------------|--------------|------------|-------------|
| news    | said (406) | would (246)  | new (241)    | one (213)  | last (177)  |
| romance | said (331) | would (247)  | could (195)  | like (189) | one (182)   |
| hobbies | one (286)  | new (149)    | may (143)    | time (133) | first (126) |


## 📊 Analyse der Stopwort-Filterung (Aufgabe a)


### 🔢 Vorher-Nachher-Vergleich

| Genre     | Originale Wortanzahl | Nach Stopwort-Filterung | Reduktion (%)   |
|-----------|----------------------|--------------------------|------------------|
| news      | 100.554              | 46.867                   | ca. 53%          |
| romance   | 70.022               | 27.570                   | ca. 61%          |
| hobbies   | 82.345               | 37.688                   | ca. 54%          |




Aufgabe b) TF-IDF-Werte bestimmen


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer


genre_docs = [' '.join(brown.words(categories=genre)) for genre in genres]

vectorizer = TfidfVectorizer(stop_words='english', lowercase=True, max_df=0.9, min_df=2)
X = vectorizer.fit_transform(genre_docs)
feature_names = vectorizer.get_feature_names_out()

import pandas as pd

df = pd.DataFrame(X.T.toarray(), index=feature_names, columns=genres)
for genre in genres:
    print(f"\n--- Niedrigste TF-IDF-Werte in '{genre}' ---")
    print(df[genre].sort_values().head(5))
    print(f"\n--- Höchste TF-IDF-Werte in '{genre}' (typisch!) ---")
    print(df[genre].sort_values(ascending=False).head(5))



--- Niedrigste TF-IDF-Werte in 'news' ---
reasoned       0.0
proves         0.0
propriety      0.0
constantine    0.0
constructed    0.0
Name: news, dtype: float64

--- Höchste TF-IDF-Werte in 'news' (typisch!) ---
000           0.299216
kennedy       0.183771
committee     0.181415
university    0.174347
tax           0.164922
Name: news, dtype: float64

--- Niedrigste TF-IDF-Werte in 'romance' ---
000         0.0
miami       0.0
michigan    0.0
mid         0.0
midwest     0.0
Name: romance, dtype: float64

--- Höchste TF-IDF-Werte in 'romance' (typisch!) ---
phil    0.274213
god     0.184796
mike    0.184796
hadn    0.178835
baby    0.178835
Name: romance, dtype: float64

--- Niedrigste TF-IDF-Werte in 'hobbies' ---
sly         0.0
mood        0.0
morality    0.0
daley       0.0
moses       0.0
Name: hobbies, dtype: float64

--- Höchste TF-IDF-Werte in 'hobbies' (typisch!) ---
design     0.190615
af         0.171553
site       0.149315
program    0.139784
000        0.139784
Name: h