TF-IDF

In [6]:
%pip install spacy
%pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl

Note: you may need to restart the kernel to use updated packages.
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m2.6 MB/s[0m  [33m0:00:04[0m eta [36m0:00:01[0m
[?25hCollecting spacy<3.8.0,>=3.7.2 (from en-core-web-sm==3.7.1)
  Downloading spacy-3.7.5-cp311-cp311-macosx_11_0_arm64.whl.metadata (27 kB)
Collecting thinc<8.3.0,>=8.2.2 (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1)
  Downloading thinc-8.2.5-cp311-cp311-macosx_11_0_arm64.whl.metadata (15 kB)
Collecting langcodes<4.0.0,>=3.2.0 (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1)
  Downloading langcodes-3.5.1-py3-none-any.whl.metadata (30 kB)
Collecting blis<0.8.0,>=0.7.8 (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1)
  Downloading blis-0.7.11-cp311-cp311-macosx_11_0_arm64.whl.metadata 

# Datenvorverarbeitung (engl. data pre-processing)
Die Datenvorverarbeitung gliedert sich in die Textvorverarbeitung (engl. text pre-processing) und die linguistische Verarbeitung (engl. linguistic processing) welche die Daten in eine modellgeeignete Form bringen.

In [2]:
import pandas as pd
import spacy

# Landen des bereinigten Datensatzes
df = pd.read_csv('../datasets/complaints_data_cleaned.csv', usecols=["text"])
print(df.head(10))
print(df.shape)

                                                text
0  I used to love Comcast. Until all these consta...
1  I'm so over Comcast! The worst internet provid...
2  If I could give them a negative star or no sta...
3  I've had the worst experiences so far since in...
4  Check your contract when you sign up for Comca...
5  Thank God. I am changing to Dish. They gave me...
6  I Have been a long time customer and only have...
7  There is a malfunction on the DVR manager whic...
8  Charges overwhelming. Comcast service rep was ...
9  I have had cable, DISH, and U-verse, etc. in t...
(5626, 1)


# Textvorverarbeitung (engl. text pre-processing)
In der Textvorverarbeitung erfolgt die Textbereinigung (engl. text cleaning). Sie wird durch Rauschentfernung (engl. noise reduction) und Standardisierung (engl. standardisation) durchgeführt. 

In [None]:
## Textbereinigung (engl. text cleaning)
import pandas as pd
import spacy
# Datensatz laden (Rohtext)
df = pd.read_csv('../datasets/complaints_data.csv', usecols=["text"], nrows=5627)  # Zeilenbegrenzung

df["text"] = df["text"].fillna("")                                                 # Fehlende Werte als leere Strings ersetzen

## spaCy Pipeline (Textbereinigung, Tokenisierung, Lemmatisierung, )
nlp = spacy.load("en_core_web_sm")                                          # englisches Modell (small version)

df["cleaned"] = [[token.lemma_.lower()
                  for token in doc
                  # Filterblock      
                  if (not token.is_stop and                                 # Standard:     Stopwort-Filter (allgemein)
                      not token.is_punct and                                # Standard:     Satzzeichen-Filter
                      not token.like_num and                                # Standard:     Nummern-Filter (einfache Zahlen)
                                                                            # Individuell:  Stopwort-Filter (individuell)
                      len(token.text) > 2 and                               # Individuell:  Wörter mit min. 2 Zeichen
                      not any(char in token.text for char in ':/-–—') and   # Individuell:  Filter für Datums-/Zeit-Token
                      token.is_ascii and                                    # Individuell:  Emoijs-Filter
                      token.pos_ != "PRON" and                              # Individuell:  Pronomen-Filter
                      token.text.lower() not in ["meh", "ugh"])]            # Individuell:  Wortfilter (ggf. Beschwerde, comcast, muss getuned werden)
                 for doc in nlp.pipe(df["text"], batch_size=50)]            # Batch-Verarbeitung (50 Texte parallel)

print(f"Verarbeitet: {len(df)} Beschwerden")

## Rechtschreibfehlerkorrektur (engl. spelling correction) - nicht umgesetzt
## Eigennamenerkennung (engl. Named Entity Recognition - NER) - nicht umgesetzt

# Ausgabe des Prozesses
df.head(10)[["text", "cleaned"]].style.set_properties(
    **{'text-align': 'left', 'width': '1000px', 'max-width': '1500px', 'font-size': '12px'}
)

Verarbeitet: 5627 Beschwerden


Unnamed: 0,text,cleaned
0,"I used to love Comcast. Until all these constant updates. My internet and cable crash a lot at night, and sometimes during the day, some channels don't even work and on demand sometimes don't play either. I wish they will do something about it. Because just a few mins ago, the internet have crashed for about 20 mins for no reason. I'm tired of it and thinking about switching to Wow or something. Please do not get Xfinity.","['love', 'comcast', 'constant', 'update', 'internet', 'cable', 'crash', 'lot', 'night', 'day', 'channel', 'work', 'demand', 'play', 'wish', 'min', 'ago', 'internet', 'crash', 'min', 'reason', 'tired', 'think', 'switch', 'wow', 'xfinity']"
1,I'm so over Comcast! The worst internet provider. I'm taking online classes and multiple times was late with my assignments because of the power interruptions in my area that lead to poor quality internet service. Definitely switching to Verizon. I'd rather pay $10 extra then dealing w/ Comcast and non stopping internet problems.,"['comcast', 'bad', 'internet', 'provider', 'take', 'online', 'class', 'multiple', 'time', 'late', 'assignment', 'power', 'interruption', 'area', 'lead', 'poor', 'quality', 'internet', 'service', 'definitely', 'switch', 'verizon', 'pay', 'extra', 'deal', 'comcast', 'non', 'stop', 'internet', 'problem']"
2,"If I could give them a negative star or no stars on this review I would. I have never worked with any industry with as bad of customer service as Comcast. It is not a matter of money because I make well enough above and beyond to afford their services but they are a legitimate ripoff. I think they are the biggest scam of since the mortgage industry's major meltdown and I hope I move somewhere where Comcast does not exist. The disregard to want to help or do the right thing is honestly astounding. If you have to call, which you do FOR ALL ISSUES - billing, connection/service, adding or removing service, errors, it does not matter you will be transferred minimum of 4 times. Everyone says the same thing and passes the issues to the next person and no one resolves the problem.They offer promotional packages in small timeframes and can never access them again so they then upgrade you without you wishing and change your billing. It has been 5 months and I have been overcharged $40 a month since I started with them. The blatant rudeness that must make you qualified to do this job is the type of quality service that gets you this review. So... Dear Comcast, you suck. Sincerely, a customer who cannot wait to never use your service again.","['negative', 'star', 'star', 'review', 'work', 'industry', 'bad', 'customer', 'service', 'comcast', 'matter', 'money', 'afford', 'service', 'legitimate', 'ripoff', 'think', 'big', 'scam', 'mortgage', 'industry', 'major', 'meltdown', 'hope', 'comcast', 'exist', 'disregard', 'want', 'help', 'right', 'thing', 'honestly', 'astounding', 'issues', 'billing', 'connection', 'service', 'add', 'remove', 'service', 'error', 'matter', 'transfer', 'minimum', 'time', 'say', 'thing', 'pass', 'issue', 'person', 'resolve', 'problem', 'offer', 'promotional', 'package', 'small', 'timeframe', 'access', 'upgrade', 'wish', 'change', 'billing', 'month', 'overcharge', 'month', 'start', 'blatant', 'rudeness', 'qualified', 'job', 'type', 'quality', 'service', 'get', 'review', 'dear', 'comcast', 'suck', 'sincerely', 'customer', 'wait', 'use', 'service']"
3,"I've had the worst experiences so far since install on 10/4/16. Nothing but problems. Two no shows on scheduled service appointments, extreme difficulty in adding boxes to the second floor. What is so difficult about adding boxes to an existing account? No thank you, I'm not starting a second account for the second floor of the same house! A separate bundle package? All I wanted was just to add a few boxes. Apparently this is not possible. Well then, I guess it's not possible to remain a customer!","['bad', 'experience', 'far', 'install', 'problem', 'show', 'schedule', 'service', 'appointment', 'extreme', 'difficulty', 'add', 'box', 'floor', 'difficult', 'add', 'box', 'exist', 'account', 'thank', 'start', 'account', 'floor', 'house', 'separate', 'bundle', 'package', 'want', 'add', 'box', 'apparently', 'possible', 'guess', 'possible', 'remain', 'customer']"
4,"Check your contract when you sign up for Comcast as their advertised offers do not match the contract they issue. I signed up for $49.99 150Mbps internet for 2 years, however my contract has $19.99 for 25Mbps internet for 2 years. They say there is an add on in place for $30 which boost it to Blast! Pro, however this isn't part of the contract, which means that Comcast can increase the price whenever they want within the 2 years. This means I haven't received the advertised rate. Comcast has so far refused to issue corrected contract, or issue in writing that the $30 will remain at that price for 2 years. I just have to trust them. So watch out, Comcast is doing the usual illegal practices, I'm guessing to catch people out and hope they don't notice and end up paying more than they should.","['check', 'contract', 'sign', 'comcast', 'advertised', 'offer', 'match', 'contract', 'issue', 'sign', '150mbps', 'internet', 'year', 'contract', '25mbps', 'internet', 'year', 'add', 'place', 'boost', 'blast', 'pro', 'contract', 'mean', 'comcast', 'increase', 'price', 'want', 'year', 'mean', 'receive', 'advertised', 'rate', 'comcast', 'far', 'refuse', 'issue', 'correct', 'contract', 'issue', 'write', 'remain', 'price', 'year', 'trust', 'watch', 'comcast', 'usual', 'illegal', 'practice', 'guess', 'catch', 'people', 'hope', 'notice', 'end', 'pay']"
5,Thank God. I am changing to Dish. They gave me awesome pricing and super people to deal with. You can actually understand what they are saying. I'm so excited to finally be able to return this equipment although still haven't received the home security yet as promised 4 times. Go to h*ll Comcast. You have made me miserable and cause me to miss many hours of work with your promises.,"['thank', 'god', 'change', 'dish', 'give', 'awesome', 'pricing', 'super', 'people', 'deal', 'actually', 'understand', 'say', 'excited', 'finally', 'able', 'return', 'equipment', 'receive', 'home', 'security', 'promise', 'time', 'h*ll', 'comcast', 'miserable', 'cause', 'miss', 'hour', 'work', 'promise']"
6,"I Have been a long time customer and only have Xfinity as my ISP for a while now. While I was in the local Walmart on November 4, 2016, there were customer representatives from Xfinity running promotions for and in the Salt Lake City area. Spoke with a representative and was able to get and signed a contract for Pro Blast at $50.00 a month with no contract or early termination fees. I received an email from Xfinity stating the changes that would be made to my account. It stated that not only would it be under contract for 24 months but there would be early termination fees. This is not what I had originally signed up for and it specifically states this on the contract that I signed. Contacted Xfinity customer service and was told since they cannot see the contract over the phone that I would need to go to Xfinity store in person. Went to Xfinity store on November 8, 2016 and was told that it would be under contract and there was no way around it. Because of this I have cancelled the upgrade and went back to my original plan. It's plain and simple. When a contract is signed it should be honored for what is stated on it. Xfinity is dishonest and not trustworthy. Therefore I will be looking and changing my ISP as soon as possible to another company. Xfinity does not deserve a paycheck from me or anyone else that I know.","['long', 'time', 'customer', 'xfinity', 'isp', 'local', 'walmart', 'november', 'customer', 'representative', 'xfinity', 'run', 'promotion', 'salt', 'lake', 'city', 'area', 'speak', 'representative', 'able', 'sign', 'contract', 'pro', 'blast', 'month', 'contract', 'early', 'termination', 'fee', 'receive', 'email', 'xfinity', 'state', 'change', 'account', 'state', 'contract', 'month', 'early', 'termination', 'fee', 'originally', 'sign', 'specifically', 'state', 'contract', 'sign', 'contacted', 'xfinity', 'customer', 'service', 'tell', 'contract', 'phone', 'need', 'xfinity', 'store', 'person', 'go', 'xfinity', 'store', 'november', 'tell', 'contract', 'way', 'cancel', 'upgrade', 'go', 'original', 'plan', 'plain', 'simple', 'contract', 'sign', 'honor', 'state', 'xfinity', 'dishonest', 'trustworthy', 'look', 'change', 'isp', 'soon', 'possible', 'company', 'xfinity', 'deserve', 'paycheck', 'know']"
7,"There is a malfunction on the DVR manager which is preventing us from adding more recordings. Customer service is fairly certain that the problem is from the signal from their system to ours, but protocol demands that they access our home before investigating that option. Since we work, that cannot be done until next Saturday. Customer service tech agreed that this seems illogical since logic would dictate that one would investigate the most probably malfunction first, but insists they must follow protocol. This is extremely frustrating. After 35 years as a customer of Comcast & their predecessors, I am investigating alternatives.","['malfunction', 'dvr', 'manager', 'prevent', 'add', 'recording', 'customer', 'service', 'fairly', 'certain', 'problem', 'signal', 'system', 'protocol', 'demand', 'access', 'home', 'investigate', 'option', 'work', 'saturday', 'customer', 'service', 'tech', 'agree', 'illogical', 'logic', 'dictate', 'investigate', 'probably', 'malfunction', 'insist', 'follow', 'protocol', 'extremely', 'frustrating', 'year', 'customer', 'comcast', 'predecessor', 'investigate', 'alternative']"
8,Charges overwhelming. Comcast service rep was so ignorant and rude when I call to resolve my issue with my bill. I emailed Tom ** his rep was rude to me. None of the representative was helpful. They all just pass me on to other people. I am cutting my service with Comcast.,"['charge', 'overwhelming', 'comcast', 'service', 'rep', 'ignorant', 'rude', 'resolve', 'issue', 'bill', 'email', 'tom', 'rep', 'rude', 'representative', 'helpful', 'pass', 'people', 'cut', 'service', 'comcast']"
9,"I have had cable, DISH, and U-verse, etc. in the past. All are eh... but you know what? Comcast takes the cake. I have never been driven to take time out of my day just to gripe online for all to see. But consumers, stay away! So my first terrible experience with Comcast is that they took 5 phones and 2 months to come out and bury the lines they had to lay in my front yard to get the cable needed into my house. Finally got someone when my special needs neighbor tripped and fell!Now 3 months into my contract, I have had my internet, phone, and TV go out for HOURS at a time. I would spend 3 hours on with a tech when it will come back up after the technician resets the router manually for the 3rd or 4th time. I have had it, I work from home occasionally and this is a huge inconvenience! The hardware is faulty, I understand that sometimes you get a lemon... but 3 months! 3 months! I have had it. Worst company ever. Crappy equipment and terrible customer service, and worse is the technicians they hire! Not a clue! Comcast should send a technician out here to switch out this equipment before I set a bonfire to it.","['cable', 'dish', 'verse', 'etc', 'past', 'know', 'comcast', 'take', 'cake', 'drive', 'time', 'day', 'gripe', 'online', 'consumer', 'stay', 'away', 'terrible', 'experience', 'comcast', 'take', 'phone', 'month', 'come', 'bury', 'line', 'lay', 'yard', 'cable', 'need', 'house', 'finally', 'get', 'special', 'need', 'neighbor', 'trip', 'fell!now', 'month', 'contract', 'internet', 'phone', 'hour', 'time', 'spend', 'hour', 'tech', 'come', 'technician', 'reset', 'router', 'manually', 'time', 'work', 'home', 'occasionally', 'huge', 'inconvenience', 'hardware', 'faulty', 'understand', 'lemon', 'month', 'month', 'bad', 'company', 'crappy', 'equipment', 'terrible', 'customer', 'service', 'bad', 'technician', 'hire', 'clue', 'comcast', 'send', 'technician', 'switch', 'equipment', 'set', 'bonfire']"


# Linguistische Verarbeitung (engl. linguistic processing)
In der linguistischen Verarbeitung erfolgen lexikalische, syntaktische und semantische Verarbeitungsschritte, um Daten für die Datenvorbereitung (engl. data preparation) zu präparieren. Die Phase beinhaltet Schritte wie .... die Sprachdaten annotieren  und ..... 

## Vokabularerstellung (engl. vocabulary construction)
Mapping der gefilterten Wörter (Token) zu IDs.

Wortfrequenzschwellen (ist in SpaCy Pipeline)
ggf. nur Nomen?

## syntaktische Verarbeitung (engl. syntactic processing)
## semantische Verarbeitung (engl. context processing)
### Semantisches Parsen (engl. semantic parsing)
#### Eigennamenerkennung (engl. Named Entity Recognition - NER)

In [None]:
# Vokabularerstellung (engl. vocabulary construction)
from sklearn.feature_extraction.text import CountVectorizer

# Umwandung des Dataframes in Liste (Vectorizer braucht Strings)
df["text_cleaned"] = [" ".join(tokens) for tokens in df["cleaned"]]

vectorizer = CountVectorizer()                                                  # Vectorizer mit fit_transform: Vokabular + Matrix in einem (fit_transform lernt das Vokabular)
X = vectorizer.fit_transform(df["text_cleaned"])

## Vokabularerstellung
vocabulary = vectorizer.get_feature_names_out()                                 # Vokabular extrahieren
print(f"Vokabulargröße: {len(vocabulary)} Token (Wörter)")                      # Ausgabe der Vokabulargrö0e

word_counts = X.sum(axis=0).A1                                                  # Häufigkeiten (Summe pro Spalte)
vocab_df = pd.DataFrame({
    'word': vocabulary,
    'frequency': word_counts
}).sort_values('frequency', ascending=False)
vocab_df.index.name = 'ID'                                                      # Beschriftung ID-Spalte
vocab_df.head(50)                                                               

Vokabulargröße: 11947 Token (Wörter)


Unnamed: 0_level_0,word,frequency
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
2170,comcast,15230
9477,service,15149
10553,tell,7859
1729,call,7768
10730,time,6219
2725,customer,5726
1343,bill,5601
5655,internet,5240
9289,say,5215
6893,month,4946


# Datenvorbereitung (engl. data preparation)
Im Rahmen der Datenverarbeitung werden Merkmale (engl. features) erzeugt und ausgewählt. Dies erfolgt durch Merkmalsgenerierung (engl. feature generation/featurization) und Merkmalsauswahl (engl. feature selection).

## Merkmalsgenerierung (engl. feature generation/featurization)
Merkmalsgenerierung bezeichnet den Prozess, aus rohem oder vorverarbeitetem Text neue, informative Merkmale zu erzeugen. Unstrukturierte Daten werden dabei durch Merkmalskodierung (engl. feature encoding) in numerische oder kategorische Repräsentationen überführt, die Machine-Learning-Modelle nutzen können.

# Vektorisierung (engl. vectorization)
Als Vektorisierung wird die Merkmalskodierung (engl. feature encoding) von Textdaten bezeichnet. 
## häufigkeitsbasierter Vektor (engl. frequency vectors)
### TF-IDF

In [None]:
# Vektorisierung (engl. vectorization)
## häufigkeitsbasierter Vektor (engl. frequency vectors)
### TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Vokabular im korrekten Format: {wort: index}
vocabulary = {word: idx for idx, word in enumerate(vocab_df['word'])}

# Inverse Mapping: {index: wort}
idx_to_word = {idx: word for word, idx in vocabulary.items()}

# TF-IDF mit dem erstellten Vokabular
vectorizer = TfidfVectorizer(vocabulary=vocabulary)

# Texte für TF-IDF vorbereiten (mit den bereinigten Token)
tfidf_matrix = vectorizer.fit_transform(df["text_cleaned"])

print(f"TF-IDF Matrix Form: {tfidf_matrix.shape}\n")

# Sparse-Matrix
result_data = []
cx = tfidf_matrix.tocoo()
for i, j, v in zip(cx.row, cx.col, cx.data):
    result_data.append({
        'Beschwerde'    : i+1,                                                                                      # Beschwerdeindex angepasst
        'ID-Vokabular'  : j,
        'Token (Wort)'  : idx_to_word[j], 
        'TF-IDF Score'  : v
    })

result_df = pd.DataFrame(result_data).sort_values(['Beschwerde', 'TF-IDF Score'], ascending=[True, False])
print(result_df.head(30).to_string(index=False))                                                                    # pandas-Spalte ausblenden

NameError: name 'vocab_df' is not defined

## Worteinbettung (engl. word embeddings)
### Glove (global Vectors for word vectorization)
GloVe steht für „Global Vectors for word vectorization“ (Globale Vektoren für die Wortdarstellung). Es handelt sich hierbei um eine weitere Vektorisierungsmethode, die häufig im NLP verwendet wird, um semantische und syntaktische Informationen in einem Vektorraum darzustellen. Während Word2Vec ein prädiktives Modell ist, ist GloVe ein unüberwachter Ansatz, der auf der Anzahl der Wörter basiert. Sie wurde entwickelt, weil Pennington et al. (2014) zu dem Schluss kamen, dass der Skip-Gram-Ansatz in Word2Vec die statistischen Informationen in Bezug auf das gemeinsame Vorkommen (Kookkurenz) von Wörtern nicht vollständig berücksichtigt. Deshalb haben sie den Skip-Gram-Ansatz mit den Vorteilen der Matrixfaktorisierung kombiniert. Das GloVe-Modell verwendet eine Kookkurenzmatrix, die Informationen über den Wortkontext enthält. Es hat sich gezeigt, dass das entwickelte Modell verwandte Modelle übertrifft, insbesondere bei der Erkennung von benannten Entitäten und Ähnlichkeitsaufgaben (Pennington et al., 2014).

(iu. DLBDSEAIS01_D , 2023, p. 70)

### Worteinbettung mit Word2Vec (engl. word embeddings)

Bias im Datensatz?

In [None]:
## Worteinbettung mit Word2Vec (engl. word embeddings)
import numpy as np
from gensim.models import Word2Vec

print("Starte Word2Vec-Training...")

# 1. Vorbereitung: Tokenisierte Daten aus df["cleaned"]
sentences = df["cleaned"].tolist()  # Liste von Listen mit Token

# 2. Word2Vec-Modell trainieren
w2v_model = Word2Vec(
    sentences=sentences,
    vector_size=200,           # Dimensionalität der Wort-Vektoren
    window=5,                  # Kontextfenster (±5 Wörter)
    min_count=2,               # Wörter die weniger als 2x vorkommen ignorieren
    workers=4,                 # Parallel-Verarbeitung
    sg=0,                      # 0=CBOW, 1=Skip-gram
    epochs=10
)

print(f"✓ Word2Vec-Modell trainiert!")
print(f"  Vokabular-Größe: {len(w2v_model.wv)}")
print(f"  Vektor-Dimension: {w2v_model.vector_size}")

# 3. Embedding-Matrix für das gesamte Vokabular erstellen
embedding_matrix = np.zeros((len(vocab_df), 200))

for idx, word in enumerate(vocab_df['word'].values):
    if word in w2v_model.wv:
        embedding_matrix[idx] = w2v_model.wv[word]
    else:
        # Wort nicht im Modell: klein random Vektor
        embedding_matrix[idx] = np.random.randn(200) * 0.01

print(f"✓ Embedding-Matrix erstellt: {embedding_matrix.shape}")

# 4. Dokumentenvektoren erstellen (Durchschnitt aller Wortvektoren)
def get_document_vector(tokens, model, vector_size=200):
    """Berechnet den Durchschnittvektor eines Dokuments"""
    vectors = [model.wv[token] for token in tokens if token in model.wv]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(vector_size)

# Dokumentenvektoren für alle Beschwerden
document_embeddings = np.array([
    get_document_vector(tokens, w2v_model, 200) 
    for tokens in df["cleaned"]
])

print(f"✓ Dokumentenvektoren erstellt: {document_embeddings.shape}")

# 5. Beispiele: Ähnliche Wörter finden
print("\n--- Ähnliche Wörter (Word2Vec) ---")
test_words = ['internet', 'comcast', 'time']
for word in test_words:
    if word in w2v_model.wv:
        similar = w2v_model.wv.most_similar(word, topn=10)
        print(f"\nÄhnlich zu '{word}':")
        for similar_word, score in similar:
            print(f"  {similar_word}: {score:.3f}")
    else:
        print(f"\n'{word}' nicht im Modell")

Starte Word2Vec-Training...
✓ Word2Vec-Modell trainiert!
  Vokabular-Größe: 7085
  Vektor-Dimension: 200
✓ Embedding-Matrix erstellt: (11975, 200)
✓ Dokumentenvektoren erstellt: (5627, 200)

--- Ähnliche Wörter (Word2Vec) ---

Ähnlich zu 'internet':
  cutter: 0.615
  wifi: 0.613
  dsl: 0.581
  blast: 0.581
  landline: 0.578
  1yr: 0.572
  6mbs: 0.570
  mbps: 0.557
  advertised: 0.538
  100mbps: 0.528

Ähnlich zu 'comcast':
  service: 0.464
  arlington: 0.462
  company: 0.455
  shreveport: 0.453
  inclination: 0.452
  satisfied: 0.451
  lawyers: 0.448
  fraudulently: 0.443
  24h: 0.439
  wth: 0.428

Ähnlich zu 'time':
  hour: 0.682
  countless: 0.633
  dozen: 0.623
  hrs: 0.587
  numerous: 0.577
  phone: 0.575
  half: 0.572
  multiple: 0.562
  occasion: 0.557
  course: 0.550


# Latent Dirichlet Allocation (LDA)

In [None]:
# Latent Dirichlet Allocation (LDA)
from gensim.models import LdaModel
from gensim.corpora import Dictionary
import pandas as pd

print("=" * 80)
print("Latent Dirichlet Allocation (LDA)")
print("=" * 80)

print("\n1. Überprüfe Datenverfügbarkeit...")

# Prüfe, ob 'cleaned' Spalte existiert
if "cleaned" in df.columns:
    print("   ✓ 'cleaned' Spalte gefunden")
    cleaned_tokens = df["cleaned"].tolist()
else:
    print("   ✗ 'cleaned' Spalte nicht gefunden!")
    print("   Nutze 'text_cleaned' und konvertiere...")
    cleaned_tokens = [tokens.split() for tokens in df["text_cleaned"]]

print(f"   ✓ {len(cleaned_tokens)} Dokumente geladen")

# Überprüfe auf leere Dokumente
non_empty_count = sum(1 for doc in cleaned_tokens if len(doc) > 0)
empty_count = len(cleaned_tokens) - non_empty_count
print(f"   - Nicht-leere Dokumente: {non_empty_count}")
print(f"   - Leere Dokumente: {empty_count}")

# Debug: Zeige erste Tokens
if non_empty_count > 0:
    sample_doc = next((doc for doc in cleaned_tokens if len(doc) > 0), None)
    if sample_doc:
        print(f"   - Beispiel-Tokens (erste 10): {sample_doc[:10]}")

if non_empty_count == 0:
    print("   ✗ FEHLER: Alle Dokumente sind leer!")
else:
    # Entferne leere Dokumente
    cleaned_tokens = [doc for doc in cleaned_tokens if len(doc) > 0]
    print(f"   → Nach Filterung: {len(cleaned_tokens)} nicht-leere Dokumente")
    
    print(f"\n2. Erstelle Dictionary aus Token-Liste...")
    dictionary = Dictionary(cleaned_tokens)
    print(f"   Wörter VOR Filter: {len(dictionary)}")
    
    # Debug: Zeige häufigste Wörter VOR Filter
    print(f"\n   Top 10 häufigste Wörter:")
    word_freq = {}
    for doc in cleaned_tokens:
        for word in doc:
            word_freq[word] = word_freq.get(word, 0) + 1
    
    sorted_words = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:10]
    for word, freq in sorted_words:
        print(f"     '{word}': {freq}x")
    
    # MINIMALE FILTER: So mild wie möglich
    print(f"\n   Wende MINIMALE Filter an:")
    print(f"   - no_below=1 (Wort kommt mindestens 1x vor)")
    print(f"   - no_above=0.99 (Wort kommt in max. 99% der Dokumente vor)")
    
    initial_dict_size = len(dictionary)
    dictionary.filter_extremes(no_below=1, no_above=0.99, keep_n=100000)
    print(f"   Wörter NACH Filter: {len(dictionary)} (von {initial_dict_size})")
    
    if len(dictionary) == 0:
        print(f"\n   ⚠️ WARNUNG: Dictionary ist leer nach Filter!")
        print(f"   → Verwende Dictionary OHNE Filter...")
        dictionary = Dictionary(cleaned_tokens)
        print(f"   Dictionary (kein Filter): {len(dictionary)} Wörter")
    
    if len(dictionary) > 0:
        print(f"\n3. Erstelle Corpus (Bag-of-Words)...")
        corpus = [dictionary.doc2bow(doc) for doc in cleaned_tokens]
        
        # Debug: Corpus-Statistiken
        corpus_lengths = [len(doc) for doc in corpus]
        non_empty_corpus = sum(1 for doc in corpus if len(doc) > 0)
        
        print(f"   Corpus Größe: {len(corpus)} Dokumente")
        print(f"   Nicht-leere Dokumente im Corpus: {non_empty_corpus}")
        print(f"   Durchschn. Terme pro Dokument: {sum(corpus_lengths)/len(corpus_lengths):.1f}")
        print(f"   Min/Max Terme pro Dokument: {min(corpus_lengths)}/{max(corpus_lengths)}")
        
        # Zeige Sample Corpus-Einträge
        sample_corpus = [doc for doc in corpus if len(doc) > 0][:3]
        print(f"\n   Sample Corpus-Dokumente (erste 3):")
        for i, doc in enumerate(sample_corpus):
            terms = [(dictionary[term_id], freq) for term_id, freq in doc[:5]]
            print(f"     Doc {i}: {len(doc)} Terme - {terms}")
        
        if non_empty_corpus > 0:
            # Entferne Dokumente mit 0 Termen für LDA
            filtered_corpus = [doc for doc in corpus if len(doc) > 0]
            filtered_tokens = [tokens for tokens, doc in zip(cleaned_tokens, corpus) if len(doc) > 0]
            
            print(f"\n4. Trainiere LDA Modell...")
            print(f"   Trainings-Daten: {len(filtered_corpus)} Dokumente mit {len(dictionary)} Wörtern im Dictionary")
            print(f"   Parameter:")
            print(f"   - Topics: 10")
            print(f"   - Passes: 20")
            print(f"   - Iterations: 400")
            print(f"   (Dies kann 5-15 Minuten dauern...)\n")
            
            try:
                model = LdaModel(
                    corpus=filtered_corpus,
                    id2word=dictionary.id2token,
                    num_topics=10,
                    random_state=42,
                    chunksize=2000,
                    passes=20,
                    iterations=400,
                    per_word_topics=True,
                    minimum_probability=0.0,
                    alpha='auto',
                    eta='auto'
                )
                
                print("\n" + "=" * 80)
                print("✓ LDA MODELL ERFOLGREICH TRAINIERT!")
                print("=" * 80)
                print(f"\nModell-Statistiken:")
                print(f"  - Dictionary Größe: {len(dictionary)} Wörter")
                print(f"  - Corpus Größe: {len(filtered_corpus)} Dokumente")
                print(f"  - Anzahl Topics: 10")
                
                # Zeige Top Words pro Topic
                print(f"\n{'='*80}")
                print("TOP WORDS PER TOPIC (mit Gewichtungen)")
                print(f"{'='*80}")
                
                for idx in range(10):
                    terms = model.show_topic(idx, topn=10)
                    print(f"\nTopic {idx}:")
                    for term, weight in terms:
                        print(f"  {term:25s} {weight:.4f}")
                        
            except Exception as e:
                print(f"\n✗ FEHLER beim LDA-Training:")
                print(f"  {type(e).__name__}: {str(e)}")
                import traceback
                traceback.print_exc()
        else:
            print(f"\n✗ FEHLER: Alle Dokumente im Corpus sind leer!")
    else:
        print(f"\n✗ FEHLER: Dictionary ist leer und konnte nicht rekonstruiert werden!")

LATENT DIRICHLET ALLOCATION (LDA)

1. Überprüfe Datenverfügbarkeit...
   ✓ 'cleaned' Spalte gefunden
   ✓ 5627 Dokumente geladen
   - Nicht-leere Dokumente: 5627
   - Leere Dokumente: 0
   - Beispiel-Tokens (erste 10): ['love', 'comcast', 'constant', 'update', 'internet', 'cable', 'crash', 'lot', 'night', 'day']
   → Nach Filterung: 5627 nicht-leere Dokumente

2. Erstelle Dictionary aus Token-Liste...
   Wörter VOR Filter: 12628

   Top 10 häufigste Wörter:
     'comcast': 15215x
     'service': 15169x
     'tell': 7876x
     'call': 7783x
     'time': 6235x
     'customer': 5732x
     'bill': 5619x
     'internet': 5247x
     'say': 5219x
     'month': 4956x

   Wende MINIMALE Filter an:
   - no_below=1 (Wort kommt mindestens 1x vor)
   - no_above=0.99 (Wort kommt in max. 99% der Dokumente vor)
   Wörter NACH Filter: 12628 (von 12628)

3. Erstelle Corpus (Bag-of-Words)...
   Corpus Größe: 5627 Dokumente
   Nicht-leere Dokumente im Corpus: 5627
   Durchschn. Terme pro Dokument: 56.7
  

# BERTopic

In [9]:
import sklearn
import numpy as np
import pandas as pd
from bertopic import BERTopic

print("=" * 80)
print("BERTOPIC - Topic Modeling")
print("=" * 80)

# Daten vorbereiten

docs = [" ".join(tokens) for tokens in df["cleaned"]]               # Datenaufbereitung für BERTopic (bereinigte Beschwerden)
print(f"   ✓ {len(docs)} Dokumente aus 'cleaned' erstellt")

if docs is not None and len(docs) > 0:
    print(f"\n✓ Daten vorbereitet:")
    print(f"  Anzahl Dokumente (Beschwerden): {len(docs)}")
    print(f"  Erste 10 Beschwerden:")
    for i, doc in enumerate(docs[:10]):
        preview = doc[:80] + "..." if len(doc) > 80 else doc
        print(f"    {i+1}. {preview}")

    print(f"\n2. Starte BERTopic-Training...")
    print("   (Dies kann einige Minuten dauern...)\n")

    topic_model = BERTopic(language="english", calculate_probabilities=True)
    topics, probs = topic_model.fit_transform(docs)

    print(f"\n" + "=" * 80)
    print(f"✓ BERTopic abgeschlossen!")
    print(f"=" * 80)
    print(f"  Topics gefunden: {len(set(topics)) - 1}")  # -1 für Outlier
    print(f"  Dokumente verarbeitet: {len(topics)}")
else:
    print("\n✗ BERTopic konnte nicht trainiert werden (keine Daten)!")


  from .autonotebook import tqdm as notebook_tqdm


BERTOPIC - Topic Modeling
   ✓ 5627 Dokumente aus 'cleaned' erstellt

✓ Daten vorbereitet:
  Anzahl Dokumente (Beschwerden): 5627
  Erste 10 Beschwerden:
    1. love comcast constant update internet cable crash lot night day channel work dem...
    2. comcast bad internet provider take online class multiple time late assignment po...
    3. negative star star review work industry bad customer service comcast matter mone...
    4. bad experience far install problem show schedule service appointment extreme dif...
    5. check contract sign comcast advertised offer match contract issue sign 150mbps i...
    6. thank god change dish give awesome pricing super people deal actually understand...
    7. long time customer xfinity isp local walmart november customer representative xf...
    8. malfunction dvr manager prevent add recording customer service fairly certain pr...
    9. charge overwhelming comcast service rep ignorant rude resolve issue bill email t...
    10. cable dish verse et

Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1574.58it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


: 

In [8]:
topic_model.get_topic(1)

NameError: name 'topic_model' is not defined