# LDA topic modelling

### LDA models

For each portal I create a topic model to check what are most prominent and talked about topics. Models will be evaluated based on coherence scores, interpretability, pyLDAvis visualizations . After creating optimal topic models (high coherence score + interpretability, optimal distributions on plots I will check wether created topics are simmilar between the portals. I want to find ca. 5 clear, interpretable topics that occure in multiple portals and further analyse them

### Emotional evaluation of topics

Second part of the notebook will be dedicated to evaluating with what emotions are the topics covered in various portals. For this purpouse I will use the emotional annotation dictionary provided by Polish Wordnet. Each article will be assigned with emotions that are most prominent in words used in it, and emotional valuations carried by those words. 
Interpretation of these scores and texts will a base for researching wether different topics evoke different emotions in written texts among various newscasters.


## Part 1: topic modelling

In [1]:
# basic libraries
import pandas as pd
import numpy as np
import random
import datetime
import re
import pprint
# plotting
import matplotlib.pyplot as plt
import seaborn as sns
import pyLDAvis
import pyLDAvis.gensim

#statistics
from scipy import stats
import scipy.stats as ss

# NLP
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud, STOPWORDS 
from langdetect import detect
import spacy
from collections import Counter
import nltk
from nltk.collocations import *

# LDA Multicore model
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.models.tfidfmodel import TfidfModel
from gensim.corpora.dictionary import Dictionary

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


In [2]:
# reading data, saving dataframes in dict for easier processing

data = {"gazeta_pl": pd.read_csv(r"C:\Users\szklarnia\Desktop\datascience\projekty\monitoring prasy\scraping_data\gazeta_pl.csv",sep="^"),
       "oko_press": pd.read_csv(r"C:\Users\szklarnia\Desktop\datascience\projekty\monitoring prasy\scraping_data\OKO_press_all.csv",sep="^"),
       "onet": pd.read_csv(r"C:\Users\szklarnia\Desktop\datascience\projekty\monitoring prasy\scraping_data\Onet.csv",sep="^",error_bad_lines=False),
       "wp":pd.read_csv(r"C:\Users\szklarnia\Desktop\datascience\projekty\monitoring prasy\scraping_data\wp.csv",sep="^",error_bad_lines=False),
       "wPolityce":pd.read_csv(r"C:\Users\szklarnia\Desktop\datascience\projekty\monitoring prasy\scraping_data\wPolityce.csv",sep="^"),
       "niezalezna":pd.read_csv(r"C:\Users\szklarnia\Desktop\datascience\projekty\monitoring prasy\scraping_data\Niezalezna.csv",sep="^").rename({"Content\r":"Content"},axis=1),
       "krypol":pd.read_csv(r"C:\Users\szklarnia\Desktop\datascience\projekty\monitoring prasy\scraping_data\krypol.csv",sep="^")}
        



  and should_run_async(code)
b'Skipping line 1195: expected 5 fields, saw 9\n'


In [56]:
data["niezalezna"].Content = [re.sub(r"<.+>"," ",x) for x in data["niezalezna"].Content]

  and should_run_async(code)
  self._context.run(self._callback, *self._args)


In [58]:
data["niezalezna"].Content[1]

  and should_run_async(code)


'niemcy twierdzą  że kwestia reparacji za szkody wyrządzone polsce podczas ii wojny światowej jest zamknięta  odmienną opinię w tej sprawie przedstawił dziś wiceminister kultury  dziedzictwa narodowego i sportu jarosław sellin    sprawa reparacji wojennych nie jest zamknięta  polska jest na szarym końcu  jeżeli chodzi o państwa  które reparacjami niemieckimi zostały obdarowane   powiedział dziś sellin  wiceminister skomentował w polskim radiu 24 wypowiedź ambasadora republiki federalnej niemiec w polsce arndta freytaga von loringhovena  który powiedział  że sprawa reparacji wojennych jest  politycznie i prawnie zamknięta   a  niemcy już wypłaciły polsce 2 mld euro reparacji za ii wojnę światową  polska na szarym końcuwedług wiceszefa mkidn   ta sprawa zamknięta nie jest   nawet z punktu widzenia prawnego  bo nawet komunistyczne państwa  które te umowy reparacyjne z zachodnimi niemcami podpisywały  nie wypełniły wewnętrznie swoich zobowiązań  bo np  część reparacji  która szła dla związ

In [3]:
def preprocess_texts(data,npl):
    # removing punctuation
    punctuation = ["/",".",":",",",")","(","\"","_","-","?","!","...","„","”","–","—","…","[","]","^","'"]
    for sign in punctuation:
        data.Content = [x.replace(sign," ") for x in data.Content.astype("str")]
    data.Content = [x.lower() for x in data.Content]
    
    
    f = open("polish.stopwords.txt","r",encoding="utf8")
    stopwords = f.readlines()
    stopwords = [x.replace("\n","") for x in stopwords] + [" ","  ","   ","    ","mieć","miec","zostac","zostać","r","osoba","być","byc","polski","mln",
                                                      "swój","móc","moc","mówić","mowic","proc","rok","stycznia","styczeń","styczniem","2021","2019","2020"]
    # create list of word representation of each text
    list_of_words = [[y.strip() for y in x.split(" ") if y not in stopwords] for x in data.Content]
    
    # create text representation with collocations
    
    bigram = gensim.models.Phrases(list_of_words, min_count=5)
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    
    data["content_with_bigrams"] = [bigram_mod[x] for x in list_of_words]
    
    
    # process texts with spacy
    data["processed"] = [nlp(x) for x in data["content_with_bigrams"].astype("str")]
    
    # lemmatization

    allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']
    for key in data:
        data["lemmatized"] = [[x.lemma_ for x in article if x.pos_ in allowed_postags if x.lemma_ not in stopwords] for article in data.processed]
    
    return data

  and should_run_async(code)


In [4]:
# loading spacy model, processing texts

nlp = spacy.load("pl_core_news_md",disable=['parser', 'ner'])
for key in data.keys():
    print(key)
    data[key] = preprocess_texts(data[key],nlp)

  and should_run_async(code)


gazeta_pl
oko_press
onet
wp
wPolityce
niezalezna
krypol


In [59]:
data["niezalezna"] = preprocess_texts(data["niezalezna"],nlp)

  and should_run_async(code)


In [5]:
data["oko_press"]

  and should_run_async(code)


Unnamed: 0.1,Unnamed: 0,Title,Author,Date,Link,Content,content_with_bigrams,processed,lemmatized
0,0,Nigdy nie będziesz szła sama. Ogólnopolski Str...,Paweł Leszkowicz,28 lutego 2021,https://oko.press/ogolnopolski-strajk-kobiet-j...,nigdy nie będziesz szła sama to pierwsza wysta...,"[będziesz_szła, pierwsza, wystawa, kraju, doku...","([, ', będziesz_szła, ', ,, ', pierwsza, ', ,,...","[będziesz_szła, pierwszy, wystawa, kraj, dokum..."
1,1,"Obajtek klnie, bo jest chory? Stowarzyszenie S...",OKO.press,28 lutego 2021,https://oko.press/obajtek-klnie-bo-jest-chory-...,nie można każdego przekleństwa tłumaczyć tikam...,"[każdego, przekleństwa, tłumaczyć, tikami, prz...","([, ', każdego, ', ,, ', przekleństwa, ', ,, '...","[przekleństwo, tłumaczyć, tik, przekleństwo, w..."
2,2,Lichocka skarży do sądu aktywistów za billboar...,Maria Pankowska,28 lutego 2021,https://oko.press/lichocka-skarzy-do-sadu-akty...,posłanka pis joanna lichocka domaga się ukaran...,"[posłanka, pis, joanna_lichocka, domaga, ukara...","([, ', posłanka, ', ,, ', pis, ', ,, ', joanna...","[posłanek, pis, joanna_lichocka, domagać, ukar..."
3,3,"O wuju, który nie był wujem: Spychalski broni ...",Michał Danielewski,28 lutego 2021,https://oko.press/pis-broni-obajtka-metoda-na-...,gazeta wyborcza ujawniła nagrania świadczące o...,"[gazeta_wyborcza, ujawniła, nagrania, świadczą...","([, ', gazeta_wyborcza, ', ,, ', ujawniła, ', ...","[gazeta_wyborcza, ujawnić, nagranie, świadczyć..."
4,4,Lewica chce znieść przedawnienia przestępstw s...,Sebastian Klauziński,28 lutego 2021,https://oko.press/lewica-pedofilia-przedawnienie/,dzwonili do nas 70 latkowie którzy płakali w ...,"[dzwonili, 70, latkowie, płakali, słuchawkę, z...","([, ', dzwonili, ', ,, ', 70, ', ,, ', latkowi...","[dzwonić, 70, latkowie, płakać, słuchawka, zap..."
...,...,...,...,...,...,...,...,...,...
287,287,Były ambasador z nadania PiS ostro o sytuacji ...,Maria Pankowska,2 lutego 2021,https://oko.press/byly-ambasador-z-nadania-pis...,wystarczy że sekretarka na polecenie ministr...,"[wystarczy, sekretarka, polecenie, ministra, n...","([, ', wystarczy, ', ,, ', sekretarka, ', ,, '...","[wystarczyć, sekretarka, polecenie, minister, ..."
288,288,Kopalnia zniszczy „europejską Amazonię” w Pols...,Robert Jurszo,1 lutego 2021,https://oko.press/kopalnia-zagraza-europejskie...,poleski park narodowy to kraina dzikiej przyro...,"[poleski, park_narodowy, kraina, dzikiej, przy...","([, ', poleski, ', ,, ', park_narodowy, ', ,, ...","[poleski, park_narodowy, kraina, dziki, przyro..."
289,289,Mieszkam w Afganistanie. Zapraszam Przyłębską ...,Jagoda Grondecka,1 lutego 2021,https://oko.press/zapraszam-magister-przylebsk...,mam nadzieję że aborcyjny wyrok trybunału prz...,"[nadzieję, aborcyjny, wyrok_trybunału, przyłęb...","([, ', nadzieję, ', ,, ', aborcyjny, ', ,, ', ...","[nadzieja, aborcyjny, wyrok_trybunału, przyłęb..."
290,290,Minister Ziobro w poszukiwaniu prawdy. Analizu...,Krzysztof Izdebski,1 lutego 2021,https://oko.press/analizujemy-rzadowy-projekt-...,ustawę przygotowano w ministerstwie sprawiedli...,"[ustawę, przygotowano, ministerstwie_sprawiedl...","([, ', ustawę, ', ,, ', przygotowano, ', ,, ',...","[ustawa, przygotować, ministerstwie_sprawiedli..."


In [6]:
def test_ldas(texts,dictionary,corpus,min_topics,max_topics):
    """
    Function computes several LDA models and evaluates them based on Coherence. As LDA's need to be evaluated further
    function saves each model for further evaluation
    
    texts - collection of lemmatized texts to perform lda on
    dictionary - id2word representation of words in texts
    corpus - corpus representation of texts 
    min_topics, max_topics - minimum and maximum nubmers of topics to test"""
    
    # dicts for storing models and evaluation scores
    scores = {}
    ldas = {}
    
    # creating models with topics ranging from minimum to maximum 
    for i in range(min_topics,max_topics,2):
        # lda model
        lda = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                       id2word=dictionary,
                                       num_topics=i,
                                       workers=5,
                                       passes = 10)
        
        # coherence
        coherence_model_lda = CoherenceModel(model=lda, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_lda = coherence_model_lda.get_coherence()
        # perplexity
        perplexity = lda.log_perplexity(corpus)
        
        # saving results
        scores[i] = [coherence_lda,perplexity]
        ldas[i] = lda
        
            
    return scores, ldas
        

  and should_run_async(code)


In [7]:
# for each dataset creating topic models with number of topics ranging from 5 to 40

models = {}
scores = {}

for key in data:
    print(key)
    
    # creating variables for test_ldas function
    dictionary = Dictionary(data[key].lemmatized)
    corpus= [dictionary.doc2bow(text) for text in data[key].lemmatized]
    texts = data[key].lemmatized
    
    # testing
    score,lda = test_ldas(texts,dictionary,corpus,5,40)

    # saving results
    models[key] = lda
    scores[key] = score
    

  and should_run_async(code)
  self._context.run(self._callback, *self._args)


gazeta_pl
oko_press
onet
wp
wPolityce
niezalezna
krypol


In [60]:

# creating variables for test_ldas function
dictionary = Dictionary(data["niezalezna"].lemmatized)
corpus= [dictionary.doc2bow(text) for text in data["niezalezna"].lemmatized]
texts = data["niezalezna"].lemmatized

# testing
score,lda = test_ldas(texts,dictionary,corpus,5,40)

# saving results
models["niezalezna"] = lda
scores["niezalezna"] = score

  and should_run_async(code)
  self._context.run(self._callback, *self._args)


In [8]:
# viewing scores
scores

  and should_run_async(code)


{'gazeta_pl': {5: [0.40459265605880157, -8.841336881342095],
  7: [0.42328953895121, -8.85377928871909],
  9: [0.41735599221782294, -8.855912838164343],
  11: [0.3862752548497339, -8.899521788861955],
  13: [0.40973813172249746, -8.888006355726787],
  15: [0.4194339073654829, -8.890476775284048],
  17: [0.42574532590868275, -8.910663450162994],
  19: [0.48869801536944707, -8.906176871693711],
  21: [0.41036205967447775, -8.97764251355891],
  23: [0.4516816139424929, -8.97722898690684],
  25: [0.4306490002002974, -8.983912797541405],
  27: [0.4458729528453009, -9.039430876078649],
  29: [0.44748791930545195, -9.05333531103463],
  31: [0.4550626094457277, -9.031018332043516],
  33: [0.48977696404575505, -9.105741509024353],
  35: [0.43292626884232516, -9.139288198239388],
  37: [0.45212785396457644, -9.133756174094302],
  39: [0.43596570225907383, -9.138857870773144]},
 'oko_press': {5: [0.35183216179662957, -8.82748254708079],
  7: [0.3169489834152884, -8.820476776998547],
  9: [0.37320

# Further evaluation of models

I checked couple of models with the best score for each portal, but only left the best ones in the notebook not to make it messy. LDAvis also allows to check keywords for each topic, so it allows to check interpretability simultainously. 

## Gazeta.pl - checking numbers of topics: 9,19,23

In [12]:
dictionary = Dictionary(data["gazeta_pl"].lemmatized)
corpus= [dictionary.doc2bow(text) for text in data["gazeta_pl"].lemmatized]

lda_display = pyLDAvis.gensim.prepare(models["gazeta_pl"][23], corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

  and should_run_async(code)
  self._context.run(self._callback, *self._args)


9 topics - Good distribution. Topics hard to interpret, very broad.  

19 topics - Worse distribution, but nice interpretability for majority of topics.  
23 topics - distribution on the plot quite clustered. Good interpretability.  

19 or 23 will be chosen as best

## OKO_press, testing: 17,25,33 topics

In [15]:
dictionary = Dictionary(data["oko_press"].lemmatized)
corpus= [dictionary.doc2bow(text) for text in data["oko_press"].lemmatized]

lda_display = pyLDAvis.gensim.prepare(models["oko_press"][33], corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

  and should_run_async(code)
  self._context.run(self._callback, *self._args)


13 topics - Clusters really overlapping, to broad to interpret  
19 topics - Clusters overlapping, some topics very interpretable, some still to broad 
25 topics - distribution again clustered, in some cases two topics still merged as one  
33 topics - distribution far from ideal, but best interpretability  

33 - best

## Onet - 13,21,27

In [19]:
dictionary = Dictionary(data["onet"].lemmatized)
corpus= [dictionary.doc2bow(text) for text in data["onet"].lemmatized]

lda_display = pyLDAvis.gensim.prepare(models["onet"][27], corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

  and should_run_async(code)
  self._context.run(self._callback, *self._args)


21 or 27


## WP - 21,35

In [42]:
dictionary = Dictionary(data["wp"].lemmatized)
corpus= [dictionary.doc2bow(text) for text in data["wp"].lemmatized]

lda_display = pyLDAvis.gensim.prepare(models["wp"][35], corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

  and should_run_async(code)
  self._context.run(self._callback, *self._args)


# wpolityce 21,31

In [23]:
dictionary = Dictionary(data["wPolityce"].lemmatized)
corpus= [dictionary.doc2bow(text) for text in data["wPolityce"].lemmatized]

lda_display = pyLDAvis.gensim.prepare(models["wPolityce"][31], corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

  and should_run_async(code)
  self._context.run(self._callback, *self._args)


In [26]:
dictionary = Dictionary(data["niezalezna"].lemmatized)
corpus= [dictionary.doc2bow(text) for text in data["niezalezna"].lemmatized]

lda_display = pyLDAvis.gensim.prepare(models["niezalezna"][21], corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

  and should_run_async(code)
  self._context.run(self._callback, *self._args)


In [28]:
dictionary = Dictionary(data["krypol"].lemmatized)
corpus= [dictionary.doc2bow(text) for text in data["krypol"].lemmatized]

lda_display = pyLDAvis.gensim.prepare(models["krypol"][15], corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

  and should_run_async(code)
  self._context.run(self._callback, *self._args)


In [43]:
best_models = {"gazeta_pl":models["gazeta_pl"][23],
              "oko_press":models["oko_press"][33],
              "onet":models["onet"][27],
              "wp":models["wp"][35],
              "wPolityce":models["wPolityce"][31],
              "niezalezna":models["niezalezna"][21],
              "krypol":models["krypol"][15]}

  and should_run_async(code)


## Identifying well interpretable topics that occure in more than one portal

I want to find couple of well interpretable, not too broad topics, that could be polarising and covered differently in different portals. I will be inspecting keywords for each topic in each portal to find them

In [49]:
best_models["niezalezna"].print_topics(-1,num_words=20)

  and should_run_async(code)


[(0,
  '0.003*"poinformować" + 0.003*"nowy" + 0.003*"zespół" + 0.003*"mecz" + 0.003*"powiedzieć" + 0.002*"dodać" + 0.002*"1" + 0.002*"bbc" + 0.002*"pierwszy" + 0.002*"podkreślić" + 0.002*"chiński" + 0.002*"czas" + 0.002*"grupa" + 0.002*"m_in" + 0.002*"wynik" + 0.002*"członek" + 0.002*"sprawa" + 0.002*"chcieć" + 0.001*"wielki" + 0.001*"sąd"'),
 (1,
  '0.003*"linia" + 0.003*"nowy" + 0.003*"żołnierz" + 0.003*"kwarantanna" + 0.002*"pociąg" + 0.002*"trasa" + 0.002*"granica" + 0.002*"c" + 0.002*"miejsce" + 0.002*"kraj" + 0.002*"wskazać" + 0.002*"sobota" + 0.002*"kolejowy" + 0.002*"m_in" + 0.002*"godzina" + 0.002*"polsce" + 0.002*"transport" + 0.002*"temperatura" + 0.002*"wysokość" + 0.002*"powiedzieć"'),
 (2,
  '0.006*"sprawa" + 0.005*"prokuratura" + 0.004*"miasto" + 0.004*"m_in" + 0.004*"spółka" + 0.004*"dotyczyć" + 0.004*"projekt" + 0.003*"sąd" + 0.003*"teren" + 0.003*"prowadzić" + 0.003*"umowa" + 0.003*"postępowanie" + 0.002*"praca" + 0.002*"mieszkaniec" + 0.002*"nowy" + 0.002*"zmiana" + 

Tomasz Greniuch IPN affair: gazeta_pl- topic[0]; oko_press - topic[22]  
Educational COVID restrictions: gazeta_pl - topic[4]; oko_press - topic[1], onet - topic[0], wPolityce - topic[21]  
Andrzej Dymer church peadofilia: gazeta_pl - topic[8]; oko_press - topic[10]  
Media Tax: gazeta_pl - topic[12], oko_press - topic[5], onet - topic[11], wPolityce - topic[17,26]  
Trump Impeachment: gazeta_pl - topic[16], oko_press - topic[29]  
Porozumienie split in government: gazeta_pl - topic[17], oko_press - topic[12], wp - topic[22], onet - topic[4], wPolityce - topic[22]  
Abortion: gazeta_pl - topic[20], oko_press - topic[16], wPolityce - topic[25]  
Vaccination program: gazeta_pl - topic[14], oko_press - topic[13], onet - topic[23], wPolityce - topic[10,13]  
Academic Evaluations: oko_press - topic[2]  

In [None]:
def get_dominant_topic(lda_model,text):


    doc = dictionary.doc2bow(text)
    
    top_list = lda_model.get_document_topics(doc)
    
    tpcs = {}
    
    for top in top_list:
        tpcs[top[0]] = top[1]
    

    return tpcs

In [None]:
for key in data.keys():
    
    lda_model = topics[key]
    data[key]["dom_topic"] = [get_dominant_topic(lda_model,text) for text in data[key].lemmatized]
    

In [None]:
data["onet"].dom_topic.value_counts()