# Extracting Topics from News Articles on [meduza.io](https://meduza.io/en)

In this notebook, we will extract topics from news articles on [Meduza website](https://meduza.io/en) using Scikit-Learn. 

Meduza is a Riga-based online newspaper writing about Russia.

First, we need to collect links to the news articles. <br>Luckily, Meduza has an [RSS feed](https://meduza.io/rss/en/all) that is pretty easy to parse using BeautifulSoup4.

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import re
from pprint import pprint
import pandas as pd


# Meduza filters requests on User-Agents, using a User-Agent of Chrome Browser
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
headers = {'User-Agent': user_agent}

# collecting the page
r = requests.get('https://meduza.io/rss/en/all', headers=headers)

#parsing the page, extracting urls  
soup = bs(r.text, 'xml')
links = [l.text for l in soup.findAll('link') if 'https://meduza.io/en/' in l.text]
print (f'{len(links)} links were collected')

30 links were collected


We have collected links to 30 latest articles. Now let's get these web-pages and parse their content.

After examining several pages [[1]](https://meduza.io/en/feature/2020/03/13/selling-vedomosti)[[2]](https://meduza.io/en/feature/2020/03/13/a-faction-s-a-faction-but-i-have-my-conscience-too)[[3]](https://meduza.io/en/feature/2020/03/13/the-kremlin-lies-to-kids),
I found that all article-related text elements contain `'SimpleBlock-'` in their class selector. <br>
The article title is always the first element with `'Title-root'` in its class selector. 


Page structure is quite straightforward, let's try to parse it.

In [2]:
def parse_meduza_page(link):
    '''Extracting data from a news article page'''
    # collecting the page
    r = requests.get(link, headers=headers)
    # loading page's html to BeautifulSoup4
    soup = bs(r.text, 'lxml')
    
    # building regular expressions to filter paragraphs and header
    paragraphs = re.compile('^SimpleBlock-.*')
    titles = re.compile('.*Title-root$')
    
    # extracting article text and its header 
    pagetext = ' '.join([p.text for p in soup.findAll(True, paragraphs)])
    title = soup.find(True, titles).text
    # extracting the timestamp
    timestamp = soup.find(True, 'Timestamp-root').text
    
    # returning parsed data as a dict 
    return(dict(
                pagetext=pagetext,
                title=title,
                timestamp=timestamp,
                url=r.url))

In [3]:
articles = [parse_meduza_page(l) for l in links] 
df = pd.DataFrame.from_records(articles)
df.to_csv('./meduza.csv')
# df = pd.read_csv('./meduza.csv')

df.head(3).style.hide_index()

pagetext,title,timestamp,url
"On March 12, the State Council of Tatarstan voted in favor of major changes to the Russian Constitution, falling into step with other regional parliaments whose approval was needed to move toward a nationwide vote on the amendments. In every region, legislators voted for the changes almost unanimously — even on a nationwide scale, there were only a few negative votes. One of them was the singular “no” among Tatarstan’s deputies: Rkail Zaidullin, a member of the United Russia faction (but not the associated party, which is politically dominant in Russia). Zaidullin said he specifically objected to the clause in the proposed amendments that calls ethnically Russian people foundational to the Russian state. Meduza special correspondent Andrey Pertsev spoke with the legislator and asked him to elaborate on his views. Did you vote against the constitutional amendments because of the clause on the Russian ethnos as foundational to the Russian state? Yes, and I spoke during the hearing to criticize [that clause]. After all, we’re not the only ones seeing a lot of debates over that change; the same thing is happening in other national republics. It’s just that most deputies are part of a “united” party: They might be against it in their souls, but they vote along the party line. You’re also part of the United Russia faction, but you voted no. Did others try to convince you to vote in favor of the amendments? I have been speaking out about that clause for a long time. Even on a logical level, I couldn’t find any way to approve it — that would have meant going against myself. A faction’s a faction, but I have my conscience, too. Just before the [voting] session, when the [United Russia] faction members got together and decided to vote as a bloc, I abstained — I’m not in the party, see. My colleagues in the faction took my vote calmly. They know I’m a writer, a free person; they also react just fine to my speeches because they know I say what I think and do what I say. There was no pressure on me either before or after the vote. Does the Constitution’s current language about Russia as a multiethnic people meet your standards? Exactly — right now, that’s what it says: Russia is a federation, and that doesn’t infringe on or detract from the rights of the Russian ethnos in any way. Nobody doubts the greatness [of the Russian people] — we’re all children of Russian culture and literature, of Pushkin, of Tolstoy. Why write that into the Constitution? We shouldn’t be thinking in sixteenth-century categories! Right now, the separation of church and state is codified, but the amendments have a clause about God over a thousand-year history. There’s no need for that in the Constitution, in my opinion. Sulustaana Myraan, a deputy in the Yakutian parliament from United Russia, not only voted against the amendments but also resigned her post. Have you heard about her? I have, and her opinion is noteworthy as well, but it’s never too late [to resign]. The amendments are being passed as a set, and that set includes clauses on the Russian government’s system of power: the State Council, zeroing out presidential term counts, and so on. What do you think about those measures? I’m critical of them as well. There should be alternation of power in a democratic society, right? I’m not familiar with all the intricacies of politics, of course, but why do we need a State Council, too? We’ll have to rename our own State Council [i.e. the legislature of Tatarstan]. Maybe they did think it all up just to zero out the [presidential] terms. I don’t know. But they want to introduce the concept of federal territories into the Constitution. If a given territory is considered important, then it can be controlled directly by the federal government separately from the other subjects of the [Russian] Federation. That’s also problematic, and it’s against federalism. I’m for the Russian Federation, so I’m against amendments like those.","‘A faction’s a faction, but I have my conscience, too’ Why a Tatar legislator from the bloc representing Russia’s ruling party voted no on the new constitutional amendments","6:30 pm, March 13, 2020",https://meduza.io/en/feature/2020/03/13/a-faction-s-a-faction-but-i-have-my-conscience-too
"Effective on March 16, Russia is imposing new restrictions on air travel to countries inside the European Union, as part of the expanding effort to contain the spread of coronavirus. According to an announcement from Russia’s federal task force, air travel will now be limited to flights destined only for the capital cities of EU nations, though the new rules do not apply to charter flights, according to the news agency Interfax. Russia is introducing identical limits on air travel to Norway and Switzerland, maintaining flights from Moscow to Geneva and Oslo, as well as charter flights.","Russia restricts air travel to the EU, allowing only charter flights and planes headed to capital cities","5:19 pm, March 13, 2020",https://meduza.io/en/news/2020/03/13/russia-restricts-air-travel-to-the-eu-allowing-only-charter-flights-and-planes-headed-to-capital-cities
"As Russia’s confirmed COVID-19 case count continues to hover in the mid-double digits, large companies and cities outside Moscow are beginning to take precautions. St. Petersburg Governor Alexander Beglov announced that from March 16 to April 30, all events involving more than 1,000 people will be banned. The surrounding Leningrad region, meanwhile, banned events of over 300 people following its first confirmed case: a woman from Kudrovo who had recently visited Italy. Large Russian corporations, including state corporations, are beginning to take precautions as well. In many cases, those measures remain relatively mild. The oil and gas company Gazprom recommended that its employees “refrain from traveling frequently abroad,” an anonymous source told Interfax. While employees will continue to work in person, meetings involving more than 10 people will be held by videoconference. At Russian Railways, meanwhile, corporate leaders asked their employees to suspend nonessential travel outside Russia but did not appear to take additional preventative measures. Yandex, a Russian tech giant whose services echo Google’s, has gone further than its peers, telling its employees to work from home for at least a week if possible. The company has also told workers to avoid public transit if they do come to work, offering to compensate them for taxi rides. The company previously cancelled events and recommended moving meetings of more than 15 – 20 people online.",Major Russian companies take mostly minor measures against coronavirus as St. Petersburg bans large events,"3:38 pm, March 13, 2020",https://meduza.io/en/news/2020/03/13/major-russian-companies-take-mostly-minor-measures-against-coronavirus-as-st-petersburg-bans-large-events


Success! 

Now we will use the Non-negative Matrix Factorization (NMF) method to extract topics from the articles. First, we will use `TfidfVectorizer`, that transforms text to a word count matrix and then normalizes it to tf-idf representation (which is recommended for NMF). 

We will exclude stopwords (like 'and', 'the', 'I'), the words that occur more than in 95% of articles and less than in 2 articles. We will include single words and two-word combinations: `ngram_range=(1,2)`

In [4]:
from sklearn.feature_extraction.text import  TfidfVectorizer
from sklearn.decomposition import NMF

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, 
                                   min_df=2, 
                                   ngram_range=(1,2), 
                                   stop_words='english')

tfidf = tfidf_vectorizer.fit_transform(df['pagetext'])
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

Now to the model building! 

We'll use NMF with a slight `l1 regularization` (lasso) to reduce small coefficients to zero (it might help with topic interpretability). As dataset contains only 30 observations, we will use a small number of components and hope they will be interpretable. 

The code below fits the model and then prints top-20 words for every component. 

In [5]:
nmf = NMF(n_components=7, alpha=.1, l1_ratio=0, random_state=42).fit(tfidf)

def display_topics(model, feature_names, no_top_words):
    """Show top words for every topic in the model"""
    for topic_idx, topic in enumerate(model.components_):
        print (f"Topic: {topic_idx+1}")
        print ([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]])

display_topics(nmf, tfidf_feature_names, 20)

Topic: 1
['duma', 'state duma', 'state', 'putin', 'term', 'amendment', 'elections', 'constitutional', 'tereshkova', 'snap', 'term clock', 'clock', 'proposal', 'constitutional reform', 'reform', 'presidential', 'reform legislation', 'zero', 'term limits', 'vladimir']
Topic: 2
['cases', 'russia', 'coronavirus', 'china', 'italy', 'new', 'italian', 'countries', 'people', 'infected', 'region', 'moscow', 'virus', 'tested', 'positive', 'tested positive', 'number', 'patients', 'disease', 'recorded']
Topic: 3
['amendments', 'vote', 'constitution', 'constitutional', 'nationwide', 'russia', 'nationwide vote', 'legislation', 'article', 'court', 'law', 'voting', 'regional', 'putin', 'nation', 'reforms', 'russia constitution', 'vladimir putin', 'territories', 'constituent territories']
Topic: 4
['oil', 'meduza', 'russian', 'agreement', 'case', 'sources', 'business', 'percent', 'alexey', 'buyers', 'ruble', 'prices', 'confirmed', 'network case', 'network', 'million', 'russia', 'media', 'price', 'said'

Most of the topics seem to be quite interpretable.

We can examine the performance of the model by comparing the most prominent topic for every article with the article title. 

In [6]:
# naming topics by hand
topics = ["Zero out of the Putin's term", 'COVID-19', "Russia's constitutional amendments", 'Market crash',
          'Mass public events im Moscow?', 'Flights cancellations', 'Kremlin-related?']

# comparing top-1 topic with the article title 
rates = pd.DataFrame(nmf.transform(tfidf), columns=topics)
pd.DataFrame([df.title.values, rates.T.idxmax().values]).T.style.hide_index()

0,1
"‘A faction’s a faction, but I have my conscience, too’ Why a Tatar legislator from the bloc representing Russia’s ruling party voted no on the new constitutional amendments",Russia's constitutional amendments
"Russia restricts air travel to the EU, allowing only charter flights and planes headed to capital cities",Flights cancellations
Major Russian companies take mostly minor measures against coronavirus as St. Petersburg bans large events,Mass public events im Moscow?
Russia's number of confirmed coronavirus infections reaches 45,COVID-19
"Kremlin spokesman says Putin is safe from coronavirus, but don't expect to see his medical records",Kremlin-related?
The Kremlin lies to kids Putin’s spokesman refuses to change language on the administration’s website for children that says presidents are prohibited from running for a third consecutive term,Kremlin-related?
‘Constitutional Gymnastics’: Russia's strange initiative to keep Vladimir Putin in office for years to come,Russia's constitutional amendments
Selling ‘Vedomosti’ Sources say two media entrepreneurs with tangled political histories are buying Russia’s leading business newspaper,Market crash
"Citing coronavirus concerns, Russia closes its borders to Italian nationals and other foreigners arriving from Italy",COVID-19
34 people have now tested positive for coronavirus in Russia,COVID-19


The model did quite a good job: most topics correlate nicely with article titles!

This approach can be used for larger datasets and a range of practical applications like automated keywords generation, dimensionality reduction for the text data, development of recommendation systems, etc. 