# Acquiring raw texts for Finger-tweets and Finger-news

Twitter API and the Finnish Wikipedia are queried to fetch the raw texts that make up Finger-tweets and Finger-news, respectively. The outputs of these queries are used when annotating in Label-studio to create the actual corpora.

In [1]:
import pandas as pd

## Finger-tweets
I use [Twarc](https://twarc-project.readthedocs.io/en/latest/) for this. Please note that you need a [Twitter API key](https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api) and [configure Twarc](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/#configure) to successfully query the API.

In [None]:
from twarc import Twarc
# initializing Twarc
t = Twarc()

#### The block below will run a query on twitter search API under these conditions:
1. The tweet contains the conjunction "ja" / "and". This is used because the query cannot be empty and I therefore decided to use a stopword
2. Twitter has identified the tweet to be in Finnish. The identification may fail especially for short tweets, but it's good enough

The output is saved to as JSON lines into whatever output path you pass as a parameter (the section after >)

Please note that the query will return tens of thousands of tweets and will run for quite a while. It's also quite sticky, can't seem to cancel that easily. Interrupting the kernel should work, but if nothing else works, just close the Jupyter notebook session.

In [None]:
!twarc search 'ja' --lang fi > input_data/finnish_tweet_search.jsonl

#### Reading the JSON back as a Pandas dataframe
Filtering and sampling will be applied

In [5]:
tweets = pd.read_json("input_data/finnish_tweet_search.jsonl", lines=True)

Making sure the tweet is an original one and not a 
1. Retweet
2. Quote retweet
3. A reply

In [6]:
filtered = tweets[(pd.isnull(tweets['retweeted_status'])) 
                  & (pd.isnull(tweets['quoted_status']) 
                     & (pd.isnull(tweets['in_reply_to_screen_name']))
                     & (tweets['lang'] =='fi'))]

In [9]:
print("Original length is:", len(tweets), "\nFiltered length is:", len(filtered))

Original length is: 56700 
Filtered length is: 8915


In [10]:
#there're a ton of columns that are useless for the corpus use case
tweets.columns

Index(['created_at', 'id', 'id_str', 'full_text', 'truncated',
       'display_text_range', 'entities', 'metadata', 'source',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'retweeted_status', 'is_quote_status', 'retweet_count',
       'favorite_count', 'favorited', 'retweeted', 'lang',
       'possibly_sensitive', 'extended_entities', 'quoted_status_id',
       'quoted_status_id_str', 'quoted_status'],
      dtype='object')

In [None]:
# only keep Tweet identifier and text
output_df = filtered[['id_str', 'full_text']]
# randomly sample a thousand tweets
output_df = output_df.sample(n=1000)

Export to csv

In [None]:
output_df.to_csv("input_data/filtered_finnish_tweets_sample1000.csv", index=False)

## Finger-news

I have collected a list of URL's by hand ("wikiuutiset_2011_urls.txt"). For each of the urls, UrlLib is used to get the HTML representation of that page, which is then queried with [BeatifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) to get article titles and texts.

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

In [3]:
path_to_url_list = 'input_data/wikiuutiset_2011_urls.txt'
with open(path_to_url_list) as f:
    urls = f.readlines()

In [4]:
len(urls)

42

In [5]:
def get_wikinews_article(url):
    # read source, get it to Beatifulsoup object
    source = urlopen(url).read()
    soup = BeautifulSoup(source,'lxml')
    
    # get title, add it to a text string w/ line change
    title = soup.find('title').text
    text = title +'\n'
    for paragraph in soup.find_all('p'):
        # append each parahraph to the text string
        text += paragraph.text
        
        # this regular expression removes citations [1] etc. from the text
        text = re.sub(r'\[.*?\]+', '', text)
        
    return title, text

For example, below is one article text and title.

In [21]:
title, text = get_wikinews_article(urls[15])
print(title)
print(text)

Wall Streetin miljardööri Raj Rajaratnam syyllistyi sisäpiirikauppaan – Wikiuutiset
Wall Streetin miljardööri Raj Rajaratnam syyllistyi sisäpiirikauppaan – Wikiuutiset
11. toukokuuta 2011
 Wall Streetin sijoittaja  Raj Rajaratnam tuomittiin syylliseksi laittomista sisäpiirikaupoista.  Srilankalainen miljardööri hyötyi syyttäjien mukaan 60 miljoonaa dollaria eli yli 40 miljoonaa euroa laittomista sisäpiirivihjeistä.
 Wikipedian mukaan FBI pidätti hänet 16. lokakuuta 2009 epäiltynä sisäpiirikaukoista. Hallitus on tutkinut tapausta vuodesta 2008. Rajaratnam sai laittomia vihjeitä yrityksiltä kuten  Goldman Sachs. Syyttäjät nauhottivat salaa Rajaratnamin puheluita. Nauhoilla vihjeitä antoi mm. Goldman Sachsin johtaja   Rajat Gupta. Tuomiosta voi valittaa.



Looping through the texts, acquiring titles and texts.

In [6]:
titles = []
texts = []

for url in urls:
    title, text = get_wikinews_article(url)
    titles.append(title)
    texts.append(text)

Casting the inputs to a Pandas DataFrame.

In [7]:
wikinews_articles = pd.DataFrame({'title':titles,'text':texts,'url':urls})

In [8]:
wikinews_articles.head(2)

Unnamed: 0,title,text,url
0,Vuodenvaihteen jälkeen 336 kuntaa – Wikiuutiset,Vuodenvaihteen jälkeen 336 kuntaa – Wikiuutise...,https://fi.wikinews.org/wiki/Vuodenvaihteen_j%...
1,Kolavian Tupolev Tu-154 -kone tuhoutui tulipal...,Kolavian Tupolev Tu-154 -kone tuhoutui tulipal...,https://fi.wikinews.org/wiki/Kolavian_Tupolev_...


Export to csv

In [9]:
wikinews_articles.to_csv('input_data/wikinews_2011.csv', index=False)