# Using `stanza` to parse unsegmented texts from cleaned dataframe of Pile data

I'm using `stanza` instead of the raw stanford core nlp model because this is designed to be run with python and I don't have to interface with java, which was causing a considerable headache. The other python interface packages like `nltk` still required instantiating a java corenlp server, and, on WSL, I can't specify a port that will actually work to do that. Using the corenlp parser would work on kay or woods or kuno, but I'm using python and pandas to clean things up, create document IDs, and filter to the data sets we are interested in. Using the command line tool requires a text file input; while it's possible to save every document to a text file and parse it that way, it has the following drawbacks: 

- requires writing a lot of temporary, unneeded files
- doesn't allow for adding comments along the way 
- doesn't allow for sentence ID creation and annotating in the step between parsing and writing
- generally is more restrictive in its output, so there would need to be another round of processing/reformatting after the parsing step. 

With this approach, the text can be pulled directly from the dataframe and have the necessary metadata as well.


## import modules

In [1]:
# -*- coding: utf-8 -*-
import stanza
from stanza.utils.conll import CoNLL
import pandas as pd
pd.set_option('display.max_colwidth', 120)

## load processed dataframe

In [2]:
# load processed dataframe
pcc = pd.read_pickle('/home/andrea/litotes/process_pile/pcc_table.pkl.gz')
pcc


Unnamed: 0,text_id,text,pile_set_name,pile_set_code
0,PiCC_test_00000,Mud Hens pitcher Evan Reed charged with sexual assault Mud Hens pitcher Evan Reed was charged July 30 with sexual a...,Pile-CC,PiCC
1,PiCC_test_00001,"I'm getting about the same thing trying to update ""tf"" (team fortress 2) on Ubuntu 7.10 (just updated it yesterday)....",Pile-CC,PiCC
2,PiCC_test_00002,Mounting tensions with Syria sink US stocks NEW YORK (AP) -- Fears of an escalating conflict in Syria rippled acros...,Pile-CC,PiCC
3,PiCC_test_00003,Upcoming Events Catholic Theologians Call to Abolish the Death Penalty In the wake of the September 21st execution...,Pile-CC,PiCC
4,PiCC_test_00004,"Tag Archives: west texas Post navigation In the summer of 1980, if I remember right, we traveled from Kansas to no...",Pile-CC,PiCC
...,...,...,...,...
52785,PiCC_test_52785,"These are all examples of street harassment. It's a serious problem, and yet it happens every day in every city arou...",Pile-CC,PiCC
52786,PiCC_test_52786,There is an online service called fiverr. fiverr lets you hire professionals in all different industries for a low p...,Pile-CC,PiCC
52787,PiCC_test_52787,Milton Friedman had no idea that his six-day trip to Chile in March 1975 would generate so much controversy. He was ...,Pile-CC,PiCC
52788,PiCC_test_52788,"Age may not hurt it as said. What will hurt you tring to sell it, is the fact that no one will ship that package any...",Pile-CC,PiCC


In [3]:
illus = pcc.sample(5)
print(illus.text.iloc[0][:800])

Seattle's free-ride zone is ending; 'funeral' is set for Friday

Members of the Transit Riders Union shown earlier this month. The group has been protesting the elimination of the ride-free area.

Jake EllisonKPLU

A cultural shift is taking place in Seattle. It's the elimination of a free-ride zone downtown, for bus riders.

It's been in place for four decades. And on Friday (Sept. 28) it will go away.

Kevin Desmond is the head of King County Metro. He says the deficit the system is facing is about $6o million, annually.

"To be perfectly honest, we're trying to save the service for all the people who use it," Desmond says. "So, the ride-fee area had its place for 40 years. We are the only city in the United States now that has such a ride-free area. Cities throughout the world, you pay 


In [4]:
illus[['text_id', 'pile_set_code', 'pile_set_name', 'text']]

Unnamed: 0,text_id,pile_set_code,pile_set_name,text
5352,PiCC_test_05352,PiCC,Pile-CC,Seattle's free-ride zone is ending; 'funeral' is set for Friday Members of the Transit Riders Union shown earlier t...
11234,PiCC_test_11234,PiCC,Pile-CC,Two more fabulously cute images and two more awesome card designs...as always I love your detailing to complement th...
34827,PiCC_test_34827,PiCC,Pile-CC,"Judge orders Lev Tahor youths into care of children's aid but most have already left Canada CHATHAM, Ont. -- An Ont..."
47219,PiCC_test_47219,PiCC,Pile-CC,"Ross, like Bruce Bowen, has found minutes in the league strictly for his defensive capabilities. Unlike Bowen, Ross ..."
11300,PiCC_test_11300,PiCC,Pile-CC,Restaurant blames closure on Fremont Bridge work View from Bandoleone now an orange fence The view for Bandoleone ...


## load and initiate language model

In [5]:
# TODO : add notes here on the one-time download step
# stanza.download('en')
# load language model
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma,depparse')
# TODO : change POS to XPOS; remove extra features?

2021-11-22 15:30:47 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
| lemma     | combined |
| depparse  | combined |

2021-11-22 15:30:47 INFO: Use device: cpu
2021-11-22 15:30:47 INFO: Loading: tokenize
2021-11-22 15:30:47 INFO: Loading: pos
2021-11-22 15:30:48 INFO: Loading: lemma
2021-11-22 15:30:48 INFO: Loading: depparse
2021-11-22 15:30:49 INFO: Done loading processors!


In [6]:
# open output file for conll formatted data
with open(f"{pcc.at[0, 'pile_set_code']}_sample.conll", mode='w') as conllout:
    
    # for each text in the pile subset...
    # here, just the the sample as an illustration
    for ix in illus.index:
        text = illus.text[ix]

        # create doc (with parsing) 
        doc = nlp(text)
        
        # add comments to sentences (info pulled from dataframe)
        for s in doc.sentences:
            print(s.text)
            text_id = illus.text_id[ix]
            
            if s.id == 0:
                # "newdoc id" will be the text_id from the pile subset
                s.add_comment(f'# newdoc id = {text_id}')

            # TODO : fix numbering of tokens (or see if necessary to start at 1)
            # "sent_id" will be doc/text id with _[sentence number] appended
            s.add_comment(f'# sent_id = {text_id}_{s.id}')
            
            # this adds the full text string to the output file
            s.add_comment(f'# text = {s.text}')
        
        conllstr = CoNLL.doc2conll_text(doc)
        # write conll formatted string of doc to output file
        conllout.write(conllstr)

Seattle's free-ride zone is ending; 'funeral' is set for Friday
Members of the Transit Riders Union shown earlier this month.
The group has been protesting the elimination of the ride-free area.
Jake EllisonKPLU
A cultural shift is taking place in Seattle.
It's the elimination of a free-ride zone downtown, for bus riders.
It's been in place for four decades.
And on Friday (Sept. 28) it will go away.
Kevin Desmond is the head of King County Metro.
He says the deficit the system is facing is about $6o million, annually.
"To be perfectly honest, we're trying to save the service for all the people who use it," Desmond says.
"So, the ride-fee area had its place for 40 years.
We are the only city in the United States now that has such a ride-free area.
Cities throughout the world, you pay fare to use transit. "
He says even Portland, Ore., recently phased out theirs for buses and the Max light-rail line.
But advocates for transit riders say many social services are located near the free ride