# V1:Python string parsiranje

Što čemo naučiti?
  * pisati jupiter bilježnicu u Pythonu i markdown jeziku
  * ponoviti osnovne strukture podataka iz pythona
  * rad sa Python stringovima
  * dolazenje do tekstualnih informacija struganjem web-a (web scraping)


## 1.1 Tipovi podataka

Ponovimo ukratko Python tipove podataka.

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [4]:
# tipovi podataka
x = [3+4j, '3+4j', 3,4j, {3+4j}, "3+4j", 4.3, [3,5], [4j], (1,2)]
for i in x:
    print(type(i))
    

<class 'complex'>
<class 'str'>
<class 'int'>
<class 'complex'>
<class 'set'>
<class 'str'>
<class 'float'>
<class 'list'>
<class 'list'>
<class 'tuple'>


In [6]:
# Primjeri razdvajanja stringova (split, partition, slicing)
tekst = 'Ana ima jabuku, Marko ima krušku.'

# Split po razmaku
rijeci = tekst.split()
print(rijeci)

# Split po zarezu
dijelovi = tekst.split(',')
print(dijelovi)

# Partition po riječi 'ima'
prvi, sep, ostatak = tekst.partition('ima')
print(prvi, '|', sep, '|', ostatak)

# Slicing: prvi dio do 10. znaka
print(tekst[:10])

['Ana', 'ima', 'jabuku,', 'Marko', 'ima', 'krušku.']
['Ana ima jabuku', ' Marko ima krušku.']
Ana  | ima |  jabuku, Marko ima krušku.
Ana ima ja


## Strukture podataka u Pythonu



In [7]:
# Primjeri osnovnih struktura podataka u Pythonu (osim stringova)

# Lista
lista = [1, 2, 3, 4]
print('Lista:', lista)

# Tuple
torka = (1, 2, 3, 4)
print('Tuple:', torka)

# Set
skup = {1, 2, 2, 3}
print('Set:', skup)

# Rječnik (dict)
rjecnik = {'ime': 'Ana', 'godine': 25}
print('Rječnik:', rjecnik)

Lista: [1, 2, 3, 4]
Tuple: (1, 2, 3, 4)
Set: {1, 2, 3}
Rječnik: {'ime': 'Ana', 'godine': 25}


**Objektno orijentirano programiranje (OOP)** je način programiranja gdje se program sastoji od objekata. Objekti su kombinacija podataka (atributa) i funkcija (metoda) koje rade s tim podacima. OOP olakšava organizaciju i ponovnu upotrebu koda.

In [None]:
# !conda install nltk scikit-learn numpy pandas matplotlib -y # anaconda python 3 
!pip install nltk scikit-learn numpy pandas matplotlib # osnovni python

In [None]:
 # preuzeti podatke za nltk
import nltk
nltk.download('punkt_tab')
nltk.download('brown')
nltk.download('universal_tagset')

U mapi `data\rjecnik.txt` dano vam je popis riječi sa gramatičkim i semantičkim obilježjima.  Iz teksta izvući samo imenice sa opisom i gramatičkim obilježjima i spremiti u JSON datoteku prema sljedećem formatu
  

In [None]:
import re
import json
from pprint import pprint

with open('data/ocr.txt', 'r', encoding='utf8') as ocr:
    content = ocr.read()

    entries = re.split(r'\n\n', content)

    regex = r'(?P<lemma>\w+)\s+(?P<pos>im.)\s+(?P<gender>.*)\s+〈(?P<inflection>.*)〉(?P<definition>.*)'    

    for i,data in enumerate(entries):
        print(f'\n\npodatak {i}: ', data) 
        lex = {}
        mObj = re.match(regex, data, re.MULTILINE | re.DOTALL)


        if mObj:
            print('\n**Pronasao uzorak: ', end=' ')
            print(mObj.groupdict())
            lex['lemma'], lex['pos'], lex['gender'], lex['inflection'], lex['definition'] = mObj.group('lemma'), mObj.group("pos"), mObj.group("gender"), mObj.group("inflection"), mObj.group("definition")
            
            jsonObj = json.dumps(lex, ensure_ascii=False, indent=4)

            with open(f"data/{lex['lemma']}.json", "w", encoding='utf8') as outfile:
                outfile.write(jsonObj)
        else: 
            print('Nije imenica')