# V1:Python string parsiranje

Što čemo naučiti?
  * pisati jupiter bilježnicu u Pythonu i markdown jeziku
  * ponoviti osnovne strukture podataka iz pythona
  * rad sa Python stringovima
  * dolazenje do tekstualnih informacija struganjem web-a (web scraping)


## 1.1 Tipovi podataka

Ponovimo ukratko Python tipove podataka.

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [4]:
# tipovi podataka
x = [3+4j, '3+4j', 3,4j, {3+4j}, "3+4j", 4.3, [3,5], [4j], (1,2)]
for i in x:
    print(type(i))
    

<class 'complex'>
<class 'str'>
<class 'int'>
<class 'complex'>
<class 'set'>
<class 'str'>
<class 'float'>
<class 'list'>
<class 'list'>
<class 'tuple'>


In [6]:
# Primjeri razdvajanja stringova (split, partition, slicing)
tekst = 'Ana ima jabuku, Marko ima krušku.'

# Split po razmaku
rijeci = tekst.split()
print(rijeci)

# Split po zarezu
dijelovi = tekst.split(',')
print(dijelovi)

# Partition po riječi 'ima'
prvi, sep, ostatak = tekst.partition('ima')
print(prvi, '|', sep, '|', ostatak)

# Slicing: prvi dio do 10. znaka
print(tekst[:10])

['Ana', 'ima', 'jabuku,', 'Marko', 'ima', 'krušku.']
['Ana ima jabuku', ' Marko ima krušku.']
Ana  | ima |  jabuku, Marko ima krušku.
Ana ima ja


## Strukture podataka u Pythonu



In [7]:
# Primjeri osnovnih struktura podataka u Pythonu (osim stringova)

# Lista
lista = [1, 2, 3, 4]
print('Lista:', lista)

# Tuple
torka = (1, 2, 3, 4)
print('Tuple:', torka)

# Set
skup = {1, 2, 2, 3}
print('Set:', skup)

# Rječnik (dict)
rjecnik = {'ime': 'Ana', 'godine': 25}
print('Rječnik:', rjecnik)

Lista: [1, 2, 3, 4]
Tuple: (1, 2, 3, 4)
Set: {1, 2, 3}
Rječnik: {'ime': 'Ana', 'godine': 25}


**Objektno orijentirano programiranje (OOP)** je način programiranja gdje se program sastoji od objekata. Objekti su kombinacija podataka (atributa) i funkcija (metoda) koje rade s tim podacima. OOP olakšava organizaciju i ponovnu upotrebu koda.

In [None]:
# !conda install nltk scikit-learn numpy pandas matplotlib -y # anaconda python 3 
!pip install nltk scikit-learn numpy pandas matplotlib # osnovni python

In [None]:
 # preuzeti podatke za nltk
import nltk
nltk.download('punkt_tab')
nltk.download('brown')
nltk.download('universal_tagset')

U mapi `data\rjecnik.txt` dano vam je popis riječi sa gramatičkim i semantičkim obilježjima.  Iz teksta izvući samo imenice sa opisom i gramatičkim obilježjima i spremiti u JSON datoteku prema sljedećem formatu
  

In [None]:
import re
import json
from pprint import pprint

with open('data/ocr.txt', 'r', encoding='utf8') as ocr:
    content = ocr.read()

    entries = re.split(r'\n\n', content)

    regex = r'(?P<lemma>\w+)\s+(?P<pos>im.)\s+(?P<gender>.*)\s+〈(?P<inflection>.*)〉(?P<definition>.*)'    

    for i,data in enumerate(entries):
        print(f'\n\npodatak {i}: ', data) 
        lex = {}
        mObj = re.match(regex, data, re.MULTILINE | re.DOTALL)


        if mObj:
            print('\n**Pronasao uzorak: ', end=' ')
            print(mObj.groupdict())
            lex['lemma'], lex['pos'], lex['gender'], lex['inflection'], lex['definition'] = mObj.group('lemma'), mObj.group("pos"), mObj.group("gender"), mObj.group("inflection"), mObj.group("definition")
            
            jsonObj = json.dumps(lex, ensure_ascii=False, indent=4)

            with open(f"data/{lex['lemma']}.json", "w", encoding='utf8') as outfile:
                outfile.write(jsonObj)
        else: 
            print('Nije imenica')

prosla. 1.

U prilogu vam da je dan izvadak iz Školskog rječnika hrvatskog jezika. Vaš je zadatak sljedeći:
  * Izvući glagole s gramatičkim obilježjima te opisom kako je dano u primjeru `gakati.json`
  * izvući pridjeve s gramatičkim obilježjima kako je dano u primjeru: `gadan.json`

In [None]:
import io
with io.open('SK_rjecnik.txt', encoding='utf8') as f:
    text = f.read()
 
 
text = text.replace('\t', '')
 
import re, json
import unidecode
 
entries = re.split('\n\n', text)
 
for i in range(len(entries)):
    entries[i] = entries[i].replace('\n', ' ')
entries
 
regex = '((\w+)\s+(gl.)\s+(nesvrš.|svrš.)\s+(neprijel.|prijel.|prijel./neprijel.)\s+(〈.*?〉))'
 
for data in entries:
    lex = {}
    mObj = re.match(regex, data, re.MULTILINE | re.DOTALL)
    if mObj:
        lex['rijec'], lex['vrsta'], lex['vid'], lex['prijelaznost'], lex['gramaticka obiljezja'] = unidecode.unidecode(mObj.group(2)), unidecode.unidecode(mObj.group(3)), unidecode.unidecode(mObj.group(4)), unidecode.unidecode(mObj.group(5)), unidecode.unidecode(mObj.group(6))
        lex['definicija'] = data.partition(mObj.group(1))[2]
       
        jsonObj = json.dumps(lex)
 
        with open(f"{lex['rijec']}.json", 'w') as outfile:
            outfile.write(jsonObj)      
 
 
         
entries = re.split('\n\n', text)
 
for i in range(len(entries)):
    entries[i] = entries[i].replace('\n', ' ')
entries
regex2 = '((\w+)\s+(prid.)\s+(〈.*?〉))'
 
for data2 in entries:
    lex2 = {}
    mObj = re.match(regex2, data2, re.MULTILINE | re.DOTALL)
    if mObj:
        lex2['rijec'], lex2['vrsta'], lex2['gramaticka obiljezja'] = unidecode.unidecode(mObj.group(2)), unidecode.unidecode(mObj.group(3)), unidecode.unidecode(mObj.group(4))
        lex2['definicija'] = data2.partition(mObj.group(1))[2]
       
        jsonObj = json.dumps(lex2)
 
        with open(f"{lex2['rijec']}.json", 'w') as outfile:
            outfile.write(jsonObj)
entries

2.Implementirajte analizu sentimenta koristeći naivni Bayes za popis filmskih recenzija koje se nalaze u `nltk.corpus.movie_reviews`. Za značajke pojedine recenzije koristite informaciju sadrži li recenzija najčešćih $K=2000$ riječi iz `movie_reviews` korpusa. Točnost (accuracy) klasifikatora mora biti barem 60%. Prikažite mjere preciznosti, odziva i $F_1$ ocjenu za pojedinu kategoriju pozitivnog i negativnog sentimenta te prikažite amtricu zbunjenosti. 

In [None]:
# vaše rješenje ...
import nltk
from nltk.corpus import movie_reviews
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
 
nltk.download('movie_reviews')
 
all_words = [word.lower() for word in movie_reviews.words()]
freq_dist = FreqDist(all_words)
most_common_words = [word for word, _ in freq_dist.most_common(2000)]
 

def extract_features(review_words):
    review_words_set = set(review_words)
    features = {word: (word in review_words_set) for word in most_common_words}
    return features
 

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
 
# Miješanje - nasumično raspoređivanje
np.random.seed(42)
np.random.shuffle(documents)

feature_sets = [(extract_features(words), category) for words, category in documents]
train_set, test_set = feature_sets[:1600], feature_sets[1600:]
 
# Treniranje naivnog Bayesovog klasifikatora
classifier = NaiveBayesClassifier.train(train_set)
 
# Evaluacija modela
print("Točnost klasifikatora:", accuracy(classifier, test_set) * 100)

 
# Prikaz najvažnijih značajki
classifier.show_most_informative_features(10)
 
# Predikcije - testni skup
y_true = [label for _, label in test_set]
y_pred = [classifier.classify(features) for features, _ in test_set]
 
# Metričke evaluacije
print("\nIzvještaj o klasifikaciji:")
print(classification_report(y_true, y_pred, target_names=movie_reviews.categories()))
 
print("\nMatrica zbunjenosti:")
print(confusion_matrix(y_true, y_pred))

3. Učinite sljedeće:
 1. Izračunajte TF-IDF vektor za svaku rečenicu u dokumentu. Prikazati rezultirajuće vektore za svaku rečenicu.

 2. Primijenite K-Means algoritam nad dobivenim TF-IDF vektorima dokumenata. Pretpostavite K = 3. Ispišite tablično kojem klasteru pripada koja rečenica.

 3. Dobili ste sljedeću rečenicu: `Reinforcement learning is used also in natural language processing` Pronađite kojem klasteru ova rečenica pripada.

In [None]:
# vaše rješenje
from sklearn.feature_extraction.text import CountVectorizer # pretvara dokumente u vektore frekvencija tokena
from sklearn.feature_extraction.text import TfidfVectorizer # 

# NLTK funkcije
from nltk.stem.porter import PorterStemmer # korijenovatelj engleskih riječi
from sklearn.cluster  import KMeans   # algoritam klasteriranja
import pandas as pd

sents = ['Machine learning algorithms use data to make predictions.','Deep learning models require large amounts of labeled data.','Natural language processing techniques analyze textual data.','Milena came home after finishing her workout, immediately took off her backpack, and washed her hands.','She sat down at the table to eat.','Then she focused on her homework, not thinking about tomorrow’s match.','How can you accentuate words in English?','Do you want to learn a new language quickly and efficiently?','Exploring English syntax: embark on an adventure through English sentence structure!']
print(sents)

count_vectorizer = CountVectorizer()
# tokeniziraj i prebroj
X = count_vectorizer.fit_transform(sents)
# koristi panadas za prikaz
#pd.DataFrame(X.toarray())
# poboljsaj prikaz?
pd.DataFrame(X.toarray(),columns=count_vectorizer.get_feature_names_out())

# novi vektorizer
count_vectorizer = CountVectorizer(stop_words='english')
X = count_vectorizer.fit_transform(sents)
pd.DataFrame(X.toarray(),columns=count_vectorizer.get_feature_names_out())

from nltk.stem import WordNetLemmatizer 
# lematizator 
lemmatizer = WordNetLemmatizer() 
# tokenizator
def tokenizer(text):
    words = re.sub(r'[^A-Za-z0-9\-]'," ",text).lower().split() # prave riječi u tekstu
    words = [lemmatizer.lemmatize(word) for word in words]
    return words

# vektorizator
count_vectorizer = CountVectorizer(stop_words='english',tokenizer=tokenizer)
X = count_vectorizer.fit_transform(sents)
pd.DataFrame(X.toarray(),columns=count_vectorizer.get_feature_names_out())

tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenizer, use_idf=False, norm='l1')
X = tfidf_vectorizer.fit_transform(sents)
df=pd.DataFrame(X.toarray(), columns = tfidf_vectorizer.get_feature_names_out())

print(sents)
df


In [None]:
# algoritam klasteriranja
from sklearn.cluster import KMeans
# broj klastera 3
number_of_clusters = 3
km = KMeans(n_clusters=number_of_clusters)
km.fit(X);vectorizer = TfidfVectorizer(use_idf=True, tokenizer=tokenizer, stop_words='english')
X = vectorizer.fit_transform(sents)

results = pd.DataFrame()
results['text'] = sents
results['category'] = km.labels_ # klaster oznake
results

In [None]:
sents = ['Machine learning algorithms use data to make predictions.','Deep learning models require large amounts of labeled data.','Natural language processing techniques analyze textual data.','Milena came home after finishing her workout, immediately took off her backpack, and washed her hands.','She sat down at the table to eat.','Then she focused on her homework, not thinking about tomorrow’s match.','How can you accentuate words in English?','Do you want to learn a new language quickly and efficiently?','Exploring English syntax: embark on an adventure through English sentence structure!']
print(sents)

# Bag-of-Words
vectorizer = TfidfVectorizer(use_idf=True, stop_words='english')
X = vectorizer.fit_transform(sents)


from sklearn.cluster import KMeans
# broj klastera
number_of_clusters = 3
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

# ispisi rezultat
results = pd.DataFrame()
results['text'] = sents
results['category'] = km.labels_
results

# novi podatak
text = ['Reinforcement learning is used also in natural language processing']
x = vectorizer.transform(text)
#  predicted cluster
predicted_cluster = km.predict(x)


print(f"'{text[0]}' belongs to cluster {predicted_cluster[0]}")