# Esercizio 4 - Segmentation

## Approccio

L'algoritmo è *brute force* perchè per trovare i tagli corretti li prova tutti generandoli in maniera randomica e associato ad ogni
sequenza di tagli un punteggio. Il punteggio è calcolato come la somma degli n termini più frequnti all'interno di ogni segmento.

L'idea alla base è che per ogni argomento ci siano un numero di parle che occorrono molte volte, e quindi l'algoritmo andrà a cercare 
i tagli che vanno a massimizzare questi valori, e che quindi vanno ad isolare i termini più frequenti per ogni segmento.

L'algoritmo ha bisogno del numero di tagli in ingresso.

- **COOCCORENCE** : In questo notebook corrispnde alla frequenza delle parole più utilizzate nel segmento

## Datasets

Abbiamo utilizzato due dataset diversi per testare l'algoritmo e scovarne i limiti.

1. *data/segmentation_eng*: Il dataset è costituito da pezzi dei paragrafi di wikipedia di 4 argomenti diversi, in questo caso i 3 argomenti sono:
      - Gorillas
      - Quantum Computing
      - Astronomy
      - Kendrick Lamar Biography

    I tagli corretti sono alla riga 59/60, alla riga 102/103 e alla riga 181/182.

2. *data/segmentation_eng_sametopic* Il dataset sametopic presenta due argomenti molto simili, *astronomy* e *astrphysics*:
      - Gorillas
      - Astrophysics
      - Quantum Computing
      - Astronomy

    I tagli corretti sono alla riga 59/60, alla riga 89/90 e alla riga 133/134.


### Imports

In [6]:
from nltk.corpus import stopwords
from collections import Counter
from gensim.test.utils import simple_preprocess
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import random

### Methods

In [7]:
def remove_stop_words_ita(phrase):
    stop_words = stopwords.words('italian')
    phrase = phrase.split()
    phrase = [word for word in phrase if word not in stop_words]
    return phrase

def get_text_from_file(path):
    file = []
    stop_words = set(stopwords.words('english'))
    with open (path, 'r') as f:
        for row in f:
            filtered_s = [w for w in word_tokenize(row) if not w.lower() in stop_words]
            file.append(simple_preprocess(str(filtered_s), deacc=True))
    f.close()
    return file

def cooccurrence(text, n_most_words):
    '''
    Calculates the cooccurrence value of a sequence of words. 
    This value correspond to the sum of the occurrences of the n most frequently used words in the word list.
    '''
    score = 0
    c = Counter()
    most_common = []
    for row in text:
        c.update(row)
        
    most_common = c.most_common(n_most_words)
    # print(most_common)
    
    for el in most_common:
        score = score + el[1]
        
    return score

def extract_segment(file, start, end):
    '''
    Given the first and the last line, extract the segment.
    '''
    segment = []
    for i in range(start, end):
        segment.append(file[i])
    return segment

def get_cuts(ntopic):
    '''
    Generate a list of random value between 1 and the number of lines in the documents,
    that correspond to the cuts of the documents. The first value is 0 and the last is
    the numebers of lines in the documents.
    '''
    cut = [0] * ntopic
    while not(all(cut[i] < cut[i+1] for i in range(len(cut) - 1))):
        for k in range(1, ntopic): # generate the cuts
            cut[k] = random.randint(1, num_lines-1)
    cut.append(num_lines)
        
    return cut

def load_file(file_path):
    file = get_text_from_file(file_path)
    c = Counter()
    num_lines = sum(1 for line in open(file_path)) # Number of lines in the file

    for row in file:
        c.update(row)
        
    return num_lines, file, c
    

### Algorithm

In [8]:
def max_scores(file, ntopic, n_most_words, n_iteration): # n_most_words = number of most used words, used in cooccurrence()
    scores = [0]*ntopic
    sum_scores, max = 0, 0
    max_cut = []
    
    for f in range(n_iteration): 
        # First generate the random cuts
        cut = get_cuts(ntopic)
        
        # Extract the segments by the cuts and calculate the cooccurrence value for each segment
        # based on the random cuts
        for i in range(len(cut)-1):
            text = extract_segment(file, cut[i], cut[i+1])
            scores[i] = cooccurrence(text, n_most_words)
        
        # Evaluate the result and store the best result
        sum_scores = sum(scores)
        if(sum_scores > max):
            max = sum_scores
            max_cut = cut
            
    return max_cut, max

### Data Processing

In [9]:
n_most_words = 3
n_topics = 4
n_iteration = 20000 # Number of iteration to generate the best result - higher value = more time and better result
c = Counter()

# file_data = load_file('../data/segmentation_eng.txt') # Load the file and get the number of lines and the list of most common word in the file

num_lines = file_data[0]
file = file_data[1]

for row in file:
    c.update(row)
    
print(c.most_common(30))

[('lamar', 78), ('quantum', 63), ('gorilla', 34), ('gorillas', 31), ('released', 30), ('also', 27), ('astronomy', 25), ('album', 25), ('first', 24), ('dre', 19), ('classical', 18), ('mixtape', 17), ('algorithm', 16), ('stars', 16), ('silverback', 15), ('early', 15), ('astronomical', 15), ('ray', 15), ('song', 15), ('years', 14), ('females', 14), ('made', 14), ('new', 14), ('computers', 14), ('computer', 14), ('wavelengths', 14), ('video', 14), ('males', 13), ('known', 13), ('may', 13)]


In [10]:
res = max_scores(file, n_topics, n_most_words, n_iteration)

print(f'\nBest cut at lines -->  {res[0]}\nthe max sum is    -->  {res[1]}')



Best cut at lines -->  [0, 59, 102, 178, 254]
the max sum is    -->  363


# Just a Try - Dynamic Cutting Algorithm
## Non funziona

Al posto di calcolare i valori dei tagli in maniera randomica andiamo a scegliere il taglio migliore su base iterativa

In [11]:
def get_scores_2_seg(seg1, seg2, most_words):
    scores = 0
    scores = cooccurrence(seg1, most_words) + cooccurrence(seg2, most_words)
    return scores

In [12]:
def max_scores_dyn(file, ntopic, n_most_words, n_iteration, n_limit): # n_most_words = number of most used words, used in cooccurrence()
    
    max_scores, limit, direction = [0]*ntopic, [0]*n_topics, [1]*n_topics
    cut = get_cuts(ntopic)
    # cut = [0, 41, 100, 160, 254]
    
    for f in range(n_iteration):
        for k in range(1,len(cut)-1):
            # get the scores of the two segments divided by a cut
            rel_score = get_scores_2_seg(extract_segment(file, cut[k-1], cut[k]), extract_segment(file, cut[k], cut[k+1]), n_most_words)
            
            if(rel_score > max_scores[k]):
                max_scores[k-1] = rel_score
                limit[k] = 0
            # Update the value of cut --> Try a direction
            elif limit[k] < n_limit:
                cut[k] = cut[k] + (1 * direction[k])
                limit[k] = limit[k] + 1
            # Change direction of search --> Wrong direction
            else:
                cut[k] = cut[k] - (limit[k] * (direction[k])) #! qua da controllare con attenzione
                direction[k] *= -1
                limit[k] = 0
            
    return cut, max_scores
    

In [13]:
res = max_scores_dyn(file, 4, 3, 100000, 40)

print(f'\nBest cut -->  {res[0]}\nthe max value is  -->  {res[1]}')


Best cut -->  [0, 94, 144, 189, 254]
the max value is  -->  [181, 78, 164, 0]


In [14]:
max_scores, scores = [0, 0, 0], [0, 0, 0]
sum_score = 0
max = 0
# cut = [0, 10, 45, num_lines]

#* Questi parametri sono legati alle singole iterazioni, quindi alla ricerca dei tagli legati ai singoli segmenti
limit = [0, 0, 0] # Used to avoid to go to far away from the cut
direction = [1, 1, 1] # Indicate the direction of the search, 1 mean "top" and -1 mean "bottom"

cut = [0, 40, 120, num_lines] #* Real: [59, 102]

# Just a Try - Segmentation without number of topics
## Non funziona

In [15]:
max = 0
max_cut = []
max_n_topic = 0

for f in range(2):
    ntopic = random.randint(2, 10)
    res = max_scores(file, ntopic)
    
    if(max < res[1]):
        max = res[1]
        max_cut = res[0]
        max_n_topic = ntopic
        
print(f'max is {max} at {max_cut} with {max_n_topic} topics')


TypeError: 'list' object is not callable