### The goal is to determine how repetitive each lyrics is

To determine how repetitive a song is, we will use Abraham Lempel and Jacob Ziv (LZ77) algorithm to identify repetitions in a piece of text. The algorithm finds and compresses parts of text that are the same. To obtain this information, we will first compress each file using GZIP - which uses DEFLATE, a combination of LZ77 and Huffman coding. Then we will use the program called infgen which provided statistics of a file compressed by gzip. We will use the statistics on compression to estimate how repetitive a song is, based on how much it was compressed. 

References: https://jvns.ca/blog/2015/02/22/how-gzip-uses-huffman-coding/
https://github.com/madler/infgen/blob/master/infgen.c
https://github.com/colinmorris/lalala

### On infgen usage

With the -s option, infgen will generate statistics comments, all of which
    begin with "! stats ".  There are statistics for each deflate block, and
    summary statistics after the last deflate block.  
    
    After the last deflate block, total statistics are output.  They all begin
    with "! stats total ".  The block input and output amounts are summed for
    example as: "! stats total inout 93232233:0 (55120762) 454563840", with the
    same format as "! stats inout", except without the reach.
    "! stats total block average 34162.3 uncompressed" states for example that
    the average number of uncompressed bytes per block was 34162.3.  Similarly
    "! stats total block average 4142.5 symbols" states that there were 4142.5
    symbols on average per block.  "! stats total literals 6.9 bits each"
    states that there were 6.9 bits used on average per literal.  Lastly the
    matches are summed: "! stats total matches 95.2% (33314520 x 13.0)" with
    the same format as "! stats matches".

In [38]:
import pandas as pd
import re
import os

In [3]:
lyrics = pd.read_csv('lyrics.nosync/source.csv')

In [3]:
lyrics.shape

(231786, 2)

In [4]:
lyrics.head()

Unnamed: 0,lyrics,song_id
0,\n\nA: Prince Rupert Awakes\n\nFarewell the te...,1760739
1,\n\nI'll stay with you until the end\nI say yo...,1165091
2,\n\n[Hook: Vidal Garcia]\nIf you was my bitch\...,2956155
3,\n\nTodo color Tropicana\n\n,3639039
4,\n\nIntéressé.e par l'explication des paroles ...,2824460


In [5]:
test = lyrics['lyrics'][5].split('\n')

In [6]:
with open('test_file.txt', 'w') as f:
    for item in test:
        f.write("%s\n" % item)

In [7]:
for i,j in lyrics.iterrows():
    print(j['lyrics'])
    print(j['song_id'])
    break



A: Prince Rupert Awakes

Farewell the temple master's bells
His kiosk and his black worm seed
Courtship solely of his word
With Eden guaranteed
For now Prince Rupert's tears of glass
Make saffron sabbath eyelids bleed
Scar the sacred tablet of wax
On which the Lizards feed

Wake your reason's hollow vote
Wear your blizzard season coat
Burn a bridge and burn a boat
Stake a Lizard by the throat

Go Polonius or kneel
The reapers name their harvest dawn
All your tarnished devil's spoons
Will rust beneath our corn
Now bears Prince Rupert's garden roam
Across his rain tree shaded lawn
Lizard bones become the clay
And there a swan is born

Wake your reason's hollow vote
Wear your blizzard season coat
Burn a bridge and burn a boat
Stake a Lizard by the throat

Gone soon Piepowder's moss-weed court
Round which upholstered Lizards sold
Visions to their leaden flock
Of rainbows' ends and gold
Now tales Prince Rupert's peacock brings
Of walls and trumpets thousand fold
Prophets chained for burni

In [9]:
#Convert each lyrics to a separate text file
for index, row in lyrics.iterrows():
    lyrics_data = row.lyrics
    song_id = row.song_id

    with open(str(song_id)+'.txt', 'w') as f:
        f.write(lyrics_data)
    break

In [25]:
# Zip every txt file in the current folder
! gzip *.txt

In [27]:
# Extract infgen data about compression from each file
! ./infgen -s *.txt.gz > *.txt

In [51]:
# Find all files with .gz extension
gz_files = [f for f in os.listdir() if f.endswith('.gz')]

In [70]:
import os

for i in gz_files:
    print("Processing:", i)
    cmd = './infgen -s {file} > {file}-infgen-out.txt'.format(file=i)
    print('Running command: ' + cmd)
    os.system(cmd)

Processing: 1760739.txt.gz
Running command: ./infgen -s 1760739.txt.gz > 1760739.txt.gz-infgen-out.txt


In [47]:
! ./infgen -s "1760739.txt.gz" > hey.txt

In [78]:
def parse_ratio(f):
    '''Determine the compression ratio as a proxy of song repetitiveness'''
    matches = 0
    n_literals = 0
    n_symbols = 0
    
    for line in f:
        if line.startswith('match'):
            _, length, dist = line.split()
            matches += 1

        pattern = r'! stats literals \d\.\d bits each \(\d+/(\d+)\)'
        p = re.compile(pattern)
        m = re.match(p, line)
        if m:
            n_literals = int(m.group(1))

        m = re.match(r'! stats total inout \d+:\d+ \((\d+)\)', line)
        if m:
            n_symbols = int(m.group(1))

        m = re.match(r'! stats total block average (\d+)\.\d uncompressed', line)
        if m:
            uncomp = int(m.group(1))

    assert matches + n_literals == n_symbols

    # 1 byte per literal, 3 bytes per match.
    pseudosize = matches * 3 + n_literals
    ratio = uncomp / pseudosize

    return ratio

In [80]:
def ratio_by_file(filename):
    '''Read in file and return compression ratio'''
    
    df = pd.read_table(filename)
    df.columns = ['infgen']
    inp = df.infgen.tolist()
    
    ratio = parse_ratio(inp)
    
    return ratio
    