# Formatting the LPP txt files

In this notebook, we'll go from the raw LPP txt files, to a word-based csv file, doing the following steps:
- Tokenizing the natural language by words
- Remove the blank space between a word and :
- Adding capital letters at the beginning of a sentence
- Remove the blank space between - and the following word (dialogue) 

### Bash commands preprocessing

In [None]:
# For :
!perl -pi.bak -e 's/ :/:/g' *.txt

# For dash
!perl -pi.bak -e 's/- /-/g' *.txt

# Word tokenizing
!for f in `seq 1 9` ; do sed 's/ /\n/g' text_french_run$f.txt | awk 'length($0) > 0 ' > new_test_run$f.txt; done


### Python commands tokenizing

In [100]:
import pandas as pd
import numpy as np
import copy

In [None]:
for i in np.arange(1,10):
    with open(f'./text_lpp/new_test_run{i}.txt') as temp_file:

        lpp = temp_file.read().splitlines() 


    df = pd.DataFrame(lpp)
    next_cap = False

    for index, word in df.iterrows():
        # First word
        if index == 0:
            df.at[index,0] = str(word.str.capitalize()[0])
        if next_cap == True:
            df.at[index,0] = str(word.str.capitalize()[0])
        if str(word).__contains__('.'):
            next_cap = True
        else:
            next_cap = False

    df.columns = ['word']
    end = (df.shape[0]*0.35)+0.7
    df['onset'] = np.arange(0.7,end,0.35)
    df['duration'] = np.ones(df.shape[0])*0.3
    
    df.to_csv(f'./txt_clean/run{i}_clean.tsv',sep='\t',index=False)
    
    
    # Create a dataframe where the duration of the black screen after the end of the sentence is longer.

    df_sentence_end = pd.DataFrame(columns = df.columns, data = copy.deepcopy(df.values))
    end_of_sentence_delay = 0.2

    for index, row in df.iterrows():
        if str(row.word).__contains__('.'):
            df_sentence_end.at[index, 'onset'] = row.onset + end_of_sentence_delay # Add the delay from this line
            # And for every next onset
            for i in np.arange(index, df.shape[0]):
                df_sentence_end.at[i, 'onset'] = df_sentence_end.at[i, 'onset'] + end_of_sentence_delay
                ww = df_sentence_end.at[i, 'word']

    
    df_sentence_end.to_csv(f'./txt_clean_end_of_sentence/run{i}_clean_sentence.tsv',sep='\t',index=False)

        
     

Added 0.2 to index 42 for word vécues. 

Added 0.2 to index 43 for word Ça 

Added 0.2 to index 44 for word représentait 

Added 0.2 to index 45 for word un 

Added 0.2 to index 46 for word serpent 

Added 0.2 to index 47 for word boa 

Added 0.2 to index 48 for word qui 

Added 0.2 to index 49 for word avalait 

Added 0.2 to index 50 for word un 

Added 0.2 to index 51 for word fauve. 

Added 0.2 to index 52 for word Voilà 

Added 0.2 to index 53 for word la 

Added 0.2 to index 54 for word copie 

Added 0.2 to index 55 for word du 

Added 0.2 to index 56 for word dessin. 

Added 0.2 to index 57 for word On 

Added 0.2 to index 58 for word disait 

Added 0.2 to index 59 for word dans 

Added 0.2 to index 60 for word le 

Added 0.2 to index 61 for word livre: 

Added 0.2 to index 62 for word "les 

Added 0.2 to index 63 for word serpents 

Added 0.2 to index 64 for word boas 

Added 0.2 to index 65 for word avalent 

Added 0.2 to index 66 for word leur 

Added 0.2 to index 67 for word 

In [125]:
df_sentence_end

Unnamed: 0,word,onset,duration
0,Il,0.7,0.3
1,y,1.05,0.3
2,"avait,",1.4,0.3
3,à,1.75,0.3
4,côté,2.1,0.3
...,...,...,...
1646,que,610.2,0.3
1647,ça,610.55,0.3
1648,a,610.9,0.3
1649,tellement,611.25,0.3
