# Tutorial 

## Part-I

In this live session, we will **(a)** read from a text file, **(b)** store data in intermediate structures, **(c)** traverse chunks of input, **(d)** eliminate trailing white spaces and stop words, and **(e)** write the processed data to an output file. Target language will be Turkish.


### * imports

In [None]:
# for reading file names in a directory
import glob

### ** globals

In [None]:
input_path = "./in/"
output_path = "./out/"

### (a-b) read input from files and store in a dictionary

In [None]:
def read_and_store_input(input_path):
    # initialize dictionary
    data = {}
    # read file names from the input directory
    input_files = glob.glob(input_path+'mtdb/*.txt')
    for file_path in input_files:
        content = open(file_path, 'r')
        # create an empty list with a key as the name of the current file
        file_name = file_path.split('/')[-1]
        data[file_name]=[]
        # populate the list with content from file
        for line in content:
            # check if line has content
            if line.strip() != "":
                data[file_name].append(line)
    return data

data = read_and_store_input(input_path)

### (c) traverse the content

In [None]:
# content of which files we have access to?
for key in data:
    print(key)

In [None]:
# lets see the contents of a specific file, say '10370000.txt'
# 20000 character limitation. last 20000 characters are displayed
for index, line in enumerate(data['10370000.txt']):
    print("LINE #%s: %s" %(index,line))

In [None]:
# too many white spaces, lets eliminate the excess
for index, line in enumerate(data['10370000.txt']):
    print("LINE #%s: %s" %(index,line.strip()))

In [None]:
# lets access to a specific line in a specific file
# remember that lines with no content were excluded while reading from input files. indexes are not identical with the files'.
file_name = '10370000.txt'
line = 101
print("LINE #%s from FILE '%s': %s" %(line, file_name, data[file_name][line].strip()))

### (d) eliminate trailing white spaces and stop words

In [None]:
# read stop words from a file into a list
# taken from --> https://github.com/sgsinclair/trombone/blob/master/src/main/resources/org/voyanttools/trombone/keywords/stop.tr.turkish-lucene.txt
stop_words = []
for line in open(input_path+'stop_words', 'r'):
    # check if not comment
    if line[0] != '#':
        stop_words.append(line.strip())

# traverse the data
for file, content in data.items():
    for index, line in enumerate(content):
        # eliminate trailing white spaces
        line = line.strip() 
        # eliminate stop words
        new_line = (' ').join(word for word in line.split() if word not in stop_words)
        content[index] = new_line

# lets compare
for index, line in enumerate(data['10370000.txt']):
    print("LINE #%s: %s" %(index,line))


### (e) export the preprocessed data

In [None]:
for file_name, content in data.items():
    output_file = open(output_path+file_name, 'w')
    for line in content:
        output_file.write(line+'\n')
    output_file.close()