# Notebook 4: reading files
### Kasper Fyhn Jacobsen

This is the simple version of an interactive program to get some stats on text files. The source code is [here](https://github.com/KasperFyhn/Playing-around/blob/master/src/simple_text_stats.py) on GitHub.

I caught an error when trying to read the contents of the Austen text, though only on my desktop computer running Windows and not on my Mac. Still haven't figured out exactly why, but it seemed that the issue was with the character encoding; hence, the `text = open(file, encoding='utf-8').read()`

The actual functionality boils down to these two functions.

In [2]:
from string import punctuation
from collections import Counter
import os

def file_to_text(file):
    """Try to open a .txt og .rtf file and return the raw text. If any 
    errors occurs, return -1."""

    # get a pathname to a file an try to open it
    try:
        text = open(file, encoding='utf-8').read()
        return text
    # in case of an error, report it and return -1
    except IOError as e:
        print("A problem was encountered. Please, check the pathname.")
        print('Error message:', e)
        return -1

def text_stats(text):
    """Prints number of tokens and types, type-to-token ratio and
    the top 10 most frequent words of a given text"""
    
    # clean the raw text
    text = ''.join(c for c in text if c not in punctuation) # get rid of punctuation
    text = text.lower() # convert all characters to lower case    
    # add to a tokens list the words that consist only of alphabetic chars
    tokens = [w for w in text.split() if w.isalpha()]   
    # make a set of types
    types = set(tokens)   
    # calculate the frequency of each type with a Counter object
    freqs = Counter(tokens)    
    # calculate type-to-token ratio
    ttr = len(types)/len(tokens)    
    # report results
    print('Tokens:', len(tokens))
    print('Types:', len(types))
    print('Type-to-token ratio:', ttr)
    print('The 10 most frequent words:')
    print(freqs.most_common(10))

Rather than having to change code for every new text, I made a small piece of code which imitates a UNIX command line interface, though only with commands as stated in the print statements.

I'm not sure if there's a prettier way to make this. In e.g. Java, the branching would be done with a switch statement, rather than several `elif`'s, which is easier to read, but it's mainly aesthetic.

In [3]:
# have the user find a text file by navigating through the file system
# as in a UNIX command line interface
print('Please, navigate to your text file by using the commands:')
print('Change directory: cd <pathame> (".." to go up)')
print('List entries in directory: ls')
print('Load file (from current dir or abs path): load <filename>')
print('Quit: quit')
# run infinite loop
while True:
    try:
        command = input(os.getcwd() + ': ').split() # get command
        if command[0] == 'cd': # change directory
            os.chdir(command[1])
        elif command[0] == 'ls': # list items in dir
            for entry in os.scandir():
                print(entry.name)
        elif command[0] == 'load': # load file
            text_path = ' '.join(command[1:]) # join in case of spaces
            text = file_to_text(text_path)
            if not text == -1: 
                text_stats(text)
        elif command[0] == 'quit':
            break # break the loop and end the program
        else:
            print('Command not recognized')
    except:
        print('An error occured in the command')

Please, navigate to your text file by using the commands:
Change directory: cd <pathame> (".." to go up)
List entries in directory: ls
Load file (from current dir or abs path): load <filename>
Quit: quit
C:\Users\Kasper Fyhn Jacobsen\Dropbox\Child Language Acquisition\Notebooks: cd ..\..\..
C:\Users\Kasper Fyhn Jacobsen: cd Desktop
C:\Users\Kasper Fyhn Jacobsen\Desktop: ls
Austen-Pride.txt
desktop.ini
test.txt
C:\Users\Kasper Fyhn Jacobsen\Desktop: load test.txt
Tokens: 7
Types: 6
Type-to-token ratio: 0.8571428571428571
The 10 most frequent words:
[('hello', 2), ('this', 1), ('is', 1), ('a', 1), ('simple', 1), ('test', 1)]
C:\Users\Kasper Fyhn Jacobsen\Desktop: load Austen-Pride.txt
Tokens: 117256
Types: 6572
Type-to-token ratio: 0.05604830456437197
The 10 most frequent words:
[('the', 4306), ('to', 4109), ('of', 3587), ('and', 3434), ('her', 2190), ('a', 1926), ('in', 1847), ('was', 1834), ('i', 1750), ('she', 1682)]
C:\Users\Kasper Fyhn Jacobsen\Desktop: quit
