# Notebook 5: for-loops
#### Kasper Fyhn Jacobsen

## Summary of Hoff's and Hart and Risley's findings
Both the article by Hoff and the article by Hart and Risley find quantitative evidence for socioeconomic status being a cause for slower growth in vocabulary development in children in their first years of learning, which in turn can have an effect on IQ and language skills later in life. Hoff does this by comparing two points in time ten weeks apart and narrowing down correlational effects to SES and vocabulary growth. Hart and Risley do it with a longitudinal study spanning over two and a half years which show the differing trajectories of vocabulary growth between different families of different SES’s and substantial gap in cumulative stimulus measured in words.

## Quiz
### Question 1: Adapting the line
Let's make it short, yet understandable, and generalizable:

In [1]:
from string import punctuation as punct

# get a sentence
sent_orig = input('Please input a sentence and hit enter: ')
# clean it and keep the original for later
sent = sent_orig.lower()
sent = ''.join(c for c in sent if c not in punct)
# tokenize and calculate mean word length
tokens = sent.split()
words = len(tokens)
chars = sum(len(word) for word in tokens)
mean_word_len = chars / words
# report results
print(f'The mean length of words in the sentence "{sent_orig}" is {mean_word_len}.')

Please input a sentence and hit enter: The farmer killed the duckling.
The mean length of words in the sentence "The farmer killed the duckling." is 5.2.


### Question 2 and 3: Doing calculations for many texts
As you might have noticed, I am a fan of generalizable (and thereby reusable) code, i.e. code that is made in such a way that it can work for any similar problems.

Of course, this quickly leads to the idea of object-oriented programming. So, in this case, we can make a class which is in some way a representation of our text with some attributes and methods. Then, ideally, as we keep on working with texts like this, the class can be extended and improved without having to change all the scripts that use it.

Another good thing about making a class like this is that it makes for a nice and easy-to-handle data structure; all info tied to a specific text is stored with the text in little pack, i.e. an object.

So, following some the things that we have done with texts so far, we can make a class like this:

In [3]:
from string import punctuation as punct
from collections import Counter

class TextStats:
    '''This class holds a text and can return different stats about it.'''
  
    def __init__(self, filepath):
        try:
            file = open(filepath, 'r', encoding='utf-8')
            # save name, raw text and cleaned text
            self.name = file.name
            text = file.read()
            self.raw_text = text
            self.clean_text = ''.join(c for c in text if c not in punct)
            self.clean_text = self.clean_text.lower()
            # make tokens and types lists
            self.tokens = self.clean_text.split()
            self.types = set(self.tokens)
        except IOError as e:
            print('An error occured when loading: ' + filepath)
            print('Error message:', e)

    def ttr(self):
        '''Return type-to-token ratio'''
        
        return len(self.types) / len(self.tokens)
    
    def mean_word_length(self):
        '''Return mean word length, only counting alpha tokens.'''
          
        tokens = [word for word in self.tokens if word.isalpha()]
        chars = sum(len(word) for word in tokens)
        words = len(tokens)
        return chars / words
    
    def word_freqs(self, n=None):
        '''Return a list of n tuples (all if n is not passed with a specific
        number) of types and their frequencies, i.e. number of tokens. The list
        is sorted from the most frequent and decreasing.'''
        
        freqs = Counter(self.tokens)
        if n > len(self.tokens): # if n is larger than the number of tokens
            return freqs.most_common() # return list of all elements
        else:
            return freqs.most_common(n)
    
    def print_stats(self):
        '''Print a stat summary for the text including number of tokens and
        types, type-to-token ratio, mean word length and ...'''
        
        spc = 7
        print(f'Stats for {self.name}')
        print(f'Tokens:\t\t{len(self.tokens):{spc}}')
        print(f'Types:\t\t{len(self.types):{spc}}')
        print(f'Type-to-token:\t{self.ttr():{spc}.3}')
        print(f'Mean word lgth:\t{self.mean_word_length():{spc}.3}')

Now, we have a class into which we can load all texts in a given directory, e.g. all the Austen texts. This script report both the mean word length of all novels as well as the average of the mean word lengths (which are different things as far as I know!).

In [5]:
import os

# set the path to the directory
path = input('Please, paste in the absolute path to the folder of texts: ')
os.chdir(path)

# make a list of TextStats object, one for each text
tstats = [TextStats(file) for file in os.listdir()]

# print stats for each text
for text in tstats:
    text.print_stats()
    print()

# calculate and report mean word length of all novels
tokens_all = [word for word in text.tokens
              for text in tstats
              if word.isalpha()]
chars = sum(len(word) for word in tokens_all)
words = len(tokens_all)
mean_all = chars/words
print(f'Mean word length of all: {mean_all:.3}')

# calculate and report "the grand average", i.e. average of all mean values
avs = [text.mean_word_length() for text in tstats]
grand_av = sum(avs) / len(avs)
print(f'Average of mean word lengths: {grand_av:.3}')

Please, paste in the absolute path to the folder of texts: C:\Users\Kasper Fyhn Jacobsen\Dropbox\Child Language Acquisition\Jane-Austen
Stats for Austen-Emma.txt
Tokens:		 157440
Types:		  11054
Type-to-token:	 0.0702
Mean word lgth:	   4.33

Stats for Austen-Mansfield.txt
Tokens:		 159540
Types:		   9406
Type-to-token:	  0.059
Mean word lgth:	   4.32

Stats for Austen-Northanger.txt
Tokens:		  77070
Types:		   7223
Type-to-token:	 0.0937
Mean word lgth:	   4.39

Stats for Austen-Persuasion.txt
Tokens:		  83281
Types:		   6004
Type-to-token:	 0.0721
Mean word lgth:	   4.38

Stats for Austen-Pride.txt
Tokens:		 121533
Types:		   7818
Type-to-token:	 0.0643
Mean word lgth:	    4.4

Stats for Austen-Sense.txt
Tokens:		 118563
Types:		   7417
Type-to-token:	 0.0626
Mean word lgth:	   4.43

Mean word length of all: 4.43
Average of mean word lengths: 4.38
