# Week 2
### Kasper Fyhn Jacobsen

## Question 1: Clark, chapter 1

In her first Chapter, Clark introduces some of the issues that researchers in CLA work with, namely the complexity on different levels of language (phonology, morphosyntax, concepts and the ability to make it work in a context) that children must learn in order to become a potent speaker. She puts emphasis on the intertwining together with other aspects of a child's life, like cognitive and social abilities. While introducing the different "levels" of language, she gives examples of how data on what children do correct or wrong can imply how children come about acquiring rules as well as how these observations back the process of acquisition. Finally, she presents a short history of CLA research and declares her own stance in the debate: process-oriented, interaction-focused and as much as possible based on empirical data. 

## Question 2: Pop quiz
1. What does "set()" do?

This function returns a set based on the list passed as argument, i.e. a list of items where each item can occur only once.
2. What does the line "a[0] = a[0].lower()" do?

It returns the first item of the list, namely the string 'The', with all characters in lowercase. 
3. What does the line "types = len(set(a))" do?

It assigns to var types the number of types in the list a. It is done in the following sequence: 1) create a set from the list a, thereby having no duplicates of items, 2) get the number items in the set, 3) assign this value to the variable.

## Question 3


I figured I would try - like last week - to follow the assignment, but add on some extra stuff, mostly just to play around with some coding.
I set out to make something that people on stackoverflow.com actually says than one really should not do: an URL reader which cleans the text from scripts and HTML tags. But I did it anyway, mostly because it is just for practice at this point. Finally, some simple calculations are made and reported. The code ended up being a bit longer than I expected; so just have a glimpse at it. It would be unfair to ask of you to review all of it.

The full source code can be seen [here](https://github.com/KasperFyhn/Playing-around/blob/master/src/url_reader.py) on GitHub.

P.S.: You might have to follow [these simple instructions](https://stackoverflow.com/questions/42098126/mac-osx-python-ssl-sslerror-ssl-certificate-verify-failed-certificate-verify) to access https pages through Python; I got some errors on my Mac, though not on my Windows desktop. It has something to do with OpenSSL and certificates.

### Retrieving text from an URL
The first part was defining a function which could prompt the user for an URL and handle some errors which can very easily occur when working with user input and even more so with accessing URL's. So this makes up the first lines of code:

In [1]:
import urllib.request as URL

def URL_to_text():
    """Prompt the user for a URL and return the raw text retrieved from the URL.
    If any error occurs, return -1."""
   
    # get a URL from the user and try to open it
    try : 
        url = input('Please, type in/paste a URL and hit Enter.\n')
        url = URL.urlopen(url).read()
        url = url.decode()
        return url
    # in case of an error, report it and return -1
    except (URL.URLError, URL.HTTPError, ValueError) as e:
        print("A problem was encountered. Please, check the URL.")
        print('Error message:', e)
        return -1
    
# keep prompting the user for an URL until it has been properly retrieved
text = -1
while text == -1:
        text = URL_to_text()

Please, type in/paste a URL and hit Enter.
https://www.ordbogen.com


### Cleaning the text
The next part is cleaning the text. This is done with a combination of built-in functions and some regular expressions:

In [2]:
import re
from string import punctuation

# clean the raw source text
text = text.lower() # convert all characters to lower case
text = re.sub(r'<head>.*?</head>', '', text, flags=re.DOTALL) # remove the HTML head
text = re.sub(r'<script.*?</script>', '', text, flags=re.DOTALL) # remove JavaScript parts 
text = re.sub(r'<.*?>', ' ', text) # clean from other HTML tags
text = re.sub(r'(\\t)+', ' ', text) # clean from "spelled out" tabs
text = re.sub(r'(\\n)+', '\n', text) # clean from "spelled out" carriage returns
text = ''.join(c for c in text if c not in punctuation) # get rid of punctuation

# add to a tokens list the words that consist only of alphabetic chars
tokens = [w for w in text.split() if w.isalpha()]

With fairly complex web pages nowadays, the regex's will not always clean properly, so I also wanted to give an opportunity to manually clean out some unwanted words. For this, I wrote a fairly complex function which essentially makes a "kill list" as chosen by the user. Then, by using another function, each occurrence of the unwanted words in the tokens list are ommitted.

In [3]:
def remove_all(iterable, val):
    """Remove all occurrences of the passed value from the passed iterable
    and return it as a list"""
    
    return [x for x in iterable if x != val]

def manually_clean(tokens):
    """Give the user an opportunity to clean out undesired words manually and
    return a cleaned list of words"""
    
    # ask the user for help to clean script "leftovers"
    print('This is a run-through of manual cleaning of the text.' +\
          '\nWARNING: This will generate a long list of words.')
    raw_types = set(tokens) # get all raw types
    raw_types = list(raw_types) # convert back to list ...
    raw_types.sort(key=lambda x: len(x), reverse=True) # ... and sort it from long to short
    # make a numbered list of the longest words and zip it with var raw_types
    numbers = [n for n in range(len(raw_types))]
    num_types = list(zip(numbers, raw_types))
    # print the words from longest to shortest
    for word in num_types:
        print(word, end='  ')
    # ask the user give indices of the words that are to be "killed"
    print('\nPlease type in with numbers the words that should be deleted: "x-y" for ranges' +\
          ' of words, "x" for a single word, separate with ",".\nExample: 0-5,9,15')
    kill_indices = input()
    # parse the indices and add the word to a "kill list"
    kill_indices = kill_indices.split(sep=',') # split the input into a list
    kill_list = []
    for index in kill_indices: # for each item in the list
        try:
            if '-' in index: # if a range is given ...
                rng = index.split(sep='-')
                lo = int(rng[0])
                hi = int(rng[1])
                for i in range(lo, hi): # ... add all words in the given range
                    kill_list.append(raw_types[i])
            elif index.isdigit(): # if just an index is given ...
                i = int(index)
                kill_list.append(raw_types[i]) #... just add the word
            else:
                print('Unable to parse:', index) # report in case of a nonsensical item
        except:
            print('Unable to parse:', index) # try to continue if an item causes an error
    # when the kill list is completed, remove all occurrences of words in the kill list
    # from var tokens
    for word in kill_list:
        tokens = remove_all(tokens, word)
    # return the cleaned list
    return tokens

# ask the user if s/he would like to clean the rest of the text manually
if input('Would you like to check the types and clean out manually?  y/n') == 'y':
    tokens = manually_clean(tokens)

Would you like to check the types and clean out manually?  y/ny
This is a run-through of manual cleaning of the text.
(0, 'internetforbindelsenbspnbsp')  (1, 'brugerhenvendelsernbspnbsp')  (2, 'retskrivningsordbogen')  (3, 'fortrolighedspolitik')  (4, 'feedbackordbogencom')  (5, 'regnskabsordbøgerne')  (6, 'driftsforstyrrelser')  (7, 'uregelmæssigheder')  (8, 'betydningsordbog')  (9, 'sprogmedarbejder')  (10, 'ordbogenbrugere')  (11, 'specialordbøger')  (12, 'internationale')  (13, 'ejendomsordbog')  (14, 'ordbogsprogram')  (15, 'internetvindue')  (16, 'almensproglige')  (17, 'specialiserede')  (18, 'mudderkastning')  (19, 'afgangsprøver')  (20, 'oversættelser')  (21, 'åbentnbspnbsp')  (22, 'mandagtorsdag')  (23, 'musikbegreber')  (24, 'fremmedordbog')  (25, 'synonymordbog')  (26, 'musikordbogen')  (27, 'betaversionen')  (28, 'offlineadgang')  (29, 'handelsvilkår')  (30, 'emailnbspnbsp')  (31, 'cookiepolitik')  (32, 'skriveordbog')  (33, 'datagrundlag')  (34, 'sprogcentret')  (35, 'reg

### Reporting results
Finally, the program does some simple calculations similar to the ones in Notebook 2 and 3.

In [4]:
from collections import Counter

# make a set of types       
types = set(tokens)

# calculate the frequency of each type with a Counter object
freqs = Counter(tokens)

# calculate type-to-token ratio
ttr = len(types)/len(tokens)

# report results
print('\nTokens:', len(tokens))
print('Types:', len(types))
print('Type-to-token ratio:', ttr)
print('The 10 most frequent words:')
print(freqs.most_common(10))

# prompt the user to choose if the types should be printed
if input('\nDo you want to see the types? y/n   ') == 'y':
    print(types)


Tokens: 702
Types: 295
Type-to-token ratio: 0.4202279202279202
The 10 most frequent words:
[('du', 29), ('at', 16), ('og', 14), ('dansk', 13), ('eller', 12), ('til', 12), ('bruger', 11), ('har', 11), ('for', 10), ('er', 10)]

Do you want to see the types? y/n   y
{'betydningsordbog', 'afgangsprøver', 'hedder', 'derfor', 'opret', 'forekomme', 'online', 'adgangskode', 'nysgerrige', 'den', 'brugernavn', 'redaktionen', 'oversættelser', 'husk', 'skriveordbog', 'dansk', 'ordbøger', 'direkte', 'privat', 'klem', 'ordbogenbrugere', 'regnskaber', 'som', 'hjælp', 'dagtimerne', 'vidste', 'købe', 'have', 'åbentnbspnbsp', 'sprogmedarbejder', 'supporten', 'oversat', 'søg', 'besvare', 'skal', 'brugerhenvendelsernbspnbsp', 'uddannelse', 'biblioteker', 'over', 'blive', 'kommentar', 'jeg', 'dette', 'hvis', 'kun', 'erhverv', 'plejer', 'mandagtorsdag', 'klik', 'musikbegreber', 'tysk', 'sproglige', 'slå', 'feedbackordbogencom', 'speciel', 'har', 'support', 'gratis', 'fremmedordbog', 'læse', 'ros', 'fransk'