# Flow Control, Data Types and IO

Python is a small base language with lots of module (package) extensions. Many of the keywords you can guess; some not so much.

## The Zen of Python


    The Zen of Python, by Tim Peters

    Beautiful is better than ugly.
    Explicit is better than implicit.
    Simple is better than complex.
    Complex is better than complicated.
    Flat is better than nested.
    Sparse is better than dense.
    Readability counts.
    Special cases aren't special enough to break the rules.
    Although practicality beats purity.
    Errors should never pass silently.
    Unless explicitly silenced.
    In the face of ambiguity, refuse the temptation to guess.
    There should be one-- and preferably only one --obvious way to do it.
    Although that way may not be obvious at first unless you're Dutch.
    Now is better than never.
    Although never is often better than *right* now.
    If the implementation is hard to explain, it's a bad idea.
    If the implementation is easy to explain, it may be a good idea.
    Namespaces are one honking great idea -- let's do more of those!
    
    import this

## Python Language Keywords


### Logical
* ```False```    
* ```None```     
* ```True```     
* ```and```      
* ```is```
* ```not```      
* ```or```       

### Importing 
* ```import```   
* ```from```     

### Conditions, looping and structure
* ```if```       
* ```else```     
* ```elif```     
* ```for```      
* ```break```    
* ```in```       
* ```pass```     
* ```return```   
* ```while```    
* ```with...as```
* ```yield```    

### Error handling 
* ```assert```   
* ```continue``` 
* ```except```   
* ```finally```  
* ```raise```    
* ```try```      

### Function and class definition 
* ```def```      
* ```lambda```   
* ```class```    

### Other 
* ```del```      
* ```global```   
* ```nonlocal``` 

In typical Python fashion you can obtain a list of the keywords using the keywords module.

    import keyword
    keyword.kwlist

## Python Built In Functions

Surprisingly few built in functions. Many things delegated to modules, e.g. all math functions. 

### Python special types 

* ```dict()```
* ```dir()```
* ```len()```
* ```list()```
* ```range()```
* ```reversed()```
* ```set()```
* ```slice()```
* ```sorted()```
* ```tuple()```

### Types and type conversion 
* ```ascii()```
* ```bin()```
* ```bytearray()```
* ```bytes()```
* ```chr()```
* ```complex()```
* ```float()```
* ```format()```
* ```hex()```
* ```int()```
* ```oct()```
* ```ord()```
* ```str()```

### Who knows? 
* ```enumerate()```
* ```getattr()```
* ```globals()```
* ```hasattr()```
* ```iter()```
* ```locals()```
* ```next()```
* ```open()```
* ```print()```
* ```repr()```
* ```setattr()```
* ```type()```
* ```zip()```


## Modules

### Built-In Modules

There are over 200 modules part of the base Python distribution. Where these modules exist they should be used. Browse documentation to understand the types of problems they solve. Look for a standard module if you feel you are doing something standard. Do not try to go through them all---wait until you have a problem to solve for motivation. 

https://docs.python.org/3/py-modindex.html

| Module       | Description                                                                             |
|:--------------|:----------------------------------------------------------------------------------------|
| \_\_future\_\_    | Future statement definitions                                                            |
| \_\_main\_\_      | The environment where the top-level script is run.                                      |
| \_dummy\_thread | Drop-in replacement for the \_thread module.                                             |
| \_thread       | Low-level threading API.                                                                |
| abc                          | Abstract base classes according to PEP 3119.                                                                         |
| aifc                         | Read and write audio files in AIFF or AIFC format.                                                                   |
| argparse                     | Command-line option and argument parsing library.                                                                    |
| array                        | Space efficient arrays of uniformly typed numeric values.                                                            |
| ast                          | Abstract Syntax Tree classes and manipulation.                                                                       |
| asynchat                     | Support for asynchronous command/response protocols.                                                                 |
| asyncio                      | Asynchronous I/O.                                                                                                    |
| asyncore                     | A base class for developing asynchronous socket handling services.                                                   |
| atexit                       | Register and execute cleanup functions.                                                                              |
| audioop                      | Manipulate raw audio data.                                                                                           |
| base64                       | RFC 3548: Base16, Base32, Base64 Data Encodings; Base85 and Ascii85                                                  |
| bdb                          | Debugger framework.                                                                                                  |
| binascii                     | Tools for converting between binary and various ASCII-encoded binary representations.                                |
| binhex                       | Encode and decode files in binhex4 format.                                                                           |
| bisect                       | Array bisection algorithms for binary searching.                                                                     |
| builtins                     | The module that provides the built-in namespace.                                                                     |
| bz2                          | Interfaces for bzip2 compression and decompression.                                                                  |
| calendar                     | Functions for working with calendars, including some emulation of the Unix cal program.                              |
| cgi                          | Helpers for running Python scripts via the Common Gateway Interface.                                                 |
| cgitb                        | Configurable traceback handler for CGI scripts.                                                                      |
| chunk                        | Module to read IFF chunks.                                                                                           |
| cmath                        | Mathematical functions for complex numbers.                                                                          |
| cmd                          | Build line-oriented command interpreters.                                                                            |
| code                         | Facilities to implement read-eval-print loops.                                                                       |
| codecs                       | Encode and decode data and streams.                                                                                  |
| codeop                       | Compile (possibly incomplete) Python code.                                                                           |
| collections                  | Container datatypes                                                                                                  |
| colorsys                     | Conversion functions between RGB and other color systems.                                                            |
| compileall                   | Tools for byte-compiling all Python source files in a directory tree.                                                |
| concurrent                   | Execute computations concurrently using threads or processes.                                                        |
| configparser                 | Configuration file parser.                                                                                           |
| contextlib                   | Utilities for with-statement contexts.                                                                               |
| contextvars                  | Context Variables                                                                                                    |
| copy                         | Shallow and deep copy operations.                                                                                    |
| copyreg                      | Register pickle support functions.                                                                                   |
| cProfile                     |                                                                                                                      |
| crypt (Unix)                 | The crypt() function used to check Unix passwords.                                                                   |
| csv                          | Write and read tabular data to and from delimited files.                                                             |
| ctypes                       | A foreign function library for Python.                                                                               |
| dataclasses                  | Generate special methods on user-defined classes.                                                                    |
| datetime                     | Basic date and time types.                                                                                           |
| dbm                          | Interfaces to various Unix "database" formats.                                                                       |
| decimal                      | Implementation of the General Decimal Arithmetic Specification.                                                      |
| difflib                      | Helpers for computing differences between objects.                                                                   |
| dis                          | Disassembler for Python bytecode.                                                                                    |
| distutils                    | Support for building and installing Python modules into an existing Python installation.                             |
| doctest                      | Test pieces of code within docstrings.                                                                               |
| dummy_threading              | Drop-in replacement for the threading module.                                                                        |
| email                        | Package supporting the parsing, manipulating, and generating email messages.                                         |
| encodings                    |                                                                                                                      |
| ensurepip                    | Bootstrapping the "pip" installer into an existing Python installation or virtual environment.                       |
| enum                         | Implementation of an enumeration class.                                                                              |
| errno                        | Standard errno system symbols.                                                                                       |
| faulthandler                 | Dump the Python traceback.                                                                                           |
| fcntl (Unix)                 | The fcntl() and ioctl() system calls.                                                                                |
| filecmp                      | Compare files efficiently.                                                                                           |
| fileinput                    | Loop over standard input or a list of files.                                                                         |
| fnmatch                      | Unix shell style filename pattern matching.                                                                          |
| formatter                    | Deprecated: Generic output formatter and device interface.                                                           |
| fractions                    | Rational numbers.                                                                                                    |
| ftplib                       | FTP protocol client (requires sockets).                                                                              |
| functools                    | Higher-order functions and operations on callable objects.                                                           |
| gc                           | Interface to the cycle-detecting garbage collector.                                                                  |
| getopt                       | Portable parser for command line options; support both short and long option names.                                  |
| getpass                      | Portable reading of passwords and retrieval of the userid.                                                           |
| gettext                      | Multilingual internationalization services.                                                                          |
| glob                         | Unix shell style pathname pattern expansion.                                                                         |
| grp (Unix)                   | The group database (getgrnam() and friends).                                                                         |
| gzip                         | Interfaces for gzip compression and decompression using file objects.                                                |
| hashlib                      | Secure hash and message digest algorithms.                                                                           |
| heapq                        | Heap queue algorithm (a.k.a. priority queue).                                                                        |
| hmac                         | Keyed-Hashing for Message Authentication (HMAC) implementation                                                       |
| html                         | Helpers for manipulating HTML.                                                                                       |
| http                         | HTTP status codes and messages                                                                                       |
| imaplib                      | IMAP4 protocol client (requires sockets).                                                                            |
| imghdr                       | Determine the type of image contained in a file or byte stream.                                                      |
| imp                          | Deprecated: Access the implementation of the import statement.                                                       |
| importlib                    | The implementation of the import machinery.                                                                          |
| inspect                      | Extract information and source code from live objects.                                                               |
| io                           | Core tools for working with streams.                                                                                 |
| ipaddress                    | IPv4/IPv6 manipulation library.                                                                                      |
| itertools                    | Functions creating iterators for efficient looping.                                                                  |
| json                         | Encode and decode the JSON format.                                                                                   |
| keyword                      | Test whether a string is a keyword in Python.                                                                        |
| lib2to3                      | the 2to3 library                                                                                                     |
| linecache                    | This module provides random access to individual lines from text files.                                              |
| locale                       | Internationalization services.                                                                                       |
| logging                      | Flexible event logging system for applications.                                                                      |
| lzma                         | A Python wrapper for the liblzma compression library.                                                                |
| macpath                      | Mac OS 9 path manipulation functions.                                                                                |
| mailbox                      | Manipulate mailboxes in various formats                                                                              |
| mailcap                      | Mailcap file handling.                                                                                               |
| marshal                      | Convert Python objects to streams of bytes and back (with different constraints).                                    |
| math                         | Mathematical functions (sin() etc.).                                                                                 |
| mimetypes                    | Mapping of filename extensions to MIME types.                                                                        |
| mmap                         | Interface to memory-mapped files for Unix and Windows.                                                               |
| modulefinder                 | Find modules used by a script.                                                                                       |
| msilib (Windows)             | Creation of Microsoft Installer files, and CAB files.                                                                |
| msvcrt (Windows)             | Miscellaneous useful routines from the MS VC++ runtime.                                                              |
| multiprocessing              | Process-based parallelism.                                                                                           |
| netrc                        | Loading of .netrc files.                                                                                             |
| nis (Unix)                   | Interface to Sun's NIS (Yellow Pages) library.                                                                       |
| nntplib                      | NNTP protocol client (requires sockets).                                                                             |
| numbers                      | Numeric abstract base classes (Complex, Real, Integral, etc.).                                                       |
| operator                     | Functions corresponding to the standard operators.                                                                   |
| optparse                     | Deprecated: Command-line option parsing library.                                                                     |
| os                           | Miscellaneous operating system interfaces.                                                                           |
| ossaudiodev (Linux, FreeBSD) | Access to OSS-compatible audio devices.                                                                              |
| parser                       | Access parse trees for Python source code.                                                                           |
| pathlib                      | Object-oriented filesystem paths                                                                                     |
| pdb                          | The Python debugger for interactive interpreters.                                                                    |
| pickle                       | Convert Python objects to streams of bytes and back.                                                                 |
| pickletools                  | Contains extensive comments about the pickle protocols and pickle-machine opcodes, as well as some useful functions. |
| pipes (Unix)                 | A Python interface to Unix shell pipelines.                                                                          |
| pkgutil                      | Utilities for the import system.                                                                                     |
| platform                     | Retrieves as much platform identifying data as possible.                                                             |
| plistlib                     | Generate and parse Mac OS X plist files.                                                                             |
| poplib                       | POP3 protocol client (requires sockets).                                                                             |
| posix (Unix)                 | The most common POSIX system calls (normally used via module os).                                                    |
| pprint                       | Data pretty printer.                                                                                                 |
| profile                      | Python source profiler.                                                                                              |
| pstats                       | Statistics object for use with the profiler.                                                                         |
| pty (Linux)                  | Pseudo-Terminal Handling for Linux.                                                                                  |
| pwd (Unix)                   | The password database (getpwnam() and friends).                                                                      |
| py_compile                   | Generate byte-code files from Python source files.                                                                   |
| pyclbr                       | Supports information extraction for a Python class browser.                                                          |
| pydoc                        | Documentation generator and online help system.                                                                      |
| queue                        | A synchronized queue class.                                                                                          |
| quopri                       | Encode and decode files using the MIME quoted-printable encoding.                                                    |
| random                       | Generate pseudo-random numbers with various common distributions.                                                    |
| re                           | Regular expression operations.                                                                                       |
| readline (Unix)              | GNU readline support for Python.                                                                                     |
| reprlib                      | Alternate repr() implementation with size limits.                                                                    |
| resource (Unix)              | An interface to provide resource usage information on the current process.                                           |
| rlcompleter                  | Python identifier completion, suitable for the GNU readline library.                                                 |
| runpy                        | Locate and run Python modules without importing them first.                                                          |
| sched                        | General purpose event scheduler.                                                                                     |
| secrets                      | Generate secure random numbers for managing secrets.                                                                 |
| select                       | Wait for I/O completion on multiple streams.                                                                         |
| selectors                    | High-level I/O multiplexing.                                                                                         |
| shelve                       | Python object persistence.                                                                                           |
| shlex                        | Simple lexical analysis for Unix shell-like languages.                                                               |
| shutil                       | High-level file operations, including copying.                                                                       |
| signal                       | Set handlers for asynchronous events.                                                                                |
| site                         | Module responsible for site-specific configuration.                                                                  |
| smtpd                        | A SMTP server implementation in Python.                                                                              |
| smtplib                      | SMTP protocol client (requires sockets).                                                                             |
| sndhdr                       | Determine type of a sound file.                                                                                      |
| socket                       | Low-level networking interface.                                                                                      |
| socketserver                 | A framework for network servers.                                                                                     |
| spwd (Unix)                  | The shadow password database (getspnam() and friends).                                                               |
| sqlite3                      | A DB-API 2.0 implementation using SQLite 3.x.                                                                        |
| ssl                          | TLS/SSL wrapper for socket objects                                                                                   |
| stat                         | Utilities for interpreting the results of os.stat(), os.lstat() and os.fstat().                                      |
| statistics                   | mathematical statistics functions                                                                                    |
| string                       | Common string operations.                                                                                            |
| stringprep                   | String preparation, as per RFC 3453                                                                                  |
| struct                       | Interpret bytes as packed binary data.                                                                               |
| subprocess                   | Subprocess management.                                                                                               |
| sunau                        | Provide an interface to the Sun AU sound format.                                                                     |
| symbol                       | Constants representing internal nodes of the parse tree.                                                             |
| symtable                     | Interface to the compiler's internal symbol tables.                                                                  |
| sys                          | Access system-specific parameters and functions.                                                                     |
| sysconfig                    | Python's configuration information                                                                                   |
| syslog (Unix)                | An interface to the Unix syslog library routines.                                                                    |
| tabnanny                     | Tool for detecting white space related problems in Python source files in a directory tree.                          |
| tarfile                      | Read and write tar-format archive files.                                                                             |
| telnetlib                    | Telnet client class.                                                                                                 |
| tempfile                     | Generate temporary files and directories.                                                                            |
| termios (Unix)               | POSIX style tty control.                                                                                             |
| test                         | Regression tests package containing the testing suite for Python.                                                    |
| textwrap                     | Text wrapping and filling                                                                                            |
| threading                    | Thread-based parallelism.                                                                                            |
| time                         | Time access and conversions.                                                                                         |
| timeit                       | Measure the execution time of small code snippets.                                                                   |
| tkinter                      | Interface to Tcl/Tk for graphical user interfaces                                                                    |
| token                        | Constants representing terminal nodes of the parse tree.                                                             |
| tokenize                     | Lexical scanner for Python source code.                                                                              |
| trace                        | Trace or track Python statement execution.                                                                           |
| traceback                    | Print or retrieve a stack traceback.                                                                                 |
| tracemalloc                  | Trace memory allocations.                                                                                            |
| tty (Unix)                   | Utility functions that perform common terminal control operations.                                                   |
| turtle                       | An educational framework for simple graphics applications                                                            |
| turtledemo                   | A viewer for example turtle scripts                                                                                  |
| types                        | Names for built-in types.                                                                                            |
| typing                       | Support for type hints (see PEP 484).                                                                                |
| unicodedata                  | Access the Unicode Database.                                                                                         |
| unittest                     | Unit testing framework for Python.                                                                                   |
| urllib                       |                                                                                                                      |
| uu                           | Encode and decode files in uuencode format.                                                                          |
| uuid                         | UUID objects (universally unique identifiers) according to RFC 4122                                                  |
| venv                         | Creation of virtual environments.                                                                                    |
| warnings                     | Issue warning messages and control their disposition.                                                                |
| wave                         | Provide an interface to the WAV sound format.                                                                        |
| weakref                      | Support for weak references and weak dictionaries.                                                                   |
| webbrowser                   | Easy-to-use controller for Web browsers.                                                                             |
| winreg (Windows)             | Routines and objects for manipulating the Windows registry.                                                          |
| winsound (Windows)           | Access to the sound-playing machinery for Windows.                                                                   |
| wsgiref                      | WSGI Utilities and Reference Implementation.                                                                         |
| xdrlib                       | Encoders and decoders for the External Data Representation (XDR).                                                    |
| xml                          | Package containing XML processing modules                                                                            |
| xmlrpc                       |                                                                                                                      |
| zipapp                       | Manage executable Python zip archives                                                                                |
| zipfile                      | Read and write ZIP-format archive files.                                                                             |
| zipimport                    | support for importing Python modules from ZIP archives.                                                              |
| zlib                         | Low-level interface to compression and decompression routines compatible with gzip                                   |

### First-Tier Modules 

Used all the time. Major modules, often with commerical support. Massive functionality. Definitive in their field. 

* numpy
* scipy
* scikit-learn
* matplotlib
* pandas

### Second Tier

Use often. Great funcationality. Serious modules. Not quite definitive in their field. Highly subjective!

* seaborn
* requests
* bs4
* lxml
* IPython
* pathlib
* fuzzywuzzy

### Third Tier
* aggregate
* thousands of roll-your-own solutions on github

## My Module Use

Top 25 out of nearly 2000 import statements. Mostly built-in plus first tier.

| Package |	n	| pct | 
|:--|:--|:--| 
| sys | 103 | 0.051811 |
os | 99 | 0.049799
numpy | 94 | 0.047284
logging | 79 | 0.039738
re | 78 | 0.039235
pandas | 72 | 0.036217
matplotlib.pyplot | 62 | 0.031187
io | 43 | 0.021630
collections | 40 | 0.020121
warnings | 35 | 0.017606
scipy.stats | 34 | 0.017103
time | 33 | 0.016600
datetime | 32 | 0.016097
hashlib | 26 | 0.013078
IPython.core.display | 25 | 0.012575
json | 25 | 0.012575
itertools | 21 | 0.010563
great | 18 | 0.009054
codecs | 18 | 0.009054
vispy | 18 | 0.009054

## Strings

In [None]:
x = 'Stephen Mildenhall'

In [None]:
x

In [None]:
x[0:7]

In [None]:
x[:7]

In [None]:
x[8:]

In [None]:
x[8:-4]

In [None]:
x[slice(8,-4)]

In [None]:
x[:]

In [None]:
x[::2]

In [None]:
x[::-1]

In [None]:
y = x + ", St. John's University"
y

In [None]:
type(x)

In [None]:
dir(x)

In [None]:
?y.upper

In [None]:
y.upper()

# Exercise 

1. Extract every third element of the string y.

2. Extract every third element of the string y and return those characters in reverse order.

3. Look  ```dir(y)``` and use ```?y.xxx``` to figure out the purpose of another function on the list. Pick something starting with a-z, not __

## Lists

In [None]:
s = y.split()
s

In [None]:
type(s)

In [None]:
s[0]

In [None]:
s[-1]

In [None]:
# strings morph into lists
s2 = list(x)
print(s2)

In [None]:
m = [123]
m

In [None]:
m + [345]

In [None]:
m

In [None]:
m + 345

In [None]:
m.append(346)
m

In [None]:
m = m + [234]
m

In [None]:
m = m + [435, 67]
m

In [None]:
m[4] = 'some'
m

In [None]:
m.append('thing')
m

In [None]:
# length of a list
len(m)

In [None]:
# alters m in place
m.pop(), m

In [None]:
m.pop(2), m

In [None]:
m.index('some'), m.index(346)

In [None]:
m.index('346')

In [None]:
# iterate over a list 
for i in m:
    print(i, type(i))

# Exercise

If ```m = [10, 20, 30, 40]``` what is the result of

* ```m[2]```
* ```m.pop(2)```
* What is the value of ```m``` after the  pop operation?



## List Comprehensions

In [None]:
[i*i for i in range(10)]

In [None]:
[i*i for i in range(10) if i % 3 == 0]

In [None]:
[[i*j for i in range(5)] for j in range(5)]

# Exercise 

Write a function to return just the vowels in a string. It should be case insensitive. For example if ```fun``` is your function then: 

    fun('abracadabra')
        >>> ['a', 'a', 'a', 'a', 'a']
        
    fun('Casualty Actuarial Society')
        >>> ['a', 'u', 'a', 'u', 'a', 'i', 'a', 'o', 'i', 'e']
        
    fun('n vwls') 
        >>> []

What happens if you run ```fun(123)```?

## Dictionaries 

In [None]:
d = {'first': 'Stephen', 'middle': 'John'}

In [None]:
d

In [None]:
d['first']

In [None]:
d['last'] = 'Mildenhall'
d

In [None]:
d = dict(first='Stephen', middle='John', last='Mildenhall')
d

In [None]:
d['age'] = 55
d['hair'] = 'brown'
d['books'] = ['Thinking Fast and Slow', 'Girl on the Train', 'Probability']

In [None]:
d

In [None]:
# iterate over a dictionary 
for k, v in d.items():
    print(k, v)

In [None]:
# access keys and values
d.keys(), d.values()

In [None]:
# dictionary comprehensions
d = {i: i*i for i in range(10)}
d

In [None]:
d[5]

In [None]:
# In and Out are a built in list and dictionary...very handy 
In[-5:]
# type(Out), type(In)

In [None]:
for i in In[-5:]:
    print(i, '\n')

In [None]:
Out[127]
# Out[137]  # careful with what has output

## Tuples and Sets and Functions

In [None]:
t = (2, 4)

In [None]:
t

In [None]:
t[1]

In [None]:
t[1] = 5

In [None]:
dir(t)

In [None]:
# just the non-built in methods
print([i for i in dir(t) if i[0] != '_'])

In [None]:
# make a function of same for future use 
def wdid(ob):
    print([i for i in dir(ob) if i[0] != '_'])

In [None]:
# apply to list 
wdid(list)

In [None]:
# better function with documentation 
def wdid(ob):
    '''
    wdid(ob)
    
    What does it do? Prints the"normal" methods of an object.
    Arguments:
    ob:    object to query 
    '''
    print([i for i in dir(ob) if i[0] != '_'])

In [None]:
?wdid

In [None]:
# aka
help(wdid)

In [None]:
wdid(tuple)

In [None]:
wdid(dict)

In [None]:
wdid(wdid)

In [None]:
dir(wdid)

In [None]:
s = list('Stephen Mildenhall')
s

In [None]:
# sorted unique elements in the list 
set(s)

In [None]:
# count characters in a string with a dictionary 
d = dict()
for c in s:
    if c in d:
        d[c] += 1
    else:
        d[c] = 1
d

In [None]:
def counter(s):
    '''
    count elements of iteratble s
    '''
    d = dict()
    for c in s:
        if c in d:
            d[c] += 1
        else:
            d[c] = 1
    return d

In [None]:
?counter

In [None]:
s = 'count characters in a string with a dictionary'
counter(s)

In [None]:
s.split()

In [None]:
# same function counts words 
counter(s.split())

# Exercise

Write a function to return the number of different characters in a string.  It should be case sensitive. For example if ```fun``` is your function then: 

    fun('abracadabra')
        >>> 5
        
    fun('Casualty Actuarial Society')
        >>> 15
        
    fun('n vwls') 
        >>> 6

## Extra Credit

Don't count spaces. Test not equals a space with ```i != ' '```

What happens if you run ```fun(123)```?

In [None]:
def fun(x):
    return len([i for i in set(x) if i!=' '])

# Let's do something interesting... 

## Word count for web pages
* Retrieve web page
* Extract text
* Break into words
* Count 

In [None]:
# need some dark arts...
import requests
import bs4

In [None]:
wdid(requests)

In [None]:
?requests.get

In [None]:
# optional pause for something more advanced... 
for m in [ i for i in dir(requests) if i[0] != '_']:
    print(f'\n\n{m}\n{"="*len(m)}\n')
    print(requests.__getattribute__(m).__doc__)

In [None]:
# pip install if not available 

In [None]:
url = 'https://en.wikipedia.org/wiki/Actuary'

In [None]:
r = requests.get(url)

In [None]:
wdid(r)

In [None]:
?r.content

In [None]:
?r.text

In [None]:
r.encoding

In [None]:
txt = r.text

In [None]:
len(txt)

In [None]:
len(txt.split())

In [None]:
counter(txt.split())

In [None]:
# need to tidy up and just get text 
soup = bs4.BeautifulSoup(txt, 'lxml')

In [None]:
ctxt = soup.text
ctxt[:1000]

In [None]:
text[:1000]

In [None]:
counter(text.split())
# contains a lot of garbage

In [None]:
def get_text_req(r):
    '''
    Tidy up URL response using beautiful soup ... Google for help ... helpful to know something about HTML 
    '''
    
    tree = bs4.BeautifulSoup(r.text, 'lxml')

    body = tree.body
    if body is None:
        return None

    # two biggest casues of mess are script and style tag elements 
    # delete them 
    for tag in body.select('script'):
        # remove script elements 
        tag.decompose()
        
    for tag in body.select('style'):
        tag.decompose()

    text = body.get_text(separator='\n')
    return text

In [None]:
text = get_text_req(r)

In [None]:
text[:1000]

In [None]:
print(text[:1000])

In [None]:
counter(text.split())

In [None]:
# more common words from a comprehension 
[(k, v) for k, v in counter(text.split()).items() if v > 25]

In [None]:
# strip out garbage
wdid(str)

In [None]:
# what does isalpha give us? 
''.join(sorted([i  for i in set(text.lower()) if i.isalpha()]))

In [None]:
# what is it omitting? 
''.join(sorted(set([i for i in text.lower() if not i.isalpha()])))

In [None]:
def super_counter(str_in, min_length=4):
    '''
    super_counter: 
        split str_in into words and count
        only count words >= min_length
        case insensitive 
        strip out unicode characters 
    '''
    # lower case
    str_in = str_in.lower()
    
    # advanced: strip out unicode characters and constrain to letters a-z
    str_in = ''.join([i if i == ' ' or i.isalpha() else ' ' for i in str_in])
    
    # strip to list of words of length >= min_length
    low = [w for w in str_in.split(' ') if len(w) >= min_length]
    
    # count, as before 
    dow = dict()
    for w in low:
        if w in dow:
            dow[w] += + 1
        else:
            dow[w] = 1
            
    # return 
    return dow

In [None]:
d = super_counter(text)

In [None]:
# try to make sorted list of most frequent words
fw = [(k, v) for k, v in d.items() if v > 10]
fw

In [None]:
sorted(fw)

In [None]:
?sorted

In [None]:
# lambda functions: on-the-fly functions 
f = lambda x : x * x
f(3)

# Quiz

What does the following lambda function calculate when given an integer argument? Why? 

In [None]:
f = lambda x : x * f(x-1) if x else 1
f(6)

In [None]:
# BTW
f(100)

In [None]:
# want to sort on the second element of each tuple: 
sorted(fw, key=lambda x : x[1], reverse=True )

In [None]:
?fw.sort

In [None]:
# enhance original function 
def super_counter(str_in, min_length=4, top=10):
    '''
    super_counter: 
        split str_in into words and count
        only count words >= min_length
        return top words in desc order of frequency 
        case insensitive 
        strip out unicode characters 
    '''
    # lower case
    str_in = str_in.lower()
    
    # advanced: strip out unicode characters and constrain to letters a-z
    str_in = ''.join([i if i == ' ' or i.isalpha() else ' ' for i in str_in])
    
    # strip to list of words
    low = [w for w in str_in.split(' ') if len(w) >= min_length]
    
    # count, as before 
    dow = dict()
    for w in low:
        if w in dow:
            dow[w] += + 1
        else:
            dow[w] = 1
            
    # convert to list for sorting 
    wl =  [(k, v) for k, v in dow.items()]
    
    # sort in place 
    wl.sort(key=lambda x: x[1], reverse=True)
    
    return wl[:top]

In [None]:
super_counter(text, top=50)

In [None]:
# uber function 
def word_count_from_url(url, min_length=4, top=50):
    r = requests.get(url)
    text = get_text_req(r)
    return super_counter(text, min_length, top)

In [None]:
word_count_from_url('https://en.wikipedia.org/wiki/New_York_City')

In [None]:
# uber function 
def wiki_word_count(page_name, min_length=4, top=50):
    r = requests.get('https://en.wikipedia.org/wiki/' + page_name)
    text = get_text_req(r)
    return super_counter(text, min_length, top)

In [None]:
wiki_word_count('Probability')

In [None]:
# what words come after actuary? 
tls = [i for i in text.lower().split() if i.isalpha()]
next_word = []
for w, nw in zip(tls[:-1], tls[1:]):
    if w[0:6] == 'actuar':
        print(f'{w:<10s}\t{nw}')
        next_word.append(nw)
len(next_word)