# Fundamentals of Computing

Before moving to higher-level aspects of software, it is best to review the basic concepts of processing data.  Some of this material will seem obvious, while other aspects can seem unnecessarily low-level, especially for analysts who are accustomed to writing scripts.  However, software engineering requires this finer understanding of how programs work with the hardware.  It will soon become apparent that this level of detail is appropriate for data science.

## Types of Data

Data comes in many forms which we often refer to by its application source or file extension.  All data have a schema which refers to the metadata describing its structural format, the definition of its values, and any other information necessary for maintaining and processing it.  Some knowledge of schema consistency may be implied, such as versioning of a database or API, as opposed to explicitly written within the files, such as eXtensible Markup Language (XML).

Schema consistency also relates to versioning of the application that maintains the data.  Application version is typically defined by an increasing sequece of values with the format: MAJOR.MINOR.PATCH, referring to changes: 

* MAJOR version - incompatible API changes
* MINOR version - add functionality in a backwards-compatible manner
* PATCH version - make backwards-compatible bug fixes

We can loosely categorize data by the consistency and complexity of their schema: structured, semi-structrued, and unstructured.  While the definitions can seem blurry for some types of data, it is a useful lense for viewing this medium.

__Structured__  This data is typically tabular​ in form.  Examples may be a database table or file with format Comma Separated Values (.csv), Tab Delimited Values (.tdv, .tsv, .txt), Microsoft Excel (.xlsx) or one of its many derivatives​.  These schemas only change with major (the first number) versions​ of the software applcations that produce them.  In the case of a SQL database, making a major version change is expensive for both the consumers of the data, as well as the maintainers of the database.  So, this rarely occurs for more mature sources.

Traditionally, statisticians sought 'long-formatted' data, which referred to many records with a handful (<30) of columns - one for each variable.  However, in the last few decades, the juxtapose can occur.  This may be a few records that are very costly to obtain, but each record contains hundreds of variables.  This type of data requires completely different approaches to traditional methods.

__Semi-structured​__ This data is often nested in a tree-like structure, and can be more difficult to parse into a form useful for typical statistical models.  Important formats include JavaScript Object Notation (.json) from web app APIs, HyperText Markup Language (.html) of served web pages, eXtensible Markup Language (.xml) of older database systems, and Portable Document Format (.pdf) used in most business files.  The eXtensible Business Reporting Language (.xbrl) became more prevalent in financial accounting, over the last few years.

Schemas may be descriptive, partial, or evolving​ which can lend to even more difficulties in longitudinal projects.  Due to the mutable nature of their format, the schema usually explicitly provides versioning information and other metadata, within the file, to support consumers.  While this type of data may seem unfamiliar to new data practiciioners, it is ubiquitous in the information technology field because of the predominance of web technologies producing data for humans and machines.

__Unstructured​__ Such data is usually used by AI practictioners because of the unprocessed openness of its form.  Natural language text may be come from communications data, such as Email (.eml), Calendar data (.ics), and reports (.txt, .docx, .pdf).  It may also be image data which comes in a wide vareity of formats (.jpg, .gif, .png).  This raw data allows more modern models, such as neural networks, to perform feature extraction (or variable selection) as part of the algorithm; rather than separate the tasks in multiple components as is typical of traditional machine learning routines.

This data comes unprocessed, directly from the source, which is often a device (or many devices) producing the data.  Schema information may be available for specific devices that create images, but this concept is less relevant when applied to people writing to others.  While feature extraction is moot, preprocessing is much more important.  This standadization may come in a variety of different methods for images and text.

## Properties of Big Data

Big Data has many challenges which led to the evolution of many disciplines, including data science and data engineering.


Volume​

Typical problem of many records in long, tabular format​

Sometimes have too many columns, or wide format​

Text and images may be in many individual files​


Velocity​

Often refers to ETL pipelines​

Data processing can be resource-intensive​


Variety​

Tabular-, json-, or document-format​

Images, email, chat, chat-images and many other forms​

## Constraints and Challenges

Von Neuman architecture​

Limitation: processing and storage cannot occur at same time​

There are many device types for the five components

Memory: L1,L2 cache, main​

Storage: SSD and Disk read, write​

Bus / Route: local, bytes over network, packets across continental cable​

'Latency numbers every programmer should know'​

​
Major trend: Balancing speed with cost is achieved by memory getting cheaper over time (latency numbers)

​

## Solutions: Hardware

Large server: Grid 128GB​

Many large servers: distributed processing​

Hadoop: copy function to disk's data​

Spark: copy function to memory's data​

​

Intel vectorization: chip performs common ops on array​

Nvidia GPU: chip-level distributed operations​

​

Combinations of these​

## Solutions: Python Software

Constraints​

* IO-bound: files and network connections (threading, asycio)​
* CPU-bound: matrix inversion (multiprocessing)​

__Concurrency__ managing multiple threads at the same time (only running one) on single processor sharing memory​

__Parallelism__ multiple processing of jobs, each with its own core (memory, processor, etc.).  Avoids GIL limitations in cPython which only allow a single thread to work on an individual object at a time. ​

### Linux

Get a basic idea of the machine you are working.

In [None]:
#make current
! sudo apt-get update && sudo apt-get upgrade

In [10]:
#os version
! cat /etc/os-release

PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"


In [11]:
#kernel version
! uname -r

5.15.49-linuxkit


In [12]:
#number of core
! python -c 'import multiprocessing as mp; print(mp.cpu_count())'	

4


In [9]:
! free -h

               total        used        free      shared  buff/cache   available
Mem:           9.7Gi       1.6Gi       5.1Gi       313Mi       3.1Gi       7.5Gi
Swap:          1.0Gi        22Mi       1.0Gi


In [15]:
! du --max-depth=1 --human-readable /home/vagrant/ | sort --human-numeric-sort
! du -d1 -h /home/ubuntu | sort -h
! du -sh

du: cannot access '/home/vagrant/': No such file or directory
du: cannot access '/home/ubuntu': No such file or directory
464K	.


In [17]:
! df -h

Filesystem      Size  Used Avail Use% Mounted on
overlay         110G  102G  3.0G  98% /
tmpfs            64M     0   64M   0% /dev
shm              64M   24K   64M   1% /dev/shm
/dev/vda1       110G  102G  3.0G  98% /vscode
grpcfuse        466G  356G  111G  77% /workspaces/ai-for-banking-and-finance
tmpfs           4.9G     0  4.9G   0% /proc/acpi
tmpfs           4.9G     0  4.9G   0% /sys/firmware


Inodes keep track of all the files on a Linux system. Except for the file name and the actual content of the file, inodes save everything else. It's like a file-based data structure that holds metadata about all of the files in the system.

In [18]:
! df -i		#inodes

Filesystem         Inodes   IUsed      IFree IUse% Mounted on
overlay           7340032 3398540    3941492   47% /
tmpfs             1275320      16    1275304    1% /dev
shm               1275320       7    1275313    1% /dev/shm
/dev/vda1         7340032 3398540    3941492   47% /vscode
grpcfuse       1154981254 1381414 1153599840    1% /workspaces/ai-for-banking-and-finance
tmpfs             1275320       1    1275319    1% /proc/acpi
tmpfs             1275320       1    1275319    1% /sys/firmware


### CPU and available processes

Don't be confused by the 'Thread(s) per core'.  This refers to 'virtual components or codes'.  The physical core is separated into, at most, two virtual cores.  So, if a CPU is dual core, then it will have 4 virtual core (displayed here as 'threads').

In [25]:
! lscpu | head -n 10

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
Vendor ID:                       GenuineIntel


If you want to use the number of cpu's to calculate number of process to spawn, use cpu_count to find the number of cpu's,

In [2]:
import psutil
psutil.cpu_count()

4

But using the CPU utilization to calculate the number of spawned processes could be a better approach, to check the CPU utilization, you could do

In [3]:
import psutil
psutil.cpu_times_percent(interval=1, percpu=False)

scputimes(user=0.8, nice=0.0, system=1.0, idle=98.2, iowait=0.0, irq=0.0, softirq=0.0, steal=0.0, guest=0.0, guest_nice=0.0)

This will give you the cpu usage and for example you could use that information to decide if you want to spawn a new process or not. It might be a good idea to keep an eye on memory and swap too.

I think this answer might be useful to look at, Limit total CPU usage in python multiprocessing

answered by Radan on [stackoverflow](https://stackoverflow.com/questions/52311339/python-3-multiprocessing-how-many-processes-should-i-use).

How many processes should I run in parallel

answered by unbuntu on [stackoverflow](https://stackoverflow.com/questions/23816546/how-many-processes-should-i-run-in-parallel)

TODO

In [22]:
import time
import multiprocessing as mp

def func():
    time.sleep(1000)


num_workers = mp.cpu_count()  

pool = mp.Pool(num_workers)
y = 100
for task in range(y):
    pool.apply_async(func, args = (task,))

pool.close()
pool.join()

100

### Threads within Processes

What is the maximum number of threads I should use?

This is answered by [Robert Gamble](https://stackoverflow.com/questions/344203/maximum-number-of-threads-per-process-in-linux) 

In [6]:
import threading
import time


def mythread():
    time.sleep(1000)

def test_thread_count():
    threads = 0     #thread counter
    y = 1000000     #a MILLION of 'em!
    for i in range(y):
        try:
            x = threading.Thread(target=mythread, daemon=True)
            threads += 1    #thread counter
            x.start()       #start each thread
        except RuntimeError:    #too many throws a RuntimeError
            break
    print("{} threads created.\n".format(threads))

test_thread_count()

78044 threads created.



In [5]:
! sysctl kern.num_taskthreads

sysctl: cannot stat /proc/sys/kern/num_taskthreads: No such file or directory


Just a limit on the total number of processes on the system (threads are essentially just processes with a shared address space on Linux) which you can view like this:

In [7]:
! cat /proc/sys/kernel/threads-max

78761


The default is the number of memory pages/4.  A page, memory page, or virtual page is a fixed-length contiguous block of virtual memory, described by a single entry in the page table. It is the smallest unit of data for memory management in a virtual memory operating system. 

You can increase this like:

In [8]:
! echo 100000 > /proc/sys/kernel/threads-max

/bin/bash: line 1: /proc/sys/kernel/threads-max: Read-only file system


There is also a limit on the number of processes (and hence threads) that a single user may create, see ulimit/getrlimit for details regarding these limits.

## Available Frameworks

Scala Spark​

R / Py wrapper​

Automatic step optimizations​

Difficult to install => AWS EMR Cluster​

Available in future AWS​

​

Python Dask​

Native python​

Easily installed​

Used on Grid​

Available now​