# Introduction.

This notebook is made to go through the fundamentals. You should already know all of this and if you don't now is the time ! I'm not gonna dive in basic operations (+,\**,%,...), control flow statements like if else and for and while loops and built-in types (int,float).  We will discuss more the bytes and ram usage in advanced concept section 

## Built-in types

Here are the built-in types in python https://www.informit.com/articles/article.aspx?p=453682&seqNum=5. Not going into much details for str and int just beware of the built-in number types when working with python. Most libraries use some kind of inference to define automatically the built-in types (if NaN in pandas dataframe => float) and it might lead to inneficient ram usage. If you want an idea of which int type to use depending on your data check here https://numpy.org/doc/stable/user/basics.types.html. (For example RGB = [0-255] uint8 seems like the best type for images.)


## Built-in data structures

A good understanding of built-in data structures is imperative to feel comfortable in Python. Next is a reminder of how to use the most prominent data structures.

### List

Lists are a mutable built-in data type in python that can stores multiples item from any built-in data type and any built-in types (int,float,...). 



In [None]:
# init list
x = [1,2,3,4,5]
print(x)


# slicing
first_position = x[0]
print("first_position",first_position)

last_position = x[-1]
print("last_position",last_position)

interval_position = x[0:2]
print("interval_position",interval_position)

## mutable part

# add object
x.append([3])
print("appended",x)
# remove object
x.remove([3])
print("removed",x)

x[0] = 10
print(x)
print(dir(x))

### Tuples

Tuples are just like list except they are immutable

In [None]:
x = (1,2,3,4,5)
print(x)

# slicing
first_position = x[0]
print("first_position",first_position)

last_position = x[-1]
print("last_position",last_position)

interval_position = x[0:2]
print("interval_position",interval_position)

# Misc
print("Where is 2 ?", x.index(2))
print(dir(x))

#immutable

#x[0] = 10

#  cant append or change value but you can still create a new variable building on a previous one
print(id(x))
x += x
print(x)
print(id(x))

### Dict

A dictionary is a collection which can be ordered (depending on your python version), read more about order in dict: [1](https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6),[2](https://gandenberger.org/2018/03/10/ordered-dicts-vs-ordereddict/),[3](https://medium.com/junior-dev/python-dictionaries-are-ordered-now-but-how-and-why-5d5a40ee327f). It is mutable and does not allow duplicates.

In [None]:
# {key:value} key must be a str, avoid special characters, value can be any python object
x = {"First name": "Kevin",
    "Last name": "Wirtz",
    "Lecture": "Advanced Programming"}
print(x)

# mutable
x.update({"First name": "other"})
print(x)

print(dir(x))

### Set

A set unordered, unindexed and contains only unique elements, it is immutable but you can add and remove elements (unlikle tuples). Sets can't contain a list or a dict since they are mutable but it can contain a tuple.

In [None]:
myset = {1,3,2,3,(1,2,3),(1,2,3),4}
print(myset)

myset.update([10,(1,2,3)])
print(myset)
print(dir(myset))

## Useful libraries

### Numpy
Numpy is the library of choice when working on arrays in Python, Why: 
- NumPy is written in C = it's fast. 
- Compatible with different libraries and specifically machine learning libraries.
- Built-in functions (random array for example)
- Universal functions that allows vectorization ( apply a function to every element instead of iterating through each one)

In [None]:
import numpy as np

# Basic operation

x = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(x)
print(x.shape)

y = np.array([[4,8,1,2,4,10,13,12,11]])
y = y.reshape((3,3))
print(y)
print(y.shape)

#addition
print(x+y)

# substraction
print(x-y)

# dot product
print(np.matmul(x,y))

# Hadamard product
print(np.multiply(x,y))

# Division
print(np.matmul(x,np.linalg.inv(y)))

# Hadamard division
print(np.divide(x,y))

Ram efficiency is really important to take into consideration wen scaling up your application. To understand a bit more about numpy and space occupied by a numpy object you need to understand the concept of Strides

In [None]:
# Ram efficiency

x = np.zeros(shape=(4,4), dtype=np.int32)
print(repr(x))
print(x.dtype)
print(x.shape)
# reminder 8 bits = 1 byte
# Stride[0] = number of bytes to go from 1 array to an other
# Stride[1] = number of bytes to go from an element of an array to another one
# int32 = 32 bits = 4 bytes

print(x.strides)

Imagine now a squared matrix 100000x100000 in int64 vs int8

In [None]:
print("in int 64 takes ",100000**2*8*10**-9, "GB of ram")
print("in int 8 takes ",100000**2*1*10**-9, "GB of ram")

This does not mean that you need to use int8 everywhere (reminder int8 ranges between -128 to 127) but don't let the automatic float64 everytime, think of your problem and matrix. 

Another way to manage ram issue is using sparse matrix. Sparse matrix are used when you have a matrix with a lot of 0's.

In [None]:
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import inv
import numpy as np
x = csr_matrix([[1, 2, 20], [0, 0, 3], [4, 0, 5]])
print("x: \n", x)

y = csr_matrix([[1, 2, 2, 0, 0, 3, 4, 0, 5]])
y = y.reshape((3,3))
print("y: \n", y)

print("x+y: \n",x+y)
print("x*y: \n",x.dot(y))
print("x/y: \n",x.dot(inv(y)))
print("x/y: \n",x._divide(y))

In [None]:
#size comparison
import sys
from scipy.sparse import csr_matrix
import numpy as np
x = csr_matrix((25000, 25000), dtype = np.int8)
y = np.zeros((25000, 25000), dtype = np.int8)
print("size of sparse in bytes :",sys.getsizeof(x))
print("size of numpy in bytes :",sys.getsizeof(y))

Another option is numpy memmap. Instead of using the ram it uses the disk space to store the numpy object.

In [None]:
import numpy as np

nrows, ncols = 1000000, 100

f = np.memmap('memmapped.dat', dtype=np.float32,
              mode='w+', shape=(nrows, ncols))

for i in range(ncols):
    f[:, i] = np.random.rand(nrows)
    
x = f[:, -1]

del f

Another important concept to optimize the speed of computation is vectorization. In numpy this concept can be applied using universal function. NumPy defines a "universal function" ("ufunc" for short) to be a function that operates on each element in an array, or combine single elements from several input arrays. A ufunc takes as inputs arrays with different numbers of dimensions, or even scalar values, and returns a new array. The process by which array elements are matched up is called broadcasting.

In [None]:
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

np.add(a, b)
np.add(a, 100)


In addition to all the standard basic math operations (+,-,*,/), NumPy offers many additional classes of functions:

- Lineary algebra
- Special math functions (trig, exp/log, polynomials)
- Cumulative functions
- Logical (bool) operations
- Random number generation

Most of these functions are implemented using compiled C code, so they execute much faster than regular Python code. It is a good idea to be familiar with the array functions that NumPy offers so you don't reinvent the wheel in your own code.

One big limitation of numpy is that when you want to create or work with a numpy array without using ufunc it becomes significantly slower. Using ufunc in every case becomes quickly unreadable.

### Pandas

"pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language."

It is built on the Numpy package and its key data structure is called the DataFrame. DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.

Basically you have a csv, you want stats and plot asap: use pandas. 

https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

In [None]:
import pandas as pd

df = pd.read_csv("data/california_housing_train.csv")
print(df.head())
print(df.columns)

In [None]:
df.plot.scatter(x= "median_house_value",
                y = "total_bedrooms")

In [None]:
# Slicing

# Get column
# df[3] does not work
print("Get column: \n", df["longitude"])
# select rows
print("Get rows: \n", df[3:5])
# select rows and columns
print("Select rows and columns:", df.iloc[0:3,1:3])
# not working df.iloc[0, ['longitude', 'latitude']]

# Query by index
print("Search row with index", df.loc(1))

# get a boolean array for a condition 
print("Bolean array with index where condition is respected", df["latitude"]>34)

# get data where this condition is true
print("Get df where condition is true:",df[df["latitude"]>34])
print("Double condition", df[(df["latitude"]>34) & (df["longitude"]>-115)])

### Matplotlib

## Know your machine

![friends](img/know_your_machine.png)