## 0.Python Introduction

- Authors : Kefia Ali
- Contributors : Rhaiti Jamal

# Introduction to Python

- General presentation of the language
    - Philosophy
    - Evolutions & Successes
- Practical exploration of the language through a concrete example
    - Computing the GCD of two numbers: discovering the basics of the language through different implementations
    - Introduction to NumPy & Pandas 

## basics

- Very powerful and intuitive object oriented programming language
- Interpreted: no compilation phase (like C or Java) =>  easy to set up
- Simple syntax / high level (indentation) => easy to learn
- Powerfull with an amazing ecosystem => The more people use a language, the more powerful it becomes
- Flexible => easy to access to your wishes (not limited by syntax, data types or other constraints)

- Strong and dynamic typing:
    - Strong : Variable do have a type and the type matter when we perform operation
    - Dynamic: The type is determined only during the runtime

- Very powerful native data structures 
     - (lists, dictionaries, sets, iterators, etc.)
     - Pay attention to mutable and immutable types

#### Print hello world java vs python

`class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello, World!"); 
    }
}`

In [None]:
print("Hello, World!")

#### Native data structures 

##### Text type => str

In [None]:
my_str = "hello"
print(my_str)

##### Numeric types  => int, float, etc..

In [None]:
# Integer data type
my_int = 5
print(my_int)
# Float data type
my_float = 3.99
print(my_float)

In [None]:
print(my_int + my_float)

##### Sequence type  => list, tuple, etc..

In [None]:
# for a list, we place elements inside square brackets [] , separated by commas
# List data type
my_list = [1, 2, 3, 4]
print(my_list)

In [None]:
# Print the length
print(len(my_list))

In [None]:
# Print the first & the last element
print(my_list[0])
print(my_list[-1])

In [None]:
# Modify the first element
my_list[0] = 5
print(my_list)

In [None]:
my_list2 = my_list + [7, 8]
print(my_list2)

In [None]:
# Add an element
my_list2.append(9)
print(my_list2)
# Remove an element
my_list2.remove(9)
print(my_list2)

In [None]:
# We can use list to create iterative instuctions
for i in my_list:
    print(i)

##### mapping type  => dict

- each element of the dict is a key and value pair 
- the value can be a certain type (string, number, user-defined types, ....), and can be repeated

In [None]:
my_dict1 = {
    "name": "jamal",
    "job": "data_engineer",
    "company": "capgemini"
}
print(my_dict1)
print(my_dict1["name"])

In [None]:
my_dict1["phone"] = 12345
my_dict1["job"] = "consultant"
print(my_dict1)

In [None]:
my_dict2 = {
    "profil": my_dict1,
    "id": 1,
    "sectors" : ["energy", "industry"]
}
print(my_dict2)

In [None]:
# iterating over a dict 
for key, value in my_dict1.items():
    print(key, value)

#### Duck typing 

- we rely on the value interface rather than the type (*If it walks like a duck, and it quacks like a duck, then it must be a duck.*) 

- When we use duck typing, we focus on the presence of a given method or attribute
- We can call len() on any python object in which is defines a .__len__() method:

In [None]:
class Magic:
    def __len__(self):
        return 1

l = (1, 2, 3)
print(len(l))

m = Magic()
print(len(m))

## Some functionalities of standard libraries

Standard librairies provides usefull functionalities:
- text handling (`upper`, `lower`, `regexp`, `diff`, `unicode`)
- numerical manipulation (`random`, `decimal`, `sqrt`, `trigo`)
- file system, processing, threading, network, etc. (web server in 2 lines)
- data persistence (`pickle`, `json`, `csv`, `sqlite`, `zlib`)
- os interface (`os`, `io`, `logging`)
- UI library (`Tkinter` as a standard)
- Modules / Packages management (very important)

#### Functionalities examples

In [None]:
# Text handling
my_text = "Hello To Everyone !"
print(my_text.lower())
print(my_text.upper())
print(my_text.count("o"))

In [None]:
# Numerical manipulation
import random
random_int = random.randint(1,10)
print(random_int)

In [None]:
help(random)

In [None]:
import math
print(math.sqrt(9))
print(math.factorial(4))

In [None]:
# System
import os
current_path = os.getcwd()
print(current_path)

## Python uses 

- Widely used in many fields
    - Calcul / Data / Stats : `Numpy`, `Pandas`, `Scikit`, `TF`, `Torch`
    - Web : `Flask`, `Django`, `FastAPI`
    - Admin / Automatisation / Cloud : `Ansible`, `awscli`, `azure-cli`

## Evolution

- Perpetual evolution
- Very active community supported by companies / universities (Google, Dropbox, etc.)
- A big shift in 2010 (`py3k`): incompatible version to correct design errors
- 10 years to migrate all community projects / frameworks
- Async approach built into the language to overcome performance limitations (`GIL`)
- Recent communication on work in progress by Guido van Rossum on Python performance (Microsoft)

- Good introduction : https://learnxinyminutes.com/docs/python/

## Practical exploration (gcd)

- gcd : greatest common divisor of two numbers (Plus Grand Commun Diviseur)
- Explanation :
    - gcd(a, b) with a = nb + m
    - if m != 0
        - gcd(a, b) divides m
        - if d divides m and b then d divides a
        => gcd(a, b) = gcd(b, m)
    - if m == 0 => gcd(a, b) = b
- Algorithm:
    - while a >= b, a = a - b
    - if a == 0 => res = b
    - otherwise while b >= a, b = b - a
      - if b == 0 => res = a
      - otherwise while ...

In [None]:
# Naive approach (using a language to express an algorithm)
# Iterative instructions

a, b = 20,6
res = None

while True:
    a = a - b  # we assume that a >= b
    if a == 0:
        res = b
        break
    elif a < b:
        a, b = b, a

print(res)

In [None]:
# add prints to vizualize 

a, b = 20, 6
res = None

i = 1
while True:
    print(f"step {i}")
    a = a - b  # we assume that a >= b
    print(f"new a: {a}, b: {b}")
    if a == 0:
        res = b
        print(f"GCD is equal to {b}")
        break
    elif a < b:
        print("switch values between a and b")
        print(f"new a: {b}, new b: {a}")
        a, b = b, a
    i += 1

print(res)

### Take away
- Easy and simple syntax (we assign as done on paper)
- `if`, `else`, `elif`
- `while`, `break`, `continue`
- `print` is your best friend

In [None]:
# To capitalize on a logic, you need building blocks
# iterative instruction
def gcd1(a, b):
    a, b = max(a, b), min(a, b)
    while True:
        a = a - b
        if a == 0:
            return b
        elif a < b:
            a, b = b, a

In [None]:
gcd1(12, 9)

### Take away
- Functions : building block important to capitalize / factorize
- Classes are the level above (Capitalize on a concept)
- Modules (.py file) and packages (hierarchy of modules) allow to group functions and classes to share / publish them (pypi)

In [None]:
# Improvement : we don't like the `while True` in development (the error is fatal)
# A function can call itself (recursivity)

def _gcd2(s, b):
    while b >= s:
        b -= s
    if b == 0:
        return s
    else:
        return _gcd2(b, s)


def gcd2(a, b):
    return _gcd2(min(a, b), max(a, b))

In [None]:
gcd2(20, 6)

In [None]:
# More readability / better use of language
# Improvement : we don't like the `while True` in development (the error is fatal)

def gcd3(a, b):
    s, b = min(a, b), max(a, b)
    while s != 0:
        s, b = b % s, s
    return b

### Take away
- Python is a very rich language (operations, native functions, multiple assignments)
- Code readability is an important criterion of the quality  (we spend more time maintaining than creating)
- A standard (pep8) and tools exist to check and format the code

In [None]:
import math

math.gcd(20, 6)

### Take away
- Before starting a development, check if there is already a library that does it
- As a Software Engineer, Data Enginner or Data Scientist, our job is more to find the best combination of components to do the job than to create new things

In [None]:
import random


N = 100000

random.seed()
A = [random.randint(1, 1000000000) for i in range(N)]
B = [random.randint(1, 1000000000) for i in range(N)]

In [None]:
import timeit


def bench(f):
    def ff():
        r = [f(a, b) for a, b in zip(A, B)]
        print(f"gcd({A[0]}, {B[0]}) = {r[0]}")
        return r
    t = timeit.timeit(ff, number=1)
    print(f"exec time : {t:.2f}s")

### Take away
- String format
- List Comprehension (exists for dict)
- Embedded functions
- The standard library is very powerful (min, max, zip, timeit)

In [None]:
bench(gcd1)

In [None]:
bench(gcd2)

In [None]:
bench(gcd3)

In [None]:
bench(math.gcd)

### Take away
- The closer you get to the standard, the better you perform
- To do data engineering, we must be vigilant because performance quickly becomes critical

In [None]:
import random


N = 10000000

random.seed()
A = [random.randint(1, 1000000000) for i in range(N)]
B = [random.randint(1, 1000000000) for i in range(N)]

In [None]:
def log(a, b, r):
    print(f"gcd({a}, {b}) = {r}")


def naive():
    R = [math.gcd(a, b) for a, b in zip(A, B)]
    for a, b, r in zip(A, B, R):
        log(a, b, r)
        break


def lazy():
    R = (math.gcd(a, b) for a, b in zip(A, B))
    for a, b, r in zip(A, B, R):
        log(a, b, r)
        break

In [None]:
print(f"naive exec time : {timeit.timeit(naive, number=1):.2f}s")
print()
print(f"lazy exec time : {timeit.timeit(lazy, number=1):.2f}s")

### Take away
- There are "lazy" data structures that avoid intermediary storage
- This consists in creating iterators and links on iterators that iterate on existing data
- Very common pattern in Python (iterate on database records, aggregations on CSV files > memory size)
- Compatible with builtins and standard library (`sorted`, `zip`, `enumerate`, etc.)

## Other data structures provided by external librairies 

## Numpy

### Take away
- Python's basic data structures are not optimized for numeric compute
- The `Numpy` package is the basis for all numerical computation libraries in Python
    - Block data structure uniformly typed 
    - Flexibility to shape the data to your needs
    - Rich library of operations to manipulate these arrays
    - Utilities to build arrays (timeseries, random, range, etc.)
    - Utilities to load input data from files and to serialize output data
- Nice introduction : http://datacamp-community-prod.s3.amazonaws.com/da466534-51fe-4c6d-b0cb-154f4782eb54

In [None]:
! pip install numpy
import numpy as np

In [None]:
help(np)

In [None]:
l1_to_array = [[1,2], [3,4], [5,6]]
a = np.array(l1_to_array)
print(a)
print(a.dtype)
print(a.shape)

In [None]:
a.reshape((2, 3))

In [None]:
l2_to_array = [[7,8], [9,10], [11,12]]
b = np.array(l2_to_array)
print(b)

In [None]:
print(a)

In [None]:
print(a+b)

In [None]:
print(a*b)

In [None]:
print(a/b)

In [None]:
a.dot(b.reshape(2, 3))

In [None]:
#a[3] = 1  # Pas de resizing (Attention)

In [None]:
a[0,0] = 0
print(a)

In [None]:
a[0] = 1.4  # Cast (Attention)
print(a)

### Take away
- Numpy is a real Swiss army knife for computational processing while keeping the flexibility of python
- You have to be careful to keep the internal structures of Numpy (go through the Numpy API)

In [None]:
#! pip install numpy
#! pip install matplotlib
import numpy as np
import matplotlib.pyplot as plt


x1 = np.arange(0.0, 2.0, 0.1)
y1 = np.sin(2 * np.pi * x1)

x2 = np.arange(0.0, 2.0, 0.01)
y2 = np.sin(2 * np.pi * x2)

fig, ax = plt.subplots()

ax.plot(x1, y1, color="red")
ax.plot(x2, y2, color="blue")
ax.grid()

plt.show()

### Linear regression

In [None]:
#! pip install sklearn
from sklearn.linear_model import LinearRegression

In [None]:
# Define the raw data
x = np.array([0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50]).reshape((-1, 1))
y = np.array([5, 10, 17, 18, 20, 14, 18, 32, 22, 38, 47])
plt.scatter(x,y)

In [None]:
model = LinearRegression()

In [None]:
model = model.fit(x, y)

In [None]:
y_prediction = model.predict(x)

In [None]:
plt.scatter(x,y)
plt.plot(x,y_prediction, c="orange", label="prediction")

### Take away
- To visualize the numbers, very powerful libraries to visualize the `numpy` array

### Pandas
Pandas is open source tool for data manipulation, analysis and cleaning. It is well suited for different kinds of data and to manage multiple columns with different types, such as:

- Tabular data with heterogeneously-typed columns
- Ordered and unordered time series data
- Arbitrary matrix data with row & column labels
- Unlabelled data
- Any other form of observational or statistical data sets

Nice introduction : https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

### basic functionalities

Pandas is built on the top of NumPy and allows to :
- Structure incoming raw data from databases, files, etc..
- Handle multiple operations of the data sets such as like slicing, filtering, grouping, etc..
- Group data for aggregations and transformations
- Perform data cleaning and handle missing values
- Preprocess data before modelization 
- Structure models output predictions
- Export output data to database, files, etc..

In [None]:
#! pip install pandas
import pandas as pd


#### Python Pandas Operations

Using Python pandas, you can perform a lot of operations with series, data frames, missing data, group by etc. Some of the common operations for data manipulation are demonstrated below

##### Create a dataframe with input file

In [None]:
## Import data if we have csv, excel files in the same folder

# excel file
#df = pd.read_excel("my_excel_file.xlsx")
# csv file
#df = pd.read_csv("my_csv_file.csv")

In [None]:
# use csv file stored in internet
# read all medals of Winter Olympics between 1924 and 2006
df = pd.read_csv('http://winterolympicsmedals.com/medals.csv')

In [None]:
df

##### Create a dataframe from a dictionnary

In [None]:
dict_to_df = {
    "City": ["Chamonix", "Turin", "Paris"],
    "Sport": ["Skating", "Bobsleight", "Ice Hokey"],
    "Medal": ["Silver", "Gold", "Bronze"]
}

In [None]:
df_new = pd.DataFrame(dict_to_df)

In [None]:
df_new

#### Viewing data

In [None]:
num = 10

In [None]:
df.head(num)

In [None]:
df.tail(num)

#### Add new columns

In [None]:
import random

In [None]:
df["random_num"] = [random.randint(1,10) for i in range(len(df))]

In [None]:
df

#### Get stats

In [None]:
df.describe()

#### Sort df by specific column(s)

In [None]:
df.sort_values(by="random_num")

In [None]:
df.sort_values(by="random_num", ascending=False)

In [None]:
df.sort_values(by=["Sport", "NOC"])

#### filter data

In [None]:
df.loc[:, ["Year", "City"]]

In [None]:
df.loc[0:20, ["Year", "City"]]

In [None]:
df[df["Year"] > 2000]

In [None]:
df[(df["Year"] > 2000) & (df["Medal"] == "Gold")]

In [None]:
df[df["Medal"].isin(["Gold", "Silver"])]

In [None]:
df[~df["Medal"].isin(["Bronze"])]

#### Aggregate data

In [None]:
df.groupby(["Year"])["Medal"].count().to_frame()

In [None]:
df.groupby(["Year", "Sport"])["Medal"].count().to_frame()

In [None]:
# aggregations
df.groupby(["Year"]).agg({"Event": "nunique", "NOC": "nunique", "Medal": "count"})

#### Missing data

In [None]:
df_2 = df.copy()
df_2["fill_missing_data"] = np.nan

In [None]:
df_2

In [None]:
df_2.fillna(value=0)

#### Export data

In [None]:
df_2.to_excel("new_export.xlsx")

#### more complexe commands

In [None]:
# the number of all medals winned by coutry for each categories
df.pivot_table(index='NOC', columns='Medal', values='Event', aggfunc='count')

In [None]:
(df
     .groupby(["NOC", "Medal"])
     .agg({"Event": "count"})
     .reset_index(level=[1])
     .pivot(columns="Medal")
     .fillna(0)
)

In [None]:
# 1st version

df_filter = df.loc[df.NOC.isin(['AUT', 'FRA', 'CHN', 'USA', 'FIN']), :].reset_index(drop=True)
df_pivot = df_filter.pivot_table(index='Year', columns='NOC', values='Medal', aggfunc='count')
df_pivot = df_pivot.fillna(0)
df_pivot.plot()

#### To improve code readability, we use Pandas Method Chaning.
Method chaining is a programmatic style of invoking multiple method calls sequentially with each call performing an action on the same object and returning it
- To deep dive : https://towardsdatascience.com/the-unreasonable-effectiveness-of-method-chaining-in-pandas-15c2109e3c69

In [None]:
# 2nd version

(
    df
    .loc[df.NOC.isin(['AUT', 'FRA', 'CHN', 'USA', 'FIN'])]
    .pivot_table(index='Year', columns='NOC', values='Medal', aggfunc='count')
    .fillna(0)
    .plot()
)

### Numpy or Pandas ?

Now majorly the difference between Numpy and Pandas lies in their data structure, memory consumption, and usage.*

- Numpy majorly works with numerical data whereas Pandas works with tabular data.
- The data structure in Pandas are Series, Dataframes and Panel whose objects can go upto three. Whereas Numpy has Arrays whose objects can go upto n dimensions.
- Numpy consumes less memory as compared to Pandas.
- Pandas perform better with the data having 500K rows or more whereas Numpy performances better for 50K rows or less
- Pandas is more widely used in industry than Numpy.
- Good read : http://gouthamanbalaraman.com/blog/numpy-vs-pandas-comparison.html