# Intro to Python and Jupyter

## 1. Quick Jupyter Notebook guide

Use **Arrow keys** to move up and down
***

Use **Enter** to activate cell and **ESC** to deactivate it
***

Use **CTRL+Enter** to execute cell (interpret Markdown or Python code)
***

While cell is deactivated (you are in Command mode) use **M** to turn cell into Markdown or **Y** to switch to Python
___

**a** adds cell above and **b** adds cell below (**ALT+Enter** execute and insert below)
___

Delete cell using **dd**
___

More shortcuts will pop up after pressing **H** (or **CTRL+SHIFT+H**)

---

Interactive command palette - **CTRL+SHIFT+C**

## 2. Markdown guide

[Markdown Cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

## 3. Python intro

### 3.1. IPython

**Getting help**

In [None]:
help(len)

In [None]:
len.__doc__

In [None]:
#IPython extended help function
?len

In [None]:
def square(a):
    """Return the square of a."""
    return a**2

In [None]:
#Access source code
square??

**Magic functions**

In [None]:
#Measure average exec time
%timeit L = [n ** 2 for n in range(1000)]

In [None]:
#Help on magic functions
%magic

In [None]:
%%latex
$$x_i=\sum^N_{i=1}{e^i}$$

**Shell functions**

In [None]:
%pwd

In [None]:
# %cd - change directory
# %ls - list
# %cp - copy e.g. cp file1.csv folder/.
# %mkdir - make directory e.g. mkdir results
# %rm - remove file (or directory) e.g. rm python.py, rm -r folder

In [None]:
!git

In [None]:
!curl --help

### 3.2. ML in Python

#### Python resources
Official tutorial

https://docs.python.org/3/tutorial/

'Think' Series by Allen B. Downey

https://greenteapress.com/wp/think-python-2e/


Free programming books on GoalKicker:

https://goalkicker.com/PythonBook/

Online courses: Datacamp, Coursera, CodeAcademy

**Data Science and ML in Python**

Scikit-learn documentation

https://scikit-learn.org/stable/

Seaborn documentation

https://seaborn.pydata.org/

Python Data Science Handbook

https://jakevdp.github.io/PythonDataScienceHandbook/

Medium platform blogs (e.g. Towards Data Science)

**Modules for Data Science in Python**

**Numpy** - provides numerical data structures and required utilities (linear algebra tool) https://numpy.org/

**Pandas** - Python DataFrames + reading/writing datasets https://pandas.pydata.org/

**Matplotlib/Seaborn** - plotting, data visualization https://matplotlib.org/

**Scikit-learn** - ML models, evaluation metrics, preprocessing

**'Production' ML project lifecycle (from "ML Engineering" by Andriy Burkov)**
![](./ML_Project_Cycle.PNG)

## Overfitting and dataset splits

Splitting data into training and validation set is done to avoid overfitting and to assess performance in more realistic situation (model will predict outcome on data it has never seen before).

[Related to Bias vs. Variance Tradeoff (Underfitting vs. Overfitting)](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html#The-Bias-variance-trade-off)

![](https://jakevdp.github.io/PythonDataScienceHandbook/figures/05.03-validation-curve.png)

In [None]:
%pip install numpy matplotlib scikit-learn --upgrade

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline, Pipeline

In [None]:
def PolynomialReg(X: np.array, y: np.array, degree: int=2) -> Pipeline:
    return make_pipeline(PolynomialFeatures(degree), LinearRegression()).fit(
        X.reshape(-1, 1), y
    )

In [None]:
def true_process_with_noise(x: np.array, seed: int = 123) -> float:
    np.random.seed(seed)
    return 2 * np.log(x) + 0.4 * np.random.normal(size=x.size) + 2

In [None]:
n = 40
random.seed(24)
test = random.sample(range(n), int(n * 0.2))
mask = np.ones(n, bool)
mask[test] = False

In [None]:
x = np.linspace(2, 10, n)
random.seed(42)
y = true_process_with_noise(x)
X_train = x[mask]
y_train = y[mask]
X_test = x[~mask]
y_test = y[~mask]

In [None]:
plt.plot(X_train, y_train, "o")
plt.plot(X_test, y_test, "go");

In [None]:
# Train polynomial regression with degree 1 and 20
model1 = PolynomialReg(X_train, y_train, 1)
model20 = PolynomialReg(X_train, y_train, 20)
# Predict values on train and test data
pred_train1 = model1.predict(X_train.reshape(-1, 1))
pred_train20 = model20.predict(X_train.reshape(-1, 1))
pred_test1 = model1.predict(X_test.reshape(-1, 1))
pred_test20 = model20.predict(X_test.reshape(-1, 1))

In [None]:
# Training data
fig, ax = plt.subplots(1, 2, figsize=(16, 4))
ax[0].plot(X_train, y_train, "o")
ax[0].plot(X_train, pred_train1)
ax[0].set_title("Degree 1")
ax[1].plot(X_train, y_train, "o")
ax[1].plot(X_train, pred_train20)
ax[1].set_title("Degree 20");

In [None]:
# Test data
fig, ax = plt.subplots(1, 2, figsize=(16, 4))
ax[0].plot(X_test, y_test, "o")
ax[0].plot(X_test, pred_test1)
ax[0].plot(X_test, [np.mean(y_test)] * X_test.size, "k--")
ax[0].set_title("Degree 1")
ax[1].plot(X_test, y_test, "o")
ax[1].plot(X_test, [np.mean(y_test)] * X_test.size, "k--")
ax[1].plot(X_test, pred_test20)
ax[1].set_title("Degree 20");

In [None]:
def R_squared(y_pred, y_true):
    y_mean = np.mean(y_true)
    ssres = (y_true - y_pred) ** 2
    sstot = (y_true - y_mean) ** 2
    r_square = 1 - (sum(ssres) / sum(sstot))
    return str(round(r_square * 100, 1)) + "%"

In [None]:
print("R^2 degree 1 on training data:", R_squared(pred_train1, y_train))
print("R^2 degree 20 on training data:", R_squared(pred_train20, y_train))
print("R^2 degree 1 on test data:", R_squared(pred_test1, y_test))
print("R^2 degree 20 on test data:", R_squared(pred_test20, y_test))

Handling overfitting:
- Gather more records (rows)
- Gather/produce more features (columns)
- Use less powerful/elastic/flexible model
- Use special data preparation or training techniques (balancing data, cross-validation)
- **Use regularization techniques**