
# Intro
This file is meant to get you familiar with the **"proper"** way of working with python code. It will cover how to install all necessary packages, how to create requirements.txt file for others, how to document and properly format your code and more.

[About markdown](https://www.markdownguide.org/basic-syntax/)

## Packages

Python has a built-in package manager called *pip*. However, its unrecommended to install project's requirements (all needed packages to run the code) into default python interpreter. Therefore, a handy tool became popular - [anaconda](https://anaconda.org/anaconda/python). Install it on your own.

Anaconda essentially allows you to have multiple python interpreters installed on the same machine, with different varsions and packages. Each interpreter is wrapped around what's called an environment - as a high level programmers we'll be using provided anaconda bash syntax to run it (for windows cmd most will be the same, although I recommend using installed automatically alongside anaconda "Anaconda Prompt (Anaconda3)" application, it will be easier for start)

Properly installed anaconda will add *"(base)"* prefix to your's shell PS1 variable 
```bask
(base) username@hostname:directory$
```

1. creating a new environment with python version 3.11 installed alongside the "ipykernel" package
   ```bash
   conda create --name satelity python=3.11 ipykernel
   ```
2. activating = we are using python interpreter iside this enviroment 
   ```bash
   conda activate satelity
   ```
3. Step 1 and 2 PS1 should indicate that we're using "satelity" environment like that:
   ```bash
   (satelity) username@hostname:directory$
   ```
   You can run *conda info* to double check.
4. If you use VS Code and have it in your path, you can now run
   ```bash
   code project_directory
   ```
   and VS Code will automatically use that interpreter. Otherwise, you will need to select it manually (right-upper corner of VS Code, look for *select kernel*)
5. From now we can call some bash commands from jupyter's python cells by adding *!* mark in front of the command

In [1]:
# example - run command from python's cell
!python --version

Python 3.11.6


In [2]:
# Let's import some libraries
try:
    import numpy as np
except Exception as e:
    print("Error: ", e)

Error:  No module named 'numpy'


In [3]:
# So let's install it
!pip install numpy

Collecting numpy
  Using cached numpy-1.26.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (115 kB)
Using cached numpy-1.26.1-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB)
Installing collected packages: numpy
Successfully installed numpy-1.26.1


In [4]:
# let's install and import some more
!pip install pandas matplotlib seaborn scikit-learn

Collecting pandas
  Using cached pandas-2.1.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (18 kB)
Collecting matplotlib
  Using cached matplotlib-3.8.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (5.8 kB)
Collecting seaborn
  Using cached seaborn-0.13.0-py3-none-any.whl.metadata (5.3 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.3.2-cp311-cp311-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2023.3.post1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.1 (from pandas)
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Using cached contourpy-1.1.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (5.9 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Using cached fonttools-4.43.1-cp311-cp311-macosx_10_9_universal2.whl.metadata (152 kB)
Collecting kiwisolver>=1.

In [5]:
# you don't need understand this

import numpy as np
from sklearn.cluster import HDBSCAN
from sklearn.datasets import load_digits
from sklearn.metrics import v_measure_score

X, true_labels = load_digits(return_X_y=True)
print(f"number of digits: {len(np.unique(true_labels))}")

hdbscan = HDBSCAN(min_cluster_size=15).fit(X)
non_noisy_labels = hdbscan.labels_[hdbscan.labels_ != -1]
print(f"number of clusters found: {len(np.unique(non_noisy_labels))}")

print(v_measure_score(true_labels[hdbscan.labels_ != -1], non_noisy_labels))

number of digits: 10
number of clusters found: 10
0.9751818034688537


Now, when someone will want to use our code, several packages will be needed. In order to make their live easier, we can save these packages name's and allow future users to download them all automatically. We use *pipreqsnb* package to do that

In [6]:
!pip install pipreqsnb

Collecting pipreqsnb
  Using cached pipreqsnb-0.2.4-py3-none-any.whl
Collecting pipreqs (from pipreqsnb)
  Using cached pipreqs-0.4.13-py2.py3-none-any.whl (33 kB)
Collecting docopt (from pipreqs->pipreqsnb)
  Using cached docopt-0.6.2-py2.py3-none-any.whl
Collecting yarg (from pipreqs->pipreqsnb)
  Using cached yarg-0.1.9-py2.py3-none-any.whl (19 kB)
Collecting requests (from yarg->pipreqs->pipreqsnb)
  Using cached requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting charset-normalizer<4,>=2 (from requests->yarg->pipreqs->pipreqsnb)
  Using cached charset_normalizer-3.3.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (33 kB)
Collecting idna<4,>=2.5 (from requests->yarg->pipreqs->pipreqsnb)
  Using cached idna-3.4-py3-none-any.whl (61 kB)
Collecting urllib3<3,>=1.21.1 (from requests->yarg->pipreqs->pipreqsnb)
  Using cached urllib3-2.0.7-py3-none-any.whl.metadata (6.6 kB)
Collecting certifi>=2017.4.17 (from requests->yarg->pipreqs->pipreqsnb)
  Using cached certifi-2023.7.22-py3

In [7]:
# this will scan all .py/.ipynb files in current directory,
# check what packages you use (and their versions) and create requirements.txt
!pipreqsnb --savepath requirements.txt .

pipreqs  --savepath requirements.txt .
Please, verify manually the final list of requirements.txt to avoid possible dependency confusions.
Please, verify manually the final list of requirements.txt to avoid possible dependency confusions.
INFO: Successfully saved requirements file in requirements.txt


In [8]:
# note: it may have more packages than already imported, it will scan entire directory
!cat requirements.txt

numpy==1.26.1
pandas==2.1.1
scikit_learn==1.3.2


In [9]:
# now, we can install them all at once, reading from file
!pip install -r requirements.txt



## Code formatting
[Here](https://pep8.org) is how you shold format your code. But who'd want to lose time for that... We will use dependency that will format code for you / check weather it's properly formatted.

In [10]:
!pip install jupyter-black

Collecting jupyter-black
  Using cached jupyter_black-0.3.4-py3-none-any.whl (8.5 kB)
Collecting black>=21 (from jupyter-black)
  Using cached black-23.10.1-py3-none-any.whl.metadata (66 kB)
Collecting tokenize-rt>=4 (from jupyter-black)
  Using cached tokenize_rt-5.2.0-py2.py3-none-any.whl.metadata (4.1 kB)
Collecting click>=8.0.0 (from black>=21->jupyter-black)
  Using cached click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting mypy-extensions>=0.4.3 (from black>=21->jupyter-black)
  Using cached mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Collecting pathspec>=0.9.0 (from black>=21->jupyter-black)
  Using cached pathspec-0.11.2-py3-none-any.whl.metadata (19 kB)
Using cached black-23.10.1-py3-none-any.whl (184 kB)
Using cached tokenize_rt-5.2.0-py2.py3-none-any.whl (5.8 kB)
Using cached click-8.1.7-py3-none-any.whl (97 kB)
Using cached pathspec-0.11.2-py3-none-any.whl (29 kB)
Installing collected packages: tokenize-rt, pathspec, mypy-extensions, click, black, jupyter-black
Su

In [11]:
!black

[1mUsage: black [OPTIONS] SRC ...

One of 'SRC' or 'code' is required.[0m


In [12]:
# we will use it to reformat another file, but with "."
# it will reformat all files in current directory

# check if sth is wrong
!black --diff --color bad_formatting.py

[1m--- bad_formatting.py	2023-10-25 22:22:23.551670+00:00[0m
[1m+++ bad_formatting.py	2023-10-25 22:26:00.492596+00:00[0m
[36m@@ -6,12 +6,11 @@[0m
             print('x is greater than y')
 else:
     print('x is not greater than y')
 """
 
[31m-x=5[0m
[31m-y=10[0m
[31m-if x>y:[0m
[31m-            print('x is greater than y')[0m
[32m+x = 5[0m
[32m+y = 10[0m
[32m+if x > y:[0m
[32m+    print("x is greater than y")[0m
 else:
[31m-    print('x is not greater than y')[0m
[31m-[0m
[32m+    print("x is not greater than y")[0m
[1mwould reformat bad_formatting.py[0m

[1mAll done! ✨ 🍰 ✨[0m
[34m[1m1 file [0m[1mwould be reformatted[0m.


In [13]:
# fix it
!black bad_formatting.py

[1mreformatted bad_formatting.py[0m

[1mAll done! ✨ 🍰 ✨[0m
[34m[1m1 file [0m[1mreformatted[0m.


# Code formatting
In order to scale project well, code documentation is required. There are some ways to generate it automatically based on commands, but it requires certain commands style. Below is just a fairly simple example, but get familiar with basics from [here](https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html)

In [14]:
import pandas as pd

from typing import List

DATASETS_FOLDER = "datasets"


def load_flights(
    years: str | List[str] = "all", cols: List[str] = None, dir: str = DATASETS_FOLDER
) -> pd.DataFrame:
    """
    Loads flight data into memory

    ..warning :: This function does nothing

    :param years: "all" or all possible data, List of str from {"1987", ..., "2008"} for specific ones
    :param cols: desired columns to be loaded, if None entire data is loaded
    :param dir: target data directory
    :returns: DataFrame with loaded data
    :raises: ValueError if years is not "all" or List[str] from {"1987", ..., "2008"}
    """

    pass