(python-for-ml)=
# Python libraries

A brief description of selected Python packages and libraries which are useful for data analysis and machine learning.

```{tip}
In most cases, to install a Python package, one needs just to run the command `pip install <package_name>` in the terminal.
```

````{admonition} The main tool
:class: note
To install [Jupyter](https://jupyter.org/), run
```
pip install notebook
```

or

```
pip install jupyterlab
```

````

## Cloud resources

You can also use clouds to run Jupyter Notebooks. Here are several popular solutions:

* [Google colab](https://colab.research.google.com/)
* [Kaggle notebooks](https://www.kaggle.com/code)
* [Binder](https://mybinder.org/)
* [DataSphere](https://yandex.cloud/ru/services/datasphere)

## Data Analysis

```{figure} https://miro.medium.com/v2/resize:fit:1400/1*2EHqvZVV4qNjRqrHiBK9-A.png
:align: center
```

### Pandas

[Pandas](https://pandas.pydata.org/docs/) is a very popular library for data manipulation and analysis, providing data structures like [DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) for handling structured data effectively. 

In [4]:
import pandas as pd
pd.read_csv("../datasets/ISLP/Publication.csv").drop("Unnamed: 0", axis=1)

Unnamed: 0,posres,multi,clinend,mech,sampsize,budget,impact,time,status
0,0,0,1,R01,39876,8.016941,44.016,11.203285,1
1,0,0,1,R01,39876,8.016941,23.494,15.178645,1
2,0,0,1,R01,8171,7.612606,8.391,24.410678,1
3,0,0,1,Contract,24335,11.771928,15.402,2.595483,1
4,0,0,1,Contract,33357,76.517537,16.783,8.607803,1
...,...,...,...,...,...,...,...,...,...
239,0,0,0,R01,4105,2.703653,5.355,65.018480,1
240,1,0,0,R44,181,1.117084,0.000,66.989733,0
241,0,0,0,K23,104,0.472321,0.000,9.987680,0
242,0,0,0,R21,69,0.404710,0.000,21.979466,0


### Polars

[Polars](https://pola-rs.github.io/polars/) is a fast DataFrames library designed for high-performance data analysis, offering a more efficient alternative to `pandas`.

```{figure} https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c144f4a-7b53-4ba3-82d2-4e6b2b9811e6_2103x1962.png
:align: center
```

In [9]:
import polars as pl
pl.read_csv("../datasets/ISLP/Publication.csv").drop("")

posres,multi,clinend,mech,sampsize,budget,impact,time,status
i64,i64,i64,str,i64,f64,f64,f64,i64
0,0,1,"""R01""",39876,8.0169405,44.016,11.203285,1
0,0,1,"""R01""",39876,8.0169405,23.494,15.178645,1
0,0,1,"""R01""",8171,7.612606,8.391,24.410678,1
0,0,1,"""Contract""",24335,11.771928,15.402,2.595483,1
0,0,1,"""Contract""",33357,76.517537,16.783,8.607803,1
…,…,…,…,…,…,…,…,…
0,0,0,"""R01""",4105,2.703653,5.355,65.01848,1
1,0,0,"""R44""",181,1.117084,0.0,66.989733,0
0,0,0,"""K23""",104,0.472321,0.0,9.98768,0
0,0,0,"""R21""",69,0.40471,0.0,21.979466,0


### SQL

**SQL** (**S**tructured **Q**uery **L**anguage) is essential for managing and querying relational databases, which are foundational in many data-driven applications. SQL enables efficient data retrieval, manipulation, and storage, making it crucial for everything from small-scale projects to enterprise-level applications.

Top database engines (according to [Stackoverflow survey](https://survey.stackoverflow.co/2024/)):

```{figure} ../images/stackoverflow-dev-survey-2024-database.png
:align: center
```

In Python, several libraries offer robust support for interacting with SQL databases, allowing developers to seamlessly integrate SQL operations into their code.

* [sqlite3](https://docs.python.org/3/library/sqlite3.html) module provides a built-in interface to interact with SQLite databases, making it ideal for small to medium-sized applications

* [SQLAlchemy](https://www.sqlalchemy.org/) allows developers to work with databases using Python objects, abstracting away the complexity of raw SQL while supporting a wide range of databases

* [PandasSQL](https://pandas.pydata.org/docs/user_guide/io.html#sql-queries) is a submodule within Pandas that allows users to read from and write to SQL databases, enabling seamless integration of SQL operations with Pandas DataFrames for data analysis

* [PyMySQL](https://pymysql.readthedocs.io/en/latest/) is a pure-Python MySQL client library that enables Python applications to connect to MySQL and MariaDB databases, execute SQL queries, and manage database connections

* [Psycopg2](https://www.psycopg.org/docs/) is a popular PostgreSQL adapter for Python

## Data Visualization

### Matplotlib

[Matplotlib](https://matplotlib.org/stable/) is a versatile plotting library that allows for the creation of static, animated, and interactive visualizations in Python.

### Seaborn 
[Seaborn](https://seaborn.pydata.org/) is built on top of Matplotlib and provides a high-level interface for creating attractive and informative statistical graphics.

### Plotly

[Plotly](https://plotly.com/python/) is an interactive visualization library that supports a wide range of chart types and offers tools for creating web-based visualizations.

### Geopandas

[Geopandas](https://geopandas.org/en/stable/) extends Pandas to handle spatial data, making it easy to work with geographic datasets and create geographic visualizations.

### Altair

[Altair](https://altair-viz.github.io/) is a declarative statistical visualization library that allows users to build complex visualizations using simple syntax, based on Vega and Vega-Lite.

In [16]:
import altair as alt

source = pd.read_csv("../datasets/ISLP/Auto.csv")

brush = alt.selection_interval(encodings=['x'])
points = alt.Chart(source).mark_point().encode(
    x='horsepower:Q',
    y='mpg:Q',
    size='acceleration',
    color=alt.condition(brush, 'origin:N', alt.value('lightgray'))
).add_params(brush)

bars = alt.Chart(source).mark_bar().encode(
    y='origin:N',
    color='origin:N',
    x='count(origin):Q'
).transform_filter(brush)

In [17]:
points & bars

## Mathematics

```{figure} https://ucarecdn.com/809f51d2-0f2b-490e-8f92-e18193629245/
:align: center
```

### NumPy
[NumPy](https://numpy.org/doc/) offers support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently.

### SciPy
[SciPy](https://docs.scipy.org/doc/scipy/)
 builds on NumPy by adding a suite of algorithms for optimization, integration, statistics, and other tasks in scientific computing.

### SymPy
[SymPy](https://docs.sympy.org/latest/index.html) is a Python library for symbolic mathematics, providing tools for algebraic manipulation, calculus, and other mathematical operations.

### NetworkX
[NetworkX](https://networkx.org/documentation/stable/) is a library for the creation, manipulation, and study of complex networks and graphs, with tools for analyzing their structure and behavior.

## Classical machine learning

```{figure} https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSQUepHxnnIWJ_ptY9-wX6Mj-RIssE39vnevw&s
:align: center
```

### Scikit-Learn
[sklearn](https://scikit-learn.org/stable/) is a widely-used library with lots of implemented classical machine learning algorithms.

### XGBoost
[XGBoost](https://xgboost.readthedocs.io/en/stable/) is an optimized gradient boosting library designed to deliver high performance, particularly for decision tree algorithms.

### CatBoost
[CatBoost](https://catboost.ai/docs/) is a gradient boosting library that is particularly strong in handling categorical features and is optimized for performance on structured datasets.

## Deep learning

```{figure} https://miro.medium.com/v2/resize:fit:1400/0*T6W0rRy8vgFU_K7Z.png
:align: center
```

### PyTorch
[PyTorch](https://pytorch.org/docs/stable/index.html) is a deep learning framework that emphasizes flexibility and ease of use, particularly for research, with dynamic computation graphs.

### TensorFlow
[TensorFlow](https://www.tensorflow.org/guide) is an end-to-end open-source platform for machine learning, particularly deep learning, offering a comprehensive ecosystem of tools and libraries.

### Keras
[Keras](https://keras.io/) is a high-level neural networks API that runs on top of TensorFlow, designed for rapid development and experimentation with deep learning models.

### JAX
[JAX](https://jax.readthedocs.io/en/latest/) is a library for high-performance machine learning research, particularly in autograd and GPU acceleration, supporting fast and flexible computations.

## Computer vision

```{figure} https://www.repeato.app/wp-content/uploads/2024/06/AI-computer-vision-automation.jpg
:align: center
```

## OpenCV
[OpenCV](https://docs.opencv.org/) (Open Source Computer Vision Library) is a comprehensive and widely-used library that provides tools for real-time computer vision, image processing, and machine learning.

### Scikit-image
[skimage](https://scikit-image.org/docs/stable/) is a collection of algorithms for image processing, built on top of SciPy, that provides easy-to-use tools for segmentation, filtering, morphology, and other computer vision tasks.

### Pillow
[Pillow](https://pillow.readthedocs.io/en/stable/) is a Python Imaging Library (PIL) fork that adds image processing capabilities to your Python interpreter, enabling tasks like opening, manipulating, and saving many different image file formats.

### Detectron2
[Detectron2](https://detectron2.readthedocs.io/) is a high-performance library developed by Facebook AI Research for object detection, segmentation, and other computer vision tasks, built on top of PyTorch.

## Natural language processing (NLP)

```{figure} https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ9-VgaTI779jH8wtJpiJPcqYlke_4KNLIn2Q&s
:align: center
```

### NLTK
[NLTK (Natural Language Toolkit)](https://www.nltk.org/) is a comprehensive library for natural language processing, offering tools for text processing, tokenization, parsing, and more.

### Gensim
[Gensim](https://radimrehurek.com/gensim/) is used for topic modeling, word embeddings, and document similarity, with implementations for various NLP algorithms.

### FastText
[FastText](https://fasttext.cc/) is a library developed by Facebook for efficient text classification and learning of word representations, with a focus on scalability.

### Transformers
[Transformers](https://huggingface.co/docs/transformers/index) is a library from Hugging Face that provides implementations of state-of-the-art NLP models, including BERT, GPT, and others.

### OpenAI API
The [OpenAI API](https://platform.openai.com/docs/introduction) provides Python bindings for accessing OpenAI's GPT models and other large language models, enabling a wide range of NLP applications.

### LangChain
[LangChain](https://python.langchain.com/) is a framework for building applications that leverage language models, with tools for prompt engineering, memory management, and more.

## Reinforcement Learning

```{figure} https://www.kdnuggets.com/wp-content/uploads/awan_reinforcement_learning_newbies_1.png}
:align: center
```

### Stable Baselines3

[Stable Baselines3](https://stable-baselines3.readthedocs.io/en/master/)
 is a set of implementations for popular reinforcement learning algorithms, built on top of PyTorch, with a focus on usability and reproducibility.

### Gymnasium

[Gymnasium](https://gymnasium.farama.org/) (formerly known as OpenAI Gym) is a toolkit for developing and comparing reinforcement learning algorithms, offering a variety of environments.

### Ray RLlib

[Ray RLlib](https://docs.ray.io/en/latest/rllib/index.html) is a scalable library for reinforcement learning that integrates with the Ray distributed computing framework, offering a wide range of RL algorithms.

### OpenAI Baselines

[OpenAI Baselines](https://github.com/openai/baselines)
is a collection of high-quality implementations of various reinforcement learning algorithms, designed for performance and ease of use.