# Setting up your environment
In the PHM data challenge, you will perform tasks like data analysis and training machine learning models. Therefore, you need to set up a suitable environment. This guide will walk you through installing the essential tools and editors.

# The tools we need:
When dealing with data and training machine learning models, you have many tools to choose from. However, a few key components are essential. You have to have these component to finish your task.


## Editors
A text editor is essential for writing code. You can choose any editor you like, such as Vim/Emacs, Notepad++, Sublime Text, or a modern IDE from JetBrains. Choose the one you are most familiar and comfortable with, as this will be your primary workspace for writing, debugging, and running code.

If you do not have a preference, I recommend [Visual Studio Code (VS Code)](https://code.visualstudio.com/). It's open-source, free, cross-platform, and offers excellent support for various coding tasks through its extensive library of extensions.

***Installing VS Code*** is straightforward: download the latest version from the [official website](https://code.visualstudio.com/) and run the installer. You will also need some crucial extensions providing the language support. For example, to add Python support, navigate to the 'Extensions' tab, search for 'Python', and install the official extension provided by Microsoft.

![The python extensions for VScode](./imgs/vscode_plugins.png)

Now that your text editor is ready, let's install the programming language toolchain for our data analysis and machine learning tasks.

## Programming environment
Today, many toolchains are available for data science and machine learning. The most popular is [**Anaconda**](https://www.anaconda.com/). I strongly recommend Anaconda unless you are already proficient with MATLAB and its machine learning packages.

The Anaconda distribution is an all-in-one solution for data science. It includes the `conda` package and environment manager, the Python programming language, and many essential data science libraries.
 
There are two common options for installing a Conda-based environment:
- **Full Anaconda Distribution**: This installs the `conda` manager, Python, and a large collection of commonly used libraries. It is convenient, but the full installation requires a significant amount of disk space (around 3-4 GB).
- **Miniconda**: This installs only the `conda` manager, Python, and a few essential packages. This approach saves disk space while still providing the full power of `conda`, but it requires you to install other libraries manually. This is a great option if you are familiar with `conda` and `pip` commands.
![Anaconda Distrobution and Miniconda](./imgs/anaconda&miniconda.png)

Alternatively, you can install Python directly from the [**Python Official Website**](https://www.python.org/). With this approach, you would typically use `pip` to manage your libraries and a tool like `venv` (built-in) or the newer `uv` to manage virtual environments.

## Essential Libraries
To analyze data and train machine learning models, you will need several libraries. The full **Anaconda** distribution includes most of them, but if you use **Miniconda** or another Python installation, you will need to install them manually.

To check your installed libraries, you can type the `conda list` command in your terminal. To check for a specific package, use `conda list <package_name>`.
 
The following libraries are essential. If they are not already installed, you need to install them using `conda install` or `pip install`.
- `numpy`: The fundamental package for N-dimensional arrays in Python. It is significantly faster than standard Python lists for numerical operations.
- `scipy`: A core library for scientific computing that provides modules for linear algebra, optimization, interpolation, and more.
- `pandas`: A library providing high-performance, easy-to-use data structures (like the DataFrame) and data analysis tools. It's essential for loading, cleaning, and manipulating structured data.
- `matplotlib` & `seaborn`: `matplotlib` is the standard plotting library in Python. `seaborn` is built on top of it and provides a higher-level, more user-friendly interface for creating attractive statistical graphics.
- `scikit-learn` (or `sklearn`): The most popular library for classical machine learning. It provides a wide range of algorithms for classification, regression, clustering, and more.

You can install these libraries with one of the following commands:
```python
# Using conda
conda install numpy scipy pandas scikit-learn matplotlib seaborn

# Or using pip
pip install numpy scipy pandas scikit-learn matplotlib seaborn
```

For deep learning, you will need to install a specialized framework. The most prominent libraries are **PyTorch** and **TensorFlow**. TensorFlow is known for its robust production deployment capabilities, while PyTorch is often praised for its flexibility and ease of use in research.

For this data challenge, I recommend **PyTorch**. When installing it, you must pay close attention to GPU support for hardware acceleration.
- **NVIDIA GPU**: First, check your CUDA driver version by typing `nvidia-smi` in your terminal (Command Prompt or PowerShell). Then, visit the [PyTorch website](https://pytorch.org/get-started/locally/), select the appropriate options for your system, and run the generated command.
- **AMD GPU**: Installing [PyTorch with AMD GPU support (ROCm)](https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/wsl/install-pytorch.html) can be challenging, especially on Windows. It is often easier to work on a Linux machine or use a computer in our Laboratory.

![CUDA version check](./imgs/cuda.png)

After installing PyTorch, you can verify that it can detect and use your GPU by running the following Python code. This script checks for CUDA availability and prints details about the GPU if one is found.

In [1]:
import torch

# check the CUDA support, it must be True if you want to use CUDA for hardware acceleration
print(f"can torch utilise my GPU? {torch.cuda.is_available()}")
# check the number of GPU you have
print(f"Number of GPUs: {torch.cuda.device_count()}")
# check the GPU you are using
print(f"Current GPU index: {torch.cuda.current_device()}")
# The name of your GPU
print(f"GPU Name: {torch.cuda.get_device_name(0)}")

can torch utilise my GPU? True
Number of GPUs: 1
Current GPU index: 0
GPU Name: NVIDIA GeForce GTX 1070
