# Beginner Series Tutorial 2: NumPy and Tabular Data

In this tutorial we will go over the fundamentals of loading, manipulating, saving and visualizing tabular data using arguably the most popular Python libraries: `numpy`, `pandas` and `matplotlib`. By the end of this, you will hopefully be comfortable with:
1. Using `numpy` to _efficiently_ process numeric data
2. Understanding the lightweight `numpy` wrapper `pandas` and its utility

# Introduction

## What is `numpy`?

The library `numpy` (NumPy) stands for **num**erical **py**thon. You can find its excellent documentation [here](https://numpy.org/doc/stable/). It is a jack-of-all-trades, state-of-the-art, open-sourced library for handling any and all types of numerical data you can imagine. I will outline a few of its features as provided in the [first pages of its documentation](https://numpy.org/doc/stable/user/whatisnumpy.html).

1. NumPy facilitates the creation of fixed-size arrays of the _same datatype_.
2. NumPy has advanced mathematical operations built-in. These are implemented efficiently using compiled backends.
3. NumPy supports _vectorization_, allowing for fewer lines of code and faster execution.

It is also worth noting that NumPy is extremely well-tested, actively maintained, and serves as the foundation for almost _all_ numerical scientific code written in Python. If you are doing numerical science in Python, you're either using NumPy or using a library that uses NumPy.

## What is the point of this tutorial?

This tutorial serves as one of a few intended to get everyone interested in the AI/ML series on the same page. If you're a beginner, we hope that you'll benefit from the overview of basic concepts, and the references to various documentation pages which can provide more detail where this tutorial is sparse. If you're not a beginner, we hope that this tutorial might help you use NumPy more efficiently, and possibly teach you something new! Regardless, keep the following in mind:

> The bread and butter of PyTorch (the library we'll be using for machine learning applications) is the `Tensor` object. This object is built on top of NumPy arrays and follows the same syntax. A solid, fundamental understand of NumPy is critical to understanding the machine learning code we will be discussing in future tutorials.

Also, keep the following in mind during the tutorial:
1. The notebook is meant to be a standalone document. While we will present this live, everything you'll need should be self-contained in the notebook, so don't worry if you miss something.
2. Everything we present here will be _fast_ unless stated otherwise. What do we mean by fast? Essentially, it's "as optimized as possible" in Python.

## Importing/installing NumPy

NumPy is not a standard library (contained in the Python installation like e.g. `os`). It can be easily installed via `pip` by `pip install numpy`. It is already pre-installed on Google Colab, so all that needs to be done is the actual importing. We use `np` as shorthand for NumPy (this is essentially a ubiquitous convention).

In [2]:
import numpy as np

# The basics

The core of NumPy is the `ndarray` object (see [here](https://numpy.org/doc/stable/user/absolute_beginners.html#more-information-about-arrays)). This stands for "N-dimensional array" and is basically a container for numerical data. It's easiest to see this by example. For instance, here is an `ndarray` object of one dimension (a vector):

In [6]:
v = np.array([1, 2, 3])
v  # Note that Jupyter Notebooks allow for "rendering" by simply typing the object at the end of the cell

array([1, 2, 3])

Possibly the most useful operation for debugging NumPy code is the `.shape` property. Often times, checking the "shape" of an `ndarray` is an easy, efficient and fast way of checking to make sure your arrays are doing what they're supposed to do. This will be especially important when considering broadcasting. For now, we have initialized an `ndarray` vector, so we expect our shape to have only one dimension:

In [7]:
v.shape

(3,)

Indeed, we see that the `v.shape` result is a tuple, noting that the only dimension has three entries. How about a 2-dimensional array?

In [8]:
X = np.array([[1, 2, 3], [4, 5, 6]])
X

array([[1, 2, 3],
       [4, 5, 6]])

As before let's check the shape:

In [9]:
X.shape

(2, 3)

We see that the first dimension in `X.shape`, following with standard matrix convention, is the number of rows and the second is the number of columns.

3