# Big Four python Data Science Libraries

**Overview of major python data science libraries**

```{attention}
Download this notebook, put it in your lesson 03 folder, and follow along!
```

<div style="max-width:720px"><div style="position:relative;padding-bottom:56.25%"><iframe id="kaltura_player" src='https://cdnapisec.kaltura.com/p/1751071/embedPlaykitJs/uiconf_id/55382703?iframeembed=true&amp;entry_id=1_9icpietf&amp;config%5Bprovider%5D=%7B%22widgetId%22%3A%221_mquxf28s%22%7D&amp;config%5Bplayback%5D=%7B%22startTime%22%3A0%7D'  allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" title="EAS-G 690 Week 3 - Big Four Lecture Demo" style="position:absolute;top:0;left:0;width:100%;height:100%;border:0"></iframe></div></div>

```{note}
Click [here](https://iu.mediaspace.kaltura.com/media/t/1_9icpietf) if there are issues with the embedded video above.
```

There are four major python libraries that are ubiquitous in data science with python; we'll use these extensively in this course, and if you continue to use python for data science, you'll likely use these libraries a lot.  These are:
1. [**NumPy**](https://numpy.org/): Numerical python, for working with arrays and matrices of numbers.
2. [**Pandas**](https://pandas.pydata.org/): python Data Analysis Library, for working with tabular data (dataframes).
3. [**SciPy**](https://www.scipy.org/): Scientific python, for scientific and technical computing.
4. [**Matplotlib**](https://matplotlib.org/): Plotting library for creating static, animated, and interactive visualizations in python.

There are many *many* other python libraries for data science that you may end up using.  For example, see [pangeo](https://pangeo.io/) for a collection of tools for geoscience data analysis and visualization, or [xarray](http://xarray.pydata.org/en/stable/) for working with labeled multi-dimensional arrays, or [seaborn](https://seaborn.pydata.org/) for statistical data visualization.

For the purposes of this part of the course, we'll focus on these four libraries, as they are foundational to most other data science libraries, and they are widely used in the data science community.

## What are these libraries?

As we discovered in [](../02_modules_vscode_git/02_new_module.md), python *modules* are simply files that contain python code that is meant to be reused.  We defined our own module (contating crude implementations of the basic trig functions) just to get an idea.  Modules like `numpy` define many functions, classes, and variables that are useful for numerical computing.

The benefit of these modules is that they

* have been developed and tested by many people over many years, so they are reliable and efficient.
* provide a wide range of functionality, so you don't have to reinvent the wheel for common tasks.
* are often optimized for performance, so they can handle large datasets and complex computations efficiently.
* are widely used in the data science community, so there is a large community of users and developers who can provide support and resources.
* are well documented, so you can easily find information on how to use them, and AI tools like ChatGPT and GitHub Copilot generally have enough context from the documentation to use them effectively.

Let's demonstrate with an example.

### numpy - for avoiding slow `for` loops

In [None]:
""" Take the square of items in a list. """

In [None]:
""" Test squaring with numpy."""

## Where do these libraries live?

When you install python, you get a basic python installation with a few standard libraries.  When you install a distribution like Anaconda or Miniforge, you get a larger set of libraries that are commonly used in data science.  When you use `mamba` or `conda` to install additional libraries, they are added to your python environment, but only when you use that environment.

Python also looks for modules in the current working directory, so if you have a module in the same directory as your script or notebook, python will find it there first.  We saw this in [](../02_modules_vscode_git/02_new_module.md) when we created our own module.

In [None]:
""" Show where numpy is installed. """

In [None]:
""" Show where pandas is installed. """
