# Week 3: The data science ecosystem

<img width = "250" src="./images/hazel.png" align="right" style="padding-left:10px">

Congrats. You have learned the foundations of Python! So far, we have studied the Python standard library, but have not looked at all at the external libraries that people typically use for data analysis.

This week you will start this journey, going through a crash course on the most essential tools you will need to analyze and visualize data. Since we are done with ATBS, it is time to delegate our flipped teaching responsibilities to multiple sources. I basically scoured the internet for readable, beginner-friendly, but practical introductions to the topics we are covering. I think I have found some really good resources, but if you find any additional material that you like, please let me know!

- [1: Overview of Python's data science ecosystem](#intro)
- [2: Virtual Environments](#virtual)
- [3: Numpy for numerical computing](#numpy)
- [4: Matplotlib for plotting](#matplotlib)
- [5: Pandas for analysis of tabular data](#pandas)

Note each of these libraries could fill multiple weeks of discussion. Our goal isn't to become an expert in matplotlib or pandas. Rather, I want you to become comfortable running basic commands, loading the libraries, and especially creating your own *analysis environments* in Python that you can quickly build up. This will be crucial for larger-scale analysis projects, like the one we will tackle in the final week with Deep Lab Cut.  

Incidentally, there is a good book on Python for data science that is free and available online: [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/). Unfortunatley, the material in that book is *way* too dense for a rapid crash course. However, when you get stuck or have questions, I encourage you to use it as a reference (along with Google).

My recommendation for picking up the material this week is the same as before. There is really no major break in method when learning to use the in the standard library and tools in third-party libraries:  make lots of cells and mess around with code. Tweak and twiddle with the examples in the videos and web pages to see what happens. Coding is largely muscle memory, so it is really important to literally *type code* rather than just read and understand the tutorials. If you run into error messages, first just inspect your code to make sure there isn't an obvious mistake (an unclosed paren or quotation mark). Second, read the error message: they are sometimes helpful. Barring that, Google the error message there are great resources online. 

<a id ="intro"></a>
# 1: Overview of Python's data science ecosystem
You will often hear about Python *data science stack*, which is the main set of Python libraries that people use to analyze and visualize their data. It includes libraries for numerical computing, plotting, statistical tests, and much more. If you want to analyze data, there is a core set of libraries to be familiar with. This does not mean you need to be an expert: it is more about being comfortable enough to install and use the library's basic functionality (one secret with these libraries is just a handful of commands is enough). 

The data science stack isn't some crisply defined list defined by some committee of developers. Indeed, *it really isn't a stack at all*. While I will write/talk about the data science stack, this metaphor implies some foundation where one piece builds on another. The data analysis tools form more of an **ecosystem**, an interconnected web of software packages all geared toward helping you analyze your data. Here is a cross-section through some of the resources available in Python:

<img width = "650" src="./images/ds_ecosystem.jpg" style="padding-top:30px; padding-bottom:30px">

As you can see there are a few core packages in the center (e.g., Numpy): these are the packages with which most people are familiar, and that many other packages depend upon. Then there are more peripheral packages located more at the edges of the web. These are the ones that you will pick up when you have a particular need (e.g., you don't need to learn Tensorflow unless you are doing a particular type of machine learning project). The more peripheral packages aren't lower quality, they are just *specialized* for particular tasks that many people will never need. All the packages are typically very high quality, that's how they make it into the ecosystem.

I can't stress enough: you **do not need** to learn all these packages. That would be a waste of time, especially when starting out. My plan is to help you navigate this web, cut through the complexity, and find your bearings. More specifically, our goal is threefold (**to do**: shouldn't this be above, not here in the middle of the section in which it is currently happening?):
- Informally discuss the main packages in the data science ecosystem. That way, in the future, you can have an idea what tool to use when you have a specific need. I do not recommend that you pre-emptively learn a package that you *might* need someday, unless it is serving some useful pedagogical purpose.
- Discuss and create *virtual environments*, which are extremely important in any serious analysis project (and you will use them in the final two weeks).
- Learn in more detail about three important tools in the data science ecosystem.

 ## Synopsis of specific packages
Before jumping into learning more specifics, let's do a quick overview of the main packages in the ecosystem.

- **numpy**: NumPy is a numerical computing library that supports extremely efficient array computations (as we will see, arrays are basically containers for numbers and are extremely useful for representing data). It also has basic mathematical operations and statistical calculations on those numerical arrays. NumPy forms the basis for almost all packages that perform operations on numbers.
- **matplotlib**: Matlplotlib is the core plotting library for Python. 
- **scipy**: Scipy is a scientific computing package that is basically one abstraction level up from NumPy. It includes basic statistics, optimization, numerical integration,  and other scientific computing tools. It is a sort of grab-bag for useful computations that you will use across multiple projects. 

The above tools are sort of the core of the Python data science ecosystem. If you do any heavy data analysis, I can pretty much *guarantee* you will use the above libraries. The other libraries typically build on the above: I would consider them more peripheral, in the sense that you may or may not use them, depending on your specialization and the type of data you use. They include:

- **Pandas**: an extremely user-friendly package for handling tabular data (basically, data arranged in something like an Excel spreadsheet). Pandas includes tools for analysis as well as plotting such data. It builds extensively on numpy and Matplotlib. Below, we will learn the basics of Pandas, partly because it is used extensively in Neuroscience, and because a lot of data in neuroscience is stored in the `csv` format.
- **Pillow**: A simple image processing library (initially, there was PIL (Python Image Library) and Pillow is based on it). Allows for reading/writing images, converting between image types, and applying simple filters (e.g., contrast/smoothing) to images. 
- **OpenCv**: An extremely powerful computer vision library. Written in C++, a Python wrapper lets you use it using pure Python. It allows you to capture video from your web camera, read and write movies, and has tools for extensive processing and filtering of your images and movies, including machine vision in real time (e.g., facial recognition). 
- **scikit-learn**: The main machine learning library when you want to do traditional (not-neural network based) machine learning. It has tools for classification, clustering, and regression. It has a consistent interface for these problems, and is fairly intuitive to use.
- **tensorflow/pytorch**: These are the two main *deep learning* packages that use artificial neural networks to let you tap into the power of cutting edge machine learning algorithms. Tensorflow was developed by Google and has been around longer, while Pytorch was developed by Facebook and is a bit easier to use and learn. Such approaches tend to work great for complex machine vision problems, but are often not necessary (i.e., scikit-learn is often more than enough). 
- **Plotly / Bokeh / Seaborn (etc)**: while Matplotlib is a great plotting library, there are some limitations. Because of this, there has been [a proliferation of plotting libraries in Python](https://geo-python-site.readthedocs.io/en/stable/lessons/L7/python-plotting.html). I've just listed some of the more popular ones here. *Seaborn* is built on Matplotlib and provides an intuitive interface to generate beautiful plots even with complex data, plots that would take many many lines of code using Matplotlib.  *Plotly* and *Bokeh* allow you to embed beautiful interactive plots in Jupyter notebooks: they are basically Python interfaces with a Javascript (web-page friendly) back end. Plotly and Bokeh can also be integrated into dashboards that you can share online, or even in the cloud. 
- **sympy**: the symbolic mathematics library for Python. This is not used that much by people tussling with real messy data. However, if you need an *exact* solution to an equation, then you can use `sympy`. 
- **statsmodels**: for specialized or advanced statistical analysis: if SciPy doesn't have what you need, check statsmodels. This used to be part of SciPy, but branched off into its own specialized package.
- **PyMC3**: A tool specialized for Bayesian modeling and inference. 

## How to navigate the ecosystem
The above is a guided tour of the main components of the Python data science ecosystem. Note do *not* worry about memorizing everything in there: I'd just try to remember the general types of tools available. Then when you need one you know you will be able to find it. 

Honestly, there is a good chance you will *never* use some of the libraries in the secondary stack, especially if you are not a full-time programmer. I *am* a full-time programmer and I have never explicitly imported `PyMC3` or `sympy`. They are perfectly good packages, I just haven't needed them yet.  

Below, we will focus on learning the basics of just three of the packages in the data science stack. This will give you a sense for how to install, import, and work with different libraries. Once you've done it a couple of times, your skill will generalize to new libraries when they are needed.

In general with Python, for important tasks (e.g., web scraping) there will typically be a canonical framework out there that does it really well. Finding it is typically easy with Google. There are also curated lists of great Python packages: my favorite is the [awesome python](https://github.com/vinta/awesome-python) list, which is fairly comprehensive and organized by general category. 

### What about Jupyter, github, and IDEs?
While Jupyter notebooks aren't part of your first-order analysis workflow (i.e., you don't ever *import* Jupyter in your code), they have become the de-facto scaffolding for workflows, so I thought it was important to include it in the figure above. 

With their extremely convenient web-interface for coding, they have become absorbed into all the major cloud computing platforms (AWS, Google Colabs, and Azure). Further, Jupyter notebooks are the tool of choice for book authors, and creators of new libraries that want to get people on board quickly. It used to be you had to worry about what IDE someone was using when you shared code -- now you know you can just provide a Jupyter notebook, and things will be fine. 

If this class were for software developers, **git** and **github** would also be in the diagram. By making it extremely easy to track and share code, such tools have been crucial for speeding up development of analysis software, and removing arbitrariness from the process. Since this class is for neuroscientists, who often will not use a version control system, I decided to leave it out. 

Similarly, as discussed in `week0.ipynb`,  *integrated development environments* (IDEs) are extremely important components of any programmer's workflow: Jupyter notebooks simply cannot compare in terms of power/speed/efficiency for writing and refactoring code. Jupyter notebooks have been a boon for *communicating* code. Our purpose here is more didactic, and Jupyter notebooks are often sufficient for most analysis purposes: if you ever feel the need to go beyond the notebook, to something more powerful, then there are plenty available (see Section 2 of `week0.ipynb` where we went over some options).

<a id ="virtual"></a>
<img width = "250" src="./images/sandbox.png" align="right" style="padding-left:10px">
# 2: Virtual environments: your Python sandbox

Before we jump into numpy, we need to learn about *virtual environments* in Python. This will be a practical, whirlwind tour of anaconda virtual environments, just enough to get us started. We'll start with an introduction and background material before building our own virtual environment.

## Background on virtual environments

Since in the final two classes we will go well beyond the standard library, forging fairly deeply into the data science jungle, we need to start thinking about how to intelligently manage the environment in which we carry out our work in Python. We need to talk about *virtual environments*.

A virtual environment is an isolated playground or sandbox that consists of a Python version and set of software packages that will allow you to carry out a task. That task might be extremely general ("Learn about the data science ecosystem") or specific ("Set up deep lab cut and run it on my fly data"). Each of your different virtual environments will have a different *name* and can easily be *activated* within conda at the command line. Once you are within this environment, you are then free to use all the resources (import the packages and tools in that environment) and ... see what possibilities exist within that self-contained Python universe! 

<img width="180" src="./images/virt_env_thumbsup.jpg" align="right" style="padding-left:10px">

As you start to build projects that depend on sometimes incompatible combinations of software packages, it will become more important to keep keep these projects isolated from each other. For instance, you might have one projects that uses Python 3.6, and Tensorflow v2. Another project that you are working on may use Python  3.8 and Pytorch. These two sets of packages, these different environments, will not play well together. It would be agonizing to tear down a precious tensorflow environment that you spent hours building up, just to build a Pytorch analysis platform on the same machine. 

Wouldn't it be easier to build up temporary *virtual* environments, one for each project, an isolated enviroment that you could activate and work within like an isolated sandbox when you needed to, and could deactivate when you needed another environment?  That way, you wouldn't have to worry about making breaking changes to your system every time you want to start a new project.

Also, what if, once you had this veritable house of cards built up, and wanted to *share* it with someone in a convenient way? Wouldn't it be nice if you could share the entire set of software dependencies, so that someone could reconstruct your sandbox on their own computer?  This would not only make things more convenient, but make for more reproducible science!

Virtual environments to the rescue! They allow you to do all these things, in an intuitive and easy way. Since you have been using anaconda all this time, it will be extremely easy to get this working! In fact, while we haven't discussed it, you have been working within a virtual environment this whole time! You have been working within your anaconda `base` environment. What anaconda allows you to do is to create new more specialized environments.

To see this, open up your anaconda prompt (whatever command line you use to open Jupyter notebooks). You should see the word `base` in parentheses before the command line prompt. This means you are in your base anaconda environment. If you want to see what packages are installed in an environment, just use the command `conda list` and it will list all the stuff you have installed.

We can see the overall situation with conda virtual environments in the following image, created by User Interface Designer [Kristztina Szerovay](https://krisztina.szerovay.hu/):
<img width = "700" src="./images/virtual_environments.jpg">

This figure deserves close study. It shows the big picture view of where we are, and where we are headed. First, at the top, you have set up the conda package manager which handles your environments. You did this in your very first class. On the left is the `base` environment (called the `root` environment in the figure) which is the default environment that contains the Python standard library and any other packages you might want in your base environment. You have been working in your base environment since day 1.

On the right hand side are environments (labeled *Additional environment*). Each environment contains a verion of Python that can be different from that in the base, and software packages that can be completely different from your base packages. You name each such isolated environment whatever you like (ideally some clear name that will remind you what it is for). These environments are your playgrounds, the sandboxes in which you can work on your different projects.

## Creating and activating your first environment
So that's geneal background, for those that want to understand virtual environments. Let's get some practical experience!

Creating new environments is really easy in conda. Just use the command:

    conda create -n <env_name> 
    
Where `<env_name>` is the name of the new environment you want to create. As mentioned previously, you typically create virtual environments for some new task. I try to pick environment names so that it is pretty clear what that environment is for. For this class, let's create a `data_science` virtual environment with the following command:

    conda create -n datasci
    
You will see a bunch of stuff, and be asked if you want to proceed, enter `y` and hit your ENTER key, and you now have your virtual environment!  To see a list of virtual environments you have in your system, use the following command:

    conda env list
    
You should see your `base` environment and `datasci` environment (as well as any others you might have created in the past).

Now we are ready to play, right? **Not yet**. You must first *activate* the environment: The single biggest mistake that beginners make when working with virtual environments is forgetting to activate them before trying to use them, and then wondering why their software is broken. 

So -- you have *created* a virtual environment named `datasci`, you aren't actually working inside of it yet. You need to *activate* that environment using the `activate` command:

    conda activate datasci

Now you should notice that the name in parentheses before your command line prompt should have switched from `(base)` to `(datasci)`. This is how you know what conda environment you are in. It should look something like this:

<img width = "400" src="./images/conda_venv_activate.jpg">

If this seems strange or esoteric, it will soon become second nature to you. The main things to remember at this point:
- The general concept of a virtual environment and why they are important.
- Before working in an environment you have created you first have to activate the environment using the `activate` command.  

## Building the environment: installing packages
Now that we are inside of our datasci environment (make sure it is activated!), we can finally do stuff there. The first thing you want to do is actually install some packages!

We haven't gotten anything inside our datasci environment yet, as we have to build it up from scratch. You can see what you have so far using `conda list`. We are going to install four things for our  through the data science ecosystem: jupyter, numpy, matplotlib, and pandas.  

> Side tip: if these were packages I had never used before, I would simply google `conda install <package name>` to figure out the recommended installation instructions. There are two main package managers for Python, `pip` and `conda`. If possible you want to try to install things using `conda` when you are using the conda package manager. If an installer *isn't* available for something, then you can use pip and it typically works out just fine. 


Let's create our datasci environment: 

    conda activate datasci # just in case you didn't already do this
    conda install -c conda-forge numpy matplotlib pandas notebook
 
You will be prompted to be sure you want to install these libraries (and their many, many dependencies). Enter 'y' and sit back for a few minutes while conda installs everything. What we have done (after activating our virtual environment) is to tell conda to install the four packages into the virtual environment. 

> Note: the `-c conda-forge` part of the command is telling conda to install these packages from the `conda-forge` channel, which is the "channel" where software developers maintain the installers for their software packages. 

Once you've done the above, you've successfully created your datasci environment! You can now enter `jupyter notebook` at the command line prompt as you have been doing all along, and your Jupyter server will start. The cool thing is that now it will start *within the context of your new virtual environment*, so you will be able to import numpy, matplotlib, and pandas. In your base environment, if you try to import those packages, you will get an error because they have not been installed.  

We will work within this environment below, for now just pat yourself on the back: you've just created your first virtual environment!

## Deactivating and uninstalling environments
What about when you are done working within a virtual environment? Say you want to *deactivate* an environment and go back to your base environment? This is very simple within conda. Just enter:

    conda deactivate
    
And your environment will go back to base.

Finally, what if your virtual environment is fubar and you want to delete it and start from scratch? This can happen when you have a system of dependencies that has gone down a horrible garden path, and you can't get out. Maybe you are getting errors that you can't fix. It is sometimes easier to just start over and install things in a different order. To remove a virtual environment from your computer, use the following command:

    conda remove -n <env_name> --all
    
Just be sure you really want to do this.

## To learn more about virtual environments
Virtual environments is a big topic, and there are lots of resources. To learn more about conda virtual environments in more depth (like how to save/share an environment), see the following):
- https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html
- https://www.freecodecamp.org/news/why-you-need-python-environments-and-how-to-manage-them-with-conda-85f155f4353c/

To learn about the pip package manager, and how to manage virtual env
More about anaconda. Our goal is not to spend too much time on this, but to learn by doing, so let's use our data science virtual environment to start exploring the data science ecosystem!

<a id ="numpy"></a>
# 3: Numpy for numerical computing

Imports
You’ve already done this (mention parts in ATBS)
The following is a really good summary of imports.
https://medium.com/cold-brew-code/a-quick-guide-to-understanding-pythons-import-statement-505eea2d601f 

Other references:
https://www.digitalocean.com/community/tutorials/how-to-import-modules-in-python-3 


<a id ="matplotlib"></a>
# 4: Plotting with matplotlib

<a id ="matplotlib"></a>
# 5: Pandas for tabular data

<span style="color:red">
    <h1>Congratulations!!!</h1>
</span>
<img width = "150" src="./images/yippee.jpg" align="left" style="padding-right:10px">



In [None]:
Further reading
