<a href="https://colab.research.google.com/github/davanstrien/Intro-to-Jupyter-Notebooks-The-Weird-and-Wonderful/blob/draft/01_environment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Python libraries in Colab 

This notebook covers some aspects of Python packaging inside a Colab notebook. This notebook is particularly aimed at people less familiar with Python who may want to try running a Jupyter notebook they have been sent/found online when that notebook hasn't been deliberately setup for running inside Colab. 

We'll take a slightly crude approach of getting a notebook to run. We won't explain Python packages or libraries in any proper detail in this notebook

### What is already available
        
There are some Python libraries which are always available in Colab. These include the 'standard library'. These are things which are included with Python by default. These libraries cover a range of tasks. These libraries may be all that is used by some notebooks. You can read more about the standard library in the Python [docs](https://docs.python.org/3/library/index.html)


As a simple example `print` will work because this is  part of the Python language (we don't import it). 

In [1]:
print('This will work')

This will work


So will basic addition

In [2]:
1 + 1

2

and the standard python 'data structures' like lists

In [3]:
data = [1,2,3,4]

We can import functionality from the standard library. Here we will import `random`

In [4]:
import random

We can then use `random` in our code

In [5]:
random.sample(data, 1)

[1]

### Python libraries included with Colab

There are some libraries which Colab makes available by default even though they are not part of the Python Standard Library. This includes popular libraries and libraries used in data analysis work. I am not sure of the exact criteria used by Google to determine if a package should be available by default. 

A library which is included by default is pandas

In [6]:
import pandas as pd

In [7]:
df = pd.DataFrame(data)

In [8]:
df

Unnamed: 0,0
0,1
1,2
2,3
3,4


Another heavily used Python package is `requests`, this is also included by default 

In [9]:
import requests

### Python libraries not included with Colab

At some point you will probably run across a notebook which includes a Python library which is not included in Colab by default. As an example we'll try and import [marc2excel](https://pypi.org/project/marc2excel/) a library for converting between marc and excel. 

Lets see what happens if we try and import this library

In [10]:
import marc2excel

ModuleNotFoundError: ignored

😬 we get an error! More specifically we get a `ModuleNotFoundError`. This means that the module we're trying to import isn't found by Python. 

If we try and run someones notebooks on Colab when they haven't specially set the notebook up to be run on Colab we'll often get errors like this. Often it can be fairly easy to fix (not always...) 

We can use `pip` to install the package. `pip` is a tool for managing packages inside Python. We can use it to install various external libraries and packages. We won't dwell too much on this here. Normally we would install library by doing 

`pip install libraryname` 

on the command line 

Inside a Colab environment we need to add a `!` mark before we use commands which would normally run on the command line. 

In [11]:
!pip install marc2excel

Collecting marc2excel
  Downloading https://files.pythonhosted.org/packages/03/99/baecdaeb8c6268faa56678fa2fddf6dc76302ef4e06117a8f86331e9c84e/marc2excel-1.2.1.tar.gz
Collecting pymarc<4.0,>=3.1.5
[?25l  Downloading https://files.pythonhosted.org/packages/13/8a/ddf96fff7324aef1bbe6bfc99ba5a63bcf4a0f892b5fbe6764e3e3ad44f9/pymarc-3.2.0.tar.gz (223kB)
[K     |████████████████████████████████| 225kB 7.2MB/s 
[?25hCollecting click<7.0,>=6.6
[?25l  Downloading https://files.pythonhosted.org/packages/34/c1/8806f99713ddb993c5366c362b2f908f18269f8d792aff1abfd700775a77/click-6.7-py2.py3-none-any.whl (71kB)
[K     |████████████████████████████████| 71kB 4.9MB/s 
Building wheels for collected packages: marc2excel, pymarc
  Building wheel for marc2excel (setup.py) ... [?25l[?25hdone
  Created wheel for marc2excel: filename=marc2excel-1.2.1-cp37-none-any.whl size=7279 sha256=1f58814097aa6d31660e1e5003a958d7cc600e96e4a6c0a659697c7eb683e822
  Stored in directory: /root/.cache/pip/wheels/86/eb/f

Now if we try importing again

In [12]:
import marc2excel

We no longer get an error. 

Often notebooks will do most imports in one cell at the top of the notebook. In this case each import will be done in order. If one of the imports fails you'll get an error 

In [13]:
import pandas as pd
from textblob import TextBlob
import ISS_Info
import twint

ModuleNotFoundError: ignored

Above you will see we get an error again. We can also see for which library we get an error. We can fix this by pip installing this package. 

In [14]:
!pip install ISS_Info

Collecting ISS_Info
  Downloading https://files.pythonhosted.org/packages/ac/99/a8541c76eaa2c24b66954396d175435d2cfbe10dd4a1040806c7f85ef2c2/ISS_Info-1.0.0-py3-none-any.whl
Installing collected packages: ISS-Info
Successfully installed ISS-Info-1.0.0


If we try importing this cell again (in reality we'd probably re-run the above cell again but we wan't to keep this notebook in order) 

In [15]:
import pandas as pd
from textblob import TextBlob
import ISS_Info
import twint

ModuleNotFoundError: ignored

Now we go to our next failing import. Again we can fix this by using `pip` to install 

In [16]:
!pip install twint

Collecting twint
  Downloading https://files.pythonhosted.org/packages/69/e1/4daa62fbae8a34558015c227a8274bb2598e0fc6e330bdeb8484ed154ce7/twint-2.1.20.tar.gz
Collecting aiohttp
[?25l  Downloading https://files.pythonhosted.org/packages/88/c0/5890b4c8b04a79b7360e8fe4490feb0bb3ab179743f199f0e6220cebd568/aiohttp-3.7.4.post0-cp37-cp37m-manylinux2014_x86_64.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 14.5MB/s 
[?25hCollecting aiodns
  Downloading https://files.pythonhosted.org/packages/ab/72/991ee33a517df69c6cd6f3486cfe9b6329557cb55acaa8cefac33c2aa4d2/aiodns-3.0.0-py3-none-any.whl
Collecting cchardet
[?25l  Downloading https://files.pythonhosted.org/packages/80/72/a4fba7559978de00cf44081c548c5d294bf00ac7dcda2db405d2baa8c67a/cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263kB)
[K     |████████████████████████████████| 266kB 40.7MB/s 
[?25hCollecting elasticsearch
[?25l  Downloading https://files.pythonhosted.org/packages/8f/5c/60a32dfc24da07703b5b32d9935bcc36786a

Now if we import again we should get no errors 

In [17]:
import pandas as pd
from textblob import TextBlob
import ISS_Info
import twint

## Versions of packages

One slight complication is that sometimes the version of a library available in Colab isn't the version you want. We won't dig into versioning in too much detail here but we'll quickly look at some common scenarios. 

Let's see if we can import `tqdm` a package for creating progress bars. 

In [18]:
from tqdm.auto import tqdm

In [19]:
data = [1,2,3,4,5,6,7,8,9,10]

In [20]:
import time 

def double(x):
    time.sleep(1)
    return x *2

In [21]:
for number in tqdm(data):
    print(double(number))

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

2
4
6
8
10
12
14
16
18
20



We want to speed this up whilst keeping our progress bar. We heard we could use `thread_map` from tqdm`s contrib module. 

In [22]:
from tqdm.contrib.concurrent import thread_map

ModuleNotFoundError: ignored

We get an error. This time it isn't that we don't have the `tqdm` library. Instead we can't import a particular part of that library that we think should be available. We can find out the version of a python package like this:

In [23]:
import tqdm
tqdm.__version__

'4.41.1'

If we compare that to the package currently on `PyPi` (the main place where Python packages are published) we can see if we have an older version. https://pypi.org/project/tqdm/

At the time of writing this notebook the PyPi version of tqdm is `4.60.0` so we might need to upgrade to get our mising `thread_map` function. We can upgrade quite easily by doing `pip install` with an `--upgrade` flag. 

In [24]:
!pip install tqdm --upgrade

Collecting tqdm
[?25l  Downloading https://files.pythonhosted.org/packages/72/8a/34efae5cf9924328a8f34eeb2fdaae14c011462d9f0e3fcded48e1266d1c/tqdm-4.60.0-py2.py3-none-any.whl (75kB)
[K     |████▎                           | 10kB 13.0MB/s eta 0:00:01[K     |████████▋                       | 20kB 17.9MB/s eta 0:00:01[K     |█████████████                   | 30kB 21.4MB/s eta 0:00:01[K     |█████████████████▎              | 40kB 17.9MB/s eta 0:00:01[K     |█████████████████████▋          | 51kB 10.1MB/s eta 0:00:01[K     |██████████████████████████      | 61kB 9.0MB/s eta 0:00:01[K     |██████████████████████████████▎ | 71kB 9.6MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 4.8MB/s 
[?25hInstalling collected packages: tqdm
  Found existing installation: tqdm 4.41.1
    Uninstalling tqdm-4.41.1:
      Successfully uninstalled tqdm-4.41.1
Successfully installed tqdm-4.60.0


You'll see a warning above showing that we need to restart to make sure we're using the version of tqdm we just installed. We can use the `RESTART RUNTIME` button to do this. 

In [1]:
import tqdm

In [2]:
tqdm.__version__

'4.60.0'

In [3]:
from tqdm.contrib.concurrent import thread_map

In [4]:
data = [1,2,3,4,5,6,7,8,9,10]

In [5]:
import time 

def double(x):
    time.sleep(1)
    return x *2

Now we can use the code from the most recent version of `tqdm`

In [6]:
thread_map(double, data)

  0%|          | 0/10 [00:00<?, ?it/s]

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

### Installing specific versions of a library

There might be times when you rely on a specific version of a python library. You can install a specific version by passing in a version number. 

In [None]:
!pip install pandas==0.25.3

You can see above we start to get some more errors here. This example is a little bit extreme because I've chosen quite an old Pandas version so often installing a specific version of a package will be okay. Just be aware that you can occasionally run into issues with conflicting requirements. 

## Non python packages 

Sometimes you may need other non python tools in your environment. Underneath colab runs on Linux so we can use some Linux tools 'out of the box'. Let's prove that this is some type of linux:

In [None]:
!uname -a

Use a linux command

In [None]:
!pwd

We can use some commands like `wget` in the usual way (only remembering to put a `!` at the front of our command)

In [None]:
!wget data.bl.uk/

### Installing external packages (non Python) 

What if we want a non python package that isn't included by default? Let's see if youtube-dl is available. This is a tool for download videos from YouTube (we will only download creative commons videos!). If we run the command for this tool 

In [None]:
!youtube-dl

We get an error. If we go to the [GitHub repository](https://github.com/ytdl-org/youtube-dl) for the tool you'll see install instructions for various packages. We can often use the Linux install instructions and just add a `!` at the start of each line. 

In [None]:
!sudo curl -L https://yt-dl.org/downloads/latest/youtube-dl -o /usr/local/bin/youtube-dl
!sudo chmod a+rx /usr/local/bin/youtube-dl

Now we should be able to use the tool 

In [None]:
!youtube-dl https://www.youtube.com/watch?v=FSdIoJdSnig

This doesn't cover every possible issue you could run into with setting up the required libraries, packages and tools inside Colab but should give a good starting point. 