# Installation

Conceived in the late 1980s as a teaching and scripting language, Python has since become an essential tool for many programmers, engineers, researchers, and data scientists across academia and industry.
As computational and data-focused scientists, we have found Python to be a near-perfect fit for the types of problems we face, whether it's extracting meaning from large social network datasets, scraping and munging/wrangling data sources from the Web, or automating day-to-day research tasks.

The appeal of Python is in its simplicity and beauty, as well as the convenience of the large ecosystem of domain-specific tools that have been built on top of it.

For example, most of the Python code in scientific computing and data science is built around a group of mature and useful packages:

- [NumPy](https://numpy.org) provides efficient storage and computation for multi-dimensional data arrays.
- [SciPy](https://scipy.org) contains a wide array of numerical tools such as numerical integration and interpolation.
- [Pandas](https://pandas.pydata.org) provides a DataFrame object along with a powerful set of methods to manipulate, filter, group, and transform data.
- [Matplotlib](https://matplotlib.org) provides a useful interface for creation of publication-quality plots and figures.
- [Scikit-Learn](https://scikit-learn.org) provides a uniform toolkit for applying common machine learning algorithms to data.
- [IPython/Jupyter](https://jupyter.org) provides an enhanced terminal and an interactive notebook environment that is useful for exploratory analysis, as well as creation of interactive, executable documents. For example, the manuscript for this report was composed entirely in Jupyter notebooks.

## Installation and Practical Considerations

Installing Python and the suite of libraries that enable scientific computing is straightforward whether you use Windows, Linux, or Mac OS X. This section will outline some of the considerations when setting up your computer.


### Python 2 vs Python 3

This report uses the syntax of Python 3, which contains language enhancements that are not compatible with the *2.x* series of Python.
Though Python 3.0 was first released in 2008, adoption was relatively slow at first, particularly in the scientific and web development communities.
This was primarily because it took some time for many of the essential packages and toolkits to be made compatible with the new language internals.
Since early 2014, however, stable releases of the most important tools in the data science ecosystem have been fully-compatible with both Python 2 and 3, 
In fact, many major projects are now deprecating Python 2, and so this course will use Python 3 syntax.
Even though that is the case, the vast majority of code snippets here will also work with little if any modification in Python 2.

### Installation with conda

Though there are various ways to install Python, the one I would **strongly suggest** — particularly if you wish to eventually use the data science tools mentioned above – is via the cross-platform Anaconda distribution:

 - [Anaconda](https://www.anaconda.com/distribution/) gives you Python and the Python standard library, a command-line tool called ``conda`` which allows you to easily install many third-party packages and libraries, and additionally bundles a suite of other pre-installed third-party packages geared toward scientific computing.

To get started, download and install the Anaconda package – make sure to choose a version with Python 3.7+, following our instruction. 

When finished, run the following script to ensure all packages for the course are available.

In [None]:
import scipy
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Launching

## Launching the IPython Shell

Start by launching the IPython interpreter by typing **``ipython``** on the command-line; alternatively, if you've installed a distribution like Anaconda or EPD, there may be a launcher specific to your system.

Once you do this, you should see a prompt like the following:
```
Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.22.0 -- An enhanced Interactive Python. Type '?' for help.
IPython 7.22.0 -- An enhanced Interactive Python.

In [1]:
```
With that, you're ready to follow along.

## Launching the Jupyter Notebook

The Jupyter notebook is a browser-based graphical interface to the IPython shell, and builds on it a rich set of dynamic display capabilities.
As well as executing Python/IPython statements, the notebook allows the user to include formatted text, static and dynamic visualizations, mathematical equations, JavaScript widgets, and much more.
Furthermore, these documents can be saved in a way that lets other people open them and execute the code on their own systems.

Though the IPython notebook is viewed and edited through your web browser window, it must connect to a running Python process in order to execute code.
This process (known as a "kernel") can be started by running the following command in your system shell:

```
$ jupyter notebook
```


This command will launch a local web server that will be visible to your browser.
It immediately spits out a log showing what it is doing; that log will look something like this:

```
$ jupyter notebook
[I 09:44:13.661 NotebookApp] Serving notebooks from local directory: C:\Users\YifangMa
[I 09:44:13.665 NotebookApp] Jupyter Notebook 6.3.0 is running at:
[I 09:44:13.665 NotebookApp] http://localhost:8888/
[I 09:44:13.665 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
```

Upon issuing the command, your default browser should automatically open and navigate to the listed local URL;
the exact address will depend on your system.
If the browser does not open automatically, you can open a window and manually open this address (*http://localhost:8888/* in this example).

# Help and Documentation

Accessing Documentation with ``?`` or ``help``

Accessing Source Code with ``??``

Exploring Modules with Tab-Completion

In [11]:
#Python has a built-in help() function that can access this information and prints the results.
help(range)

Help on class range in module builtins:

class range(object)
 |  range(stop) -> range object
 |  range(start, stop[, step]) -> range object
 |  
 |  Return an object that produces a sequence of integers from start (inclusive)
 |  to stop (exclusive) by step.  range(i, j) produces i, i+1, i+2, ..., j-1.
 |  start defaults to 0, and stop is omitted!  range(4) produces 0, 1, 2, 3.
 |  These are exactly the valid indices for a list of 4 elements.
 |  When step is given, it specifies the increment (or decrement).
 |  
 |  Methods defined here:
 |  
 |  __bool__(self, /)
 |      self != 0
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(self, key, /)
 |      Return self[key].
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __hash__(self, /)
 |

Because finding help on an object is so common and useful, IPython introduces the ``?`` character as a shorthand for accessing this documentation and other relevant information:

In [12]:
range?

In [13]:
#This notation works for just about anything, including object methods:
L = [1, 2, 3]
L.insert?

In [1]:
def square(a):
    """Return the square of a."""
    return a ** 2

In [4]:
square?
#square??

In [None]:
#IPython's other useful interface is the use of the tab key for 
#auto-completion and exploration of the contents of objects, 
#modules, and name-spaces.
L.<TAB>

In [None]:
L._<TAB>

# IPython Magic Commands

## Timing Code Execution: ``%timeit``

- ``%time``: Time the execution of a single statement
- ``%timeit``: Time repeated execution of a single statement for more accuracy

Another example of a useful magic function is ``%timeit``, which will automatically determine the execution time of the single-line Python statement that follows it.
For example, we may want to check the performance of a list comprehension:

In [5]:
%timeit L = [n ** 2 for n in range(1000)]

227 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [None]:
import random
L = [random.random() for i in range(100000)]
%timeit L.sort()

The benefit of ``%timeit`` is that for short commands it will automatically perform multiple runs in order to attain more robust results.
For multi line statements, adding a second ``%`` sign will turn this into a cell magic that can handle multiple lines of input.
For example, here's the equivalent construction with a ``for``-loop:

In [6]:
%%timeit
L = []
for n in range(1000):
    L.append(n ** 2)


268 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Profiling Full Scripts: ``%prun``

A program is made of many single statements, and sometimes timing these statements in context is more important than timing them on their own.
Python contains a built-in code profiler (which you can read about in the Python documentation), but IPython offers a much more convenient way to use this profiler, in the form of the magic function ``%prun``.

By way of example, we'll define a simple function that does some calculations:

In [9]:
def sum_of_lists(N):
    total = 0
    for i in range(5):
        L = [j ^ (j >> i) for j in range(N)]
        total += sum(L)
    return total

Now we can call ``%prun`` with a function call to see the profiled results:

In [12]:
%prun sum_of_lists(1000000)

 

The result is a table that indicates, in order of total time on each function call, where the execution is spending the most time. In this case, the bulk of execution time is in the list comprehension inside ``sum_of_lists``.
From here, we could start thinking about what changes we might make to improve the performance in the algorithm.

## Profiling Memory Use: ``%memit`` and ``%mprun``

Another aspect of profiling is the amount of memory an operation uses.
This can be evaluated with another IPython extension, the ``memory_profiler``.
As with the ``line_profiler``, we start by ``pip``-installing the extension:

```
$ pip install memory_profiler
```

Then we can use IPython to load the extension:

In [15]:
!pip install memory_profiler

Collecting memory_profiler
  Downloading memory_profiler-0.58.0.tar.gz (36 kB)
Building wheels for collected packages: memory-profiler
  Building wheel for memory-profiler (setup.py): started
  Building wheel for memory-profiler (setup.py): finished with status 'done'
  Created wheel for memory-profiler: filename=memory_profiler-0.58.0-py3-none-any.whl size=30183 sha256=fbb4f6ada6097d4210fef480c2531f7e10f71d63f3779bba7e130c0b474533cf
  Stored in directory: c:\users\yifang ma\appdata\local\pip\cache\wheels\6a\37\3e\d9e8ebaf73956a3ebd2ee41869444dbd2a702d7142bcf93c42
Successfully built memory-profiler
Installing collected packages: memory-profiler
Successfully installed memory-profiler-0.58.0


In [16]:
%load_ext memory_profiler

In [17]:
%memit sum_of_lists(1000000)

peak memory: 123.91 MiB, increment: 71.34 MiB


## Other Usefull Magic commends:

``%matplotlib inline``

``%conda``

``%pwd``

Full list: https://ipython.readthedocs.io/en/stable/interactive/magics.html

# How to run python scripts

``python test.py``

``Spyder``

``Jupyter Notebook``

# Resources for Further Learning

I have tried to cover the pieces and patterns in the Python language that will be most useful to a data scientist using Python, but this has by no means been a complete introduction. If you'd like to go deeper in understanding the Python language itself and how to use it effectively, you may check the following book usefull:

- [*Dive Into Python*](http://www.diveintopython.net/) by Mark Pilgrim. This is a free online book that provides a ground-up introduction to the Python language.

To dig more into Python tools for data science and scientific computing:

- [*The Python Data Science Handbook*](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas. This book starts precisely where this mini-text leaves off, and provides a comprehensive guide to the essential tools in Python's data science stack, from data munging and manipulation to machine learning.

- [*Python for Data Analysis*](http://shop.oreilly.com/product/0636920023784.do) by Wes McKinney, creator of the Pandas package. This book covers the Pandas library in detail, as well as giving useful information on some of the other tools that enable it.