<a href="https://colab.research.google.com/github/4dsolutions/clarusway_data_analysis/blob/main/python_warm_up/warmup_3rd_party_datascience.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a><br/>
[![nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/4dsolutions/clarusway_data_analysis/blob/main/python_warm_up/warmup_3rd_party_datascience.ipynb)


# 3rd Party Python Libraries:  numpy and pandas

<a data-flickr-embed="true" href="https://www.flickr.com/photos/kirbyurner/52563704012/in/album-72177720296706479/" title="LMS Dashboard"><img src="https://live.staticflickr.com/65535/52563704012_71ef4beb8a_b.jpg" width="1024" height="354" alt="LMS Dashboard"></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>

Python Warm-up Notebooks:

*  [Introduction to Python](warmup_python_intro.ipynb)
*  [3rd Party Libraries](warmup_3rd_party_datascience.ipynb)   (you are here)
*  [Object Types](warmup_data_structures.ipynb)
*  [Object Oriented Paradigm](warmup_object_oriented.ipynb)
*  [Calling Callables and Type Checking](warmup_callables.ipynb)
*  [Class and Static Methods, Properties](warmup_object_oriented2.ipynb)
*  [SQLite3 and Context Managers](warmup_object_sql.ipynb)
*  [Iterators and Generators](warmup_generators.ipynb) 

Once you have these two packages installed (see Installation Tips below), you will be able to import them and start exploring what they do.

In [1]:
import numpy as np
import pandas as pd

Notice how we're able to assign an alias to the package as we import it.  An alias is like a nickname, a shorter way to refer to something.  

Now we're able to write `np.array` instead of the longer `numpy.array` when we want to reach into the numpy namespace, to use its array builder.

## numpy n-dimensional arrays

In [2]:
table = np.array(range(1, 11)).reshape(2,5)
table

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10]])

That looks a little complicated.  What's going on?

The `range` type is native to Python, and returns lists of consecutive integers, like this:

In [3]:
one_to_ten = range(1, 11)  # not inclusive of upper bound
one_to_ten

range(1, 11)

Now that we have named our range object, lets feed it to the list type, so we might see the contents:

In [4]:
list(one_to_ten)  # this will convert it to a list type object

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

We keep creating a `list` or `tuple` type, to better see the contents:

In [5]:
list(range(2, 21, 2))     # start at 2, step by 2

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

In [6]:
tuple(range(20, -1, -2))  # start at 20, step by -2

(20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 0)

In [7]:
list(range(-5, 6))

[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5]

The time has come to take our native Python range object, and use it to create something new, a numpy array.  

This is a vastly more powerful type of object that we use throughout our data analysis and visualization courses, and not only there.

Remember `table`?  We created that up above.  This Notebook remembers the names we have defined so far.

In [8]:
table  # 2 rows, 5 columns (10 elements)

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10]])

In [9]:
type(table)  # nd stands for "n dimensional" i.e. as many as we want

numpy.ndarray

In [10]:
table.ndim # two dimensional

2

OK, now it's time to use `one_to_ten` (range type) to make yet another n-dimensional array.

In [11]:
np.array(one_to_ten)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [12]:
np.array(one_to_ten).ndim  # one dimensional

1

In [13]:
# print(dir(np.array(one_to_ten)))  # so many superpowers!

A numpy array may be shaped into rows and columns.  

The range object, used as input, gives us an array we call "one dimensional" i.e. it doesn't have any columns (yet). 

However, once a range is fed to `np.array`, we get a new type of object that's quite happy to be turned into rows and columns -- as long as the shape we ask for is the right size for the elements we give (1 through 10 in this case).

In [14]:
np.array(one_to_ten).reshape(5,2)

array([[ 1,  2],
       [ 3,  4],
       [ 5,  6],
       [ 7,  8],
       [ 9, 10]])

## range versus arange and linspace

The native `range` type is extremely useful in Python, and like `list`, it's a sequence.  However `range` is very integer oriented.

numpy augments our ability to create ranges.  `np.arange` is a lot like `range` except your steps might be floating point.

In [15]:
np.arange(0, 4, 0.1)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2,
       1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5,
       2.6, 2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8,
       3.9])

`np.linspace` lets you specify the start and stop points (like on a numberline) and give "how many" points you want evenly distributed, including those two.  The results may well include floating point numbers.

In [16]:
np.linspace(0, 4, 20)

array([0.        , 0.21052632, 0.42105263, 0.63157895, 0.84210526,
       1.05263158, 1.26315789, 1.47368421, 1.68421053, 1.89473684,
       2.10526316, 2.31578947, 2.52631579, 2.73684211, 2.94736842,
       3.15789474, 3.36842105, 3.57894737, 3.78947368, 4.        ])

In data analysis work, we often reach for `arange` or `linspace` before we reach for `range`.  

With some additional work though, `range` may be used in a list comprehension to get more decimalized output.

In [17]:
[x/10 for x in range(11)]  

[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

List comprehension syntax is quite powerful and will later become our jumping off point to set and dictionary comprehensions, as well as "generator expressions".

## pandas DataFrames

Now that we have a two-dimensional numpy array named `table`, lets turn it into a pandas DataFrame just to see what that looks like:

In [18]:
df = pd.DataFrame(table)
df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,5
1,6,7,8,9,10


Interesting.  The DataFrame uses the array as data, but ads a frame, of row and column numbers.

We may use those row and column numbers to access the data. 

In [19]:
df.iloc[0, 1]  # row 0, column 1

2

In [20]:
df.iloc[1, 4]  # row 1, column 4

10

The numpy array allows this too.  So why did we need a DataFrame?

In [21]:
table[0,1]

2

In [22]:
table[1,4]

10

In `pandas` we can go beyond merely numeric rows and columns, and use labels of our choosing.

In [23]:
df.index = ("Row 1", "Row 2")
df.columns = (("A", "B", "C", "D", "E"))
df

Unnamed: 0,A,B,C,D,E
Row 1,1,2,3,4,5
Row 2,6,7,8,9,10


Now we're able to access the data using these new names for the rows and columns, making the code we write more robust and easier to read.  

DataFrames make more sense when they tell us about what they contain.

In [24]:
df.loc['Row 2', 'D']

9

In [25]:
df.loc['Row 1', 'B']

2

In [26]:
import sqlite3 as sql
conn = sql.connect('roller_coasters.db')

In [27]:
# help(pd.read_sql)

In [28]:
coasters = pd.read_sql("SELECT * FROM Coasters", conn)

In [29]:
conn.close()

In [30]:
coasters

Unnamed: 0,Name,Park,State,Country,Duration,Speed,Height,VertDrop,Length,Yr_Opened,Inversions
0,Top Thrill Dragster,Cedar Point,Ohio,USA,60,120.0,420.0,400.00,2800.00,2003,0
1,Superman The Escape,Six Flags Magic Mountain,California,USA,28,100.0,415.0,328.10,1235.00,1997,0
2,Millennium Force,Cedar Point,Ohio,USA,165,93.0,310.0,300.00,6595.00,2000,0
3,Goliath,Six Flags Magic Mountain,California,USA,180,85.0,235.0,255.00,4500.00,2000,0
4,Titan,Space World,Kitakyushu,Japan,180,71.5,166.0,178.00,5019.67,1994,0
...,...,...,...,...,...,...,...,...,...,...,...
71,Oblivion,Alton Towers,Alton,England,75,68.0,65.0,180.00,1222.00,1998,0
72,Stunt Fall,Warner Bros. Movie World,San Martin de la Vega,Spain,92,65.6,191.6,177.00,1204.00,2002,3
73,Hayabusa,Tokyo SummerLand,Tokyo,Japan,108,60.3,137.8,124.67,2559.10,1992,0
74,Top Gun,Paramount Canada's Wonderland,Vaughan,Canada,125,56.0,102.0,93.00,2170.00,1995,5


## Installation Tips

Many computer languages come with onboard utilities for accessing 3rd party modules and packages.  Python's native installer is `pip` and/or `pip3`.  Python's main online repository is [PyPi](https://pypi.org/) (the Python Index).

Another Python distro we can recommend (many do) is from [Anaconda](https://www.anaconda.com/products/distribution).  This free open source alternative (to the Python at [Python.org](https://python.org), has its own installer named conda.  The advantage of this path is it already includes a lot of 3rd party packages, such as numpy and pandas.