# Numpy
* The core library for manipulating and cleaning data in python is called pandas.
* The core library underneath pandas, is numpy (numerical python)
* Numpy is all about n dimensional arrays and doing things quickly
* We won't talk about numpy much beyond today, but it's worth a quick consideration...

In [None]:
import numpy as np


In [None]:
# Things we can do with an ndarray include:


In [None]:
# Lets create something multidimensional


In [None]:
# What is this ndarray filled with?


In [None]:
# What is this reallly?


In [None]:
# And some automatic typecasting is available
c = np.array([2.2, 5, 1.1])
print(c.dtype)
print(c)
#l=[2.2, 5, 1.1]
#print(type(l[0]))
#print(type(l[1]))
#print(type(l[2]))


In [None]:
# You'll see this code a lot in examples


In [None]:
# just like range! ten to fifty by twos!

In [None]:
# last one
# 15 numbers from 0 (inclusive) to 2 (inclusive)

## Array Operations
* We can do many things on arrays, such as mathematical manipulation (addition, subtraction, square, exponents) as well as use boolean arrays, which are binary values. 
* We can also do matrix manipulation such as product, transpose, inverse, and so forth.
* Key difference, arithmetic operations are **element wise**

In [None]:
a = np.array([10,20,30,40])
b = np.array([1, 2, 3,4])
c # this is the same as `display(c)`

## Example
* Metrication in the US ([Wiki article](https://en.wikipedia.org/wiki/Metrication_in_the_United_States))

* Might want to convert, for our international audience...

![weather forecast](datasets/weather.jpg)

In [None]:
fahrenheit = np.array([32,27,32,21,29,16])



* What's happening underneath is so beautiful. It's called *broadcasting*, or *vectorization*.
* Each item in the ndarray can be operated on individually - there is no need to consider other operations
* I believe this is a data science **threshold concept**, and we should call it out:

*The fundamental idea behind array programming is that operations apply at once to an entire set of values. This makes it a high-level programming model as it allows the programmer to think and operate on whole aggregates of data, without having to resort to explicit loops of individual scalar operations.* (wikipedia)



* Vectorization allows for:
1. Massive parallelization and thus efficiency
2. Increased readability
3. Added flexibility
4. Increased code quality

In [None]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )
# Solve this the numpy way, vectorized!
celcius = (farenheit - 32) * (5/9)

In [None]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )


In [None]:
# How else could we solve this iteratively?

In [None]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )


In [None]:
# Any other ideas on solving this?

In [None]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )


* How do we better understand this? What's going on underneath?

In [None]:
celcius=map(lambda temp: (temp-32)*(5/9), farenheit)
for i in celcius:
    print(i)

# Final thoughts
1. Vectorization is the process of applying array programming techniques to data. When you vectorize an operation, function(s) are broadcast across elements in an array which allows for parallelization of operations.
2. Vectorization is powerful, and it's thinking in a vectorized way which is really important here. This aligns well with functional programming methods, and is a key to being an effective data scientist. #loopsaredead


# Boolean masking
* This is a **critical concept** (and not difficult!) in this course.
* This will impact how you look at data and understand how queries work.

* A Boolean mask is analagous to a bitwise mask!

* Ok, it's really simple. You take a range of values which are `True (1)` or `False (0)` and you either AND them or OR them with another range of values which are `True` or `False`.

* See https://stackoverflow.com/questions/28282869/shift-masked-bits-to-the-lsb

In [None]:
#bitwise masking
a=np.random.randint(2, size=10)
b=np.random.randint(2, size=10)

In [None]:
#boolean masking
import random
a=[random.choice([True,False]) for x in range(0,10)]
b=[random.choice([True,False]) for x in range(0,10)]
c=[random.choice([True,False]) for x in range(0,10)]


# Why are we doing this?
* It's really common to take an array of data and mask it to reveal a result.
* This works hand in hand with broadcasting! This is a highly parallelizable result!
* We can broadcast individuals values with comparison operators

In [None]:
a = np.random.randint(5,size=10)
mask = [random.choice([True,False]) for x in range(0,10)]



# Indexing operator
* The indexing operator in numpy, and pandas, is incredibly overloaded. You can use it to
  * get a single item out of the array, e.g. `a[0]`
  * slice a range out of the array, e.g. `a[1:4]`
  * apply a boolean mask to an array, e.g. `a[True, False, True]`

In [None]:
# Extended Topic
# Just for fun, here's how to implement the indexing operator yourself
class HomemadePsychologist:
    def __getitem__(self, key):
        print("""That's a good question, what do you think about the question '{}' Why do you think you are wondering about that?""".format(key))

psych = HomemadePsychologist()
psych["Are dogs fun?"]

In [None]:
a=np.random.randint(2, size=10)
a

You and boolean masking with numpy (and thus pandas!) will become good friends. :)