#  <center> Pandas, Numpy, Matplotlib

<div class="alert alert-block alert-info">

# Numpy Arrays
 </div>
 
## NumPy Arrays overview

* Core (or Standard) Python Library provides lists and 1D arrays (array.array)

  * Lists are general containers for objects
  * Arrays are 1D containers for objects of the same type
  * Limited functionality
  * Some memory and performance overhead associated with these structures

* NumPy provides multidimensional arrays (numpy.ndarray)
  * Can store many elements of the same data type in multiple dimensions
  * cf. Fortran/C/C++ arrays
  * More functionality than Core Python e.g. many conveninent methods for array manipulation
  * Efficient storage and execution

* [Extensive online documentation !](https://docs.scipy.org/doc/numpy/)
Let's begin our introduction by exploring how to create NumPy arrays.

## Creating NumPy Arrays


In [3]:
import numpy as np

In [4]:
my_list = [1,2,3]
np.array(my_list)

array([1, 2, 3])

In [5]:
my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
np.array(my_matrix)

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

### Using built-in Methods

There are lots of built-in ways to generate Arrays

In [6]:
np.arange(0,8)

array([0, 1, 2, 3, 4, 5, 6, 7])

In [7]:
np.zeros(3)

array([0., 0., 0.])

In [8]:
np.zeros((4,4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [9]:
np.ones(3)

array([1., 1., 1.])

In [10]:
np.eye(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [11]:
np.linspace(0,20,10)

array([ 0.        ,  2.22222222,  4.44444444,  6.66666667,  8.88888889,
       11.11111111, 13.33333333, 15.55555556, 17.77777778, 20.        ])

In [12]:
np.random.rand(2)

array([0.61376164, 0.97317508])

In [13]:
np.random.rand(3,3)

array([[0.52062612, 0.20417339, 0.29124143],
       [0.26491589, 0.89025899, 0.27359915],
       [0.85060093, 0.81233277, 0.19795627]])

In [14]:
np.random.randn(3)

array([ 0.30780686,  2.00467905, -0.55733128])

In [15]:
np.random.randint(1,100)

33

In [16]:
np.random.randint(1,100,6)

array([68, 50, 33, 10, 66, 50])

## Array Attributes and Methods


In [21]:
arr = np.arange(25)
ranarr = np.random.randint(0,50,10)
# Examine key array attributes
print("Dimensions ", arr.ndim)   # Number of dimensions
print("Shape      ", arr.shape)  # number of elements in each dimension
print("Size       ", arr.size)   # total number of elements
print("Data type  ", arr.dtype)  # data type of element, 64 bit float (IEEE 754) by default
print("Max", arr.max())
print("Max place", arr.argmax())
print("mean", arr.mean())

Dimensions  1
Shape       (25,)
Size        25
Data type   int64
Max 24
Max place 24
mean 12.0


In [18]:
arr.reshape(5,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

# NumPy Indexing and Selection


In [22]:
arr[8]

8

In [23]:
arr[1:5]

array([1, 2, 3, 4])

In [24]:
arr[:6]

array([0, 1, 2, 3, 4, 5])

In [25]:
arr_2d = np.array(([5,7,9],[10,12,14],[15,17,19]))

#Show
arr_2d

array([[ 5,  7,  9],
       [10, 12, 14],
       [15, 17, 19]])

In [26]:
arr_2d[1][0]

10

In [27]:
arr_2d[1,0]

10

In [28]:
arr_2d[:2,1:]

array([[ 7,  9],
       [12, 14]])

In [29]:
arr>5

array([False, False, False, False, False, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True])

In [30]:
arr[arr>5]

array([ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
       23, 24])

# NumPy Operations


In [31]:
def f(x):
    return x**3

x = np.array([1,2,3,4,5,6,7,8,9])
y = f(x)

print(y)

[  1   8  27  64 125 216 343 512 729]


## Universal Array Functions

Numpy comes with many [universal array functions](http://docs.scipy.org/doc/numpy/reference/ufuncs.html), which are essentially just mathematical operations you can use to perform the operation across the array. Let's show some common ones:

In [32]:
#Taking Square Roots
np.sqrt(arr)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ,
       3.16227766, 3.31662479, 3.46410162, 3.60555128, 3.74165739,
       3.87298335, 4.        , 4.12310563, 4.24264069, 4.35889894,
       4.47213595, 4.58257569, 4.69041576, 4.79583152, 4.89897949])

In [33]:
np.sin(arr)

array([ 0.        ,  0.84147098,  0.90929743,  0.14112001, -0.7568025 ,
       -0.95892427, -0.2794155 ,  0.6569866 ,  0.98935825,  0.41211849,
       -0.54402111, -0.99999021, -0.53657292,  0.42016704,  0.99060736,
        0.65028784, -0.28790332, -0.96139749, -0.75098725,  0.14987721,
        0.91294525,  0.83665564, -0.00885131, -0.8462204 , -0.90557836])

<div class="alert alert-block alert-info">

# Pandas
 </div>



A Series is built on top of the NumPy array object.
- A Series can be indexed by a label.
- It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

#  DataFrames: creating, reading, writing


In [115]:
import pandas as pd


In [116]:
df = pd.DataFrame(np.random.randn(4,3), index=['A','B','C','D'], columns=['X','Y','Z'])


In [117]:
df

Unnamed: 0,X,Y,Z
A,1.05106,1.170169,1.884653
B,-1.674935,-1.099609,0.229938
C,-0.137199,-0.710915,-0.351819
D,-2.268068,-0.463825,1.374155


**CSV** Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:


In [118]:
df1 = pd.read_csv('inputs/df1.csv')

<div class="alert alert-block alert-info">

#  Selection, Assigning data and Indexing
</div>




## Selection

Let's learn the various methods to grab data from a DataFrame.
These are the two ways of selecting a specific Series out of a `DataFrame`. 

The indexing operator `[]` does have the advantage that it can handle column names with reserved characters in them.


In [119]:
df

Unnamed: 0,X,Y,Z
A,1.05106,1.170169,1.884653
B,-1.674935,-1.099609,0.229938
C,-0.137199,-0.710915,-0.351819
D,-2.268068,-0.463825,1.374155


In [120]:
df.columns

Index(['X', 'Y', 'Z'], dtype='object')

In [121]:
df.index

Index(['A', 'B', 'C', 'D'], dtype='object')

In [122]:
df['Y']

A    1.170169
B   -1.099609
C   -0.710915
D   -0.463825
Name: Y, dtype: float64

In [123]:
# Pass a list of column names
df[['Y','Z']]

Unnamed: 0,Y,Z
A,1.170169,1.884653
B,-1.099609,0.229938
C,-0.710915,-0.351819
D,-0.463825,1.374155


### Creating a new column:
**DataFrame Columns are just Series**

In [124]:
type(df['Z'])

pandas.core.series.Series

In [125]:
df['new'] = df['Z'] + df['Y']

In [126]:
df

Unnamed: 0,X,Y,Z,new
A,1.05106,1.170169,1.884653,3.054822
B,-1.674935,-1.099609,0.229938,-0.869671
C,-0.137199,-0.710915,-0.351819,-1.062734
D,-2.268068,-0.463825,1.374155,0.91033


### Removing Columns

In [127]:
df.drop('new',axis=1)

Unnamed: 0,X,Y,Z
A,1.05106,1.170169,1.884653
B,-1.674935,-1.099609,0.229938
C,-0.137199,-0.710915,-0.351819
D,-2.268068,-0.463825,1.374155


In [128]:
# Not inplace unless specified!
df

Unnamed: 0,X,Y,Z,new
A,1.05106,1.170169,1.884653,3.054822
B,-1.674935,-1.099609,0.229938,-0.869671
C,-0.137199,-0.710915,-0.351819,-1.062734
D,-2.268068,-0.463825,1.374155,0.91033


In [129]:
df.drop('new',axis=1,inplace=True)
#or 
#df_newVer = df.drop('new',axis=1)


Can also drop rows this way:

In [130]:
df.drop('A',axis=0)

Unnamed: 0,X,Y,Z
B,-1.674935,-1.099609,0.229938
C,-0.137199,-0.710915,-0.351819
D,-2.268068,-0.463825,1.374155


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [131]:
df>0

Unnamed: 0,X,Y,Z
A,True,True,True
B,False,False,True
C,False,False,False
D,False,False,True


In [132]:
df[df>0]

Unnamed: 0,X,Y,Z
A,1.05106,1.170169,1.884653
B,,,0.229938
C,,,
D,,,1.374155


<div class="alert alert-block alert-info">

#  Operations
</div>


There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:

In [155]:
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['aa','cc','dd','ee']})
df.head()

Unnamed: 0,col1,col2,col3
0,1,444,aa
1,2,555,cc
2,3,666,dd
3,4,444,ee


### Info on Unique Values

In [156]:
df['col2'].unique()

array([444, 555, 666])

In [157]:
df['col2'].nunique()

3

In [158]:
df['col2'].value_counts()

444    2
555    1
666    1
Name: col2, dtype: int64

### Duplications

In [159]:
df.duplicated()#.sum()

0    False
1    False
2    False
3    False
dtype: bool

#### Drop duplication

In [160]:
df.drop_duplicates(inplace=True)

### statistical information

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

In [161]:
df['col2'].mean() #.std() #.median()

527.25

#### Summary Function
Pandas provides many simple *summary functions* (not an official name) which restructure the data in some useful way. For example, consider the `describe()` method:

In [162]:
df.describe()

Unnamed: 0,col1,col2
count,4.0,4.0
mean,2.5,527.25
std,1.290994,106.274409
min,1.0,444.0
25%,1.75,444.0
50%,2.5,499.5
75%,3.25,582.75
max,4.0,666.0


### Applying Functions

In [163]:
def times2(x):
    return x*2

In [164]:
df['col1'].apply(times2)

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

In [165]:
df['col3'].apply(len)

0    2
1    2
2    2
3    2
Name: col3, dtype: int64

In [166]:
df['col2'].sum()

2109

## Data Types 

You can use the `dtype` property to grab the type of a specific column. Or you can use `dtypes` to see all data types of columns

In [167]:
df['col2'].dtype 

dtype('int64')

In [168]:
df.dtypes

col1     int64
col2     int64
col3    object
dtype: object

## Missing Values

Entries missing values are given the value `NaN`, short for "Not a Number". For technical reasons these `NaN` values are always of the `float64` `dtype`.

Pandas provides some methods specific to missing data. To select `NaN` entries you can use `pd.isnull()` (or its companion `pd.notnull()`). This is meant to be used thusly:

In [169]:
df = pd.DataFrame({'col1':[1,2,3,np.nan],
                   'col2':[np.nan,555,666,444],
                   'col3':['aaa','bbb','ccc','ddd']})
df.head()

Unnamed: 0,col1,col2,col3
0,1.0,,aaa
1,2.0,555.0,bbb
2,3.0,666.0,ccc
3,,444.0,ddd


In [170]:
df.isnull()#.sum()

Unnamed: 0,col1,col2,col3
0,False,True,False
1,False,False,False
2,False,False,False
3,True,False,False


In [171]:
df.fillna('FILL')

Unnamed: 0,col1,col2,col3
0,1.0,FILL,aaa
1,2.0,555.0,bbb
2,3.0,666.0,ccc
3,FILL,444.0,ddd


In [172]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})

In [173]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [174]:
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


In [175]:
df.fillna(value='FILL VALUE')

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,FILL VALUE,2
2,FILL VALUE,FILL VALUE,3


In [176]:
df['A'].fillna(value=df['A'].mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

The `replace()` method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown", "Undisclosed", "Invalid", and so on.


<div class="alert alert-block alert-info">

#  Matplotlib and seaborn
</div>

In [177]:
import numpy as np
%matplotlib notebook
import matplotlib.pyplot as plt

x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)
z = np.sin(x+np.pi/2)

In [178]:
fig = plt.figure()

plt.plot(x, y)
plt.plot(x, z)
plt.show()

<IPython.core.display.Javascript object>

In [179]:
fig = plt.figure()
ax = fig.add_subplot()  # Add a subplot (Axes) to a Figure

ax.plot(x, y)  # Using the plot method from the Axes instance
ax.plot(x, z)
plt.show()

<IPython.core.display.Javascript object>

In [180]:
fig = plt.figure()
ax = fig.add_subplot()

ax.set_xlim((x.max()*0.25, x.max()*0.75))
ax.set_xlabel("x")
ax.set_ylabel("my y label")
ax.set_title("Trigonometry")
ax.plot(x, y, x, z)
plt.show()

<IPython.core.display.Javascript object>

In [181]:
fig, ax = plt.subplots()  # we can create fig and ax on one line!

ax.set_xlabel("x")
ax.set_ylabel("y")
ax.plot(x, y, x, z)

# Add a legend
ax.legend(("sin(x)", "sin(x+pi/2)"))
plt.show()

<IPython.core.display.Javascript object>

In [182]:
import seaborn as sns