#  <center> Pandas and Numpy

<div class="alert alert-block alert-danger">

# Terminology

**If it is hard, Do not Panic!**
____

**If it is easy, Do not underestimate it!**

___
**Do write your own code for the exercises -- Do not copy paste!** you will never become good in coding if you read through the lines of others code and copy-paste them. 
___

- There are  series of excercises, both in the morning and afternoon class. At the end of each day, I will put the solutions in Github repo.
 
- There is an [Extra](./Extra.ipynb) notebook, shown in <span style='color:green'> green boxes through this lecture</span>, that can be skipped (by you and me) in the interests of time. If/when you have the time please read through these sections for your general education
    
</div>

# Introduction to Numpy 
* [NumPy](https://numpy.org/) (or Numpy) is a Linear Algebra Library for Python
 NumPy supports:
- Multidimensional arrays (`ndarray`)
- Matrices and linear algebra operations
- Random number generation
- Fourier transforms
- Polynomials
- Tools for integrating with Fortran/C libraries

<div class="alert alert-block alert-info">

## Learning objectives 
- Learn what NumPy arrays are
- Learn basic array manipulations
- Learn what vectorial code is

 </div>

In [None]:
import numpy as np

# Numpy Arrays

### NumPy Arrays overview

* Core (or Standard) Python Library provides lists and 1D arrays (array.array)

  * Lists are general containers for objects
  * Arrays are 1D containers for objects of the same type
  * Limited functionality
  * Some memory and performance overhead associated with these structures

* NumPy provides multidimensional arrays (numpy.ndarray)
  * Can store many elements of the same data type in multiple dimensions
  * cf. Fortran/C/C++ arrays
  * More functionality than Core Python e.g. many conveninent methods for array manipulation
  * Efficient storage and execution

* [Extensive online documentation !](https://docs.scipy.org/doc/numpy/)
Let's begin our introduction by exploring how to create NumPy arrays.

## Creating NumPy Arrays

### From a Python List

We can create an array by directly converting a list or list of lists:

In [None]:
my_list = [1,2,3]
np.array(my_list)

In [None]:
my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
np.array(my_matrix)

### Using built-in Methods

There are lots of built-in ways to generate Arrays

### arange

Return evenly spaced values within a given interval.

In [None]:
np.arange(0,8)

In [None]:
np.arange(0,11,2)

### zeros, ones and identity matrix

Generate arrays of zeros or ones and  identity  matrix

In [None]:
np.zeros(3)

In [None]:
np.zeros((4,4))

In [None]:
np.ones(3)

In [None]:
np.eye(4)

### linspace
Return evenly spaced numbers over a specified interval.

In [None]:
np.linspace(0,20,10)

## Random 

Numpy also has lots of ways to create random number arrays:

### rand -- from a uniform distribution
Create an array of the given shape and populate it with
random samples from a uniform distribution
over ``[0, 1)``.

In [None]:
np.random.rand(2)

In [None]:
np.random.rand(3,3)

### randn --  from a normal distribution
(mean=0, standard deviation=1)

In [None]:
np.random.randn(3)

### randint 
Return random integers from `low` (inclusive) to `high` (exclusive).

In [None]:
np.random.randint(1,100)

In [None]:
np.random.randint(1,100,6)

You can create 2d arrays with complex elements by specifying the data type.

In [None]:
alist = [[1, 2, 3], [4, 5, 6]]
mat = np.array(alist, complex)
print(mat)

## Array Attributes and Methods

Let's discuss some useful attributes and methods or an array:

In [None]:
arr = np.arange(25)
ranarr = np.random.randint(0,50,10)

In [None]:
arr

In [None]:
ranarr

In [None]:
# Examine key array attributes
print("Dimensions ", arr.ndim)   # Number of dimensions
print("Shape      ", arr.shape)  # number of elements in each dimension
print("Size       ", arr.size)   # total number of elements
print("Data type  ", arr.dtype)  # data type of element, 64 bit float (IEEE 754) by default

### Reshape
Returns an array containing the same data with a new shape.

In [None]:
arr.reshape(5,5)

In [None]:
# Vector
arr.shape

In [None]:
# Notice the two sets of brackets
arr.reshape(1,25)

In [None]:
arr.reshape(1,25).shape

In [None]:
arr.reshape(25,1)

In [None]:
arr.reshape(25,1).shape

### dtype

You can also grab the data type of the object in the array:

In [None]:
arr.dtype

### max,min,argmax,argmin
These are useful methods for finding max or min values. Or to find their index locations using argmin or argmax

In [None]:
ranarr.max()

In [None]:
ranarr.argmax()

### mean, std

In [None]:
ranarr.std()

# NumPy Indexing and Selection
 we will discuss how to select elements or groups of elements from an array.

In [None]:
#Creating sample array
arr = np.arange(2,11)
arr

## Bracket Indexing and Selection
The simplest way to pick one or some elements of an array looks very similar to python lists:

In [None]:
#Get a value at an index
arr[8]

In [None]:
#Get values in a range
arr[1:5]

In [None]:
#Get values in a range
arr[0:5]

In [None]:
arr[:6]

## Broadcasting

Numpy arrays differ from a normal Python list because of their ability to broadcast:

In [None]:
#Setting a value with index range (Broadcasting)
arr[0:5]=99

#Show
arr

In [None]:
# Reset array, 
arr = np.arange(2,11)

#Show
arr

In [None]:
#Important notes on Slices
slice_of_arr = arr[0:6]

#Show slice
slice_of_arr

In [None]:
#Change Slice
slice_of_arr[:]=99

#Show Slice again
slice_of_arr

Now note the changes also occur in our original array!

In [None]:
arr

Data is not copied, it's a view of the original array! This avoids memory problems!

In [None]:
#To get a copy, need to be explicit
arr_copy = arr.copy()

arr_copy

## Indexing a 2D array (matrices)

The general format is **arr_2d[row][col]** or **arr_2d[row,col]**. I recommend usually using the comma notation for clarity.

In [None]:
arr_2d = np.array(([5,7,9],[10,12,14],[15,17,19]))

#Show
arr_2d

In [None]:
# Format is arr_2d[row][col] or arr_2d[row,col]

# Getting individual element value
arr_2d[1][0]

In [None]:
# Getting individual element value
arr_2d[1,0]

In [None]:
# 2D array slicing

#Shape (2,2) from top right corner
arr_2d[:2,1:]

In [None]:
#Shape bottom row
arr_2d[2] #arr_2d[2,:]

## Selection

Let's briefly go over how to use brackets for selection based off of comparison operators.

In [None]:
arr = np.arange(0,11)
arr

In [None]:
arr>5

In [None]:
arr[arr>=5]

# NumPy Operations
## Arithmetic

You can easily perform array with array arithmetic, or scalar with array arithmetic. Let's see some examples:

In [None]:
arr + arr

In [None]:
arr * arr

In [None]:
arr - arr

In [None]:
# Warning on division by zero, but not an error!
# Just replaced with nan
arr/arr

In [None]:
# Also warning, but not an error instead infinity
1/arr

In [None]:
arr**3

In [None]:
def f(x):
    return x**3

x = np.array([1,2,3,4,5,6,7,8,9])
y = f(x)

print(y)

## Universal Array Functions

Numpy comes with many [universal array functions](http://docs.scipy.org/doc/numpy/reference/ufuncs.html), which are essentially just mathematical operations you can use to perform the operation across the array. Let's show some common ones:

In [None]:
#Taking Square Roots
np.sqrt(arr)

In [None]:
#Calcualting exponential (e^)
np.exp(arr)

In [None]:
np.max(arr) #same as arr.max()

In [None]:
np.sin(arr)

In [None]:
np.log(arr)

### Linear algebra with numpy.linalg
 
Numpy provides some linear algebra capabilities, from matrix-vector product to matrix inversion and system solution

In [None]:
A = np.array([[1,2,3],[4,5,6],[7,8,8]])
B = np.array([1,2,1])

print(np.dot(A,B))

In [None]:
import numpy.linalg as la

In [None]:
n = la.norm(B)
print(n)

n = la.norm(A)
print(n)

d = la.det(A)
print(d)

And it is possible to solve linear systems, using low level C/Fortran code:

In [None]:
la.solve(A,B)

In [None]:
A_inv = la.inv(A)
print(A_inv)

In [None]:
#The eigen decomposition (of a square matrix) can also be computed:
eival, eivec = la.eig(A)
print(eival)
print(eivec)

### Performance

Python has a convenient timing function called `timeit`.

Can use this to measure the execution time of small code snippets.

* From python: `import timeit` and supply code  snippet as a string
* From ipython: can use magic command `%timeit`

By default, `%timeit` loops (repeats) over your code 3 times and outputs the best time. It also tells you how many iterations it ran the code per loop. 
You can specify the number of loops and the number of iterations per loop.
```
%timeit -n <iterations> -r <repeats>  <code_snippet>
```

See

* `%timeit?` for more information
* https://docs.python.org/2/library/timeit.html

# Introduction to Pandas
[Pandas](https://pandas.pydata.org/) is an open source library that's built on top of `NumPy`.
 The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals.

- It allows for fast analysis and data cleaning and preparation. It is used for preprocessing machine learning approaches.
- It excels in performance and productivity for the user.
- It has built-in visualization features.
- It can work with data from a wide variety of sources.
- It allows importing data in various formats such as csv, excel, HTML, etc.
- It allows a range of data manipulation operations such as `groupby`, `join`, `merge`, `melt`, `concatenation` as well as data cleaning features such as filling, replacing or imputing null values.
- It is used for timeseries analysis.

<div class="alert alert-block alert-info">

## Learning objectives 
Today we will learn how to use pandas for data analysis. 
- Series
- DataFrames: Creating, reading and writing to `DataFrame`'s.
- Indexing of `DataFrame`'s and how to slice and reference them.
- Operations
- Extract information from your data through summary functions and maps.
- Grouping and sorting data.
- `DataType`'s and handling missing data.
- Renaming Merging,Joining,and Concatenating.
- **Built-in visualization features
- **Timeseries with Pandas
 </div>

In [None]:
#Importing library
import pandas as pd

from numpy.random import randn
np.random.seed(101)

#  Series
A Series is built on top of the NumPy array object.
- A Series can be indexed by a label.
- It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

## Creating a Series in diffrent ways

You can convert a **list**, **numpy array**, or **dictionary** to a Series:

In [None]:
labels = ['x','y','z']
my_list = [100,200, 300] #python list
arr = np.array([100,200,300]) #python array
dic = {'x':100,'y':200,'z':300}  #python dictionary

**List**

In [None]:
pd.Series(data = my_list) 

It looks a lot like an numpy array. Except here it's very distinguished that we have an index 0 1 2 and then the actual data 100 200 300 and the key to a panda series is that you can actually specify what you want that index to be.

In [None]:
pd.Series(data=my_list,index=labels)

**numpy array**

In [None]:
pd.Series(arr)

In [None]:
pd.Series(arr,labels)

**Dictionary**

In [None]:
pd.Series(dic)

### Data in a Series

A pandas Series can hold a variety of object types. Its entries are not limited to integers. For instance, here's a `series` whose values are strings:

In [None]:
pd.Series(data=labels)

In [None]:
# Even functions (although unlikely that you will use this) 
#This is just to demonstrate of pandas flexibility to work with various data type
pd.Series([sum,print,len])

## Using an Index

Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

In [None]:
ser1 = pd.Series([10,20,30,40],index = ['X', 'Y','Z', 'T'])                                   

In [None]:
ser1

In [None]:
ser2 = pd.Series([10,20,50,40],index = ['X', 'Y','M', 'T'])                                   

In [None]:
ser2

In [None]:
ser1['X'] #just pass in the index label

**Operations are then also done based off of index:**

*Note:  when performing operations with a Pandas series (or almost ny numpy Panda's base object)  the integers will be converted into floats. That's so you don't accidentally lose information and maintain them.

In [None]:
ser1 + ser2

#  DataFrames: creating, reading, writing

A `DataFrame` is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

# Creating
We are using the `pd.DataFrame()` constructor to generate these `DataFrame` objects.

In [None]:
df = pd.DataFrame(randn(4,3), index=['A','B','C','D'], columns=['X','Y','Z'])

#!Be more Professional:
#df = pd.DataFrame(randn(4,3),index='A B C D'.split(),columns='X Y Z'.split())

In [None]:
df

In [None]:
#type(df['X'])

In [None]:
data = pd.DataFrame({'Course':['NPP','NPP','EDMS','EDMS','CM','CM'],
       'Person':['Bob','Sam','Amy','Vanessa','Carl','Sarah'],
       'Marks':[70,75,80,65,60,90]},
       index=['A','B','C','D','E','F'])

In [None]:
data

# Reading and writing

Being able to create a `DataFrame` or `Series` by hand is useful. But, most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.

**CSV** Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:

In [None]:
df1 = pd.read_csv('inputs/df1.csv')

The `pd.read_csv()` function is very versatile, with over 30 optional parameters you can specify. For example, you can see in this dataset that the CSV file has a built-in index, which pandas did not pick up on automatically. To make pandas use that column for the index (instead of creating a new one from scratch), we can specify an index_col.

In [None]:
#df.to_csv('example',index=False)

**Excel** Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.

In [None]:
#pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')

In [None]:
#df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

<div class="alert alert-block alert-success" style="color:red">

You can see how to read HTML and SQL files in [Extra](./Extra.ipynb) </div>

#  Selection, Assigning data and Indexing

## Selection

Let's learn the various methods to grab data from a DataFrame.
These are the two ways of selecting a specific Series out of a `DataFrame`. 

The indexing operator `[]` does have the advantage that it can handle column names with reserved characters in them.

In [None]:
df

In [None]:
df.columns

In [None]:
df.index

In [None]:
df['Y']

In [None]:
# Pass a list of column names
df[['Y','Z']]

In [None]:
# SQL Syntax (NOT RECOMMENDED!)
df.Y

In [None]:
df['Y'][0]

### Index-based selection

Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. `iloc` follows this paradigm.

The second paradigm for attribute selection is the one followed by the `loc` operator: **label-based selection**. In this paradigm, it's the data index value, not its position, which matters.

Both `loc` and `iloc` are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

In [None]:
df.loc['A']

In [None]:
df.loc['B','Y']

In [None]:
df.loc[['A','B'],['Z','Y']]

In [None]:
df.iloc[0,:] #or df.iloc[1]

In [None]:
df.iloc[:, 0]

In [None]:
df.iloc[:3, 0]

In [None]:
df.iloc[[1, 2], 0]

### Creating a new column:
**DataFrame Columns are just Series**

In [None]:
type(df['Z'])

In [None]:
df['new'] = df['Z'] + df['Y']

In [None]:
df

In [None]:
df['index_backwards'] = range(len(df), 0, -1)
df

### Removing Columns

In [None]:
df.drop('new',axis=1)

In [None]:
# Not inplace unless specified!
df

In [None]:
df.drop('new',axis=1,inplace=True)
#or 
#df_newVer = df.drop('new',axis=1)

Can also drop rows this way:

In [None]:
df.drop('A',axis=0)

**Permanently Removing a Column**

In [None]:
del df['index_backwards']

### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [None]:
df>0

In [None]:
df[df>0]

In [None]:
df[df['Z']>0]

In [None]:
df[df['Z']>0]['Y']

In [None]:
df[df['Z']>0][['Y','X']]

For two conditions you can use | and & with parenthesis:

In [None]:
df[(df['X']>0) & (df['Y'] > 1)]

In [None]:
df

Pandas comes with a few built-in conditional selectors, two of which we will highlight here.

The first is `isin`. `isin` is lets you select data whose value "is in" a list of values.

The second is `isnull` (and its companion `notnull`). These methods let you highlight values which are (or are not) empty (`NaN`). For example, to filter out wines lacking a price tag in the dataset, here's what we would do:

In [None]:
df.loc[df['Z'].notnull()]

### More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else.

In [None]:
# Reset to default 0,1...n index
df.reset_index()

In [None]:
newind = 'one two three four'.split()

In [None]:
df['Hs_type'] = newind

In [None]:
df

In [None]:
df.set_index('Hs_type')

In [None]:
df.set_index('Hs_type',inplace=True)

<div class="alert alert-block alert-success">
    
See [Extra](./Extra.ipynb) for **Multi-Index and Index Hierarchy** </div>

#  Operations

There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:

In [None]:
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['aa','cc','dd','ee']})
df.head()

### Info on Unique Values

In [None]:
df['col2'].unique()

In [None]:
df['col2'].nunique()

In [None]:
df['col2'].value_counts()

### Duplications

In [None]:
df.duplicated()#.sum()

#### Drop duplication

In [None]:
df.drop_duplicates(inplace=True)

### statistical information

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

In [None]:
df['col2'].mean() #.std() #.median()

#### Summary Function
Pandas provides many simple *summary functions* (not an official name) which restructure the data in some useful way. For example, consider the `describe()` method:

In [None]:
df.describe()

### Maps

A **map** is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

There are two mapping methods that you will use often.

`map()` is the first, and slightly simpler one. For example, suppose that we wanted to remean the scores the wines received to 0. We can do this as follows:

In [None]:
df_mean = df['col1'].mean()
df['col1'].map(lambda p: p - df_mean)

### Applying Functions

In [None]:
def times2(x):
    return x*2

In [None]:
df['col1'].apply(times2)

In [None]:
df['col3'].apply(len)

In [None]:
df['col2'].sum()

## Data Types 

You can use the `dtype` property to grab the type of a specific column. Or you can use `dtypes` to see all data types of columns

In [None]:
df['col2'].dtype 

In [None]:
df.dtypes

Data types tell us something about how pandas is storing the data internally. `float64` means that it's using a 64-bit floating point number; `int64` means a similarly sized integer instead, and so on.

One peculiarity to keep in mind (and on display very clearly here) is that columns consisting entirely of strings do not get their own type; they are instead given the object type.

It's possible to convert a column of one type into another wherever such a conversion makes sense by using the `astype()` function. For example, we may transform the points column from its existing `int64` data type into a `float64` data type:

In [None]:
df['col2'].astype('float64')

## Missing Values

Entries missing values are given the value `NaN`, short for "Not a Number". For technical reasons these `NaN` values are always of the `float64` `dtype`.

Pandas provides some methods specific to missing data. To select `NaN` entries you can use `pd.isnull()` (or its companion `pd.notnull()`). This is meant to be used thusly:

In [None]:
df = pd.DataFrame({'col1':[1,2,3,np.nan],
                   'col2':[np.nan,555,666,444],
                   'col3':['aaa','bbb','ccc','ddd']})
df.head()

In [None]:
df.isnull()#.sum()

In [None]:
df.fillna('FILL')

In [None]:
data = {'A':['Class1','Class1','Class1','Class2','Class2','Class2'],
     'B':['M1','M1','M2','M2','M1','M1'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])

In [None]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})

In [None]:
df.dropna()

In [None]:
df.dropna(axis=1)

In [None]:
df.dropna(thresh=2)

In [None]:
df.fillna(value='FILL VALUE')

In [None]:
df['A'].fillna(value=df['A'].mean())

The `replace()` method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown", "Undisclosed", "Invalid", and so on.

# Groupby

The `groupby()` method allows you to group rows of data together and call aggregate functions.

In [None]:
# Create dataframe
data = pd.DataFrame({'Course':['NPP','NPP','EDMS','EDMS','CM','CM'],
       'Person':['Bob','Sam','Amy','Vanessa','Carl','Sarah'],
       'Marks':[70,75,80,65,60,90]},
       index=['A','B','C','D','E','F'])

In [None]:
df = pd.DataFrame(data)

** Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:**

In [None]:
df.groupby('Course')

You can save this object as a new variable:

In [None]:
by_course = df.groupby("Course")

And then call aggregate methods off the object:

In [None]:
by_course.mean()

In [None]:
df.groupby('Course').mean()

In [None]:
by_course.std() #max() #min()

In [None]:
by_course.count()

In [None]:
by_course.describe()

In [None]:
by_course.describe().transpose()

In [None]:
by_course.describe().transpose()['NPP']

Another groupby() method worth mentioning is agg(), which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:

In [None]:
df.groupby(['Course']).agg([len, min, max])

# Merging, Joining, and Concatenating

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this lecture we will discuss these 3 methods with examples.

In [None]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

In [None]:
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 

In [None]:
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])

## Concatenation

Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use **pd.concat** and pass in a list of DataFrames to concatenate together:

In [None]:
pd.concat([df1,df2,df3])

In [None]:
pd.concat([df1,df2,df3],axis=1)

In [None]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})    

## Merging

The **merge** function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example:

In [None]:
pd.merge(left,right,how='inner',on='key')

In [None]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})

In [None]:
pd.merge(left, right, on=['key1', 'key2'])

In [None]:
pd.merge(left, right, how='outer', on=['key1', 'key2'])

In [None]:
pd.merge(left, right, how='right', on=['key1', 'key2'])

In [None]:
pd.merge(left, right, how='left', on=['key1', 'key2'])

## Joining
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

In [None]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [None]:
left.join(right)

In [None]:
left.join(right, how='outer')

<div class="alert alert-block alert-success">
    
See **Built-in visualization features**
and **Timeseries with Pandas** in [Extra](./Extra.ipynb) </div>

# Further reading and refrences

 For more info on why you would want to use Arrays instead of lists, check out this great [StackOverflow post](http://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists).

 [Pandas extra](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html).

Further, if you are interested in data science and machine learning then you need to know about Kaggle. You will use it a few times throughout the year. An introduction can be found:

[![Two](http://img.youtube.com/vi/TNzDMOg_zsw/0.jpg)](https://youtu.be/TNzDMOg_zsw)