# Introduction to the NumPy Stack

# NumPy

The NumPy module is a Python package that implements Linear Algebra functionality in Python, and is the basis of nearly all modules used for Data Science in industry.  This will allow us to do fast and easy vector and matrix operations.  NumPy is highly optimized for numerical calculations, and has much of its back-end written in C.  So, it's also going to run a lot faster than normal Python code.

We start by importing the NumPy module.  By convention, as we import NumPy, we shorten its name to "np", as follows.

In [1]:
import numpy as np

Now that we've imported the module, the first object within NumPy we will look at is the array.

## NumPy Arrays

### Creating Arrays

NumPy Arrays can be either one-dimensional or two-dimensional.  The one-dimensional array can be thought of as a vector.  The two-dimensional array can be thought of as a matrix.  We can create these by either casting a Python list as an array, or we can generate an array in a number of different ways.

In [2]:
# Converting a list into an array

l = [1,2,3]
a = np.array(l)

a

array([1, 2, 3])

We can generate an array using the ```arange()``` function, which takes three arguements, ```start```, ```stop```, and ```step```.  If nothing is passed for the ```start``` argument, it is assumed to be ```0```, and if nothing is passed for the ```step``` argument, the step is assumed to be ```1```.  Notice that the ```stop``` argument is not included in the range generated (just like the normal ```range()``` function in Python).

In [3]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [4]:
np.arange(3, 10)

array([3, 4, 5, 6, 7, 8, 9])

In [5]:
np.arange(3, 11, 2)

array([3, 5, 7, 9])

There are also certain special functions for generating certain types of arrays that may be useful in certain calculations.  the function ```zeros``` generates an array of all zeros.

In [6]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [7]:
np.zeros((4, 5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

Similarly, the function```ones``` generates an array of all ones.

In [8]:
np.ones(5)

array([1., 1., 1., 1., 1.])

In [9]:
np.ones((4,5))

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

We can also generate an array that is a sequence between two points using the ```linspace()``` function.  This function is similar to the ```arange()```.  It also takes three arguments ```start```, ```stop```, and ```n_steps```.

In [10]:
np.linspace(0, 10, 21)

array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,
        5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5, 10. ])

We can also easily create an identity matrix of any size using either the ```eye()``` function, or the ```identity()``` function.

In [11]:
np.eye(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [12]:
np.identity(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

Notice that these functions only require one argument, since an identity matrix must be square ($n \times n$), we can just specify ```n```.

We also have a variety of methods for creating randomly generated arrays.  These can all be found in the ```random``` sub module.

In [13]:
# Generate numbers from the Standard Normal distribution

np.random.randn(4,5)

array([[-0.90157392, -0.35810523,  1.19218259, -0.638735  ,  0.27005714],
       [ 0.7576297 ,  0.65659721, -0.71212639,  0.25641297, -1.42523939],
       [ 1.05494824, -0.39642985,  0.15274572,  1.04354782,  0.08127806],
       [ 0.17641942,  0.03843056, -0.14245144,  0.95498997,  0.56100064]])

In [14]:
# Generate numbers from a Uniform distribution between 0 and 1

np.random.random((4,5))

array([[0.47509625, 0.81733751, 0.3965635 , 0.50190242, 0.90600861],
       [0.13397918, 0.60897764, 0.93595223, 0.51261453, 0.52444688],
       [0.85979127, 0.67897626, 0.62116908, 0.66064461, 0.09287412],
       [0.73548006, 0.01003759, 0.14252235, 0.75739945, 0.96897057]])

We can also generate random numbers from any named distribution.

In [15]:
np.random.laplace(size = 10)

array([-1.43989943,  0.33283284, -3.39478332, -0.05119575, -1.17939584,
       -2.49422989,  0.99744007, -1.96995493,  1.0396065 , -2.38949074])

In [16]:
np.random.laplace(size = (4,5))

array([[-0.02320204,  1.21183852, -0.5749149 ,  0.42625337,  1.40075635],
       [-0.66924381, -0.57472231,  2.49616142,  0.42475732, -1.15837486],
       [ 0.10106318,  0.97892206,  0.03703811,  0.0934123 , -0.45738134],
       [-2.29963187, -0.28431472,  0.36672991, -1.37577146,  1.97074355]])

In [17]:
np.random.lognormal(size = 10)

array([0.47320938, 0.16106393, 1.55520015, 0.42051955, 0.74289384,
       4.61965857, 0.30455112, 1.43367874, 1.64850603, 1.30792177])

### Manipulating Arrays

Arrays can be manipulated in a variety of ways.  Let's look at the shape of an array, and then see how that can be changed.

In [18]:
arr1 = np.array([1,2,3,4,5,6])
arr2 = np.array([[1,2],[3,4]])

arr1

array([1, 2, 3, 4, 5, 6])

In [19]:
arr1.shape     # Check the shape attribute of the array

(6,)

In [20]:
arr2

array([[1, 2],
       [3, 4]])

In [21]:
arr2.shape

(2, 2)

We can use the ```reshape()``` method to change the shape of these arrays.

In [22]:
arr1.reshape(3,2)

array([[1, 2],
       [3, 4],
       [5, 6]])

Notice that this change doesn't happen "in place".

In [23]:
arr1

array([1, 2, 3, 4, 5, 6])

To change the shape of an array permanently, we need to store this change in a variable.  We can even store this change over the old array by assigning the reshaped array to the same variable name.

In [24]:
arr1 = arr1.reshape(3,2)

In [25]:
arr1

array([[1, 2],
       [3, 4],
       [5, 6]])

In [26]:
arr1.shape

(3, 2)

We can also perform mathematical operations with NumPy arrays.  In this way, NumPy arrays behave much like the mathematical counterparts they represent.

In [27]:
a1 = np.array([1,2,3])
a2 = np.array([4,5,6])

a1 + a2         # Vector Addition

array([5, 7, 9])

In [28]:
3 * a1          # Scalar Multiplication

array([3, 6, 9])

In [29]:
a1 * a2         # Element-wise multiplication

array([ 4, 10, 18])

In [30]:
a2 / a1         # Element-wise division

array([4. , 2.5, 2. ])

In [31]:
a1 ** 2         # Element-wise exponentiation

array([1, 4, 9], dtype=int32)

We can also apply functions to each element of an array.

In [32]:
np.exp(a1)     #e^a1

array([ 2.71828183,  7.3890561 , 20.08553692])

Arrays also support the dot product operation.  In the case of two vectors, this does exactly what it is supposed to do.  In the case of a matrix and a vector, or two matrices, the ```dot()``` function still performs the appropriate multiplication operation.  This can be done as either a function, or a method of the array itself.

In [33]:
np.dot(a1, a2)

32

In [34]:
a1.dot(a2)

32

In [35]:
m1 = np.array([[1,2,3], [4,5,6]])

m1 + a1

m1.dot(a1)

array([14, 32])

In [36]:
m2 = np.array([[1,2],[3,4],[5,6]])

m1.dot(m2)

array([[22, 28],
       [49, 64]])

Another common manipulation or opertation of a matrix that NumPy makes very easy is taking the transpose of a matrix.  This is executed as a simple method of an array.

In [37]:
A = np.array([[1,2],[3,4]])
A

array([[1, 2],
       [3, 4]])

In [38]:
A.T

array([[1, 3],
       [2, 4]])

NumPy also has some very convenient functions for summarising vectors and matrices.  For example, NumPy has built in functions for calculating the sum and mean of a vector or matrix.

In [39]:
np.sum(a1)

6

In [40]:
a1.sum()

6

In [41]:
np.mean(a1)

2.0

In [42]:
a1.mean()

2.0

In the case of matrices, we can either sum the entire array, or we can get the sum or each column or each row.

In [43]:
np.sum(A)

10

In [44]:
# Sum each row

np.sum(A, axis=0)

array([4, 6])

In [45]:
# Sum each column

np.sum(A, axis=1)

array([3, 7])

### Indexing and Selection

NumPy arrays can be indexed using "slice" notation, just as with normal Python lists.

In [46]:
a1 = np.array([1,2,3,4,5,6,7,8,9])

a1[:5]

array([1, 2, 3, 4, 5])

In [47]:
a1[5:]

array([6, 7, 8, 9])

In [48]:
a1[-1]

9

In [49]:
a1[-4:-1]

array([6, 7, 8])

Matrix indexing can be done using the double bracket notation, as with embedded lists in Python, but they can also be indexed using the more natural two-dimensional index.

In [50]:
m = np.array([[1,2],[3,4]])

m[0][1]

2

In [51]:
m[0,1]

2

Finally, indexing and selection can be done using conditional statements.

In [52]:
a1

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [53]:
a1[a1 > 3]

array([4, 5, 6, 7, 8, 9])

In [54]:
m

array([[1, 2],
       [3, 4]])

In [55]:
m[m < 4]

array([1, 2, 3])

### More Matrix Operations

NumPy also provides access to many matrix operations from Linear Algebra in the ```linalg``` submodule.  We can find the inverse of a matrix.

In [56]:
A

array([[1, 2],
       [3, 4]])

In [57]:
np.linalg.inv(A)

array([[-2. ,  1. ],
       [ 1.5, -0.5]])

We can calculate the determinant of a matrix.

In [58]:
np.linalg.det(A)

-2.0000000000000004

We can create a diagonal matrix.

In [59]:
np.diag([1,2,3])

array([[1, 0, 0],
       [0, 2, 0],
       [0, 0, 3]])

And we can calculate the trace of a matrix.

In [60]:
np.trace(A)

5

Finally, we can solve a linear system of the form $Ax = b$ using the ```solve()``` function.

In [61]:
A

array([[1, 2],
       [3, 4]])

In [62]:
b = np.array([1,2])
b

array([1, 2])

In [63]:
x = np.linalg.inv(A).dot(b)
x

array([2.22044605e-16, 5.00000000e-01])

Or

In [64]:
x = np.linalg.solve(A, b)
x

array([0. , 0.5])

#### NumPy Problems

#### 1.
The admission fee at a small fair is 2.50 GBP for children and 5.00 GBP for adults.  On a given day, 2,400 tickets are sold and 8,000 GBP is collected.  How many children and how many adults attended?

In [66]:
A = np.array([[1, 1], [2.5, 5]])
b = np.array([[2400], [8000]])

x = np.linalg.solve(A, b)
print("Children: {c}\nAdults: {a}".format(c = x[0], a = x[1]))

Children: [1600.]
Adults: [800.]


In [68]:
x = np.array([2400, 8000])
print(x)
x = np.expand_dims(x, 1)
print(x)

[2400 8000]
[[2400]
 [8000]]


#### 2.
Create a $6 \times 6$ identity matrix.

In [69]:
np.eye(6)

array([[1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 1.]])

#### 3.
Reverse the following vector.

In [70]:
v = np.array([1,2,3,4])

v_reverse = v[::-1]

v_reverse

array([4, 3, 2, 1])

#### 4.
Some people are standing in a row in a park. There are trees between them which cannot be moved (marked by -1). Your task is to write a function to rearrange the people by their heights in ascending order without moving the trees.

Example

For ```a = [-1, 150, 190, 170, -1, -1, 160, 180]```, the output should be
```sort_by_height(a) = [-1, 150, 160, 170, -1, -1, 180, 190]```

In [71]:
def sort_by_height(a):
    a[a > 0] = np.sort(a[a > 0])
    return a

a = np.array([-1, 150, 190, 170, -1, -1, 160, 180])
sort_by_height(a)

array([ -1, 150, 160, 170,  -1,  -1, 180, 190])

# Pandas

Pandas is an open-source library built on top of NumPy.  It introduces some new data structures to Python, including the DataFrame, which will be used to store general tabular data.  In order to use these tools, we have to import the Pandas library.  There is a strong convention in industry to reference Pandas by the shorthand "pd" as follows.

In [72]:
import pandas as pd

### Series

Before we get to DataFrames, we're first going to look at another, more basic data structure in Pandas, the Series.  A Series is much like a NumPy Array, but a Series is able to have axis labels.  This means a Series can now be indexed by a label.

In [73]:
l = [1,2,3]                              # An ordinary Python list
labels = ["a", "b", "c"]                 # A list of "labels"
a = np.array([10, 20, 30])               # A NumPy Array
d = {i:j for i, j in zip(labels, a)}     # An ordinary Python dictionary

We're going to use the objects defined above to demonstrate how we can create a Series.  First, we can simply pass in our Python list to the argument ```data```.

In [74]:
s = pd.Series(data = l)
s

0    1
1    2
2    3
dtype: int64

We see here that the Series has been created with a default index (a sequence of integers starting at 0), since we didn't pass any index labels to the function.  If we'd like the Series to have an access label, we simply pass in a list of labels for the function to use.

In [75]:
s = pd.Series(data = l, index = labels)
s

a    1
b    2
c    3
dtype: int64

Now our Series has index labels "a", "b", and "c".  We can also use a NumPy array as our data, as follows.

In [76]:
s = pd.Series(data = a, index = labels)
s

a    10
b    20
c    30
dtype: int32

The arguments ```data``` and ```index``` are actually positional arguments to the function, so we can just pass them in and Python will know what to do as long as we pass the data first and the label second.

In [77]:
s = pd.Series(a, labels)
s

a    10
b    20
c    30
dtype: int32

Finally, we can also pass a dictionary into the ```Series()``` function, and it will assign the values as the data and the keys as the index labels.

In [78]:
s = pd.Series(d)
s

a    10
b    20
c    30
dtype: int64

This added feature, the axis label, allows a Pandas Series to very efficiently and easily perform certain operations that would otherwise be more complex to execute.  For example, we can now index into a Series using the index label.

In [79]:
s["c"]

30

In [80]:
s[["a", "c"]]

a    10
c    30
dtype: int64

Being built upon NumPy Arrays, Series also support vectorized arithmetic operations, but these already powerful operations can be augmented further by the index label.  We see that we can perform vector addition with two Series as follows.

In [81]:
s1 = pd.Series([1,2,3])
s2 = pd.Series([4,5,6])

s1 + s2

0    5
1    7
2    9
dtype: int64

Now let's look at two Series with axis labels.

In [82]:
s1 = pd.Series([1,2,3,4,5], ["USA", "UK", "Canada", "Australia", "France"])
s2 = pd.Series([1,2,3,4,5], ["UK", "USA", "Australia", "France", "Germany"])

s3 = s1 + s2
s3

Australia    7.0
Canada       NaN
France       9.0
Germany      NaN
UK           3.0
USA          3.0
dtype: float64

In [83]:
np.isnan(s3)

Australia    False
Canada        True
France       False
Germany       True
UK           False
USA          False
dtype: bool

Even though each country in ```s1``` has a different "position" in ```s2```, Python was able to match up the values assigned each country and perform this element-wise addition according to the axis label.  Notice that the countries that don't appear in both series show up as "Not a Number" (```NaN```).  This can be very powerful and convenient for making sure all of our calculations line up as they should while dealing with large data sets.

### DataFrames

While the Pandas Series object has a lot of useful properties and features, the real work horse of data manipulation in Python is the Pandas DataFrame, which is able to store general tabular data.  We can generate DataFrames for testing out concepts by casting dictionaries.

In [84]:
d = {i:[j, j+5, j*2] for i,j in zip(labels, a)}
d

{'a': [10, 15, 20], 'b': [20, 25, 40], 'c': [30, 35, 60]}

In [85]:
df = pd.DataFrame(d)
df

Unnamed: 0,a,b,c
0,10,20,30
1,15,25,35
2,20,40,60


#### Importing and Exporting Data

Most of the time, in real-world data science applications our DataFrames will be the result of loading in some file given to us by a client (usually in some text format, like a .csv, or an Excel spreadsheet).  To appreciate the power of Pandas, let's first look at how we would load in such a data file using only NumPy.

In [89]:
data = []

with open("data_2d.csv") as f:
    for line in f:
        row = line.split(",")
        sample = map(float, row)
        data.append(sample)
        
data = np.array(data)

data[:10, :]

That wasn't exactly hard, but it wasn't as easy as that could possibly be either.  Also, the object ```data``` is a matrix, so we're restricted in terms of the type of data we can store in it, because all of the elements of a matrix *must* be of the same type.  However, we know that a lot of real world data we encounter in industry is a mixture of data types.  One column might be numeric, while other columns are categorical variables, dates, or free text.  In this case, we definitely need a DataFrame.  We can read a file into a DataFrame quite easily.  Let's go ahead and load the same data set we just used to create a DataFrame.

In [None]:
data = pd.read_csv("data_2d.csv", header=None)
data.head(10)

Now *that* was easy.  And the object ```data``` is now a Pandas DataFrame, which has a number of useful methods and features.  We will save the discussion of most methods and features for the next section.  However, we will demonstrate now how to write a DataFrame to a csv file.

In [None]:
data.to_csv("data_2d_test_write.csv")

Now let's try and read this file back in.

In [None]:
data_test_read = pd.read_csv("data_2d_test_write.csv")
data_test_read.head(10)

We see that the method also wrote the index as a column in the csv file.  When we read the file back in, Pandas created another index, so now we have two.  To avoid this, we can use the ```index``` argument in the ```to_csv``` method.  By default, this argument is set to ```True```, but we will set it to ```False```.

In [None]:
data.to_csv("data_2d_test_write.csv", index = False)
data_test_read = pd.read_csv("data_2d_test_write.csv")
data_test_read.head(10)

Now when we read the file back in, we see that we don't have the pesky redundant indexes.

Another common data format we encounter in industry is the Excel workbook.  An Excel workbook can be thought of as a collection of "Sheets", and each Sheet can be thought of as a table (read DataFrame).  To read data directly from an Excel file, we will need the package ```xlrd```, which should come with the Anaconda distribution.  If you don't have it, simply open the command prompt or terminal and type the following command.

```conda install xlrd```

Or, if you're not using Anaconda, use the normal pip install.

```pip install xlrd```

In [None]:
import xlrd

data = pd.read_excel("excel_sample.xlsx", sheetname = "Sheet1")
data.head(10)

Keep in mind that an Excel workbook may contain all kinds of structures under the hood, such as formulas, images, and macros.  Pandas can't read in these aspects of an Excel workbook, it can only read the data values in the cells.  When trying to read an Excel sheet with these aspects, best case scenario, it won't work and Python will simply read in the data, worst case scenario, Python will crash.

To save a DataFrame as an Excel file, simply call the corresponding method.

In [None]:
data.to_excel("excel_sample_write_test.xlsx", sheet_name = "Test")

In [None]:
data_test_read = pd.read_excel("excel_sample_write_test.xlsx", sheetname = "Test")
data_test_read.head(10)

Pandas can also retrieve data from sources on the internet.  For this example, let's navigate to the web page https://www.fdic.gov/bank/individual/failed/banklist.html.  This page contains a table of failed banks with some additional information about the bank, such as the city, the bank's certification number, and when they failed.  Pandas can actually retrieve this table directly from the web page.

In [None]:
data = pd.read_html("https://www.fdic.gov/bank/individual/failed/banklist.html")

Let's check to see the type of our object ```data```.

In [None]:
type(data)

We see that ```data``` is actually an ordinary Python ```list``` instead of a DataFrame.  What the function ```pd.read_html``` has done is parse the entire web page looking for table markers.  Everywhere that the parser found a table marker in the html, it parsed the table and then stored that information in a list.  We now have to check what is in our list to see if we got what we were looking for.

In [None]:
len(data)

Luckily, we see that in this case the function only found one table, so we can simply extract it from our list as follows.

In [None]:
data = data[0]
data.head(10)

Please note that there are a lot of file formats out there (JSON, FWF, SAS, STATA, and many more), and Pandas has functions to read each of these formats.  The ones shown here are simply the most commonly encountered.  Pandas can also read data using SQL queries, but we'll cover this in detail when we cover SQL.  For now we will move on to the methods of DataFrames.

#### Basic DataFrame Methods

The first method you might've noticed is the ```head(n=5)``` method.  This allows us to get a quick glimpse at the structure of our table without trying to display the entire thing.  The default is for ```head()``` to output the first five rows of the DataFrame, but we can change the number of rows output by simply passing the number of rows we want to the method.

In [None]:
data.head()

Now let's look at basic methods for querying DataFrames for information.  Here we will use the iris data set.

In [None]:
iris_df = pd.read_csv("iris.csv")
iris_df.head()

First, we may want to format the names of our columns to better fit the style of the Python language.  There is also a more practical consideration in doing this.  Here we see that most of our column names contain the "." character.  This character is also used to call methods, and we don't want the Python interpreter to confuse part of a column name with a method.  This will almost certainly lead to errors that crash the Python interpreter, and may even lead to unexpected results.  We proceed with the rename method as follows.

In [None]:
iris_df \
    .rename(
        columns = {
            "Sepal.Length":"sepal_length",
            "Sepal.Width":"sepal_width",
            "Petal.Length":"petal_length",
            "Petal.Width":"petal_width",
            "Species":"species"},
        inplace = True
    )

In [None]:
iris_df.head()

Now that our columns are named as we would like them, we can look at methods for basic queries into the data.  We've seen that DataFrames are very similar to Python dictionaries.  Recall that a dictionary can be cast directly as a Pandas DataFrame and the keys will be interpreted as the column names, while the values will be taken as the columns themselves.  We can extend this analogy a little further by calling a column from our DataFrame in the same way we would call a value by its key name in a dictionary.

In [None]:
iris_df["sepal_length"].head()

Let's see the exact nature of this output.

In [None]:
type(iris_df["sepal_length"])

We see that each individual column in a Pandas DataFrame is actually a Pandas Series.  So, a DataFrame is just a collection of Series which share the same index.  It turns out the rows of the DataFrame can be thought of in this way as well.

In [None]:
iris_df.iloc[0]

In [None]:
type(iris_df.iloc[0])

Each row is a Pandas Series indexed by the column names in the DataFrame.

We can also call multiple columns at once by passing a list of column names into the brackets.  We will see that in this case the result is also a Pandas DataFrame, which is a sub frame of the original DataFrame.

In [None]:
iris_df[["sepal_length", "sepal_width"]].head()

In [None]:
type(iris_df[["sepal_length", "sepal_width"]])

Now let's look at methods for subsetting the rows of a DataFrame.  We just saw in the above example that we can call rows by their index value using the ```iloc``` method.  Think of the ```loc``` as being "location", and the ```i``` as "index."  So this method is for calling the index location of the desired row(s).

In [None]:
iris_df.iloc[10]

In [None]:
iris_df.iloc[0:12]

We can use slice notation to call ranges of rows in a DataFrame.  Just as with columns, when we call a single row, we get a Series; when we call multiple rows, we get a DataFrame.  We can also use the method to call columns after subsetting rows, so we can subset our DataFrame in multiple ways simultaneously.

In [None]:
iris_df.iloc[0:10]["sepal_length"]

In [None]:
iris_df.iloc[0:10][["sepal_length", "sepal_width"]]

Often we won't be subsetting our DataFrames using row indexes, but rather with logical conditions.  For this we will use the ```loc``` method to simply find the locations that match our condition(s).

In [None]:
iris_df.loc[iris_df["sepal_length"] > 5].head()

In [None]:
iris_df.loc[(iris_df["sepal_length"] > 5) & (iris_df["sepal_width"] > 4)]

We can also use these methods to simultaneously call rows and columns, but we must use the same method in both axes to do this.  That is, we must use index values when using ```iloc```, and we must use conditions and key values when using ```loc```.

In [None]:
iris_df.iloc[0:10, 0:]

In [None]:
iris_df.loc[
    (iris_df["sepal_length"] > 5) &
    (iris_df["sepal_width"] > 4),
    ["petal_length",
     "petal_width",
     "species"]]

If we try to mix and match within the same method, we will get an error.

In [None]:
iris_df.iloc[0:10, "sepal_width"]

In [None]:
iris_df.loc[iris_df["sepal_width"] > 4, 3]

Pandas DataFrames also have some built-in methods for handling missing data.  This is an important consideration in any real world data science project.

In [None]:
iris_df.dropna(inplace = True)

iris_df \
    .loc["sepal_length"] \
    .fillna(iris_df["sepal_length"].mean())

#### Two-Table Verbs

In [None]:
pd.merge(df1, df2, by=["vars"], how="what type of join (left, right, inner, full)")


In [None]:
df = pd.DataFrame(
    {"subject":["Math", "Biology", "Western Civilization"],
    "Mid1":[99, 98, 100],
    "Mid2":[92, 97, 99],
    "Final":[99, 98, 100]})
df.head()

In [None]:
pd.melt(df, id_vars="subject",
        value_vars=["Mid1", "Mid2", "Final"],
        var_name = "exam",
        value_name = "score")

### matplotlib

In [None]:
import matplotlib.pyplot as plt
import seaborn as sn; sn.set()
%matplotlib inline

In [None]:
x = np.linspace(0, 10, 100)
y = 2.7 + 1.4*x + np.random.randn(100)

In [None]:
plt.figure(figsize=(16,12))
plt.plot(x, y)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Fake Data")
plt.show()

In [None]:
u = 14 - 4.1*x + np.random.randn(100)

plt.figure(figsize=(16,12))
plt.scatter(x,y, label = "y")
plt.scatter(x,u, label = "u")
plt.legend()

In [None]:
z = np.random.randn(100)

plt.figure(figsize=(16,12))
plt.hist(z)
plt.xlim((-2,2))
plt.xlabel("This is whatever z is", fontsize = 18)