# Python Data Science Prep Class - Intro to Pandas 
#### (JPW Lecture)

In [14]:
# import Numpy with name as 'np' by convention.
# import Pandas with name as 'pd' by convention.  
# We can now use 'np' and 'pd' to access all Numpy and Pandas methods and attributes, respectively.
import numpy as np
import pandas as pd

## Creating a New DataFrame
***
There are many ways we can create a DataFrame.

### 1. We can just pass a list. 
Note the columns and rows (index) are not labeled but are simply given a number. This is just a demonstration to show how DataFrames parse incoming data and would likely never be done.  

In [15]:
# Passing in a single list creates a single column.
one_col = [10, 20, 30, 40, 50]
pd.DataFrame(data=one_col)

Unnamed: 0,0
0,10
1,20
2,30
3,40
4,50


In [16]:
# Passing in a nested list (note double brackets) creates a single row
one_row = [[10, 20, 30, 40, 50]]
pd.DataFrame(data=one_row)

Unnamed: 0,0,1,2,3,4
0,10,20,30,40,50


In [17]:
# Passing in a list of nested lists creates as many rows as there are nested lists
three_rows = [[10, 20, 30, 40, 50], [11, 21, 31, 41, 51], [12, 22, 32, 42, 52]]
pd.DataFrame(data=three_rows)

Unnamed: 0,0,1,2,3,4
0,10,20,30,40,50
1,11,21,31,41,51
2,12,22,32,42,52


### 2. We can create a _skeleton_ DataFrame by specifying the column and/or row names and dimensions at creation.  
Note we don't have to actually pass in any data to create a DataFrame.  We can just specify the structure of its rows (index) x columns and it will be created blank.

In [18]:
# Create skeleton DF
pd.DataFrame(index=['row_1', 'row_2', 'row_3'], columns=['col_1', 'col_2', 'col_3'])

Unnamed: 0,col_1,col_2,col_3
row_1,,,
row_2,,,
row_3,,,


When we run this, we get a 3x3 DataFrame as expected, but since we passed no actual data in we get __`NaN`__ values.  
NaN stands for "__Not a Number__", and is equivalent to a null value.  

__Dense v. Sparse data__: To keep things simple here, a matrix that has NaNs for a given row or column is called "sparse" (it can be more technical than this, but that's the gist you need to take home).  If the DataFrame is mostly full,  it is called "dense."  In general, we will want to ideally work with "dense" data as most machine learning algorithms either perform better with it or actually require it.  The creation of a skeleton DF has uses for memory efficiency in certain use cases, but we will not encounter those today.  I only mention it here so you will be familiar with the terms and their meanings.  We will see how to deal with such "missing" data further below.

Normally we won't want an empty DataFrame, however.  So one way to fill it with data is to use the __`data=`__ argument when we create it, like so:
### 3. We can create filled DF with same skeleton as \#2

In [19]:
# Use same skeleton as above, but also give data to fill the DF with
data = np.arange(1, 10).reshape(3,3)
pd.DataFrame(data=data, index=['row_1', 'row_2', 'row_3'], columns=['col_1', 'col_2', 'col_3'])

Unnamed: 0,col_1,col_2,col_3
row_1,1,2,3
row_2,4,5,6
row_3,7,8,9


Here we use Numpy to create a list using __`np.arange`__, which stands for "array range", and will create an _array_ (Numpy's advaned version of a list) from the range you specify.  Functionally it is equivalent to __`list(range(1, 10))`__. 

In [20]:
np.arange(1, 10)    # returns a 1-dimensional array (list)

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [21]:
list(range(1, 10))   # also returns a 1-dimensional list
# Note if you are on Python 2.x instead of 3.x, you don't have to use the enclosing 'list()' func.

[1, 2, 3, 4, 5, 6, 7, 8, 9]

There are two primary differences between Numpy's advanced __`arange()`__ function (and its cousin the __`linspace()`__ function) and the standard __`list()`__.

1. __`np.arange()`__ can take any value for a number (not just an `int`) and use any increment between them.  
    + `np.arange(.5, 1, .1)` will return `[.5, .6, .7, .8, .9]`, for example.  
    
2. __`np.arange()`__ can be reshaped using Numpy's __`.reshape()`__ method for the array class.
    + `.reshape(r, c)`, where `r` = rows, `c` = cols.  
    
We need \#2 here to make our data fit the DF structure we have just created.  Let's see what happens when we try to pass in the exact same data, a list from 1 to 9, without changing its shape first.

In [22]:
# No reshaping of the data this time....
data = np.arange(1, 10)   
pd.DataFrame(data=data, index=['row_1', 'row_2', 'row_3'], columns=['col_1', 'col_2', 'col_3'])

ValueError: Shape of passed values is (1, 9), indices imply (3, 3)

Uh oh! We got "__`ValueError: Shape of passed values is (1, 9), indices imply (3, 3)`__"  Pandas is telling us, "hey, the DataFrame you created implies a shape of 3x3, but you gave me a 1x9 set of data.  Not cool. Never program again!"

Okay.  I made up the last part about never programming again, but Pandas absolutely said the other part.

Just a reminder, you can simply pass in the 3x3 data argument without naming the rows or columns and Pandas will create a 3x3 DF without named rows or columns.

There are a couple more prominent ways to create a DataFrame.
### 4. Create a DataFrame by using a dictionary as the data

In [None]:
data_dict = {'Col_A': 11, 'Col_B': 22, 'Col_C': 33}
pd.DataFrame(data=data_dict, index=range(3))

Note how we set the values for one row of data and then extended it three times by passing in the __`index=range(3)`__ argument.  Pandas requires an __`index`__ value when you pass in a single value dictionary. (delete the `index=range(3)` argument and run the cell above, you'll be scolded).

One way around this, and an approach which provides greater flexibility going forward, is to make the __values__ of the dictionary into a __list__.  Then, when Pandas reads this as data, it knows there are only as many rows as there are members in the list.

In [None]:
data_dict_list = {'Col_A': [11], 'Col_B': [22], 'Col_C': [33]}
pd.DataFrame(data=data_dict_list)

Boom.  Same dictionary as before but the values are now in a list and Pandas knows exactly how many rows to make.  One catch is that all lists must be the same length in the dictionary.  For one last point let's extend the lists in this dictionary to illustrate how easy it is to make a DataFrame from a dict.

In [None]:
extended_dict_list = {'Col_A': [11, 101, 1001, 10001], 'Col_B': [22, 202, 2002, 20002], 'Col_C': [33, 303, 3003, 30003]}
pd.DataFrame(data=extended_dict_list)

In general, using dictionaries or even lists of dictionaries tends to be more flexible than lists or lists of lists.

In [None]:
list_of_dicts = [{'Col_A': 'Ohhhhh'}, {'Col_B': 'Mahhhhh'}, {'Col_C': 'Gawddddd'}]
pd.DataFrame(data=list_of_dicts)

## Inspecting the Data
***
Now that we know how to make a DataFrame from scratch, let's load one with data that already exists in it so we can get this party started.  By convention, when loading a DataFrame we usually call it __df__, and variations arise from that.  Pandas has many ways to load different formats of data, but the most prevalent is likely from spreadsheet files, like Excel's `.xlsx` format or `.csv` files.
If possible, always choose the `.csv` file to import into a DataFrame because the `.xlsx` files have a lot of info overhead that Pandas has to strip away, resulting in much faster parsing and load times for `.csv` files of a large size.

We will now import a `.csv` file into a new DataFrame using the __`read_csv()`__ function.  There are many options to how you import a `.csv` file (see [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) here), most of which are dependent upon the data itself, but we will stick to a vanilla import here.  

The data is actual real-life data from my dating history, all anonymous.  I am choosing to use this data beause it is manageable but also highly relateable -- we've all dated someone who was really intelligent but had no sense of humor, or who was great on paper but had no chemistry in person.  Using these real examples for an everday topic will help demonstrate how effortless Pandas can make data wrangling to find deeper insights into the data.

In [27]:
# load the data from a csv file into a DataFrame called "df"
df = pd.read_csv('pandas_dating_demo_df_anon.csv')

When I load data into a DataFrame, the first thing I want to know are the dimensions of the data.  For this we can use the __`shape`__ method.  Another thing I always do is take a quick peek at the data, just to see what I'm working with.  What are some of the column names and what kind of data do they have? (There are other ways to do this, as we'll see, but I find just looking at the data visually is a great first step).  To do this we will use the __`head`__ method to look at the first N rows (5 by default).

In [31]:
# returns shape = (rows, cols)
df.shape

(50, 16)

In [32]:
# show first five rows -- you can scroll to the right to see more columns!
df.head()

Unnamed: 0,ID,Age,Height(in.),Attraction,Hair,Intellectual_Connection,Humor,Chemistry,Attitude,Wine,Politics,Income,Divorced,Kids,Second_Date,Like_This_Person?
0,1,27,67,4.0,Blonde,2.5,1.0,4.0,Neutral,Red,Left,Low,No,No,Yes,No
1,2,27,65,5.5,Blonde,6.5,4.5,3.0,Complainer,Red,Left,Low,No,No,Yes,No
2,3,25,61,1.0,Brunette,2.0,2.5,1.0,Negative,White,Left,Low,No,No,No,No
3,4,21,68,8.0,Brunette,7.5,7.0,8.0,Negative,White,Left,Low,No,No,Yes,Yes
4,5,27,65,7.0,Blonde,7.5,6.5,8.0,Positive,Red,Left,Medium,No,No,No,Yes


We can also use __`df.tail()`__ to see the bottom N rows.

In [33]:
# Show bottom 8 rows
df.tail(8)

Unnamed: 0,ID,Age,Height(in.),Attraction,Hair,Intellectual_Connection,Humor,Chemistry,Attitude,Wine,Politics,Income,Divorced,Kids,Second_Date,Like_This_Person?
42,43,27,67,8.0,Brunette,3.5,4.0,6.5,Positive,Red,Right,Medium,No,No,Yes,No
43,44,40,70,9.0,Blonde,6.5,9.0,9.0,Positive,Red,Left,Medium,Yes,Yes,Yes,Yes
44,45,30,71,5.0,Blonde,9.0,7.0,3.5,Negative,Red,Independent,Low,No,No,Yes,Yes
45,46,32,65,9.5,Blonde,5.0,5.0,7.0,Negative,Red,Left,High,Yes,No,No,Yes
46,47,36,66,4.5,Blonde,4.0,7.5,2.5,Complainer,Red,Left,Low,Yes,Yes,No,No
47,48,33,63,8.0,Brunette,5.0,6.0,9.5,Positive,Red,Left,Low,No,No,Yes,Yes
48,49,40,68,7.0,Brunette,4.0,0.0,2.5,Negative,Red,Left,Medium,Yes,Yes,Yes,No
49,50,33,64,4.5,Brunette,6.0,5.0,2.5,Neutral,Red,Left,Low,Yes,Yes,No,No


One thing we will want to do frequently is perform different calculations or operations on parts of the DataFrame.  To do this, we will need to know the __type__ of data that's in each column.  For example, just because we see a written number in a column does not me that the number is _actually_ an `int` or a `float`.  It could very well be a string, which will cause problems for us when we try to divide it by another column later.  So we use __`df.info()`__ to see the __`dtype`__ (which stands for _data type_) of every column.  

For DataFrames that are over 60 rows long, when you try to print out the DataFrame you will see the top 30 and bottom 30 rows with the remaining middle rows represented by an ellipsis, showing you more data is there but is not being printed.  Use __`df.info(verbose=True)`__ when you want to see the info for more than 60 columns.

In [34]:
# show counts and dtypes for all columns - "object" means "string", for practical purposes.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 16 columns):
ID                         50 non-null int64
Age                        50 non-null int64
Height(in.)                50 non-null int64
Attraction                 50 non-null float64
Hair                       50 non-null object
Intellectual_Connection    50 non-null float64
Humor                      50 non-null float64
Chemistry                  50 non-null float64
Attitude                   50 non-null object
Wine                       50 non-null object
Politics                   50 non-null object
Income                     50 non-null object
Divorced                   50 non-null object
Kids                       50 non-null object
Second_Date                50 non-null object
Like_This_Person?          50 non-null object
dtypes: float64(4), int64(3), object(9)
memory usage: 6.3+ KB


For numeric columns we will often also like to get a lay of the land by using __`df.describe()`__ for a quick view of summary statistics.  This helps gives us a bird's-eye-view understanding of the numeric data we are working with.  If a DataFrame has mixed dtypes, with some columns being numeric and others being `object`, only the numeric columns will be reported in the `describe()` summary.  However, if a DataFrame is _entirely_ made of `object` columns, the `describe` method will give you the counts for now categorical data.  

In [35]:
df.describe()

Unnamed: 0,ID,Age,Height(in.),Attraction,Intellectual_Connection,Humor,Chemistry
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,25.5,30.16,65.24,5.94,5.12,5.22,5.45
std,14.57738,5.946565,2.938554,2.335703,2.282498,2.513555,2.659695
min,1.0,21.0,60.0,1.0,1.0,0.0,0.0
25%,13.25,25.25,63.0,4.0,3.125,3.5,3.5
50%,25.5,30.0,65.0,6.75,5.5,5.25,6.0
75%,37.75,34.0,67.0,8.0,6.5,7.0,7.5
max,50.0,45.0,74.0,9.5,10.0,10.0,10.0


Using `df.describe()` we see things like the average age the women I've dated in recent years has been 30.1 years old.  The tallest person was 74 inches -- 6'2".  And the mean rating for "similar humor" is 5.2 out of 10 (p.s. no one said dating was fun).  

Last, there are three primary parts to any DataFrame.

1. Data
2. Rows (index)
3. Columns

Fortunately, Pandas was designed with the versatility of Python in mind.  So the rows (called the _index_) and the columns are actually __objects__ that we can inspect, apply, and loop through.  They both have their own attributes to access them.  Yeah, you guessed it, it's __`df.index`__ and __`df.columns`__.  Both of these commands will return an object that basically functions like a list.  

In [36]:
# get all columns in the DF
df.columns

Index(['ID', 'Age', 'Height(in.)', 'Attraction', 'Hair',
       'Intellectual_Connection', 'Humor', 'Chemistry', 'Attitude', 'Wine',
       'Politics', 'Income', 'Divorced', 'Kids', 'Second_Date',
       'Like_This_Person?'],
      dtype='object')

In [37]:
# Get all rows in the DF
df.index

RangeIndex(start=0, stop=50, step=1)

A single row or column of data is called a __`Series`__.  This is a an actual Pandas object, created with __`pd.Series`__, and can do many of the same things as a DataFrame.

When we are working with other DataFrames we will often want to simply keep a `Series` as itself. However, when we are working with various machine learning algorithms or other general data manipulation, we will frequently want _only the values_ in the column and not the column object itself.  To do this, we have two options, a `df.values` attribute and a `.tolist()` function.  Note that we can use both of these with more than one column at a time, or even the entire DataFrame, but we will learn with the simplest case of a single column, or a `Series`.

I will show an example a `Series` and the two methods associated with its values below.

In [41]:
# A Series is just a column from the bigger DataFrame (or row). We will see its name and its dtype below.
# Let's create a 'pol' Series from the Politics column
pol = df['Politics']
pol

0            Left
1            Left
2            Left
3            Left
4            Left
5            Left
6            Left
7           Right
8     Independent
9            Left
10          Right
11           Left
12          Right
13           Left
14    Independent
15           Left
16          Right
17           Left
18           Left
19          Right
20    Independent
21           Left
22          Right
23          Right
24           Left
25           Left
26          Right
27           Left
28          Right
29          Right
30           Left
31           Left
32           Left
33           Left
34          Right
35          Right
36           Left
37           Left
38           Left
39           Left
40           Left
41          Right
42          Right
43           Left
44    Independent
45           Left
46           Left
47           Left
48           Left
49           Left
Name: Politics, dtype: object

In [43]:
# the .values attribute returns a Numpy array of the members in the series. 
# We can do the same things to it as any array.
pol.values

array(['Left', 'Left', 'Left', 'Left', 'Left', 'Left', 'Left', 'Right',
       'Independent', 'Left', 'Right', 'Left', 'Right', 'Left',
       'Independent', 'Left', 'Right', 'Left', 'Left', 'Right',
       'Independent', 'Left', 'Right', 'Right', 'Left', 'Left', 'Right',
       'Left', 'Right', 'Right', 'Left', 'Left', 'Left', 'Left', 'Right',
       'Right', 'Left', 'Left', 'Left', 'Left', 'Left', 'Right', 'Right',
       'Left', 'Independent', 'Left', 'Left', 'Left', 'Left', 'Left'], dtype=object)

In [45]:
# We can also turn a series directly into a list.  We can do all the normal "list" things with this list, like indexing.
pol_list = pol.tolist()
pol_list[5:10]

['Left', 'Left', 'Right', 'Independent', 'Left']

### Accessing the Data
Did you notice how we plucked out a single column, 'Politics', from the entire DataFrame?  There are two ways, but the way demonstrated above is the safer of the two.

1. Indexing in with the column name (above).
    + `df['Politics']`
2. Accessing the column as an attribute.
    + `df.Politics`

The second option looks nicer, admittedly, but it has a potential fatal flaw: it cannot access column names with spaces in them!  Let's pretend the column was name "Political Party" instead of "Politics."  If we tried to access this column as an attribute, it would fail with an __`SyntaxError: Invalid Syntax`__ exception.  

The first method of indexing in with the column name has no such issue.  Let's create a duplicate column of the 'Politics' column with the name "Political Party" just to demonstrate.

In [50]:
# Creating new columns from existing data in Pandas is this easy.
df['Political Party'] = df['Politics']

# Access the new column with the Indexing method -- this works.
df['Political Party']

0           Left
1           Left
2           Left
3           Left
4           Left
5           Left
6           Left
7          Right
8    Independent
9           Left
Name: Political Party, dtype: object

In [52]:
# Access new column with the attribute method -- this fails.
df.Political Party

SyntaxError: invalid syntax (<ipython-input-52-48aceb8704d4>, line 1)

If we wanted to access multiple columns at once (this will now return a DataFrame object instead of a Series object, since it will be more than a single column), all we have to do is pass in a list of columns.  It's that easy.

In [53]:
# Access multiple columns at once by passing in a list of desired cols.
columns_we_want = ['Age', 'Height(in.)', 'Politics']
df[columns_we_want]

Unnamed: 0,Age,Height(in.),Politics
0,27,67,Left
1,27,65,Left
2,25,61,Left
3,21,68,Left
4,27,65,Left
5,31,67,Left
6,37,65,Left
7,23,68,Right
8,34,61,Independent
9,26,63,Left


For this reason, it is good habit to use the Indexing method to access data in your Pandas DataFrames.  You will see both in the wild, and it isn't necessarily "bad" to use the other method, but if you have to learn one you might as well learn the one that has no problems associated with it.  Which leads us to our next point...

It is always best to have no spaces in column headers.  One of the first things I always do is rename headers to remove spaces.  In MS Excel, this is a huge time sink.  In Python and Pandas, it's a breeze.


In [54]:
# create a list of all column names
cols = df.columns.tolist()

# Use list comprehension to replace all spaces with underscores
cols = [col.replace(' ', '_') for col in cols]

# assign the fixed names to be the actual column names in the DF
df.columns = cols

We just learned how to access an entire column -- a very common task -- in Pandas.  Great!  But what about accessing specific rows?  Or subsets of certain rows or columns?  What about a particular, single cell in the DataFrame?  Well, good news.  There are three ways to do that and they're pretty logical.  Examples follow below.

#### 1. `.loc`
This is the most commonly used method. It indexes into a DF using row and column __names__.  If you pass in a name that doesn't exist in the DF, you'll get an error.
Example: `df.loc['row_5', 'col_3']`

#### 2. `.iloc`
This is __integer__ indexing.  We pass in the number of the row or column we want, very similar to how we index into a list.  It doesn't matter that the names are.  A DF's rows and columns are indexed from 0 just like lists in Python.  If you pass in a value that doesn't exist, you'll get an error.
Example: `df.iloc[5, 3]`   -- this will give you the cell at the 5th row and 3rd column.

#### 3. `.ix`
This is a vestige of early versions of Pandas.  It can handle both integers and names.  Sounds great, right?  Well, there's a catch.  If your index (or columns, but usually it's the index) is named as a list of integers, such as 1, 2, 3, ... etc. for each row, then using `.ix` can give you unexpected (and undesired) results to your queries.  Say you use `df.ix[5, 3]` and your index (rows) are named as ascending integers, 1, 2, ....  Does your command mean you want the acutal 5th row, or the row named '5'?  If they happen to be the same row, great.  But if they aren't, which one did you mean?  So, it is strongly recommended to use either `.loc` or `.iloc`.  It follows one of the core principles of Pythonic programming: __Explicit Over Implicit__.  

Let's look at some quck examples below using our dating database.

In [55]:
# Let's look at the wine preference for the person in the row named '3' and the column named 'wine'.
df.loc[3, 'Wine']

'Red'

In [61]:
# Wine is the 9th column
# Now let's look at the same thing for the 3rd row and 9th column (Wine)
df.iloc[3, 9]

'White'

Whoa, Nelly!  What happened?  We got "red" from the first query and "white" from the second.  Well, this example is to help drive the point home of how these indexing methods work.  In the first one we used `.loc`, which says "give me the row named '3'.  In the second we used `.iloc` which says "give me the 3rd overall row."  So remember, in Pandas just like in Python, lists (arrays) are 0-indexed, which means the first entry has an index of 0.  Let's use the `.index` attribute we learned earlier to look at the first five rows of the index and see the explanation.

In [65]:
df.index[:5].values

array([0, 1, 2, 3, 4])

These are the _names_ of each row!  Yes, they're integers.  And yes, they're the same as a normal index, but these are the actual names of each row.  So, the 3rd row itself has a name of 2 (0, 1, 2 = third row).  But there is also a row with the _name_ of 3 (which, as usual in an index, is actually the fourth row overall).  Thus, using `.loc[3]` gives us the row named 3, which is the fourth overall, and using `.iloc[3]` gives us the 3rd overall row, regardless of name.  

In [74]:
# create a temporary, dummy DF with the new names just for this demo
df_names = df.rename(index={0: 'Marie_Curie', 1: 'Rosalind_Franklin', 2: 'Sally_Ride', 3: "Joan_d'Arc", 4: 'My_Ex-Wife'})
df_names.iloc[:5, :]

Unnamed: 0,ID,Age,Height(in.),Attraction,Hair,Intellectual_Connection,Humor,Chemistry,Attitude,Wine,Politics,Income,Divorced,Kids,Second_Date,Like_This_Person?,Political_Party
Marie_Curie,1,27,67,4.0,Blonde,2.5,1.0,4.0,Neutral,Red,Left,Low,No,No,Yes,No,Left
Rosalind_Franklin,2,27,65,5.5,Blonde,6.5,4.5,3.0,Complainer,Red,Left,Low,No,No,Yes,No,Left
Sally_Ride,3,25,61,1.0,Brunette,2.0,2.5,1.0,Negative,White,Left,Low,No,No,No,No,Left
Joan_d'Arc,4,21,68,8.0,Brunette,7.5,7.0,8.0,Negative,White,Left,Low,No,No,Yes,Yes,Left
My_Ex-Wife,5,27,65,7.0,Blonde,7.5,6.5,8.0,Positive,Red,Left,Medium,No,No,No,Yes,Left


Now let's try our indexing again.

In [78]:
# Get Joan's wine pref.
df_names.loc["Joan_d'Arc", 'Wine']

'White'

In [80]:
df_names.iloc[0, 9]

'Red'

In [81]:
# this worked above but will fail now that we've renamed the rows -- there is no row named '3' anymore.
df_names.loc[3, 'Wine']

KeyError: 'the label [3] is not in the [index]'

Last, if we want, we can do broad indexing into a DF just like a list in order to select rows:

In [96]:
# This is almost the same as .loc, but doesn't include the endpoint
df[4:6]

Unnamed: 0,ID,Age,Height(in.),Attraction,Hair,Intellectual_Connection,Humor,Chemistry,Attitude,Wine,Politics,Income,Divorced,Kids,Second_Date,Like_This_Person?,Political_Party
4,5,27,65,7.0,Blonde,7.5,6.5,8.0,Positive,Red,Left,Medium,No,No,No,Yes,Left
5,6,31,67,2.5,Brunette,3.5,3.5,2.0,Complainer,Red,Left,Low,No,No,Yes,No,Left


In [95]:
# .loc includes the endpoint of row 5 when used this way
df.loc[4:6]

Unnamed: 0,ID,Age,Height(in.),Attraction,Hair,Intellectual_Connection,Humor,Chemistry,Attitude,Wine,Politics,Income,Divorced,Kids,Second_Date,Like_This_Person?,Political_Party
4,5,27,65,7.0,Blonde,7.5,6.5,8.0,Positive,Red,Left,Medium,No,No,No,Yes,Left
5,6,31,67,2.5,Brunette,3.5,3.5,2.0,Complainer,Red,Left,Low,No,No,Yes,No,Left
6,7,37,65,7.0,Brunette,2.5,4.5,5.0,Neutral,Red,Left,Low,Yes,Yes,No,No,Left


And again, we can select multiple rows or columns together by passing in a list as we did before.

In [91]:
# Multiple rows and columns by name 
df.loc[4:6, ['Age', 'Hair', 'Kids']]

Unnamed: 0,Age,Hair,Kids
4,27,Blonde,No
5,31,Brunette,No
6,37,Brunette,Yes


In [97]:
# By integer - doesn't include endpoint, just like normal python integer indexing
df.iloc[4:6]

Unnamed: 0,ID,Age,Height(in.),Attraction,Hair,Intellectual_Connection,Humor,Chemistry,Attitude,Wine,Politics,Income,Divorced,Kids,Second_Date,Like_This_Person?,Political_Party
4,5,27,65,7.0,Blonde,7.5,6.5,8.0,Positive,Red,Left,Medium,No,No,No,Yes,Left
5,6,31,67,2.5,Brunette,3.5,3.5,2.0,Complainer,Red,Left,Low,No,No,Yes,No,Left


In [102]:
# Multiple columns and rows by int -- same exact columns as above, just by their index number instead of name
df.iloc[4:6, [1,4,13]]

Unnamed: 0,Age,Hair,Kids
4,27,Blonde,No
5,31,Brunette,No


## 