The DataFrame is conceptually a two-dimentional series object, where there's an index and multiple columns of content, with each column having the label. In fact, the distinction between a column and a row is really only a conceptual distinction. And you can think of the DataFrame itself as simply a two-axes labeled array

In [1]:
import pandas as pd

In [2]:
# Create 3 school records for students and their class grades. 
record1 = pd.Series({'Name': 'Alice',
                    'Class': 'Physics',
                    'Score': 85})
record2 = pd.Series({'Name': 'Jack',
                    'Class': 'Chemistry',
                    'Score': 82})
record3 = pd.Series({'Name': 'Helen',
                    'Class': 'Biology',
                    'Score': 90})


In [3]:
# Like a Series, the Dataframe object is index. Here I'll use a group of series, where each series
# represents a row of data. Just like the Series function, we can pass in our individual items
# in an array, and we can pass in our index values as a second arguments.
df = pd.DataFrame([record1, record2, record3], ['school1', 'school2', 'school1'])

df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [4]:
# An alternatice method is that you could use a list of dictionaries, where each dictionary
# represents a row of data.

students = [{'Name': 'Alice',
            'Class': 'Physics',
            'Score': 85},
           {'Name': 'Jack',
            'Class': 'Chemistry',
            'Score': 82},
           {'Name': 'Helen',
            'Class': 'Biology',
            'Score': 90}]

df = pd.DataFrame(students, index=['school1','school2','school1'])

df

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [8]:
# Similar to Series, we can extract data using the .iloc and .loc attributes. Because the
# DataFrame is two-dimensional, passing a single value to the loc indexing operator will return
# the series if there's only one row to return

# For instance, if we wanted to select data associated with school2, we would just query the
# .loc attribute with one parameter.
df.loc['school2']

type(df.loc['school1'])

pandas.core.frame.DataFrame

In [10]:
# One of the powers of the Pandas's DataFrame is that you can quickly select data based on the multiple axes
# For instance, if you wanted to just list the student names for school1, you would suppply two
# parameters to .loc, one being the row index and the other being the column name.

# For instance
df.loc['school1','Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [13]:
# The pandas developers have implemented this using the indexing operator and
# not as parameters to a function.

# Select a single column. First, we could transpose the matrix. This pivots all the rows
# into column and all the columns into rows, and is done with the T attribute.
df.T

Unnamed: 0,school1,school2,school1.1
Name,Alice,Jack,Helen
Class,Physics,Chemistry,Biology
Score,85,82,90


In [14]:
# Then we call the .loc on the transpose to get the student names only
df.T.loc['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [15]:
# However, since iloc and loc are used for row selection, Panda reserves the indexing operator
# directly on the DataFrame for column selection. In a Panda's DataFrame, columns always have a name.
# So this selection is always label based and is not as confusing as it was when using the square
# bracket operator on the series objects. For those familiar with relational databases, this operator is analogous
# to column projection.
df['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [18]:
# This means that you get a key error if you try and use .loc with a column name
df.loc['Name']

KeyError: 'Name'

In [19]:
# Note that the result of a single column projection is a Series object
type(df['Name'])

pandas.core.series.Series

In [20]:
# Since the result of using the indexing operator is either a DataFrame or Series, you can chain
# operations together. For instance, we can select all of the rows which related to school1 using
# .loc, then project the name column from just those rows.
df.loc['school1','Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [21]:
# Chaning, by indexing on the return type of another index, can come with some costs and is 
# best avoided if you can use another approach. In particular, chaining tends to cause Pandas
# to return a copy of the DataFrame instead of a view on the DataFrame.
# For selecting data, this is not a big deal, though it might be slower than necessary
# If you are chaning data though this is an important distinction and can be s source of error.


In [22]:
# As we saw, .loc does row selection, and it can take two parameters, the row index and the list
# of column names. The .loc attribute also supports slicing.

# If we wanted to select all rows, we can use a colon to indicate a full slice from beginning to end.
# This is just like slicing characters in a list in python. Then we can add the column name as the second
# parameter as a string. If we wanted to include multiple columns, we could do so in a list.
# and Pandas will bring back only the columns we have asked for.

# Here is an example, where we ask for all the names and scores for all schools using the .loc operator.
df.loc[:, ['Name', 'Score']]

Unnamed: 0,Name,Score
school1,Alice,85
school2,Jack,82
school1,Helen,90


In [23]:
# That's selecting and projecting data from a DataFrame based on row and column labels. The key
# concepts to remember are that the rows and columns are really just for our benefit. Underneath
# this is just a two axes labeled array, and transposong the columns is easy. Also, consider the issue
# of chaining carefully and tru to avoid it. as it can cause unpredictable results, where your intent was
# to obtain a view of the data, but instead Pandas returns to you a copy.


In [24]:
# Dropping data
# This function takes a single parameyer, which is the index or row label to drop. This is another
# tricky place for new users -- the drop function doesnt change the DataFrame by default. Instead, 
# the drop function returns to you a copy of the DataFrame with the given rows removed.

df.drop('school1')

Unnamed: 0,Name,Class,Score
school2,Jack,Chemistry,82


In [25]:
# But, if we look at our original DataFrame we see the data is still intact.
df

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [31]:
# Drop has two interesting parameters. The first is call implace, and is it's
# set to true, the DataFrame will be updated in place, instead of a copy being returned.
# The second parameter is the axes, which should be dropped. By default, this value is 0,
# indicating the row axis. But you could change it to 1 if you want to drop a column.

# for example, 
copy_df = df.copy()
# drop name column in this copy
copy_df.drop('Name', inplace=True, axis=1)
copy_df

Unnamed: 0,Class,Score
school1,Physics,85
school2,Chemistry,82
school1,Biology,90


In [32]:
# There is a second way to drop a column, and that;s directly through the use of the indexing
# operator, using the del keyword. This way of dropping data, however, takes immediately effect
# on the DataFrame and does not return a view.
del copy_df['Class']
copy_df

Unnamed: 0,Score
school1,85
school2,82
school1,90


In [34]:
# Finally, adding a new column to the DataFrame is as easy as assigning it to some value using 
# the indexing operator. For instance, if we wanted to add a class ranking column with default
# value of None, we could do so by using the assignment operator after the square brackets.

df['Class Ranking'] = None
df

Unnamed: 0,Name,Class,Score,Class Ranking
school1,Alice,Physics,85,
school2,Jack,Chemistry,82,
school1,Helen,Biology,90,
