The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of content, with each column having a label. In fact, the distinction between a column and a row is only a conceptual distinction, and you can think of the DataFrame itself as simply a two-axis labeled array.

In [2]:
import pandas as pd

In [3]:
record1 = pd.Series({'Name': 'Alice',
                     'Class': 'Physics',
                     'Score': 85})
record2 = pd.Series({'Name': 'Jack',
                     'Class': 'Chemistry',
                     'Score': 82})
record3 = pd.Series({'Name': 'Helen',
                     'Class': 'Biology',
                     'Score': 90})

the DataFrame object is indexed. We'll use a group of series, where each series represents a row of data. We can pass in our individual items in an array and pass in our index values as second arguments. 

In [4]:
df = pd.DataFrame([record1, record2, record3],
                  index=['school1', 'school2', 'school1'])
df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [7]:
# an alternative method is to use a list of dictionaries, where each dictionary represents a row of data
students = [{'Name': 'Alice',
             'Class': 'Physics',
             'Score': 85},
            {'Name': 'Jack',
            'Class': 'Chemistry',
            'Score': 82},
            {'Name': 'Helen',
             'Class': 'Biology',
             'Score': 90}]
df = pd.DataFrame(students, index=['school1', 'school2', 'school1'])
df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


similar to the series, we can extract data using the.iloc and.loc attributes. Because the DataFrame is two-dimensional, passing a single value to loc indexing operator will return the series if there's only one row to return. 

In [8]:
# if we want to select data of school2, we would just query the.loc attribute with one parameter.
df.loc['school2']

Name          Jack
Class    Chemistry
Score           82
Name: school2, dtype: object

In [9]:
# the name of the series is returned as the index value
# and we can check the data type of the return
type(df.loc['school2'])

pandas.core.series.Series

In [10]:
df.loc['school1']
# we see 2 rows for school1 as different rows, so the return is not a new series but a new dataframe 

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school1,Helen,Biology,90


In [11]:
type(df.loc['school1'])

pandas.core.frame.DataFrame

In [12]:
# if we are interested in school1's student names
# we would supply 2 parameters to .loc, one being the row index and the other being the column name
df.loc['school1', 'Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

What if we just want to select a single column?

In [14]:
# first, we transpose the matrix, selecting the rows means selecting columns???
df.T

Unnamed: 0,school1,school2,school1.1
Name,Alice,Jack,Helen
Class,Physics,Chemistry,Biology
Score,85,82,90


In [15]:
# then we call .loc on the transpose to get the student names only
df.T.loc['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

iloc and loc are used for row selection, Panda reserves the indexing operator directly on the DataFrame for column selection, columns always have a name. So this selection is always label based. For those familiar with relational databases, this operator is analogous to column projection. 

In [16]:
df['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [17]:
# this means that we get a key error if we use .loc with a column name
df.loc['Name']

KeyError: 'Name'

In [18]:
# notice that the result of a single column projection is a series object
type(df['Name'])

pandas.core.series.Series

In [19]:
# the result of using the indexing operator is either a DataFrame or series, we can chain operations together. 
# we select all of the rows which related to school1 using.loc, then project the name column for just those rows.
df.loc['school1']['Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [21]:
# checking the data type
print(type(df.loc['school1']))  # a dataframe
print(type(df.loc['school1']['Name']))  # a series

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


we had better avoid chaining if we can use another approach. Chaining tends to cause Pandas to return a copy of the DataFrame instead of a view on the DataFrame. For selecting data, chaining might be slower, and it can be a source of error.

In [22]:
# the .loc attribute also supports slicing
# If we wanted to select all rows, we can use : to indicate a full slice from beginning to end.
# Then we can add the column name as the second parameter as a string. 
# If we wanted to include multiple columns, we could do so in a list. 

# if we ask for all of the names and scores for all schools using the.loc operator.
df.loc[:,['Name', 'Score']]

Unnamed: 0,Name,Score
school1,Alice,85
school2,Jack,82
school1,Helen,90


it means we want to get all rows, but just the name and score columns

Droping data

the drop function takes a single parameter, which is the index or row label to draw. Notice that the drop function doesn't actually change the DataFrame, it returns a copy of the DataFrame with the given rows removed.

In [24]:
df.drop('school1')

Unnamed: 0,Name,Class,Score
school2,Jack,Chemistry,82


In [25]:
# the original dataframe, nothing changed
df

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


Drop has two optional parameters. The first is called in-place, and if it's set to true, the DataFrame will be updated in place instead of a copy being returned. The second parameter is the axes, which should be dropped. By default this value is zero, indicating the row axis. But we can change it to one if we wanted to drop a column.

In [28]:
# a copy of df
copy_df = df.copy()

# dropping the name column in the copy, we set the axis to tell that 'Name' is a named column 
copy_df.drop("Name", inplace=True, axis=1)
copy_df

Unnamed: 0,Class,Score
school1,Physics,85
school2,Chemistry,82
school1,Biology,90


In [27]:
# or we can use del keyword to drop the column, but del effects the original dataframe and does not return a view
del copy_df['Class']
copy_df

Unnamed: 0,Score
school1,85
school2,82
school1,90


Adding a new column

In [31]:
# we use the indexing operator to add new columns to the dataframe. 
# if we wanted to add a class ranking column with default value of None
# we use the assignment operator after the square brackets, and it effects the original dataframe.

df['ClassRanking'] = ['1', '3', '2']  # or df['ClassRanking'] = None
df

Unnamed: 0,Name,Class,Score,ClassRanking
school1,Alice,Physics,85,1
school2,Jack,Chemistry,82,3
school1,Helen,Biology,90,2


In [30]:
# deleting classRanking
del df['ClassRanking']
df

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90
