# Algorithms for Data Mining - WS02: Introduction to Pandas

Instructor: Gerhard Neumann

Demonstrators: 
- Aiden Durrant <ADurrant@lincoln.ac.uk>;
- Deema Abdal Hafeth <dabdalhafeth@lincoln.ac.uk>

**pandas** is a Python package providing fast, flexible, and expressive data structures designed to work with *relational* or *labeled* data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python. 

pandas is well suited for:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure


Key features:
    
- Easy handling of **missing data**
- **Size mutability**: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
- Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets
- Intuitive **merging and joining** data sets
- Flexible **reshaping and pivoting** of data sets
- **Hierarchical labeling** of axes
- Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5
- **Time series functionality**: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

In [4]:
from IPython.core.display import HTML
HTML("<iframe src=http://pandas.pydata.org width=800 height=350></iframe>")



In [2]:
%matplotlib inline
import pandas as pd
import numpy as np

# Set some Pandas options
pd.set_option('notebook_repr_html', False)
pd.set_option('max_columns', 30)
pd.set_option('max_rows', 20)

## Pandas Data Structures

### Series

A **Series** is a single vector of data (like a NumPy array) with an *index* that labels each element in the vector.

In [None]:
counts = pd.Series([632, 1638, 569, 115])
counts

If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the `Series`, while the index is a pandas `Index` object.

In [None]:
counts.values

In [None]:
counts.index

We can assign meaningful labels to the index, if they are available:

In [None]:
bacteria = pd.Series([632, 1638, 569, 115], 
    index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

bacteria

These labels can be used to refer to the values in the `Series`.

In [None]:
bacteria['Actinobacteria']

In [None]:
bacteria[[name.endswith('bacteria') for name in bacteria.index]]

In [None]:
[name.endswith('bacteria') for name in bacteria.index]

Notice that the indexing operation preserved the association between the values and the corresponding indices.

We can still use positional indexing if we wish.

In [None]:
bacteria[0]

We can give both the array of values and the index meaningful labels themselves:

In [None]:
bacteria.name = 'counts'
bacteria.index.name = 'phylum'
bacteria

NumPy's math functions and other operations can be applied to Series without losing the data structure.

In [None]:
np.log(bacteria)

We can also filter according to the values in the `Series`:

In [None]:
bacteria[bacteria>1000]

A `Series` can be thought of as an ordered key-value store. In fact, we can create one from a `dict`:

In [None]:
bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
pd.Series(bacteria_dict)

### DataFrame

Inevitably, we want to be able to store, view and manipulate data that is *multivariate*, where for every index there are multiple fields or columns of data, often of varying data type.

A `DataFrame` is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the `DataFrame` allows us to represent and manipulate higher-dimensional data.

In [9]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient':[1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
    'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']})
data

   value  patient          phylum
0    632        1      Firmicutes
1   1638        1  Proteobacteria
2    569        1  Actinobacteria
3    115        1   Bacteroidetes
4    433        2      Firmicutes
5   1130        2  Proteobacteria
6    754        2  Actinobacteria
7    555        2   Bacteroidetes

Notice the `DataFrame` is sorted by column name. We can change the order by indexing them in the order we desire:

In [10]:
data[['phylum','value','patient']]

           phylum  value  patient
0      Firmicutes    632        1
1  Proteobacteria   1638        1
2  Actinobacteria    569        1
3   Bacteroidetes    115        1
4      Firmicutes    433        2
5  Proteobacteria   1130        2
6  Actinobacteria    754        2
7   Bacteroidetes    555        2

A `DataFrame` has a second index, representing the columns:

In [11]:
data.columns

Index(['value', 'patient', 'phylum'], dtype='object')

If we wish to access columns, we can do so either by dict-like indexing or by attribute:

In [12]:
data['value']

0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

In [13]:
data.value

0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

In [14]:
type(data.value)

pandas.core.series.Series

In [15]:
type(data[['value']])

pandas.core.frame.DataFrame

Notice this is different than with `Series`, where dict-like indexing retrieved a particular element (row). If we want access to a row in a `DataFrame`, we index its `ix` attribute.


In [16]:
data.ix[3]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


value                115
patient                1
phylum     Bacteroidetes
Name: 3, dtype: object

Its important to note that the Series returned when a DataFrame is indexted is merely a **view** on the DataFrame, and not a copy of the data itself. So you must be cautious when manipulating this data:

In [17]:
vals = data.value
vals

0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

In [18]:
vals[5] = 0
vals

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


0     632
1    1638
2     569
3     115
4     433
5       0
6     754
7     555
Name: value, dtype: int64

In [19]:
data

   value  patient          phylum
0    632        1      Firmicutes
1   1638        1  Proteobacteria
2    569        1  Actinobacteria
3    115        1   Bacteroidetes
4    433        2      Firmicutes
5      0        2  Proteobacteria
6    754        2  Actinobacteria
7    555        2   Bacteroidetes

In [20]:
vals = data.value.copy()
vals[5] = 1000
data

   value  patient          phylum
0    632        1      Firmicutes
1   1638        1  Proteobacteria
2    569        1  Actinobacteria
3    115        1   Bacteroidetes
4    433        2      Firmicutes
5      0        2  Proteobacteria
6    754        2  Actinobacteria
7    555        2   Bacteroidetes

We can create or modify columns by assignment:

In [21]:
data.value[3] = 14
data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


   value  patient          phylum
0    632        1      Firmicutes
1   1638        1  Proteobacteria
2    569        1  Actinobacteria
3     14        1   Bacteroidetes
4    433        2      Firmicutes
5      0        2  Proteobacteria
6    754        2  Actinobacteria
7    555        2   Bacteroidetes

In [22]:
data['year'] = 2013
data

   value  patient          phylum  year
0    632        1      Firmicutes  2013
1   1638        1  Proteobacteria  2013
2    569        1  Actinobacteria  2013
3     14        1   Bacteroidetes  2013
4    433        2      Firmicutes  2013
5      0        2  Proteobacteria  2013
6    754        2  Actinobacteria  2013
7    555        2   Bacteroidetes  2013

But note, we cannot use the attribute indexing method to add a new column:

In [23]:
data.treatment = 1
data

   value  patient          phylum  year
0    632        1      Firmicutes  2013
1   1638        1  Proteobacteria  2013
2    569        1  Actinobacteria  2013
3     14        1   Bacteroidetes  2013
4    433        2      Firmicutes  2013
5      0        2  Proteobacteria  2013
6    754        2  Actinobacteria  2013
7    555        2   Bacteroidetes  2013

In [24]:
data.treatment

1

Specifying a `Series` as a new columns cause its values to be added according to the `DataFrame`'s index:

In [None]:
treatment = pd.Series([0]*4 + [1]*2)
treatment

In [None]:
data['treatment'] = treatment
data

Other Python data structures (ones without an index) need to be the same length as the `DataFrame`:

In [None]:
month = ['Jan', 'Feb', 'Mar', 'Apr']
data['month'] = month

In [None]:
data['month'] = ['Jan']*len(data)
data

We can use `del` to remove columns, in the same way `dict` entries can be removed:

In [None]:
del data['month']
data

We can extract the underlying data as a simple `ndarray` by accessing the `values` attribute:

In [None]:
data.values

Notice that because of the mix of string and integer (and `NaN`) values, the dtype of the array is `object`. The dtype will automatically be chosen to be as general as needed to accomodate all the columns.

In [None]:
df = pd.DataFrame({'foo': [1,2,3], 'bar':[0.4, -1.0, 4.5]})
df.values

Pandas uses a custom data structure to represent the indices of Series and DataFrames.

## Importing data

A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze. Though it is easy to load basic data structures into Python using built-in tools or those provided by packages like NumPy, it is non-trivial to import structured data well, and to easily convert this input into a robust data structure:

    genes = np.loadtxt("genes.csv", delimiter=",", dtype=[('gene', '|S10'), ('value', '<f4')])

Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a `DataFrame` object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.

Let's start with some more bacteria data, stored in csv format.

In [25]:
!cat data/microbiome.csv

Taxon,Patient,Tissue,Stool
Firmicutes,1,632,305
Firmicutes,2,136,4182
Firmicutes,3,1174,703
Firmicutes,4,408,3946
Firmicutes,5,831,8605
Firmicutes,6,693,50
Firmicutes,7,718,717
Firmicutes,8,173,33
Firmicutes,9,228,80
Firmicutes,10,162,3196
Firmicutes,11,372,32
Firmicutes,12,4255,4361
Firmicutes,13,107,1667
Firmicutes,14,96,223
Firmicutes,15,281,2377
Proteobacteria,1,1638,3886
Proteobacteria,2,2469,1821
Proteobacteria,3,839,661
Proteobacteria,4,4414,18
Proteobacteria,5,12044,83
Proteobacteria,6,2310,12
Proteobacteria,7,3053,547
Proteobacteria,8,395,2174
Proteobacteria,9,2651,767
Proteobacteria,10,1195,76
Proteobacteria,11,6857,795
Proteobacteria,12,483,666
Proteobacteria,13,2950,3994
Proteobacteria,14,1541,816
Proteobacteria,15,1307,53
Actinobacteria,1,569,648
Actinobacteria,2,1590,4
Actinobacteria,3,25,2
Actinobacteria,4,259,300
Actinobacteria,5,568,7
Actinobacteria,6,1102,9
Actinobacteria,7,678,377
Actinobacteria,8,260,58
Actinobacteria,9,424,233

This table can be read into a DataFrame using `read_csv`:

In [None]:
mb = pd.read_csv("data/microbiome.csv")
mb

Notice that `read_csv` automatically considered the first row in the file to be a header row.

We can override default behavior by customizing some the arguments, like `header`, `names` or `index_col`.

In [None]:
pd.read_csv("data/microbiome.csv", header=None).head()

`read_csv` is just a convenience function for `read_table`, since csv is such a common format:

In [None]:
mb = pd.read_table("data/microbiome.csv", sep=',')

The `sep` argument can be customized as needed to accomodate arbitrary separators. For example, we can use a regular expression to define a variable amount of whitespace, which is unfortunately very common in some data formats: 
    
    sep='\s+'

For a more useful index, we can specify the first two columns, which together provide a unique index to the data.

Most real-world data is incomplete, with values missing due to incomplete observation, data entry or transcription error, or other reasons. Pandas will automatically recognize and parse common missing data indicators, including `NA` and `NULL`.

In [None]:
!cat data/microbiome_missing.csv

In [None]:
pd.read_csv("data/microbiome_missing.csv").head(20)

Above, Pandas recognized `NA` and an empty field as missing data.

In [None]:
pd.isnull(pd.read_csv("data/microbiome_missing.csv")).head(20)

Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark "?" and a large negative number where there should have been a positive integer. We can specify additional symbols with the `na_values` argument:
   

In [None]:
pd.read_csv("data/microbiome_missing.csv", na_values=['?', -99999]).head(20)

These can be specified on a column-wise basis using an appropriate dict as the argument for `na_values`.

## Pandas Fundamentals

This section introduces the new user to the key functionality of Pandas that is required to use the software effectively.

For some variety, we will leave our digestive tract bacteria behind and employ some baseball data.

In [4]:
baseball = pd.read_csv("data/baseball.csv", index_col='id')
baseball.head()

          player  year  stint team  lg   g  ab  r   h  X2b  X3b  hr  rbi   sb  \
id                                                                              
88641  womacto01  2006      2  CHN  NL  19  50  6  14    1    0   1  2.0  1.0   
88643  schilcu01  2006      1  BOS  AL  31   2  0   1    0    0   0  0.0  0.0   
88645  myersmi01  2006      1  NYA  AL  62   0  0   0    0    0   0  0.0  0.0   
88649  helliri01  2006      1  MIL  NL  20   3  0   0    0    0   0  0.0  0.0   
88650  johnsra05  2006      1  NYA  AL  33   6  0   1    0    0   0  0.0  0.0   

        cs  bb   so  ibb  hbp   sh   sf  gidp  
id                                             
88641  1.0   4  4.0  0.0  0.0  3.0  0.0   0.0  
88643  0.0   0  1.0  0.0  0.0  0.0  0.0   0.0  
88645  0.0   0  0.0  0.0  0.0  0.0  0.0   0.0  
88649  0.0   0  2.0  0.0  0.0  0.0  0.0   0.0  
88650  0.0   0  4.0  0.0  0.0  0.0  0.0   0.0  

Notice that we specified the `id` column as the index, since it appears to be a unique identifier. We could try to create a unique index ourselves by combining `player` and `year`:

In [5]:
player_id = baseball.player + baseball.year.astype(str)
baseball_newind = baseball.copy()
baseball_newind.index = player_id
baseball_newind.head()

                  player  year  stint team  lg   g  ab  r   h  X2b  X3b  hr  \
womacto012006  womacto01  2006      2  CHN  NL  19  50  6  14    1    0   1   
schilcu012006  schilcu01  2006      1  BOS  AL  31   2  0   1    0    0   0   
myersmi012006  myersmi01  2006      1  NYA  AL  62   0  0   0    0    0   0   
helliri012006  helliri01  2006      1  MIL  NL  20   3  0   0    0    0   0   
johnsra052006  johnsra05  2006      1  NYA  AL  33   6  0   1    0    0   0   

               rbi   sb   cs  bb   so  ibb  hbp   sh   sf  gidp  
womacto012006  2.0  1.0  1.0   4  4.0  0.0  0.0  3.0  0.0   0.0  
schilcu012006  0.0  0.0  0.0   0  1.0  0.0  0.0  0.0  0.0   0.0  
myersmi012006  0.0  0.0  0.0   0  0.0  0.0  0.0  0.0  0.0   0.0  
helliri012006  0.0  0.0  0.0   0  2.0  0.0  0.0  0.0  0.0   0.0  
johnsra052006  0.0  0.0  0.0   0  4.0  0.0  0.0  0.0  0.0   0.0  

This looks okay, but let's check:

In [6]:
baseball_newind.index.is_unique

False

So, indices need not be unique. Our choice is not unique because some players change teams within years.

In [7]:
pd.Series(baseball_newind.index).value_counts()

claytro012007    2
hernaro012007    2
loftoke012007    2
coninje012007    2
wickmbo012007    2
wellsda012007    2
gomezch022007    2
benitar012007    2
sweenma012007    2
trachst012007    2
                ..
clemero022007    1
martipe022007    1
bondsba012007    1
thomafr042007    1
mabryjo012007    1
whiteri012007    1
sandere022007    1
perezne012007    1
helliri012006    1
oliveda022007    1
Length: 88, dtype: int64

The most important consequence of a non-unique index is that indexing by label will return multiple values for some labels:

In [None]:
baseball_newind.ix['wickmbo012007']

We will learn more about indexing below.

We can create a truly unique index by combining `player`, `team` and `year`:

In [None]:
player_unique = baseball.player + baseball.team + baseball.year.astype(str)
baseball_newind = baseball.copy()
baseball_newind.index = player_unique
baseball_newind.head()

In [None]:
baseball_newind.index.is_unique

We can create meaningful indices more easily using a hierarchical index; for now, we will stick with the numeric `id` field as our index.

## Indexing and Selection

Indexing works analogously to indexing in NumPy arrays, except we can use the labels in the `Index` object to extract values in addition to arrays of integers.

In [None]:
# Sample Series object
hits = baseball_newind.h
hits

In [None]:
# Numpy-style indexing
hits[:3]

In [None]:
# Indexing by label
hits[['womacto01CHN2006','schilcu01BOS2006']]

We can also slice with data labels, since they have an intrinsic order within the Index:

In [None]:
hits['womacto01CHN2006':'gonzalu01ARI2006']

In [None]:
hits['womacto01CHN2006':'gonzalu01ARI2006'] = 5
hits

In a `DataFrame` we can slice along either or both axes:

In [None]:
baseball_newind[['h','ab']]

In [None]:
baseball_newind[baseball_newind.ab>500]

The indexing field `ix` allows us to select subsets of rows and columns in an intuitive way:

In [None]:
baseball_newind.ix['gonzalu01ARI2006', ['h','X2b', 'X3b', 'hr']]

In [None]:
baseball_newind.ix[['gonzalu01ARI2006','finlest01SFN2006'], 5:8]

In [None]:
baseball_newind.ix[:'myersmi01NYA2006', 'hr']

Similarly, the cross-section method `xs` (not a field) extracts a single column or row *by label* and returns it as a `Series`:

In [None]:
baseball_newind.xs('myersmi01NYA2006')

## Operations

`DataFrame` and `Series` objects allow for several operations to take place either on a single object, or between two or more objects.

For example, we can perform arithmetic on the elements of two objects, such as combining baseball statistics across years:

In [8]:
hr2006 = baseball[baseball.year==2006].xs('hr', axis=1)
hr2006.index = baseball.player[baseball.year==2006]

hr2007 = baseball[baseball.year==2007].xs('hr', axis=1)
hr2007.index = baseball.player[baseball.year==2007]

In [9]:
hr2006 = pd.Series(baseball.hr[baseball.year==2006].values, index=baseball.player[baseball.year==2006])
hr2007 = pd.Series(baseball.hr[baseball.year==2007].values, index=baseball.player[baseball.year==2007])

In [10]:
hr_total = hr2006 + hr2007
hr_total

player
alomasa02   NaN
aloumo01    NaN
ausmubr01   NaN
benitar01   NaN
benitar01   NaN
biggicr01   NaN
bondsba01   NaN
cirilje01   NaN
cirilje01   NaN
claytro01   NaN
             ..
wellsda01   NaN
wellsda01   NaN
whiteri01   NaN
whitero02   NaN
wickmbo01   NaN
wickmbo01   NaN
williwo02   NaN
witasja01   NaN
womacto01   NaN
zaungr01    NaN
Length: 94, dtype: float64

Pandas' data alignment places `NaN` values for labels that do not overlap in the two Series. In fact, there are only 6 players that occur in both years.

In [11]:
hr_total[hr_total.notnull()]

player
finlest01     7.0
gonzalu01    30.0
johnsra05     0.0
myersmi01     0.0
schilcu01     0.0
seleaa01      0.0
dtype: float64

While we do want the operation to honor the data labels in this way, we probably do not want the missing values to be filled with `NaN`. We can use the `add` method to calculate player home run totals by using the `fill_value` argument to insert a zero for home runs where labels do not overlap:

In [12]:
hr2007.add(hr2006, fill_value=0)

player
alomasa02     0.0
aloumo01     13.0
ausmubr01     3.0
benitar01     0.0
benitar01     0.0
biggicr01    10.0
bondsba01    28.0
cirilje01     0.0
cirilje01     2.0
claytro01     0.0
             ... 
wellsda01     0.0
wellsda01     0.0
whiteri01     0.0
whitero02     4.0
wickmbo01     0.0
wickmbo01     0.0
williwo02     1.0
witasja01     0.0
womacto01     1.0
zaungr01     10.0
Length: 94, dtype: float64

Operations can also be **broadcast** between rows or columns.

For example, if we subtract the maximum number of home runs hit from the `hr` column, we get how many fewer than the maximum were hit by each player:

In [13]:
baseball.hr - baseball.hr.max()

id
88641   -34
88643   -35
88645   -35
88649   -35
88650   -35
88652   -29
88653   -20
88662   -35
89177   -35
89178   -34
         ..
89499   -34
89501   -35
89502   -33
89521    -7
89523   -25
89525   -35
89526   -35
89530   -32
89533   -22
89534   -35
Name: hr, Length: 100, dtype: int64

Or, looking at things row-wise, we can see how a particular player compares with the rest of the group with respect to important statistics

In [14]:
baseball.ix[89521]["player"]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


'bondsba01'

In [15]:
stats = baseball[['h','X2b', 'X3b', 'hr']]
diff = stats - stats.xs(89521)
diff[:10]

        h  X2b  X3b  hr
id                     
88641 -80  -13    0 -27
88643 -93  -14    0 -28
88645 -94  -14    0 -28
88649 -94  -14    0 -28
88650 -93  -14    0 -28
88652  11    7   12 -22
88653  65   38    2 -13
88662 -89  -13    0 -28
89177 -84  -11    0 -28
89178 -84  -14    0 -27

We can also apply functions to each column or row of a `DataFrame`

In [16]:
stats.apply(np.median)

h      8.0
X2b    1.0
X3b    0.0
hr     0.0
dtype: float64

In [17]:
stat_range = lambda x: x.max() - x.min()
stats.apply(stat_range)

h      159
X2b     52
X3b     12
hr      35
dtype: int64

Lets use apply to calculate a meaningful baseball statistics, slugging percentage:

$$SLG = \frac{1B + (2 \times 2B) + (3 \times 3B) + (4 \times HR)}{AB}$$

And just for fun, we will format the resulting estimate.

In [None]:
slg = lambda x: (x['h']-x['X2b']-x['X3b']-x['hr'] + 2*x['X2b'] + 3*x['X3b'] + 4*x['hr'])/(x['ab']+1e-6)
baseball.apply(slg, axis=1).apply(lambda x: '%.3f' % x)

## Data summarization

We often wish to summarize data in `Series` or `DataFrame` objects, so that they can more easily be understood or compared with similar data. The NumPy package contains several functions that are useful here, but several summarization or reduction methods are built into Pandas data structures.

In [None]:
baseball.sum()

Clearly, `sum` is more meaningful for some columns than others. For methods like `mean` for which application to string variables is not just meaningless, but impossible, these columns are automatically exculded:

In [None]:
baseball.mean()

A useful summarization that gives a quick snapshot of multiple statistics for a `Series` or `DataFrame` is `describe`:

In [None]:
baseball.describe()

`describe` can detect non-numeric data and sometimes yield useful information about it.

In [None]:
baseball.player.describe()

## Writing Data to Files

As well as being able to read several data input formats, Pandas can also export data to a variety of storage formats. We will bring your attention to just a couple of these.

In [None]:
mb.to_csv("mb.csv")

The `to_csv` method writes a `DataFrame` to a comma-separated values (csv) file. You can specify custom delimiters (via `sep` argument), how missing values are written (via `na_rep` argument), whether the index is writen (via `index` argument), whether the header is included (via `header` argument), among other options.