## 2.2 CREATING YOUR OWN DATA

Whether you are manually inputting data or creating a small test example, knowing how to create dataframes without loading data from a file is a useful skill. It is especially helpful when you are asking a question about a StackOverflow error.

2.2.1 Creating a Series

The Pandas Series is a one-dimensional container, similar to the built-in Python list. It is the data type that represents each column of the DataFrame. Table 1.1 lists the possible dtypes for Pandas DataFrame columns. Each column in a dataframe must be of the same dtype. Since a dataframe can be thought of a dictionary of Series objects, where each key is the column name and the value is the Series, we can conclude that a Series is very similar to a Python list, except each element must be the same dtype. Those who have used the numpy library will realize this is the same behavior as demonstrated by the ndarray.

The easiest way to create a Series is to pass in a Python list. If we pass in a list of mixed types, the most common representation of both will be used. Typically the dtype will be object.

In [3]:
import pandas as pd

s = pd.Series(['banana', 42])

print(s)

0    banana
1        42
dtype: object


Notice on the left that the “row number” is shown. This is actually the index for the series. It is similar to the row name and row index we saw in Section 1.3.2 for dataframes. It implies that we can actually assign a “name” to values in our series.

In [2]:
# manually assign index values to a series
# by passing a Python list

s = pd.Series(['Wes McKinney', 'Creator of Pandas'],
              index=['Person', 'Who'])

print(s)

Person         Wes McKinney
Who       Creator of Pandas
dtype: object


Questions

1. What happens if you use other Python containers such as list, tuple, dict, or even the ndarray from the numpy library?

2. What happens if you pass an index along with the containers?

3. Does passing in an index when you use a dict overwrite the index? Or does it sort the values?

2.2.2 Creating a DataFrame

As mentioned in Section 1.1, a DataFrame can be thought of as a dictionary of Series objects. This is why dictionaries are the the most common way of creating a DataFrame. The key represents the column name, and the values are the contents of the column.

In [4]:
scientists = pd.DataFrame({
    'Name': ['Rosaline Franklin', 'William Gosset'],
    'Occupation': ['Chemist', 'Statistician'],
    'Born': ['1920-07-25', '1876-06-13'],
    'Died': ['1958-04-16', '1937-10-16'],
    'Age': [37, 61]})

print(scientists)

   Age        Born        Died               Name    Occupation
0   37  1920-07-25  1958-04-16  Rosaline Franklin       Chemist
1   61  1876-06-13  1937-10-16     William Gosset  Statistician


Notice that order is not guaranteed.

If we look at the documentation for DataFrame,1 we see that we can use the columns parameter or specify the column order. If we wanted to use the name column for the row index, we can use the index parameter.

1. DataFrame documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [5]:
scientists = pd.DataFrame(
    data={'Occupation': ['Chemist', 'Statistician'],
          'Born': ['1920-07-25', '1876-06-13'],
          'Died': ['1958-04-16', '1937-10-16'],
          'Age': [37, 61]},
    index=['Rosaline Franklin', 'William Gosset'],
    columns=['Occupation', 'Born', 'Died', 'Age'])

print(scientists)

                     Occupation        Born        Died  Age
Rosaline Franklin       Chemist  1920-07-25  1958-04-16   37
William Gosset     Statistician  1876-06-13  1937-10-16   61


The order is not guaranteed because Python dictionaries are not ordered. If we want an ordered dictionary, we need to use the OrderedDict from the collections module.2 Doing so is not as simple as wrapping the OrderedDict function around our dictionary, however, because the dictionary would have already lost its order by the time it was created and passed into our OrderedDict function.

2. Collections module: https://docs.python.org/3.6/library/collections.html

In [6]:
from collections import OrderedDict

# note the round brackets after OrderedDict
# then we pass a list of 2-tuples

scientists = pd.DataFrame(OrderedDict([
    ('Name', ['Rosaline Franklin', 'William Gosset']),
    ('Occupation', ['Chemist', 'Statistician']),
    ('Born', ['1920-07-25', '1876-06-13']),
    ('Died', ['1958-04-16', '1937-10-16']),
    ('Age', [37, 61])
    ])
)

print(scientists)

                Name    Occupation        Born        Died  Age
0  Rosaline Franklin       Chemist  1920-07-25  1958-04-16   37
1     William Gosset  Statistician  1876-06-13  1937-10-16   61
