In this session, we will introduce the basic usage of data manipulation library **Pandas**, visualization library **Matplotlib** and **Seaborn** through some simple examples. All code in this tutorial can be run on Google Colab.

# Introducing Pandas Objects

Pandas is one of the most popular Python library for data analysis. It provides tools to read, clean, transform, and analyze data. If you've ever worked with a spreadsheet program like Excel or Google Sheets, you can think of Pandas as a way to do all of that, and much more, with code.



At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.
As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are.
Thus, before we go any further, let's introduce these three fundamental Pandas data structures: the ``Series``, ``DataFrame``, and ``Index``. Understanding them is the key to using the library effectively

We will start our code sessions with the standard NumPy and Pandas imports:

In [1]:
import numpy as np
import pandas as pd

## The Pandas Series Object

A Pandas ``Series`` is a one-dimensional array of indexed data. Think of it as a single column in a spreadsheet.

It has two main components:
- **Values**: The actual data.
- **Index**: A label for each value.



Let's create a Series from a simple list.

In [2]:
# Create a Series from a list of numbers
data = pd.Series([0.25, 0.5, 0.75, 1.0])

# Display the Series. Notice the index on the left (0, 1, 2, 3) and values on the right
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the output, the ``Series`` wraps both a sequence of values and a sequence of indices, which we can access with the ``values`` and ``index`` attributes.
The ``values`` are simply a familiar NumPy array:

In [3]:
# Access the values of the Series, which returns a NumPy array.
print(data.values)

# Check the type of the values attribute which should be a NumPy ndarray
print(type(data.values))

[0.25 0.5  0.75 1.  ]
<class 'numpy.ndarray'>


The ``index`` is an array-like object of type ``pd.Index``, which we'll discuss in more detail momentarily.

In [4]:
# Access the index of the Series.
# By default, it's a RangeIndex, starting from 0.
print(data.index)

RangeIndex(start=0, stop=4, step=1)


Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [5]:
# Get the first element using its index.
data[0]

np.float64(0.25)

In [6]:
# Slice the Series to get elements from index 1 up to (but not including) index 3.
data[1:3]

1    0.50
2    0.75
dtype: float64

As we will see, though, the Pandas ``Series`` is much more general and flexible than the one-dimensional NumPy array that it emulates.

### ``Series`` as generalized NumPy array

From what we've seen so far, it may look like the ``Series`` object is basically interchangeable with a one-dimensional NumPy array.
The essential difference is the presence of the index: while the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an *explicitly defined* index associated with the values.

This explicit index definition gives the ``Series`` object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.
For example, if we wish, we can use strings as an index:

In [7]:
# Create a Series with custom string labels for the index.
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])


print(data.values)
print(data.index)


[0.25 0.5  0.75 1.  ]
Index(['a', 'b', 'c', 'd'], dtype='object')


In [8]:
# Now we can access the data using these meaningful labels.
data['a']

np.float64(0.25)

### Series as specialized dictionary

In this way, you can think of a Pandas ``Series`` a bit like a specialization of a Python dictionary.
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a ``Series`` is a structure which maps typed keys to a set of typed values.
This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas ``Series`` makes it much more efficient than Python dictionaries for certain operations.

Simply: A dictionary maps keys to values; a Series maps index labels to values.

The ``Series``-as-dictionary analogy can be made even more clear by constructing a ``Series`` object directly from a Python dictionary. Pandas will automatically use the dictionary's keys as the index.

In [9]:
# A dictionary of state populations.
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

# Create a Series from the dictionary. The keys become the index.
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

By default, a ``Series`` will be created where the index is drawn from the sorted keys.
From here, typical dictionary-style item access can be performed:

In [10]:
# Access the population of California using its index label.
population['California']

np.int64(38332521)

Unlike a dictionary, though, the ``Series`` also supports array-style operations such as slicing:

In [11]:
# Select a range of data from 'California' to 'Illinois'.
population['California':'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

We'll discuss some of the quirks of Pandas indexing and slicing in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb).

### Constructing Series objects

We've already seen a few ways of constructing a Pandas ``Series`` from scratch; all of them are some version of the following:

```python
>>> pd.Series(data, index=index)
```

where ``index`` is an optional argument, and ``data`` can be one of many entities.

For example, ``data`` can be a list or NumPy array, in which case ``index`` defaults to an integer sequence:

In [12]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

``data`` can be a scalar, which is repeated to fill the specified index:

In [13]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

``data`` can be a dictionary, in which ``index`` defaults to the sorted dictionary keys:

In [14]:
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In each case, the index can be explicitly set if a different result is preferred:

In [15]:
# Returns a Series where the index is in the order specified by the index parameter
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2, 1])

3    c
2    a
1    b
dtype: object

Notice that in this case, the ``Series`` is populated only with the explicitly identified keys.

## The Pandas DataFrame Object

The next fundamental structure in Pandas is the ``DataFrame``.
Like the ``Series`` object discussed in the previous section, the ``DataFrame`` can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.
We'll now take a look at each of these perspectives.

### DataFrame as a generalized NumPy array
If a ``Series`` is analogous to a one-dimensional array with flexible indices, a ``DataFrame`` is analogous to a two-dimensional array with both flexible row indices and flexible column names. If a Series is a single column, a DataFrame is a group of Series objects that all share the same index, ie the entire spreadsheet or data table.
A DataFrame is a two-dimensional structure with:
- Labeled columns.
- A shared index for all rows.

To demonstrate this, let's first construct a new ``Series`` listing the area of each of the five states discussed in the previous section:

In [16]:
# Create another dictionary for state areas.
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}

# Create a Series for area.
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

Now that we have this along with the ``population`` Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

In [17]:
# Create a DataFrame from our two Series.
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


Like the ``Series`` object, the ``DataFrame`` has an ``index`` attribute that gives access to the index labels:

In [18]:
# Access the DataFrame's row index.
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

Additionally, the ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:

In [19]:
# Access the DataFrame's column labels.
states.columns

Index(['population', 'area'], dtype='object')

Thus the ``DataFrame`` can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

### DataFrame as specialized dictionary

Similarly, we can also think of a ``DataFrame`` as a specialization of a dictionary.
Where a dictionary maps a key to a value, a ``DataFrame`` maps a column name to a ``Series`` of column data.
For example, asking for the ``'area'`` attribute returns the ``Series`` object containing the areas we saw earlier:

In [20]:
# Select the 'area' column. The result is a Pandas Series!
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Notice the potential point of confusion here: in a two-dimesnional NumPy array, ``data[0]`` will return the first *row*. For a ``DataFrame``, ``data['col0']`` will return the first *column*.
Because of this, it is probably better to think about ``DataFrame``s as generalized dictionaries rather than generalized arrays, though both ways of looking at the situation can be useful.
We'll explore more flexible means of indexing ``DataFrame``s in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb).

### Constructing DataFrame objects

A Pandas ``DataFrame`` can be constructed in a variety of ways.
Here we'll give several examples.

#### From a single Series object

A ``DataFrame`` is a collection of ``Series`` objects, and a single-column ``DataFrame`` can be constructed from a single ``Series``:

In [21]:
# Create a single-column DataFrame from our population Series.
# We explicitly name the column 'population'.
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
If omitted, an integer index will be used for each:

In [22]:
# Create a 3x2 NumPy array of random numbers.
random_data = np.random.rand(3, 2)

# Create a DataFrame, providing the data, column names, and index labels.
pd.DataFrame(
    data = random_data,
    columns = ['input', 'output'],
    index = ['a', 'b', 'c']
)


Unnamed: 0,input,output
a,0.953876,0.692547
b,0.07117,0.95633
c,0.754377,0.339923


Congratulations! This covers the fundamental objects in Pandas!