## Why Pandas?

In many real-world analytics projects, the data exists in the tabular format, that is, each observation is recorded in a row and different attributes of the observation are recorded in columns. Data analysts have been traditionally using software that manipulates data in this format, such as Excel, Stata or SPSS. [Pandas](https://pandas.pydata.org/), a Python library for data analysis, similarly represents data in the form of a spreadsheet, but in comparison to applications like Excel or Stata, data manipulation can be much more varied, it can programmed more easily and performed more efficiently, which is critical in large-scale projects.

Pandas is a high-level library in the sense that it is built on-top of NumPy. In effect, Pandas is a set of tools that makes it easier to use NumPy to work with real-world datasets: load data from a variety of sources (files or databases), clean the data (for example, remove duplicates and rows with missing values), select subsets of the data, group and visualize the data, perform statistical analysis of the data and export the data to other file formats or databases.

Like NumPy, Pandas is part of the Anaconda distribution of Python, and it should already be installed on your computer if you installed Anaconda. Otherwise, Pandas can be installed using the "[pip](https://realpython.com/what-is-pip/)" command (if you are using the standard Python distribution) or "[conda](https://docs.conda.io/en/latest/)" (if you are using Anaconda).

The two main data structures in Pandas are a DataFrame (equivalent to a spreadsheet in Excel) and a Series (equivalent to a column in a spreadsheet).

If Pandas is used inside a Jupyter notebook (this is the most common case), one normally imports pandas (the `pd` alias is usually used to save on typing), numpy (with the `np` alias), and the "magic" command `%matplotlib inline` is run, in order to enable displaying of matplotlib plots right in the notebook.

In [1]:
import pandas as pd
import numpy as np

## Series

A Series in Pandas is effectively a column of a spreadsheet. For example, if we have a spreadsheet with students' grades in different subjects, a Series can encode grades of all students in a particular subject.

Let's create a Series object. We need to instantiate an object of the class Series and pass a Python list with values to it.

In [2]:
s = pd.Series([65.1, 50.5, 83.0, 72.4])

In [3]:
# let's print it
s

0    65.1
1    50.5
2    83.0
3    72.4
dtype: float64

As you can see, a Series object includes the values we passed as well as an index - here, the index is a range of integers from 0 to 3, similarly to the index in Python lists and NumPy arrays. We can access the index and the values separately:

In [4]:
s.index

RangeIndex(start=0, stop=4, step=1)

In [5]:
s.values

array([65.1, 50.5, 83. , 72.4])

You will notice that the values are in fact a NumPy array.

We can pass our custom index when we create a Series, and we can specify the name of the series.

In [6]:
s = pd.Series([65.1, 50.5, 83.0, 72.4],
             index=["Joe", "Emma", "Max", "Lucy"],
             name="Maths")
s

Joe     65.1
Emma    50.5
Max     83.0
Lucy    72.4
Name: Maths, dtype: float64

Here, we use strings as the index for the values.

We can create a Series also from a Python dictionary:

In [7]:
population = {'California': 38332521,
              'Texas': 26448193,
              'New York': 19651127,
              'Florida': 19552860,
              'Illinois': 12882135}
s = pd.Series(population)
s

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

The keys in the dictionary become the index, and the values - the values of the Series object.

Now that we have created a Series, we can use the index to look up specific values in the series. We can do that by using the integer index (similar to indexing a list or an array) or by the string index (similar to retrieving a values by key from a dictionary):

In [8]:
s[1]

26448193

In [9]:
s['Texas']

26448193

We can use the index to select a subset of values, using syntax that is similar to slicing a list or an array:

In [10]:
s[1:3]

Texas       26448193
New York    19651127
dtype: int64

In [11]:
s['Texas':'Florida']

Texas       26448193
New York    19651127
Florida     19552860
dtype: int64

A mathematical operation can be applied to a Series object, in the same way as to a NumPy array. Let's convert the population values to millions:

In [12]:
s = s/10**6
s

California    38.332521
Texas         26.448193
New York      19.651127
Florida       19.552860
Illinois      12.882135
dtype: float64

Similar to a NumPy array, we can use methods of the Series to find out the sum of the values, their mean, standard deviation, etc:

In [13]:
s.sum()

116.866836

In [14]:
s.mean()

23.3733672

In [15]:
s.std()

9.64038558044315