# Getting to Know Pandas’ Data Structures

## Understanding Series Objects

Python’s most basic data structure is the list, which is also a good starting point for getting to know pandas.Series objects. Create a new Series object based on a list:



In [0]:
import pandas as pd

Python’s most basic data structure is the **list**, which is also a good starting point for getting to know **pandas.Series** objects. Create a new Series object based on a list

In [41]:
revenues = pd.Series([2000,2500,3000])
revenues

0    2000
1    2500
2    3000
dtype: int64

You’ve used the list [] to create a Series object called revenues. A Series object wraps two components:
1.   A sequence of **values**
2.   A sequence of **identifiers**, which is the index

You can access these components with .values and .index, respectively:



In [42]:
revenues.values

array([2000, 2500, 3000])

In [43]:
revenues.index

RangeIndex(start=0, stop=3, step=1)

In [44]:
type(revenues.values)

numpy.ndarray

While Pandas builds on NumPy, a significant difference is in their indexing. Just like a NumPy array, a Pandas Series also has an integer index that’s implicitly defined. This implicit index indicates the element’s position in the Series.

However, a Series can also have an arbitrary type of index. You can think of this explicit index as labels for a specific row:

In [0]:
city_revenues = pd.Series([50000,60000,80000], index=['NY','NJ','FL'])

In [46]:
city_revenues

NY    50000
NJ    60000
FL    80000
dtype: int64

Here, the index is a list of city names represented by strings. You may have noticed that Python dictionaries use string indices as well, and this is a handy analogy to keep in mind! You can use the code blocks above to distinguish between two types of Series:

1.   **revenues**: This Series behaves like a Python list because it only has a positional index.
2.   **city_revenues**: This Series acts like a Python dictionary because it features both a positional and a label index.

Here’s how to construct a Series with a label index from a Python dictionary:

In [0]:
city_unemployment = pd.Series({'NY':10,'NJ':8})

In [48]:
city_unemployment

NY    10
NJ     8
dtype: int64

The dictionary keys become the index, and the dictionary values are the Series values.

Just like dictionaries, Series also support .keys() and the in keyword:

In [49]:
city_unemployment.keys()

Index(['NY', 'NJ'], dtype='object')

In [50]:
'NY' in city_unemployment

True

In [51]:
'FL' in city_unemployment

False

## Understanding DataFrame Objects

While a Series is a pretty powerful data structure, it has its **limitations**. For example, you can only store one attribute per key. As you’ve seen with the nba dataset, which features 23 columns, the Pandas Python library has more to offer with its **DataFrame**. This data structure is a sequence of Series objects that share the same index.

If you’ve followed along with the Series examples, then you should already have two Series objects with cities as keys:


1.   city_revenues
2.   city_unemployment


You can combine these objects into a DataFrame by providing a dictionary in the constructor. **The dictionary keys will become the column names, and the values should contain the Series objects:**

In [0]:
city = pd.DataFrame({"revenue": city_revenues, "unemployment":city_unemployment})

In [53]:
city

Unnamed: 0,revenue,unemployment
FL,80000,
NJ,60000,8.0
NY,50000,10.0


Note how Pandas replaced the missing unemployment value for FL with NaN.

The new DataFrame index is the union of the two Series indices:

In [54]:
city.values

array([[8.e+04,    nan],
       [6.e+04, 8.e+00],
       [5.e+04, 1.e+01]])

In [55]:
city.index

Index(['FL', 'NJ', 'NY'], dtype='object')

In [56]:
city.axes

[Index(['FL', 'NJ', 'NY'], dtype='object'),
 Index(['revenue', 'unemployment'], dtype='object')]

The axis marked with **0** is the **row** index, and the axis marked with 1 is the **column** index. This terminology is important to know because you’ll encounter several DataFrame methods that accept an axis parameter.

In [57]:
city.axes[0]

Index(['FL', 'NJ', 'NY'], dtype='object')

In [58]:
city.axes[1]

Index(['revenue', 'unemployment'], dtype='object')

In [59]:
city.keys()

Index(['revenue', 'unemployment'], dtype='object')

## Combining Multiple Datasets

Now, you’ll take this one step further and use .concat() to combine city_data with another DataFrame. Say you’ve managed to gather some data on two more cities:

In [0]:
further = pd.DataFrame(
    {"revenue":[30000,40000],"unemployment":[100,300]}, 
    index=['CA','PA']
)

In [61]:
All_concat_with_row = pd.concat([city, further], sort=False)
All

Unnamed: 0,revenue,unemployment
FL,80000,
NJ,60000,8.0
NY,50000,10.0
CA,30000,100.0
PA,40000,300.0


Choice 1: Note how Pandas added NaN for the missing values. If you want to combine only the cities that appear in both DataFrame objects, then you can set the **join** parameter to inner:


In [62]:
city_countries = pd.DataFrame({
    "country": ["US", "US", "US", "US", "US"],
    "capital": [1, 1, 0, 0, 0]},
    index=["FL", "NJ", "NY", "CA", "PA"])
All_concat_with_column = pd.concat([All_concat_with_row, city_countries], axis = 1, join = "inner", sort=False)
All_concat_with_column

Unnamed: 0,revenue,unemployment,country,capital
FL,80000,,US,1
NJ,60000,8.0,US,1
NY,50000,10.0,US,0
CA,30000,100.0,US,0
PA,40000,300.0,US,0


Choice 2: You can use .merge() to implement a join operation similar to the one from SQL:



In [63]:
countries = pd.DataFrame({
    "population_millions": [17, 127, 37],
    "continent": ["North America", "North America","North America"]
    }, index= ["US", "US", "US"])
pd.merge(All_concat_with_column, countries, left_on="country", right_index=True)

Unnamed: 0,revenue,unemployment,country,capital,population_millions,continent
FL,80000,,US,1,17,North America
FL,80000,,US,1,127,North America
FL,80000,,US,1,37,North America
NJ,60000,8.0,US,1,17,North America
NJ,60000,8.0,US,1,127,North America
NJ,60000,8.0,US,1,37,North America
NY,50000,10.0,US,0,17,North America
NY,50000,10.0,US,0,127,North America
NY,50000,10.0,US,0,37,North America
CA,30000,100.0,US,0,17,North America


In [64]:
pd.merge(All_concat_with_column, countries,
         left_on="country",
         right_index=True,
         how="left"
         )

Unnamed: 0,revenue,unemployment,country,capital,population_millions,continent
FL,80000,,US,1,17,North America
FL,80000,,US,1,127,North America
FL,80000,,US,1,37,North America
NJ,60000,8.0,US,1,17,North America
NJ,60000,8.0,US,1,127,North America
NJ,60000,8.0,US,1,37,North America
NY,50000,10.0,US,0,17,North America
NY,50000,10.0,US,0,127,North America
NY,50000,10.0,US,0,37,North America
CA,30000,100.0,US,0,17,North America
