<a href="https://colab.research.google.com/github/JonaJS/E_Pandas/blob/main/Chptr4_Series%2BIntroduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
series = {
    'index': [0, 1, 2, 3],
    'data': [145, 142, 38, 13],
    'name': 'songs'
}

In [2]:
def get(series, index):
  value_idx = series['index'].index(index)
  return series['data'][value_idx]

In [3]:
get(series=series, index=3)

13

In [4]:
songs = {
    'index': ['Paul', 'John', 'George', 'Ringo'],
    'data': [145, 142, 38, 13],
    'name': 'counts'
}

In [5]:
def get(series, index):
  value_idx = series['index'].index(index)
  return series['data'][value_idx]

In [6]:
get(series=songs, index='John')

142

Pandas series.

In [7]:
import pandas as pd
songs = pd.Series([145, 142, 38, 13], name='counts')

In [8]:
songs

0    145
1    142
2     38
3     13
Name: counts, dtype: int64

In [9]:
songs.index

RangeIndex(start=0, stop=4, step=1)

In [10]:
songs3 = pd.Series([145, 142, 38, 13],
                   name='counts',
                   index=['Paul', 'John', 'George', 'Ringo']
                   )

In [11]:
songs3

Paul      145
John      142
George     38
Ringo      13
Name: counts, dtype: int64

In [12]:
songs3.index

Index(['Paul', 'John', 'George', 'Ringo'], dtype='object')

The actual data (or values) for a series does not have to be numeric or homegeneous. We can insert Python objects into a series.

In [13]:
class Foo():
  pass

ringo = pd.Series(['Richard', 'Starkey', 13, Foo()], name='Ringo')

In [14]:
ringo

0                                    Richard
1                                    Starkey
2                                         13
3    <__main__.Foo object at 0x7c1136dfd5d0>
Name: Ringo, dtype: object

The NaN value

In [15]:
import numpy as np
nan_series = pd.Series([2, np.nan], index=["Ono", "Clapton"])

In [16]:
nan_series

Ono        2.0
Clapton    NaN
dtype: float64

When pandas sees numeric data (number 2) as well np.nan, it coerced the 2 to a float value.

Example of how pandas ignores NaN values.
.count() method, which counts the number of values in a series, disregards NaN.

In [17]:
nan_series.count()

1

Inspect the number of entries (including missing values)

In [18]:
nan_series.size

2

Optional integer support for NaN.

In [19]:
nan_series2 = pd.Series([2, np.nan], index=['Ono', 'Clapton'], dtype='Int64')
nan_series2

Ono           2
Clapton    <NA>
dtype: Int64

Operations on these series still ignore NaN or < NA >

In [20]:
nan_series2.count()

1

In [21]:
nan_series2.size

2

Oher way to change the dtype o a series

In [22]:
nan_series

Ono        2.0
Clapton    NaN
dtype: float64

In [23]:
print(nan_series.astype('Int64'))


Ono           2
Clapton    <NA>
dtype: Int64


Similar to Numpy.

In [24]:
songs3

Paul      145
John      142
George     38
Ringo      13
Name: counts, dtype: int64

In [25]:
import numpy
numpy_ser = np.array([145, 142, 38, 13])

In [26]:
print(songs3[0])
print(numpy_ser[0])

145
145


They both have methods in common:

In [27]:
songs3.mean()

84.5

In [28]:
numpy_ser.mean()

84.5

They also have a notation of a boolean array. A boolean array is a series with the same index as the series we are working with that has boolean values, and it can be used as a mask to filter out items.

In [29]:
mask = songs3 > songs3.median()
mask

Paul       True
John       True
George    False
Ringo     False
Name: counts, dtype: bool

In [30]:
songs3[mask]

Paul    145
John    142
Name: counts, dtype: int64

In [31]:
mask2 = numpy_ser > np.median(numpy_ser)
mask2

array([ True,  True, False, False])

In [32]:
numpy_ser[mask2]

array([145, 142])

Categorical data.
When we load data, we can indicate that the data is categorical. If we know that our data is limited to a few values, we might want to use categorical data.

Categorical values have a few benefits:


*   Use less memory than strings.
*   Improve performance.
*   Can have an ordering.
*   Can perform operations on categories.
*   Enforce membership on values.



In [33]:
s = pd.Series(['m', 'l', 'xs', 's', 'xl'], dtype='category')
s

0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['l', 'm', 's', 'xl', 'xs']

If this series represents the size, there is a natural ordering as a small is less than a medium. By default, categories don't have an ordering. We can verify this by inspecting the `.cat` attribute that has various properties.

In [34]:
s.cat.ordered

False

To convert a non-categorical series to an ordered category, we can create a type with the CategorcialDtype constructor and the appropiate parameter.
Then we pass this type into de `.astype` method:

In [36]:
s2 = pd.Series(['m', 'l', 'xs', 's', 'xl'])
size_type = pd.api.types.CategoricalDtype(categories=['s', 'm', 'l'], ordered=True)
s3 = s2.astype(size_type)
s3

0      m
1      l
2    NaN
3      s
4    NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']

In this case, we limited the categories to just 's', 'm', and 'l', but the data had values that were not in those categories. Converting the data to a category type replaces those extra values with NaN.

If we ordered categories, we can do comparisons then:

In [37]:
s3 > 's'

0     True
1     True
2    False
3    False
4    False
dtype: bool

The prior example created a new *Series* from existing data that was not categorical. We can also add ordering information to a categorical data. We just need to make sure that we specify all of the members of the category or pandas will throw a *ValueError*:

In [40]:
new_s = s.cat.reorder_categories(['xs', 's', 'm', 'l', 'xl'], ordered=True)
new_s

0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']

| Command      | Description |
| ----------- | ----------- |
| `pd.Series(data=None, index=None, dtype=None, name=None, copy=None)`      | Create a series from data (sequence, dictionary or scalar).       |
| `s.index`   | Access index of series        |
| `s.astrype(dtype, errors='raise')`   | Cast a series to `dtype`. To ignore errors (and return original object) use `errors='ignore'`        |
| `s[boolean_array]`   | Return values from `s` where boolean_array is `True`        |
| `s.cat.ordered`   | Determines if a categorical series is ordered        |
| `s.cat.reorder_categories(new_categories, ordered=False)`   | Add categories (potentially ordered) to the series. `new_categories` must include all categories. We usually set `ordered=True`       |