# Series

A pandas Series is the fundamental building block of data analysis in pandas, equivalent to R's vectors. While R vectors are foundational data structures, pandas Series are more feature-rich objects that combine the functionality of R vectors with additional index capabilities. Understanding Series is crucial because DataFrames are essentially collections of Series.

In [2]:
import pandas as pd
import numpy as np

## Key Methods Summary

Here's a quick reference comparing R vector operations to pandas Series methods:

| Operation | R | Pandas Series |
|-----------|---|---------------|
| Create | `c(1, 2, 3)` | `pd.Series([1, 2, 3])` |
| Length | `length(x)` | `len(s)` or `s.size` |
| Sum | `sum(x)` | `s.sum()` |
| Mean | `mean(x)` | `s.mean()` |
| Summary | `summary(x)` | `s.describe()` |
| Unique | `unique(x)` | `s.unique()` |
| Sort | `sort(x)` | `s.sort_values()` |
| Missing check | `is.na(x)` | `s.isna()` |
| Value counts | `table(x)` | `s.value_counts()` |
| Subset | `x[x > 5]` | `s[s > 5]` |
| Apply function | `sapply(x, fun)` | `s.apply(fun)` |

## Creating Series

Just like creating vectors in R with `c()`, pandas offers multiple ways to create Series:

In [4]:
# Pandas: from a list
numbers = pd.Series([10, 20, 30, 40, 50])
numbers

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [6]:
# From a NumPy array (more efficient for large data)
arr = np.array([1.5, 2.5, 3.5, 4.5])
float_series = pd.Series(arr)
float_series

0    1.5
1    2.5
2    3.5
3    4.5
dtype: float64

In [8]:
# From a dictionary (automatic index assignment)
dict_series = pd.Series({"a": 10, "b": 20, "c": 30})
dict_series

a    10
b    20
c    30
dtype: int64

In [9]:
# With explicit index (like R's named vectors)
# R: setNames(c(10, 20, 30), c("x", "y", "z"))
named_series = pd.Series([10, 20, 30], index=["x", "y", "z"])
named_series

x    10
y    20
z    30
dtype: int64

## The Index Concept

The biggest difference from R vectors is that every Series has an index. While R has names for vectors, pandas makes the index a core feature:

In [10]:
# Default index (0-based, like Python)
series = pd.Series([100, 200, 300])
print("Values:", series.values)  # Like unname() in R
print("Index:", series.index)

Values: [100 200 300]
Index: RangeIndex(start=0, stop=3, step=1)


In [11]:
# Custom index
grades = pd.Series([85, 92, 78, 95], 
                   index=["Alice", "Bob", "Charlie", "David"])
print(grades)

Alice      85
Bob        92
Charlie    78
David      95
dtype: int64


In [13]:
# Accessing by index (like R's named vector access)
print("Bob's grade:", grades["Bob"])  # R: grades["Bob"]
print("Using loc:", grades.loc["Bob"])  # More explicit
print("Using position:", grades.iloc[1])  # R: grades[2] (R is 1-indexed!)

Bob's grade: 92
Using loc: 92
Using position: 92


## Basic Attributes and Methods

Series objects have many useful attributes and methods that go beyond R's vector capabilities:

In [15]:
# Create a sample series
data = pd.Series([15, 28, 33, 45, 22, 38, 40, 55, 18, 30])

### Basic attributes

In [16]:
print(f"Length: {len(data)}")          # R: length(data)
print(f"Shape: {data.shape}")          # Returns tuple (n,)
print(f"Size: {data.size}")            # Same as length
print(f"Data type: {data.dtype}")      # R: typeof(data)

Length: 10
Shape: (10,)
Size: 10
Data type: int64


### Basic statistics

In [17]:
print(f"Mean: {data.mean()}")          # R: mean(data)
print(f"Median: {data.median()}")      # R: median(data)
print(f"Std Dev: {data.std()}")        # R: sd(data)
print(f"Min: {data.min()}")            # R: min(data)
print(f"Max: {data.max()}")            # R: max(data)

Mean: 32.4
Median: 31.5
Std Dev: 12.48287716122458
Min: 15
Max: 55


### Describe

In [19]:
# R's summary() equivalent
data.describe()

count    10.000000
mean     32.400000
std      12.482877
min      15.000000
25%      23.500000
50%      31.500000
75%      39.500000
max      55.000000
dtype: float64

## Vectorized Operations


Like R, pandas Series support vectorized operations, making calculations efficient and readable:

In [20]:
s1 = pd.Series([10, 20, 30, 40])
s2 = pd.Series([1, 2, 3, 4])

### Arithmetic operations

In [21]:
print("Addition:", s1 + s2)          # R: s1 + s2
print("Multiplication:", s1 * s2)    # R: s1 * s2
print("Power:", s1 ** 2)             # R: s1^2

Addition: 0    11
1    22
2    33
3    44
dtype: int64
Multiplication: 0     10
1     40
2     90
3    160
dtype: int64
Power: 0     100
1     400
2     900
3    1600
dtype: int64


In [22]:
# With scalars
print("Add 5:", s1 + 5)              # R: s1 + 5
print("Divide by 10:", s1 / 10)     # R: s1 / 10

Add 5: 0    15
1    25
2    35
3    45
dtype: int64
Divide by 10: 0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64


### Mathematical functions

In [23]:
print("Square root:", np.sqrt(s1))   # R: sqrt(s1)
print("Log:", np.log(s1))           # R: log(s1)

Square root: 0    3.162278
1    4.472136
2    5.477226
3    6.324555
dtype: float64
Log: 0    2.302585
1    2.995732
2    3.401197
3    3.688879
dtype: float64


## Boolean Operations and Filtering

In [24]:
temps = pd.Series([65, 72, 68, 75, 62, 78, 70], 
                  index=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])

In [25]:
# Boolean comparisons (like R)
print("Greater than 70:")
print(temps > 70)                    # R: temps > 70

Greater than 70:
Mon    False
Tue     True
Wed    False
Thu     True
Fri    False
Sat     True
Sun    False
dtype: bool


### Using `[`

In [26]:
# Filtering (subsetting)
print("Days above 70:")
print(temps[temps > 70])

Days above 70:
Tue    72
Thu    75
Sat    78
dtype: int64


In [27]:
print("Between 65 and 75:")
print(temps[(temps > 65) & (temps < 75)])

Between 65 and 75:
Tue    72
Wed    68
Sun    70
dtype: int64


### Using `.where`

In [28]:
print("Where > 70 (keeps original index):")
print(temps.where(temps > 70))  # NaN for False conditions

Where > 70 (keeps original index):
Mon     NaN
Tue    72.0
Wed     NaN
Thu    75.0
Fri     NaN
Sat    78.0
Sun     NaN
dtype: float64


## String Operations

For string data, pandas provides vectorized string methods through the `.str` accessor:

In [29]:
# String series
names = pd.Series(["Alice", "bob", "CHARLIE", "  David  ", "Eve"])

In [30]:
# R equivalent using stringr or base R
print("Uppercase:", names.str.upper())        # R: toupper(names)
print("Lowercase:", names.str.lower())        # R: tolower(names)
print("Length:", names.str.len())             # R: nchar(names)
print("Strip whitespace:", names.str.strip()) # R: trimws(names)

Uppercase: 0        ALICE
1          BOB
2      CHARLIE
3      DAVID  
4          EVE
dtype: object
Lowercase: 0        alice
1          bob
2      charlie
3      david  
4          eve
dtype: object
Length: 0    5
1    3
2    7
3    9
4    3
dtype: int64
Strip whitespace: 0      Alice
1        bob
2    CHARLIE
3      David
4        Eve
dtype: object


In [31]:
# Contains pattern (like grepl in R)
print(names.str.contains('a', case=False))    # R: grepl("a", names, ignore.case=TRUE)

0     True
1    False
2     True
3     True
4    False
dtype: bool


## Handling Missing Data

Unlike R's universal `NA`, pandas Series can have different missing value types:

In [33]:
# Series with missing values
# R: c(1, 2, NA, 4, NA, 6)
s_missing = pd.Series([1, 2, np.nan, 4, np.nan, 6])

print("Original series:")
print(s_missing)

Original series:
0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
5    6.0
dtype: float64


In [34]:
# Check for missing values
print("Is null:", s_missing.isna())      # R: is.na(s_missing)
print("Not null:", s_missing.notna())    # R: !is.na(s_missing)

Is null: 0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool
Not null: 0     True
1     True
2    False
3     True
4    False
5     True
dtype: bool


In [35]:
# Operations with missing values
print("Sum:", s_missing.sum())           # R: sum(s_missing, na.rm=TRUE)
print("Mean:", s_missing.mean())         # R: mean(s_missing, na.rm=TRUE)

Sum: 13.0
Mean: 3.25


In [36]:
# Fill missing values
print("Fill with 0:", s_missing.fillna(0))           # R: replace_na(s_missing, 0)
print("Forward fill:", s_missing.fillna(method='ffill'))  # R: na.locf(s_missing)
print("Drop missing:", s_missing.dropna())           # R: na.omit(s_missing)

Fill with 0: 0    1.0
1    2.0
2    0.0
3    4.0
4    0.0
5    6.0
dtype: float64
Forward fill: 0    1.0
1    2.0
2    2.0
3    4.0
4    4.0
5    6.0
dtype: float64
Drop missing: 0    1.0
1    2.0
3    4.0
5    6.0
dtype: float64


  print("Forward fill:", s_missing.fillna(method='ffill'))  # R: na.locf(s_missing)


## Unique Values and Counting

Working with categorical-like data in Series:

In [37]:
# Series with repeated values
colors = pd.Series(["red", "blue", "red", "green", "blue", "red", "yellow"])

In [38]:
# Unique values
print("Unique values:", colors.unique())          # R: unique(colors)
print("Number of unique:", colors.nunique())      # R: length(unique(colors))

Unique values: ['red' 'blue' 'green' 'yellow']
Number of unique: 4


### Value counts

In [41]:
# Value counts (like R's table())
print(colors.value_counts())                      # R: table(colors)

red       3
blue      2
green     1
yellow    1
Name: count, dtype: int64


In [42]:
# As proportions
print(colors.value_counts(normalize=True))        # R: prop.table(table(colors))

red       0.428571
blue      0.285714
green     0.142857
yellow    0.142857
Name: proportion, dtype: float64


## Sorting Series

Sorting operations with more control than R's `sort()`:

In [43]:
# Create unsorted series
values = pd.Series([30, 10, 40, 20, 50], index=["c", "a", "d", "b", "e"])

In [44]:
# Sort by values
print(values.sort_values())              # R: sort(values)

a    10
b    20
c    30
d    40
e    50
dtype: int64


In [45]:
print("Sort descending:")
print(values.sort_values(ascending=False))  # R: sort(values, decreasing=TRUE)

Sort descending:
e    50
d    40
c    30
b    20
a    10
dtype: int64


In [46]:
# Sort by index
print("Sort by index:")
print(values.sort_index())               # R: values[order(names(values))]

Sort by index:
a    10
b    20
c    30
d    40
e    50
dtype: int64


## Applying Functions

Applying custom functions to Series elements:

In [47]:
# Sample series
nums = pd.Series([1, 4, 9, 16, 25])

In [48]:
# Apply function element-wise
nums.apply(lambda x: x**0.5)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

In [50]:
# Apply custom function
def categorize(x):
    """Categorize numbers into size groups."""
    if x < 10:
        return "small"
    elif x < 20:
        return "medium"
    else:
        return "large"

nums.apply(categorize)

0     small
1     small
2     small
3    medium
4     large
dtype: object

## Recode Values

In [53]:
# Map values (like R's recode)
mapping = {1: "one", 4: "four", 9: "nine"}
nums.map(mapping)

0     one
1    four
2    nine
3     NaN
4     NaN
dtype: object

## Series Alignment

One powerful feature of pandas Series is automatic alignment by index:

In [54]:
# Two series with different indices
s1 = pd.Series([10, 20, 30], index=["a", "b", "c"])
s2 = pd.Series([1, 2, 3, 4], index=["b", "c", "d", "e"])

print("Series 1:", s1)
print("Series 2:", s2)

Series 1: a    10
b    20
c    30
dtype: int64
Series 2: b    1
c    2
d    3
e    4
dtype: int64


In [55]:
# Operations align by index automatically
print("Addition (aligned by index):")
s1 + s2 # NaN where indices don't match

# R doesn't do this automatically - you'd need to align manually

Addition (aligned by index):


a     NaN
b    21.0
c    32.0
d     NaN
e     NaN
dtype: float64

## Converting to Other Types

Series can be easily converted to other Python/pandas data types:

In [56]:
# Sample series
s = pd.Series([10, 20, 30], index=["x", "y", "z"])

In [57]:
# To list (like R's as.vector())
print("To list:", s.tolist())           # R: as.vector(s)

To list: [10, 20, 30]


In [58]:
# To NumPy array
print("To array:", s.values)            # R: as.numeric(s)

To array: [10 20 30]


In [59]:
# To dictionary
print("To dict:", s.to_dict())          # R: setNames(as.list(s), names(s))

To dict: {'x': 10, 'y': 20, 'z': 30}


In [60]:
# To DataFrame
print("To DataFrame:")
print(s.to_frame(name="values"))        # R: data.frame(values = s)

To DataFrame:
   values
x      10
y      20
z      30


## Practical Tips

1. **Index awareness**: Always be mindful of the index - it's what makes Series more powerful than R vectors but can also cause unexpected behavior.

2. **Method chaining**: Unlike R, pandas encourages method chaining for elegant workflows:

In [61]:
# Example of method chaining (tidy-style workflow)
result = (temps
          .where(temps > 65)      # Filter like dplyr::filter()
          .dropna()               # Remove missing values
          .sort_values()          # Sort like dplyr::arrange()
          .head(3))               # Take top 3 like dplyr::slice_head()

print("Chained operations result:")
print(result)

Chained operations result:
Wed    68.0
Sun    70.0
Tue    72.0
dtype: float64


3. **Copy vs view**: Some operations return views (references) rather than copies. Use `.copy()` when you need a true copy.

4. **Performance**: For large numeric operations, Series are much faster than Python lists but may be slower than R vectors for some operations.

The Series is your workhorse in pandas - master it, and DataFrames become much easier to understand since they're just collections of Series with shared indices.