# Pandas series and data frames

In this lesson we introduce the two core objects in the pandas library, the pandas.Series and the pandas.DataFrame. The overall goal is to gain familiarity with these two objects, understand their relation to each other, and review Python data structures such as dictionaries and lists.

In [1]:
import pandas as pd
import numpy as np

## Series

A series is a one-dimensional array of indexed data.

A pandas.Series having an index is the main difference between a pandas.Series and a NumPy array.

In [2]:
# A numpy array
arr = np.random.randn(4) # random values from std normal distribution
print(type(arr))
print(arr, "\n")

# A pandas series made from the previous array
s = pd.Series(arr)
print(type(s))
print(s)

<class 'numpy.ndarray'>
[ 0.89049409  0.02421328 -0.14519775 -0.03044247] 

<class 'pandas.core.series.Series'>
0    0.890494
1    0.024213
2   -0.145198
3   -0.030442
dtype: float64


## Creating a Panda series

The data parameter can be:

- a list or NumPy array,
- a Python dictionary, or
- a single number, boolean (True/False), or string.

In [6]:
s = pd.Series(data, index=index) #index parameter optional

### Creating a pandas.Series from a NumPy array

In [7]:
# A series from a numpy array 
pd.Series(np.arange(3), index=[2023, 2024, 2025])

2023    0
2024    1
2025    2
dtype: int64

### Creating a pandas.Series from a list

 If we don’t include an index, the default is to make the index equal to [0,...,len(data)-1]

In [8]:
# A series from a list of strings with default index
pd.Series(['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])

0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

In [9]:
#practice

pd.Series(['red', 'yellow', 'blue', 'green'])

0       red
1    yellow
2      blue
3     green
dtype: object

### Creating a pandas.Series from a dictionary

- recall that a dictionary is a set of key-value pairs

In [10]:
# Construct dictionary
d = {'key_0':2, 'key_1':'3', 'key_2':5}

# Initialize series using a dictionary
pd.Series(d)

#object= list of strings and numbers

key_0    2
key_1    3
key_2    5
dtype: object

### Creating a pandas.Series from a single value

In [11]:
pd.Series(3.0, index = ['A', 'B', 'C'])

A    3.0
B    3.0
C    3.0
dtype: float64

### Operations on a series

In [None]:
# Define a series
s = pd.Series([98,73,65],index=['Andrea', 'Beth', 'Carolina'])

# Divide each element in series by 10
print(s /10, '\n')

# Take the exponential of each element in series
print(np.exp(s), '\n')

# Original series is unchanged
print(s)

In [13]:
# Define a series
s = pd.Series([98,73,65],index=['Andrea', 'Beth', 'Carolina'])

print(s)


Andrea      98
Beth        73
Carolina    65
dtype: int64


In [20]:
sdic = {'Andrea':98, 'Beth':73, 'Carolina': 65}

print(sdic)
print(type(sdic))

{'Andrea': 98, 'Beth': 73, 'Carolina': 65}
<class 'dict'>


In [23]:
# e can also produce new pandas.Series with True/False values indicating whether the elements in a series satisfy a condition or not:
s > 70

Andrea       True
Beth         True
Carolina    False
dtype: bool

### Identifying missing values

In [24]:
# Series with NAs in it
s = pd.Series([1, 2, np.nan, 4, np.nan]) #nan means 'not a number'
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

The `hasnans` attribute for a pandas.Series returns True if there are any NA values in it and false otherwise:

In [25]:
# Check if series has NAs
s.hasnans

True

In [27]:
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

### Check-in 1

The integer number -999 is often used to represent missing values. Create a pandas.Series named s with four integer values, two of which are -999. The index of this series should be the the letters A through D.


In [29]:
s = pd.Series([3,-999,6,-999], index = ['A', 'B', 'C', 'D'])

s


A      3
B   -999
C      6
D   -999
dtype: int64

In the pandas.Series documentation, look for the method mask(). Use this method to update the series s so that the -999 values are replaced by NA values. HINT: check the first example in the method’s documentation. https://pandas.pydata.org/docs/reference/api/pandas.Series.mask.html

In [32]:
s.mask(s < 0)

A    3.0
B    NaN
C    6.0
D    NaN
dtype: float64

## Data Frames

Each column of a pandas.DataFrame is a pandas.Series.

In fact, the pandas.DataFrame is a dictionary of pandas.Series, with each column name being the key and the column values being the key’s value.

### Creating a pandas.DataFrame

In [33]:
# Initialize dictionary with columns' data 
d = {'col_name_1' : pd.Series(np.arange(3)),
     'col_name_2' : pd.Series([3.1, 3.2, 3.3]),
     }

# Create data frame
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


 We can change the index by changing the index attribute in the data frame:

In [34]:
# Change index
df.index = ['a','b','c']
df

Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,2,3.3


### Check- in 2

We can access the data frame’s column names via the columns attribute. Update the column names to C1 and C2 by updating this attribute.

In [39]:
df = df.rename(columns = {'col_name_1':'C1'})

In [40]:
df = df.rename(columns = {'col_name_2':'C2'})

df

Unnamed: 0,C1,C2
a,0,3.1
b,1,3.2
c,2,3.3


## Summary of activity

In this activity, we defined pd.Series and pd.Dataframes. A series is a one dimensional array of indexed data. A dataframe is a dictionary of series.