# Pandas Series

Before we get into DataFrames, we'll take a brief detour to explore pandas series. Series are very similar to ndarrays: the main difference between them is that with series, you can provide custom index labels and then operations you perform on series automatically align the data based on the labels.

To create a new series, first load the numpy and pandas libraries:

In [1]:
import numpy as np
import pandas as pd

Note: It is common practice to import pandas with the shorthand "pd".

Define a new series by passing a collection of homogeneous data like ndarray or list, along with a list of associated indexes to pd.Series():

In [3]:
my_series = pd.Series(data = [2,3,5,4], #Data
                      index = ['a','b','c','d']) #Index
my_series

a    2
b    3
c    5
d    4
dtype: int64

You can also create a series from a dictionary, in which case the dictionary keys act as the labels and the values act as the data:

In [4]:
my_dict = {"x": 2, "a": 5, "b": 4, "c": 8}
my_series2 = pd.Series(my_dict)
my_series2

x    2
a    5
b    4
c    8
dtype: int64

Similar to a dictionary, you can access items in a series by the labels:

In [5]:
my_series["a"]

2

Numeric indexing also works:

In [6]:
my_series[0]

2

If you take a slice of a series, you get both the values and the labels contained in the slice:

In [7]:
my_series[1:3]

b    3
c    5
dtype: int64

As mentioned earlier, operations performed on two series align by label:

In [8]:
my_series + my_series

a     4
b     6
c    10
d     8
dtype: int64

In [9]:
my_series + my_series2

a     7.0
b     7.0
c    13.0
d     NaN
x     NaN
dtype: float64

Other than labeling, series behave much like numpy's ndarrays. A series is even a valid argument to many of the numpy array functions we covered last time:

In [10]:
np.mean(my_series)
# numpy array function generally work on series

3.5

# DataFrame Creation and Indexing

A DataFrame is a 2D table with labeled columns that can each hold different types of data. DataFrames are essentially a Python implementation of the types of tables you'd see in an Excel workbook or SQL database. DataFrames are the defacto standard data structure for working with tabular data in Python; we'll be using them a lot throughout the remainder of this guide.

You can create a DataFrame out a variety of data sources like dictionaries, 2D numpy arrays and series using the pd.DataFrame() function. Dictionaries provide an intuitive way to create DataFrames: when passed to pd.DataFrame() a dictionary's keys become column labels and the values become the columns themselves:

In [11]:
# Create a dictionary with some differnet data types as values

my_dict = {"name" : ["Joe","Bob","Frans"],
           "age" : np.array([10,15,20]),
           "weight" : (75,123,239),
           "height" : pd.Series([4.5, 5, 6.1], 
                                index=["Joe","Bob","Frans"]),
           "siblings" : 1,
           "gender" : "M"}
df = pd.DataFrame(my_dict) #Conver the dict to DataFrame

df # Show the DataFrame


Unnamed: 0,name,age,weight,height,siblings,gender
Joe,Joe,10,75,4.5,1,M
Bob,Bob,15,123,5.0,1,M
Frans,Frans,20,239,6.1,1,M


Notice that values in the dictionary you use to make a DataFrame can be a variety of sequence objects, including lists, ndarrays, tuples and series. If you pass in singular values like a single number or string, that value is duplicated for every row in the DataFrame (in this case gender is set to "M" for all records and siblings is set to 1.).

Also note that in the DataFrame above, the rows were automatically given indexes that align with the indexes of the series we passed in for the "height" column. If we did not use a series with index labels to create our DataFrame, it would be given numeric row index labels by default:

In [12]:
my_dict2 = {"name" : ["Joe","Bob","Frans"],
           "age" : np.array([10,15,20]),
           "weight" : (75,123,239),
           "height" :[4.5, 5, 6.1],
           "siblings" : 1,
           "gender" : "M"}

df2 = pd.DataFrame(my_dict2)   # Convert the dict to DataFrame

df2                            # Show the DataFrame

Unnamed: 0,name,age,weight,height,siblings,gender
0,Joe,10,75,4.5,1,M
1,Bob,15,123,5.0,1,M
2,Frans,20,239,6.1,1,M


You can provide custom row labels when creating a DataFrame by adding the index argument:

In [13]:
df2 = pd.DataFrame(my_dict2, index = my_dict["name"])

df2

Unnamed: 0,name,age,weight,height,siblings,gender
Joe,Joe,10,75,4.5,1,M
Bob,Bob,15,123,5.0,1,M
Frans,Frans,20,239,6.1,1,M


A DataFrame behaves like a dictionary of Series objects that each have the same length and indexes. This means we can get, add and delete columns in a DataFrame the same way we would when dealing with a dictionary:

In [14]:
# Get a column by name

df2["weight"]

Joe       75
Bob      123
Frans    239
Name: weight, dtype: int64

Alternatively, you can get a column by label using "dot" notation:

In [15]:
df2.weight

Joe       75
Bob      123
Frans    239
Name: weight, dtype: int64

In [16]:
# Delete a column
del df2['name']

In [17]:
# Add a new column

df2["IQ"] = [130, 105, 115]

df2

Unnamed: 0,age,weight,height,siblings,gender,IQ
Joe,10,75,4.5,1,M,130
Bob,15,123,5.0,1,M,105
Frans,20,239,6.1,1,M,115


Inserting a single value into a DataFrame causes it to populate across all the rows:

In [18]:
df2["Married"] = False

df2

Unnamed: 0,age,weight,height,siblings,gender,IQ,Married
Joe,10,75,4.5,1,M,130,False
Bob,15,123,5.0,1,M,105,False
Frans,20,239,6.1,1,M,115,False


When inserting a Series into a DataFrame, rows are matched by index. Unmatched rows will be filled with NaN:

In [19]:
df2["College"] = pd.Series(["Havard"],
                          index = ["Frans"])

df2

Unnamed: 0,age,weight,height,siblings,gender,IQ,Married,College
Joe,10,75,4.5,1,M,130,False,
Bob,15,123,5.0,1,M,105,False,
Frans,20,239,6.1,1,M,115,False,Havard


You can select both rows or columns by label with df.loc[row, column]:

In [20]:
df2.loc["Joe"]  # Select row "Joe"

age            10
weight         75
height        4.5
siblings        1
gender          M
IQ            130
Married     False
College       NaN
Name: Joe, dtype: object

In [21]:
df2.loc["Joe","IQ"]     # Select row "Joe" and column "IQ"

130

In [22]:
df2.loc["Joe":"Bob" , "IQ":"College"]   # Slice by label

Unnamed: 0,IQ,Married,College
Joe,130,False,
Bob,105,False,


Select rows or columns by numeric index with df.iloc[row, column]:

In [23]:
df2.iloc[0]          # Get row 0

age            10
weight         75
height        4.5
siblings        1
gender          M
IQ            130
Married     False
College       NaN
Name: Joe, dtype: object

In [24]:
df2.iloc[0, 5]       # Get row 0, column 5

130

In [25]:
df2.iloc[0:2, 5:8]   # Slice by numeric row and column index

Unnamed: 0,IQ,Married,College
Joe,130,False,
Bob,105,False,


You can also select rows by passing in a sequence boolean(True/False) values. Rows where the corresponding boolean is True are returned:

In [26]:
boolean_index = [False, True, True]  

df2[boolean_index] 

Unnamed: 0,age,weight,height,siblings,gender,IQ,Married,College
Bob,15,123,5.0,1,M,105,False,
Frans,20,239,6.1,1,M,115,False,Havard


This sort of logical True/False indexing is useful for subsetting data when combined with logical operations. For example, say we wanted to get a subset of our DataFrame with all persons who are over 12 years old. We can do it with boolean indexing:

In [27]:
# Create a boolean sequence with a logical comparison
boolean_index = df2["age"] > 12

# Use the index to get the rows where age > 12
df2[boolean_index]

Unnamed: 0,age,weight,height,siblings,gender,IQ,Married,College
Bob,15,123,5.0,1,M,105,False,
Frans,20,239,6.1,1,M,115,False,Havard


You can do this sort of indexing all in one operation without assigning the boolean sequence to a variable:

In [28]:
df2[ df2["age"] > 12 ]


Unnamed: 0,age,weight,height,siblings,gender,IQ,Married,College
Bob,15,123,5.0,1,M,105,False,
Frans,20,239,6.1,1,M,115,False,Havard


# Exploring DataFrames


Exploring data is an important first step in most data analyses. DataFrames come with a variety of functions to help you explore and summarize the data they contain.


In [31]:
x = pd.read_csv("https://raw.githubusercontent.com/Tublifet/Statistics-With-Python/main/dataset.csv")

type(x)

pandas.core.frame.DataFrame

In [33]:
x.shape # Check dimensions

(1009, 4)

In [34]:
x.head(6)

Unnamed: 0,Date,Open,Close,D/U/I
0,10/3/2018,57.5,58.0,INCREASE
1,10/4/2018,57.7,57.0,DECREASE
2,10/5/2018,57.0,56.1,DECREASE
3,10/8/2018,55.6,55.9,INCREASE
4,10/9/2018,55.9,56.7,INCREASE
5,10/10/2018,56.4,54.1,DECREASE


In [36]:
 x.tail(6) # check the last 6 rows

Unnamed: 0,Date,Open,Close,D/U/I
1003,9/28/2022,147.6,149.8,INCREASE
1004,9/29/2022,146.1,142.5,DECREASE
1005,9/30/2022,141.3,138.2,DECREASE
1006,,,Total D: 450,
1007,,,Total U: 33,
1008,,,Total I: 523,


In [37]:
x.index = x["Date"]

del x["Date"]

print(x.index[0:10])

Index(['10/3/2018', '10/4/2018', '10/5/2018', '10/8/2018', '10/9/2018',
       '10/10/2018', '10/11/2018', '10/12/2018', '10/15/2018', '10/16/2018'],
      dtype='object', name='Date')


In [38]:
x.columns

Index(['Open', 'Close', 'D/U/I'], dtype='object')

In [39]:
x.describe()

Unnamed: 0,Open
count,1006.0
mean,104.86839
std,44.832955
min,36.0
25%,56.65
50%,116.05
75%,145.5
max,182.6


In [41]:
np.mean(x, axis=0) # Get the mean of each numeric column

Open    104.86839
dtype: float64

In [42]:
x.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1009 entries, 10/3/2018 to nan
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Open    1006 non-null   float64
 1   Close   1009 non-null   object 
 2   D/U/I   1006 non-null   object 
dtypes: float64(1), object(2)
memory usage: 31.5+ KB
