# Example Sheet. Chapter 2. The Series and the DataFrame

**Book: From Social Science to Data Science** 

**Author: Bernie Hogan**

**Last revision: September 19, 2019**

This worksheet provides executable versions of the code featured in the chapter. The explanations are much more terse here, but you also get some extra steps and output because it is easier to show it on screen than in a text.

In order to keep the cells lightweight, I assume here that you will running these cells in order. So if you dip in later and get an error like 

~~~ python
NameError: name 'pd' is not defined
~~~
Then you have to go back and ```import pandas as pd```. 

# The Series

In [1]:
# Creating a series - one way. 
from pandas import Series 
ser1 = Series()

# Creating a series - another way. 
import pandas as pd
ser1 = pd.Series()

## Making a series from a list

In [2]:
lweekdays = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]

sweekdays = pd.Series(lweekdays,name="Weekdays")

display(sweekdays)

0       Monday
1      Tuesday
2    Wednesday
3     Thursday
4       Friday
5     Saturday
6       Sunday
Name: Weekdays, dtype: object

## Making a series from a dictionary
In this case, the keys become the indices and the values become the values. 

In [3]:
dsleephours =  {"Sunday":8,
                "Monday":7,
                "Tuesday":5,
                "Wednesday":6,
                "Thursday":8,
                "Friday":9,
                "Saturday":8}

sleephours = pd.Series(dsleephours)

display(sleephours) 

Sunday       8
Monday       7
Tuesday      5
Wednesday    6
Thursday     8
Friday       9
Saturday     8
dtype: int64

## Working from an index
Two ways to get values out using indices: by index label and by position. 

In [4]:
# Get a value from the table by index label.
display(sleephours["Tuesday"])

# Get a value from the table by position.
display(sleephours[2])


5

5

## Working from values (and slicing)

In [5]:
display(sleephours > 7)

Sunday        True
Monday       False
Tuesday      False
Wednesday    False
Thursday      True
Friday        True
Saturday      True
dtype: bool

In [6]:
# This shows how to slice data down to rows that meet the criterion
display(sleephours[sleephours >= 8])

Sunday      8
Thursday    8
Friday      9
Saturday    8
dtype: int64

In [7]:
# Here we show 
days_sleep = len(sleephours[sleephours >= 8])
total_days = len(sleephours)
display(days_sleep / total_days)

# I'm printing it here in the proper style of using 'format()'
# And putting the number in a useful sentence to remember what I calculated
print("The proportion of days per week where the subject had 8 or more hours is {:.2f}".format
      (days_sleep/total_days))

0.5714285714285714

The proportion of days per week where the subject had 8 or more hours is 0.57


## Working with distributions 
The way to count the unique values in a series is to use the ```value_counts()``` method. This is like doing a frequency or tabulate command in other software.

In [8]:
display(sleephours.value_counts())

8    3
7    1
6    1
5    1
9    1
dtype: int64

In the snippet below we can see how to summarize a series using a boolean. ```(sleephours > 7)```  will give us the list of ```True``` and ```False``` values. Then using ```.value_counts()``` we can have those values summarised into a count of ```True``` and ```False```. 

In [9]:
(sleephours > 7).value_counts()

True     4
False    3
dtype: int64

Below we can see how easy it is to get a statistic out of a series. For ```sleephours``` we can simply say ```.mean()```. However, I also like to print this result inside of a ```"{}".format()``` so that we can format the result in an attractive way. I use 3 digits. 

In [10]:
print(sleephours.mean())

display("{:.3}".format(sleephours.mean()))

7.285714285714286


'7.29'

Some more simple descriptives.

In [11]:
display(sleephours.max())
display(sleephours.min())
display(sleephours.median())

9

5

8.0

## Differences in methods between a Series and a List

You can see all of the methods for a series by typing ```dir(SERIESNAME)```. 

In [12]:
ex_list = [] # Just an empty list
ex_series = pd.Series(ex_list) # Now an empty series

display("A list has {} methods.".format( len( dir(ex_list)))) 

display("A series has {} methods.".format( len( dir(ex_series)))) 

'A list has 46 methods.'

'A series has 457 methods.'

Below is a slightly more complicated way to print all the methods, but it does show lots of things at once:
1. It uses a list comprehension to select from a list 
2. It uses ```enumerate()``` to provide a counter for a for loop. It also uses this counter to create a new line every $4$ entries.
3. It uses ```"{:<24}"``` as a way to say if this is $<24$ characters, then pad it to $24$. 

In [32]:


def prettyPrint(OBJECT,system_methods=False):
    output = ""
    # filter out names preceding with "_"
    if not system_methods:
        all_methods = [x for x in dir(OBJECT) if x[0] != "_"]
    else:
        all_methods = dir(OBJECT)
    
    #24 was hardcoded, but could probably be variable based on list in next version
    
    for c,method in enumerate(all_methods):
        output += "{:<24}".format(method)
        if not c%4: output += "\n"
    
    return output

print("The methods for a list:\n",prettyPrint(ex_list))
print()
print("The methods for a series:\n",prettyPrint(ex_series))

The methods for a list:
 append                  
clear                   copy                    count                   extend                  
index                   insert                  pop                     remove                  
reverse                 sort                    

The methods for a series:
 T                       
abs                     add                     add_prefix              add_suffix              
agg                     aggregate               align                   all                     
any                     append                  apply                   argmax                  
argmin                  argsort                 array                   as_matrix               
asfreq                  asof                    astype                  at                      
at_time                 autocorr                axes                    base                    
between                 between_time            bfill                   

## Adding data to a Series

In [33]:
# Convert a list to a Series and append it to an existing Series.
# Step 1. Create Series 1. 
ldemo1 = ["Kermit","Piggy","Fozzie"]
sdemo1 = pd.Series(ldemo1) 

# Step 2. Create series 2. 
ldemo2 = ["Animal","Janice", "Dr. Teeth"]
sdemo2 = pd.Series(ldemo2) 

# Step 3. Append series 2. 
# Notice the 'ignore_index' argument. 
# Try running this without that argument (you will notice the index will be messed up)
sdemo1and2 = sdemo1.append(sdemo2,ignore_index=True)
display(sdemo1and2)

0       Kermit
1        Piggy
2       Fozzie
3       Animal
4       Janice
5    Dr. Teeth
dtype: object

In [34]:
# When appended without the ignore_index, the index is 0,1,2,0,1,2
sdemo1and2noindex = sdemo1.append(sdemo2)

display(sdemo1and2noindex)

0       Kermit
1        Piggy
2       Fozzie
0       Animal
1       Janice
2    Dr. Teeth
dtype: object

In [36]:
ldemo1 = ["Kermit","Piggy","Fozzie"]
sdemo1 = pd.Series(ldemo1) 

# The second way, let's append the data one new index at a time.
ldemo2 = ["Animal","Janice", "Dr. Teeth"]

for i in ldemo2: 
    sdemo1[len(sdemo1)] = i
display(sdemo1)

0       Kermit
1        Piggy
2       Fozzie
3       Animal
4       Janice
5    Dr. Teeth
dtype: object

### Comparing the two different ways using timeit. 

In the book we simply stated that appending the two together was quicker than using a for loop. We can, however, test that out by using timeit, a module in jupyter that allows us to see how long it takes to run a cell or line of code. 

In [37]:
%%timeit

ldemo1 = ["Kermit","Piggy","Fozzie"]
sdemo1 = pd.Series(ldemo1) 

ldemo2 = ["Animal","Janice", "Dr. Teeth"]
sdemo2 = pd.Series(ldemo2) 

sdemo1and2 = sdemo1.append(sdemo2,ignore_index=True)

416 µs ± 171 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [38]:
%%timeit 

ldemo1 = ["Kermit","Piggy","Fozzie"]
sdemo1 = pd.Series(ldemo1) 

ldemo2 = ["Animal","Janice", "Dr. Teeth"]

for i in ldemo2: 
    sdemo1[len(sdemo1)] = i

2.45 ms ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


There we have it, The append is an order of magnitude faster. You will see this sort of speed difference in managing data throughout this book and your work. Using methods like ```%timeit``` can help here.

## Deleting data from a Series 

In [39]:
sdemo = pd.Series(["Kermit","Piggy","Fozzie"])
del sdemo[1]
display(sdemo)

0    Kermit
2    Fozzie
dtype: object

In [40]:
# Notice how it deleted the data but preserved the index? 

# Let's create a new index
sdemo.index = range(len(sdemo))

display(sdemo)

0    Kermit
1    Fozzie
dtype: object

## Working with missing data in a Series

In [41]:
# Creates a series of length 5 with no data. 
sdemo = pd.Series(index=[0,1,2,3,4])
display(sdemo)

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
dtype: float64

In [42]:
# Now adding some data to that series
sdemo[0] = "Kermit"
sdemo[3] = "Fozzie"
display(sdemo)

0    Kermit
1       NaN
2       NaN
3    Fozzie
4       NaN
dtype: object

### Getting rid of missing values 

In [43]:
display(sdemo.dropna())

0    Kermit
3    Fozzie
dtype: object

### Using fillna() to replace values

In [44]:
display(sdemo.fillna("Extra"))

0    Kermit
1     Extra
2     Extra
3    Fozzie
4     Extra
dtype: object

### Using isna() and notna() to filter the DataFrame

In [45]:
display(sdemo.isna())

print()

display(sdemo[sdemo.notna()])

0    False
1     True
2     True
3    False
4     True
dtype: bool




0    Kermit
3    Fozzie
dtype: object

## Getting unique values in a Series

In [46]:
ser1 = pd.Series(["Kermit","Fozzie","Kermit","Piggy","Fozzie"])
display(ser1.unique())

array(['Kermit', 'Fozzie', 'Piggy'], dtype=object)

In [47]:
ser2 = pd.Series(ser1.unique()) # To transform back to a Series

print(type(ser2),ser2,sep="\n\n")

<class 'pandas.core.series.Series'>

0    Kermit
1    Fozzie
2     Piggy
dtype: object


## Sorting a Series

In [48]:
# Notice here we will sort by the values in the series. 
# Notice the inplace=True argument. 

ser1 = pd.Series( {"Kermit":"Frog",
                   "Piggy":"Pig",
                   "Fozzie":"Bear",
                   "Robin":"Frog"} )

ser1.sort_values(ascending=True,inplace=True)
display(ser1)

Fozzie    Bear
Kermit    Frog
Robin     Frog
Piggy      Pig
dtype: object

In [49]:
# Here instead of values we are sorting by index.

ser2 = ser1.sort_index(ascending=False)
display(ser2)

Robin     Frog
Piggy      Pig
Kermit    Frog
Fozzie    Bear
dtype: object

## Changing Series Values 

### I: Adding, Multiplying, etc...

In [50]:
import numpy as np 
ser1 = pd.Series([1,np.NaN,7])

# See how each value is doubled, not the series is duplicated.
ser1 = ser1*2
display(ser1)

# Similarly how each value has four subtracted. 
ser1 = ser1-4
display(ser1)

0     2.0
1     NaN
2    14.0
dtype: float64

0    -2.0
1     NaN
2    10.0
dtype: float64

In [51]:
try: 
    ser1 = ser1 + "A" #Note that the Series is full of numbers so it throws an error
except TypeError: 
    print("If you try to add a character to a number it will throw an Error. It will not concatenate them.")

If you try to add a character to a number it will throw an Error. It will not concatenate them.


In [52]:
# This is a series full of strings so the concatenation works
ser2 = pd.Series(["Kermit","Piggy","Fozzie"])
ser2 = ser2 + " the Muppet"
display(ser2)

0    Kermit the Muppet
1     Piggy the Muppet
2    Fozzie the Muppet
dtype: object

### II: Recoding values using Map

In [53]:
import pandas as pd 

example_gender_list = ['Male', 
                       "Woman", 
                       "Female", 
                       "Female", 
                       'Male', 
                       'Man', 
                       'Male (sex)', 
                       "Woman", 
                       'Male (sex)',
                       "Female", 
                       "Female "]

gender_series = pd.Series(example_gender_list)
print(gender_series.unique())

['Male' 'Woman' 'Female' 'Man' 'Male (sex)' 'Female ']


In [54]:
gender_recode_dict = {"Male":"M", 
                 "Man":"M",
                 "Male (sex)": "M",
                 "Woman":"F",
                 "Female":"F",
                 "Female ":"F"}

gender_recode = gender_series.map(gender_recode_dict)

print(gender_recode)
print()
print(gender_recode.value_counts())

0     M
1     F
2     F
3     F
4     M
5     M
6     M
7     F
8     M
9     F
10    F
dtype: object

F    6
M    5
dtype: int64


### III: Defining your own recode using Lambda

In [55]:
ser1 = pd.Series([1,3,5],index=["one","three","five"]) 
ser1 = ser1.map(lambda val: val**2)
display(ser1)

one       1
three     9
five     25
dtype: int64

In [56]:
def has_email(TEXT): 
    import re
    # This is the function that we didn't cover, it is very complicated. 
    # You can see more regex examples in Chapter XX
    file_email = re.compile(r"[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*")
    if re.search(file_email, TEXT): return True
    else: return False
    
    
smessages = pd.Series(["Hey, catch me at bernie.hogan@oii.ox.ac.uk", 
                       "I once emailed steve@apple.com and got a reply", 
                       "I don't really use email",
                       "Is test@example a valid email?"])

result = smessages.map(lambda x: has_email(x)) 

display(result) 

0     True
1     True
2    False
3     True
dtype: bool

The important thing to understand here is how lambda works. I would not worry about that regular expression, nor would I even be able to draft that on my own. It comes from the W3C spec for email addresses [featured here](https://html.spec.whatwg.org/multipage/input.html#valid-e-mail-address)

# From Series to DataFrame 



In [57]:
import pandas as pd 

ser1 = pd.Series({"Kermit":"Frog", "Fozzie":"Bear", "Janice":"Hippy"})

display(ser1)

Kermit     Frog
Fozzie     Bear
Janice    Hippy
dtype: object

In [58]:
df1 = pd.DataFrame(ser1)

display(df1)

Unnamed: 0,0
Kermit,Frog
Fozzie,Bear
Janice,Hippy


In [59]:
ser1 = pd.Series({"Kermit":"Frog", "Fozzie":"Bear", "Janice":"Hippy"}, name="MuppetType")

df1 = pd.DataFrame(ser1)

display(df1)

Unnamed: 0,MuppetType
Kermit,Frog
Fozzie,Bear
Janice,Hippy


In [60]:
df1.columns = ["NewColumnName"]

display(df1)

Unnamed: 0,NewColumnName
Kermit,Frog
Fozzie,Bear
Janice,Hippy


## Getting Data in to a DataFrames

### From a list of lists 

In [61]:
muppetList = [["Kermit","Frog",1955,"Male"], 
              ["Miss Piggy", "Pig", 1974, "Female"], 
              ["Gonzo", "Unknown", 1970, "Male"]]

muppetFrame1 = pd.DataFrame(muppetList)
display(muppetFrame1)

Unnamed: 0,0,1,2,3
0,Kermit,Frog,1955,Male
1,Miss Piggy,Pig,1974,Female
2,Gonzo,Unknown,1970,Male


In [62]:
muppetFrame1.set_index(0, inplace=True)
display(muppetFrame1)

Unnamed: 0_level_0,1,2,3
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Kermit,Frog,1955,Male
Miss Piggy,Pig,1974,Female
Gonzo,Unknown,1970,Male


In [63]:
df_orient_index = pd.DataFrame.from_dict({"Kermit":"Frog", "Fozzie":"Bear", "Janice":"Hippy"},orient="index",columns=["MuppetType"])

display(df_orient_index)

Unnamed: 0,MuppetType
Kermit,Frog
Fozzie,Bear
Janice,Hippy


In [64]:
df_orient_columns = pd.DataFrame.from_dict({"Kermit":["Frog"], "Fozzie":["Bear"], "Janice":["Hippy"]},orient="columns")#,columns=["MuppetType"])

display(df_orient_columns)

Unnamed: 0,Kermit,Fozzie,Janice
0,Frog,Bear,Hippy


### A DataFrame from a dictionary

In [65]:
muppetDict = {"Kermit": ["Frog",1955,"Male"], 
              "Miss Piggy":["Pig", 1974, "Female"], 
              "Gonzo": ["Unknown", 1970, "Male"]}

muppetFrame2 = pd.DataFrame.from_dict(muppetDict,orient="index")
display(muppetFrame2)

Unnamed: 0,0,1,2
Kermit,Frog,1955,Male
Miss Piggy,Pig,1974,Female
Gonzo,Unknown,1970,Male


In [66]:
muppetFrame3 = pd.DataFrame.from_dict({"Kermit":"Frog", "Miss Piggy":"Pig", "Gonzo":"Unknown"},orient="index",columns=["MuppetType"])

muppet_year = pd.Series({"Gonzo":1970,"Kermit":1955,"Miss Piggy":1974})

muppetFrame3["MuppetYear"] = muppet_year

display(muppetFrame3)

Unnamed: 0,MuppetType,MuppetYear
Kermit,Frog,1955
Miss Piggy,Pig,1974
Gonzo,Unknown,1970


In [67]:
muppet_gender = ["male","female","male"]

muppetFrame3["MuppetGender"] = muppet_gender

display(muppetFrame3)

Unnamed: 0,MuppetType,MuppetYear,MuppetGender
Kermit,Frog,1955,male
Miss Piggy,Pig,1974,female
Gonzo,Unknown,1970,male


In [68]:
muppetFrame3["MuppetDecade"] = muppetFrame3["MuppetYear"].map(lambda x: (x // 10)*10)

display(muppetFrame3)

muppetdf = muppetFrame3 

Unnamed: 0,MuppetType,MuppetYear,MuppetGender,MuppetDecade
Kermit,Frog,1955,male,1950
Miss Piggy,Pig,1974,female,1970
Gonzo,Unknown,1970,male,1970


## Returning Data from a DataFrame: Querying and Slicing

In [69]:
print(muppetdf.loc["Gonzo"])

MuppetType      Unknown
MuppetYear         1970
MuppetGender       male
MuppetDecade       1970
Name: Gonzo, dtype: object


In [70]:
print(muppetdf.iloc[2])

MuppetType      Unknown
MuppetYear         1970
MuppetGender       male
MuppetDecade       1970
Name: Gonzo, dtype: object


### Querying using iloc and loc

In [71]:
# xx.toDo -> examples of select item queries

### Returning a slice of data

In [72]:
muppetdf["MuppetYear"] > 1967

Kermit        False
Miss Piggy     True
Gonzo          True
Name: MuppetYear, dtype: bool

In [73]:
muppetdf[muppetdf["MuppetYear"] > 1967]

Unnamed: 0,MuppetType,MuppetYear,MuppetGender,MuppetDecade
Miss Piggy,Pig,1974,female,1970
Gonzo,Unknown,1970,male,1970


### Exploring deep versus shallow copies 

In [74]:
# Attempt 1 (which will fail)
muppetdf.loc["Gonzo"]["MuppetType"] = "weirdo"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [75]:
# Attempt 2 (which will succeed)
muppetdf.loc["Gonzo","MuppetType"] = "Weirdo"
display(muppetdf)

Unnamed: 0,MuppetType,MuppetYear,MuppetGender,MuppetDecade
Kermit,Frog,1955,male,1950
Miss Piggy,Pig,1974,female,1970
Gonzo,Weirdo,1970,male,1970


In [76]:
newmuppetdf = muppetdf

newmuppetdf.loc["Kermit","MuppetType"] = "Lizard" #change in newmuppetdf

display(muppetdf.loc["Kermit","MuppetType"]) #it appears in original muppetdf

'Lizard'