In [1]:
import pandas as pd

# Loops - Part 2

We've looked at looping previously, now can can apply those concepts to more complex problems, involving some datasets. For this we'll work with a normal dataset, like a small version of what we'd use for machine learning. 

In [2]:
df = pd.read_excel("../data/sportsref_download.xlsx", header=1)
df = df.head(32)
df.tail()

Unnamed: 0,Rk,Unnamed: 1,AvAge,GP,W,L,OL,PTS,PTS%,GF,...,PK%,SH,SHA,PIM/G,oPIM/G,S,S%,SA,SV%,SO
27,28.0,Vegas Golden Knights,28.2,6,2,4,0,4,0.333,13,...,80.0,1,0,9.0,8.8,208,6.3,202,0.901,0
28,29.0,Los Angeles Kings,28.2,6,1,4,1,3,0.25,14,...,58.82,0,1,7.2,10.0,218,6.4,184,0.891,0
29,30.0,Montreal Canadiens,28.3,7,1,6,0,2,0.143,11,...,64.0,0,0,7.7,7.4,192,5.7,201,0.876,0
30,31.0,Chicago Blackhawks,27.8,6,0,5,1,1,0.083,12,...,90.91,0,0,10.0,11.0,186,6.5,182,0.852,0
31,32.0,Arizona Coyotes,28.4,6,0,5,1,1,0.083,11,...,35.71,0,1,13.0,12.7,160,6.9,188,0.846,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 32 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Rk          32 non-null     float64
 1   Unnamed: 1  32 non-null     object 
 2   AvAge       32 non-null     float64
 3   GP          32 non-null     int64  
 4   W           32 non-null     int64  
 5   L           32 non-null     int64  
 6   OL          32 non-null     int64  
 7   PTS         32 non-null     int64  
 8   PTS%        32 non-null     float64
 9   GF          32 non-null     int64  
 10  GA          32 non-null     int64  
 11  SOW         32 non-null     float64
 12  SOL         32 non-null     float64
 13  SRS         32 non-null     float64
 14  SOS         32 non-null     float64
 15  GF/G        32 non-null     float64
 16  GA/G        32 non-null     float64
 17  PP          32 non-null     int64  
 18  PPO         32 non-null     int64  
 19  PP%         32 non-null     flo

## Series

A series is another data structure from pandas. A series can be thought of as roughly one column of a dataframe. In fact, if we slice a column from a dataframe, we get a series by default. Series are kind of like lists with much of the dataframe functionality.

The advantages of using a series over a list are:
<ul>
<li> Series have a type, like a column in a dataframe. </li>
<li> Series are mutable, like lists, and can be changed in place. </li>
<li> Series are iterable, like lists, and can be looped over. </li>
<li> Series have some "extra" methods to accomplish common tasks. </li>
    <ul>
    <li> There are methods to get the mean, median, mode, etc as well as more elaborate ones like quintiles. </li>
    <li> There are methods to deal with missing or 'dirty' data values. (e.g dropna, fillna, drop_duplicates). </li>
    <li> There are interaction methods like head, tail, sample, etc. </li>
    </ul>
</ul>

![Series to Dataframe](../../images/series_dataframe.png "Series to Dataframe")
![Series to Dataframe](../images/series_dataframe.png "Series to Dataframe")

We can also declare a series using the constructor, pd.Series(), into which we can optionally pass another data structure, such as:
<ul>
<li> List </li>
<li> Dictionary </li>
<li> Numpy array </li>
</ul>

The documentation for Series from Pandas is reasonably clear: https://pandas.pydata.org/docs/reference/api/pandas.Series.html 

#### Using Series

We use series largely like we'd use a list, and conveniently many functions that we use will work on both. For example, we can use the len() function to get the length of a series, just like we would a list. When manipulating dataframes we commonly get a series by default if we slice a column or do something resulting in a one column dataframe. We can also use a series if we want to use some of its extra functionality, like cleaning up the data or doing some statistical calculations. 

For the most part, we don't really have to use a series, other data structures will do just fine. In practice, we use them pretty often when doing data science work, either because it is what we end up with or because we want to use some of the extra functionality. A series is marginally more resource intensive than a list, but not enough to worry about. When dealing with machine learning we may use a series regularly because we are often dealing with data in dataframes, and a series fits directly into that model of thinking. 

In [4]:
# Type information

teamNames = df["Unnamed: 1"]
print(type(teamNames))
print(teamNames.dtype)
print(teamNames.head())


<class 'pandas.core.series.Series'>
object
0       Florida Panthers
1    Carolina Hurricanes
2        Edmonton Oilers
3        St. Louis Blues
4         Minnesota Wild
Name: Unnamed: 1, dtype: object


### In-Place vs. Not In-Place

Several methods that we use with a series have an "in-place" argument, which we can set to control if the original object is modified or not by a function call. By default, the in-place argument is set to False, meaning that the original object is not modified so we must capture the return value into a variable in order to keep it. 

If we want to modify the original object, we can set the in-place argument to True. In this case, there's no need to capture the return value, since the original object is modified.

For the most part, this difference won't impact your overall program at all, but it is an easy place to create a bug. As well, creating a new object can be expensive in terms of memory and potentially time, especially if we are dealing with large items like machine learning datasets. When our objects are large, we likely want to avoid creating new objects if we can. 

<b>Note:</b> It is rarely a concern with regular variables in data science work, as our programs aren't that large, but if you can imagine a program running on a server that stays active for weeks at a time, this can be an impactful memory leak. Python does have a garbage collector, or automated system to free memory from unneeded items, but it is not perfect and there can be scenarios where it doesn't capture all the items that are no longer needed.

In [5]:
teamNames = teamNames.dropna()

In the example code below, check the "Variables" table in VS Code to see the size of each variable (we don't get the size in kb, but we could calculate it from the size * object size). If we want to see the size of a single variable, we can use the sys.getsizeof() function, which returns the size in bytes.

In [6]:
df2 = df.drop(columns={ "Rk", "W", "L", "SHA", "S", "GF", "GA", "SO", "PIM/G", "oPIM/G", "S%"})

In [7]:
df.drop(columns={"Rk", "W", "L"}, inplace=True)

In [8]:
# Mutable
df2.iloc[0,0] = "The bestest team!!"
df2.head()

Unnamed: 0,Unnamed: 1,AvAge,GP,OL,PTS,PTS%,SOW,SOL,SRS,SOS,...,GA/G,PP,PPO,PP%,PPA,PPOA,PK%,SH,SA,SV%
0,The bestest team!!,27.5,6,0,12,1.0,0.0,0.0,2.39,-0.11,...,2.0,5,25,20.0,4,27,85.19,1,189,0.937
1,Carolina Hurricanes,27.9,5,0,10,1.0,0.0,0.0,2.13,-0.67,...,1.6,6,19,31.58,2,20,90.0,0,149,0.946
2,Edmonton Oilers,29.4,5,0,10,1.0,1.0,0.0,1.94,-0.26,...,2.6,8,17,47.06,2,17,88.24,1,188,0.931
3,St. Louis Blues,28.8,5,0,10,1.0,0.0,0.0,1.83,-0.97,...,2.2,6,16,37.5,1,16,93.75,1,170,0.935
4,Minnesota Wild,29.4,6,0,10,0.833,0.0,0.0,0.53,0.2,...,3.0,4,22,18.18,8,26,69.23,0,167,0.892


In [9]:
import sys
print("DF size:", sys.getsizeof(df))
print("DF2 size:", sys.getsizeof(df2))

DF size: 9664
DF2 size: 7618


## Enumeration

Enumeration is another tool for looping through objects that are iterable. Enumeration can act as a sort of combo between a for and while loops - we automatically loop through every item without managing an index, like a for loop; we also get an index or counter, like a while loop. Enumeration is a function that takes an iterable as an argument and returns a tuple of the index and the item, so we can apply it in the same way for many different types of iterables. We call the enumeration function with enumerate(ITERABLE), and we can optionally pass a start value, which will be the value of the index for the first item. The enumeration returns a tuple of the index and the value, so we can capture both in a single variable, or we can capture them separately.

![Enumeration](../../images/enumeration.png "Enumeration")
![Enumeration](../images/enumeration.png "Enumeration")

<b>Note:</b> enumeration is seen fairly regularly when using large datasets for neural networks. 

### Enumerable Data Structures

We can use enumeration with any iterable data structure, including lists, dictionaries, series, and more, and it will work the same way. This gives us one additional layer of "abstraction" in our code - we don't need to worry about what the data structure is when using enumeration, as long as the data structure is iterable. This means we can swap out one data structure for another, and our code will still work. Yay for interchangeable parts!

Again, looking forwards to machine learning, it isn't too uncommon to swap out data structures, particularly if we work with some small dataset to create something, then switch to a larger one later on. 

### Index and Value

The enumeration always gives us both the index and the value when we are looping, which is convenient. In data science work we generally work with large datasets, and we also generally split and then recombine them back together, so the ability to manage where things are in a dataset is important. In particular, we may have data like images where the data is not in a tabular format, and we still need to track which image is which so we can align them with our predictions. Using enumeration will process through a dataset and maintain the index, so we can use that to align our data later.

In [10]:
for index, team in enumerate(teamNames):
    print(index, team)

0 Florida Panthers
1 Carolina Hurricanes
2 Edmonton Oilers
3 St. Louis Blues
4 Minnesota Wild
5 Washington Capitals
6 Buffalo Sabres
7 Calgary Flames
8 New York Rangers
9 San Jose Sharks
10 Columbus Blue Jackets
11 Pittsburgh Penguins
12 New York Islanders
13 Vancouver Canucks
14 Detroit Red Wings
15 Winnipeg Jets
16 Tampa Bay Lightning
17 Nashville Predators
18 Boston Bruins
19 Dallas Stars
20 New Jersey Devils
21 Anaheim Ducks
22 Philadelphia Flyers
23 Toronto Maple Leafs
24 Seattle Kraken
25 Ottawa Senators
26 Colorado Avalanche
27 Vegas Golden Knights
28 Los Angeles Kings
29 Montreal Canadiens
30 Chicago Blackhawks
31 Arizona Coyotes


We can also capture the values as a tuple of index and value. In the example above, our tuple is "unpacked", in the one below it is "packed" - or all bundled up into one. 

In [11]:
for team in enumerate(teamNames):
    print(team)

(0, 'Florida Panthers')
(1, 'Carolina Hurricanes')
(2, 'Edmonton Oilers')
(3, 'St. Louis Blues')
(4, 'Minnesota Wild')
(5, 'Washington Capitals')
(6, 'Buffalo Sabres')
(7, 'Calgary Flames')
(8, 'New York Rangers')
(9, 'San Jose Sharks')
(10, 'Columbus Blue Jackets')
(11, 'Pittsburgh Penguins')
(12, 'New York Islanders')
(13, 'Vancouver Canucks')
(14, 'Detroit Red Wings')
(15, 'Winnipeg Jets')
(16, 'Tampa Bay Lightning')
(17, 'Nashville Predators')
(18, 'Boston Bruins')
(19, 'Dallas Stars')
(20, 'New Jersey Devils')
(21, 'Anaheim Ducks')
(22, 'Philadelphia Flyers')
(23, 'Toronto Maple Leafs')
(24, 'Seattle Kraken')
(25, 'Ottawa Senators')
(26, 'Colorado Avalanche')
(27, 'Vegas Golden Knights')
(28, 'Los Angeles Kings')
(29, 'Montreal Canadiens')
(30, 'Chicago Blackhawks')
(31, 'Arizona Coyotes')


#### Index Offset

When performing enumeration we can also start our index at a different value than 0. This is useful if we are combining multiple datasets together, or if we are using a dataset that has an index that starts at a different value. One example is if we are dealing with large datasets where it is common to break them into "chunks" - if we were aiming to enumerate through all the items in the dataset, we would want to start our index at the correct value for the chunk we are working with.

In [12]:
for item in enumerate(teamNames, 20):
    print(item)

(20, 'Florida Panthers')
(21, 'Carolina Hurricanes')
(22, 'Edmonton Oilers')
(23, 'St. Louis Blues')
(24, 'Minnesota Wild')
(25, 'Washington Capitals')
(26, 'Buffalo Sabres')
(27, 'Calgary Flames')
(28, 'New York Rangers')
(29, 'San Jose Sharks')
(30, 'Columbus Blue Jackets')
(31, 'Pittsburgh Penguins')
(32, 'New York Islanders')
(33, 'Vancouver Canucks')
(34, 'Detroit Red Wings')
(35, 'Winnipeg Jets')
(36, 'Tampa Bay Lightning')
(37, 'Nashville Predators')
(38, 'Boston Bruins')
(39, 'Dallas Stars')
(40, 'New Jersey Devils')
(41, 'Anaheim Ducks')
(42, 'Philadelphia Flyers')
(43, 'Toronto Maple Leafs')
(44, 'Seattle Kraken')
(45, 'Ottawa Senators')
(46, 'Colorado Avalanche')
(47, 'Vegas Golden Knights')
(48, 'Los Angeles Kings')
(49, 'Montreal Canadiens')
(50, 'Chicago Blackhawks')
(51, 'Arizona Coyotes')


In [13]:
for item in enumerate(teamNames, -25420):
    print(item)

(-25420, 'Florida Panthers')
(-25419, 'Carolina Hurricanes')
(-25418, 'Edmonton Oilers')
(-25417, 'St. Louis Blues')
(-25416, 'Minnesota Wild')
(-25415, 'Washington Capitals')
(-25414, 'Buffalo Sabres')
(-25413, 'Calgary Flames')
(-25412, 'New York Rangers')
(-25411, 'San Jose Sharks')
(-25410, 'Columbus Blue Jackets')
(-25409, 'Pittsburgh Penguins')
(-25408, 'New York Islanders')
(-25407, 'Vancouver Canucks')
(-25406, 'Detroit Red Wings')
(-25405, 'Winnipeg Jets')
(-25404, 'Tampa Bay Lightning')
(-25403, 'Nashville Predators')
(-25402, 'Boston Bruins')
(-25401, 'Dallas Stars')
(-25400, 'New Jersey Devils')
(-25399, 'Anaheim Ducks')
(-25398, 'Philadelphia Flyers')
(-25397, 'Toronto Maple Leafs')
(-25396, 'Seattle Kraken')
(-25395, 'Ottawa Senators')
(-25394, 'Colorado Avalanche')
(-25393, 'Vegas Golden Knights')
(-25392, 'Los Angeles Kings')
(-25391, 'Montreal Canadiens')
(-25390, 'Chicago Blackhawks')
(-25389, 'Arizona Coyotes')


## Small Exercise

Create a function that takes in a Series of numbers (one of the columns from the dataset), and a Series of team names (another column from the dataset), and returns the team names in order of the numbers. There are lots of ways to do this, you'll need to:
<ul>
<li> Use the number column to determine the correct order. </li>
<li> Use that order to set the team names in order. </li>
<li> Return the team names in order. </li>
<li> If this is easy, add some input validation to fail gracefully if the wrong data is passed in. </li>
</ul>

Right now, the data is sorted in terms of the PTS column. We want to resort it according to whatever column we pass in, so if we supplied the PP% column, we should get the team that had the highest PP% first, and the lowest PP% last.

Think of each step independently, and use some print statements to check what the values are at each step. Critically, this isn't something where there is a "correct answer" - we have a few goals, there are many ways to get there, and we don't care which one we use. Plan it out with some psuedocode and then try to implement it.

In [14]:
#make sorter function


In [15]:
sorter(teamNames, df["SRS"])

0          Florida Panthers
1       Carolina Hurricanes
2       Philadelphia Flyers
3           Edmonton Oilers
4           St. Louis Blues
5             Boston Bruins
6       Washington Capitals
7            Calgary Flames
8            Buffalo Sabres
9             Winnipeg Jets
10      Pittsburgh Penguins
11          San Jose Sharks
12           Minnesota Wild
13            Anaheim Ducks
14    Columbus Blue Jackets
15      Nashville Predators
16        Vancouver Canucks
17       New York Islanders
18        New Jersey Devils
19      Tampa Bay Lightning
20             Dallas Stars
21        Los Angeles Kings
22        Detroit Red Wings
23         New York Rangers
24       Colorado Avalanche
25     Vegas Golden Knights
26           Seattle Kraken
27          Ottawa Senators
28      Toronto Maple Leafs
29          Arizona Coyotes
30       Montreal Canadiens
31       Chicago Blackhawks
dtype: object

## Exercise

Create a function called nextDown that returns the team name the next team below the one supplied in PTS:
<ul>
<li> The PTS column is the number of points a team has earned, and what defines their rank. The more points, the higher they are. </li>
    <ul>
    <li> I.e. if the team we are checking is the Edmonton Oilers, the answer should be the Buffalo Sabres as that is the next team that has fewer points, the others in between aren't lower in points. </li>
    <li> If we are checking the Florida Panthers, the Carolina Hurricanes are the next team down, as they have fewer points. </li>
    </ul>
<li> The function should take in the dataframe of standings and the team name you're checking as arguments. </li>
    <ul>
    <li> Think about what else you need (to select the data you're targeting), and how you can get it. </li>
    <li> You generally don't want things hard-coded. </li>
    </ul>
<li> Challenge 2 - Modify your function to take also return, as a second return value, the columns in which the team that was originally supplied is greater than the average of all teams for that column. </li>
    <ul>
    <li> E.g. The Florida Panthers should return, "Carolina Hurricanes - ['PTS', 'PTS%', 'GF', 'SRS', 'GF/G', 'PP', 'PPO', 'PPOA', 'PK%', 'SH', 'SHA', 'PIM/G', 'oPIM/G', 'S', 'S%', 'SA', 'SV%']" or similar. </li>
    </ul>
</ul>

<b>Note:</b> There are a lot of ways to do this, try to use some enumeration. This is also a very good chance to include print statements for debugging - you're probably looping through the data a couple of times, start with a print() of what values are there and check to see that they are in line with what you expect, then build from that. 


In [16]:
df.head(10)

Unnamed: 0,Unnamed: 1,AvAge,GP,OL,PTS,PTS%,GF,GA,SOW,SOL,...,PK%,SH,SHA,PIM/G,oPIM/G,S,S%,SA,SV%,SO
0,Florida Panthers,27.5,6,0,12,1.0,27,12,0.0,0.0,...,85.19,1,1,10.8,11.8,210,12.9,189,0.937,0
1,Carolina Hurricanes,27.9,5,0,10,1.0,22,8,0.0,0.0,...,90.0,0,0,8.0,8.0,175,12.6,149,0.946,0
2,Edmonton Oilers,29.4,5,0,10,1.0,23,13,1.0,0.0,...,88.24,1,0,13.0,8.6,168,13.7,188,0.931,0
3,St. Louis Blues,28.8,5,0,10,1.0,25,11,0.0,0.0,...,93.75,1,0,9.0,10.2,174,14.4,170,0.935,1
4,Minnesota Wild,29.4,6,0,10,0.833,20,18,0.0,0.0,...,69.23,0,0,13.3,12.0,218,9.2,167,0.892,0
5,Washington Capitals,29.1,6,2,10,0.833,26,16,0.0,0.0,...,70.59,2,2,9.0,10.7,183,14.2,165,0.903,0
6,Buffalo Sabres,28.3,6,1,9,0.75,18,11,1.0,0.0,...,87.5,0,0,6.0,6.7,198,9.1,187,0.941,0
7,Calgary Flames,28.0,6,1,9,0.75,21,15,0.0,0.0,...,78.95,1,1,9.3,9.0,212,9.9,177,0.915,1
8,New York Rangers,26.2,7,1,9,0.643,15,18,0.0,0.0,...,78.26,0,1,10.0,10.9,189,7.9,220,0.918,0
9,San Jose Sharks,28.6,6,0,8,0.667,20,14,0.0,0.0,...,85.71,1,1,8.0,10.0,166,12.0,168,0.917,1


##### Suppress Warnings

There was one function that I used in mine that has a warning, due to being deprecated, or old. This block suppresses the warning so it doesn't pop up in the notebook and annoy me sensibilities. Warnings like this are somewhat common, when things change in libraries that we use, they typically plan for changes that will cause old code to fail for a long time. So if a method is going to be removed or drastically changed, the old one will be left in for a long time, but it will be marked as deprecated, or old, and a warning will be given. This is to give us time to update our code before the old method is removed.

In [17]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [18]:
# write function




These are some test cases that I used for mine. 

In [19]:
nd1 = nextDown(df, "Edmonton Oilers")
print(nd1[0], "\n")
print(nd1[1])
#

Buffalo Sabres 

['AvAge', 'OL', 'PTS', 'PTS%', 'SOW', 'SRS', 'PP', 'PP%', 'PK%', 'S', 'SV%']


In [20]:
nd2 = nextDown(df, "Florida Panthers")
print(nd2[0], "\n")
print(nd2[1])

Carolina Hurricanes 

['PTS', 'PTS%', 'GF', 'SRS', 'GF/G', 'PP', 'PP%', 'PPOA', 'PK%', 'S%', 'SV%']


In [21]:
nd3 = nextDown(df, "Calgary Flames")
print(nd3[0], "\n")
print(nd3[1])

San Jose Sharks 

['AvAge', 'PTS', 'PTS%', 'GF', 'SRS', 'GF/G', 'PP', 'PPO', 'PP%', 'PK%', 'SH', 'SHA', 'S%', 'SV%', 'SO']
