# Iterating with .iterrows()
In the video, we discussed that .iterrows() returns each DataFrame row as a tuple of (index, pandas Series) pairs. But, what does this mean? Let's explore with a few coding exercises.

A pandas DataFrame has been loaded into your session called pit_df. This DataFrame contains the stats for the Major League Baseball team named the Pittsburgh Pirates (abbreviated as 'PIT') from the year 2008 to the year 2012. It has been printed into your console for convenience.

In [1]:
import pandas as pd 
pit_df = pd.read_csv('pit.csv')
pit_df

Unnamed: 0,Team,League,Year,RS,RA,W,G,Playoffs
0,PIT,NL,2012,651,674,79,162,0
1,PIT,NL,2011,610,712,72,162,0
2,PIT,NL,2010,587,866,57,162,0
3,PIT,NL,2009,636,768,62,161,0
4,PIT,NL,2008,735,884,67,162,0


In [2]:
# Iterate over pit_df and print each row
for i,row in pit_df.iterrows():
    print(row)

Team         PIT
League        NL
Year        2012
RS           651
RA           674
W             79
G            162
Playoffs       0
Name: 0, dtype: object
Team         PIT
League        NL
Year        2011
RS           610
RA           712
W             72
G            162
Playoffs       0
Name: 1, dtype: object
Team         PIT
League        NL
Year        2010
RS           587
RA           866
W             57
G            162
Playoffs       0
Name: 2, dtype: object
Team         PIT
League        NL
Year        2009
RS           636
RA           768
W             62
G            161
Playoffs       0
Name: 3, dtype: object
Team         PIT
League        NL
Year        2008
RS           735
RA           884
W             67
G            162
Playoffs       0
Name: 4, dtype: object


In [3]:
# Iterate over pit_df and print each index variable and then each row
for i,row in pit_df.iterrows():
    print(i)
    print(row)
    print(type(row))

0
Team         PIT
League        NL
Year        2012
RS           651
RA           674
W             79
G            162
Playoffs       0
Name: 0, dtype: object
<class 'pandas.core.series.Series'>
1
Team         PIT
League        NL
Year        2011
RS           610
RA           712
W             72
G            162
Playoffs       0
Name: 1, dtype: object
<class 'pandas.core.series.Series'>
2
Team         PIT
League        NL
Year        2010
RS           587
RA           866
W             57
G            162
Playoffs       0
Name: 2, dtype: object
<class 'pandas.core.series.Series'>
3
Team         PIT
League        NL
Year        2009
RS           636
RA           768
W             62
G            161
Playoffs       0
Name: 3, dtype: object
<class 'pandas.core.series.Series'>
4
Team         PIT
League        NL
Year        2008
RS           735
RA           884
W             67
G            162
Playoffs       0
Name: 4, dtype: object
<class 'pandas.core.series.Series'>


In [4]:
# Use one variable instead of two to store the result of .iterrows()
for row_tuple in pit_df.iterrows():
    print(row_tuple)

(0, Team         PIT
League        NL
Year        2012
RS           651
RA           674
W             79
G            162
Playoffs       0
Name: 0, dtype: object)
(1, Team         PIT
League        NL
Year        2011
RS           610
RA           712
W             72
G            162
Playoffs       0
Name: 1, dtype: object)
(2, Team         PIT
League        NL
Year        2010
RS           587
RA           866
W             57
G            162
Playoffs       0
Name: 2, dtype: object)
(3, Team         PIT
League        NL
Year        2009
RS           636
RA           768
W             62
G            161
Playoffs       0
Name: 3, dtype: object)
(4, Team         PIT
League        NL
Year        2008
RS           735
RA           884
W             67
G            162
Playoffs       0
Name: 4, dtype: object)


In [5]:
# Print the row and type of each row
for row_tuple in pit_df.iterrows():
    print(row_tuple)
    print(type(row_tuple))

(0, Team         PIT
League        NL
Year        2012
RS           651
RA           674
W             79
G            162
Playoffs       0
Name: 0, dtype: object)
<class 'tuple'>
(1, Team         PIT
League        NL
Year        2011
RS           610
RA           712
W             72
G            162
Playoffs       0
Name: 1, dtype: object)
<class 'tuple'>
(2, Team         PIT
League        NL
Year        2010
RS           587
RA           866
W             57
G            162
Playoffs       0
Name: 2, dtype: object)
<class 'tuple'>
(3, Team         PIT
League        NL
Year        2009
RS           636
RA           768
W             62
G            161
Playoffs       0
Name: 3, dtype: object)
<class 'tuple'>
(4, Team         PIT
League        NL
Year        2008
RS           735
RA           884
W             67
G            162
Playoffs       0
Name: 4, dtype: object)
<class 'tuple'>


# Run differentials with .iterrows()
You've been hired by the San Francisco Giants as an analyst—congrats! The team's owner wants you to calculate a metric called the run differential for each season from the year 2008 to 2012. This metric is calculated by subtracting the total number of runs a team allowed in a season from the team's total number of runs scored in a season. 'RS' means runs scored and 'RA' means runs allowed.

The below function calculates this metric:

```python
def calc_run_diff(runs_scored, runs_allowed):

    run_diff = runs_scored - runs_allowed

    return run_diff
```
A DataFrame has been loaded into your session as giants_df and printed into the console. Let's practice using .iterrows() to add a run differential column to this DataFrame.

In [6]:
import pandas as pd 
giants_df = pd.read_csv('giants.csv')

def calc_run_diff(runs_scored, runs_allowed):

    run_diff = runs_scored - runs_allowed

    return run_diff

# Create an empty list to store run differentials
run_diffs = []

# Write a for loop and collect runs allowed and runs scored for each row
for i,row in giants_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']
    
    # Use the provided function to calculate run_diff for each row
    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    # Append each run differential to the output list
    run_diffs.append(run_diff)

giants_df['RD'] = run_diffs
print(giants_df)

  Team League  Year   RS   RA   W    G  Playoffs   RD
0  SFG     NL  2012  718  649  94  162         1   69
1  SFG     NL  2011  570  578  86  162         0   -8
2  SFG     NL  2010  697  583  92  162         1  114
3  SFG     NL  2009  657  611  88  162         0   46
4  SFG     NL  2008  640  759  72  162         0 -119


# Iterating with .itertuples()
Remember, .itertuples() returns each DataFrame row as a special data type called a namedtuple. You can look up an attribute within a namedtuple with a special syntax. Let's practice working with namedtuples.

A pandas DataFrame has been loaded into your session called rangers_df. This DataFrame contains the stats ('Team', 'League', 'Year', 'RS', 'RA', 'W', 'G', and 'Playoffs') for the Major League baseball team named the Texas Rangers (abbreviated as 'TEX').

In [7]:
rangers_df = pd.read_csv('rangers.csv')
rangers_df

Unnamed: 0,Team,League,Year,RS,RA,W,G,Playoffs
0,TEX,AL,2012,808,707,93,162,1
1,TEX,AL,2011,855,677,96,162,1
2,TEX,AL,2010,787,687,90,162,1
3,TEX,AL,2009,784,740,87,162,0
4,TEX,AL,2008,901,967,79,162,0
5,TEX,AL,2007,816,844,75,162,0
6,TEX,AL,2006,835,784,80,162,0
7,TEX,AL,2005,865,858,79,162,0
8,TEX,AL,2004,860,794,89,162,0
9,TEX,AL,2003,826,969,71,162,0


In [8]:
# Loop over the DataFrame and print each row
for row_tuple in rangers_df.itertuples():
  print(row_tuple)

Pandas(Index=0, Team=' TEX', League=' AL', Year=2012, RS=808, RA=707, W=93, G=162, Playoffs=1)
Pandas(Index=1, Team=' TEX', League=' AL', Year=2011, RS=855, RA=677, W=96, G=162, Playoffs=1)
Pandas(Index=2, Team=' TEX', League=' AL', Year=2010, RS=787, RA=687, W=90, G=162, Playoffs=1)
Pandas(Index=3, Team=' TEX', League=' AL', Year=2009, RS=784, RA=740, W=87, G=162, Playoffs=0)
Pandas(Index=4, Team=' TEX', League=' AL', Year=2008, RS=901, RA=967, W=79, G=162, Playoffs=0)
Pandas(Index=5, Team=' TEX', League=' AL', Year=2007, RS=816, RA=844, W=75, G=162, Playoffs=0)
Pandas(Index=6, Team=' TEX', League=' AL', Year=2006, RS=835, RA=784, W=80, G=162, Playoffs=0)
Pandas(Index=7, Team=' TEX', League=' AL', Year=2005, RS=865, RA=858, W=79, G=162, Playoffs=0)
Pandas(Index=8, Team=' TEX', League=' AL', Year=2004, RS=860, RA=794, W=89, G=162, Playoffs=0)
Pandas(Index=9, Team=' TEX', League=' AL', Year=2003, RS=826, RA=969, W=71, G=162, Playoffs=0)
Pandas(Index=10, Team='TEX', League=' AL', Year=20

In [9]:
# Loop over the DataFrame and print each row's Index, Year and Wins (W)
for row in rangers_df.itertuples():
  i = row.Index
  year = row.Year
  wins = row.W
  print(i, year, wins)

0 2012 93
1 2011 96
2 2010 90
3 2009 87
4 2008 79
5 2007 75
6 2006 80
7 2005 79
8 2004 89
9 2003 71
10 2002 72
11 2001 73
12 2000 71
13 1999 95
14 1998 88
15 1997 77
16 1996 90
17 1993 86
18 1992 77
19 1991 85
20 1990 83
21 1989 83
22 1988 70
23 1987 75
24 1986 87
25 1985 62
26 1984 69
27 1983 77
28 1982 64
29 1980 76
30 1979 83
31 1978 87
32 1977 94
33 1976 76
34 1975 79
35 1974 83
36 1973 57


In [10]:
# Loop over the DataFrame and print each row's Index, Year and Wins (W)
for row in rangers_df.itertuples():
  i = row.Index
  year = row.Year
  wins = row.W
  
  # Check if rangers made Playoffs (1 means yes; 0 means no)
  if row.Playoffs == 1:
    print(i, year, wins)

0 2012 93
1 2011 96
2 2010 90
13 1999 95
14 1998 88
16 1996 90


# Run differentials with .itertuples()
The New York Yankees have made a trade with the San Francisco Giants for your analyst contract— you're a hot commodity! Your new boss has seen your work with the Giants and now wants you to do something similar with the Yankees data. He'd like you to calculate run differentials for the Yankees from the year 1962 to the year 2012 and find which season they had the best run differential.

You've remembered the function you used when working with the Giants and quickly write it down:

```python
def calc_run_diff(runs_scored, runs_allowed):

    run_diff = runs_scored - runs_allowed

    return run_diff
```

Let's use .itertuples() to loop over the yankees_df DataFrame (which has been loaded into your session) and calculate run differentials.

In [11]:
yankees_df = pd.read_csv('yankees.csv')
yankees_df

Unnamed: 0,Team,League,Year,RS,RA,W,G,Playoffs
0,NYY,AL,2012,804,668,95,162,1
1,NYY,AL,2011,867,657,97,162,1
2,NYY,AL,2010,859,693,95,162,1
3,NYY,AL,2009,915,753,103,162,1
4,NYY,AL,2008,789,727,89,162,0
5,NYY,AL,2007,968,777,94,162,1
6,NYY,AL,2006,930,767,97,162,1
7,NYY,AL,2005,886,789,95,162,1
8,NYY,AL,2004,897,808,101,162,1
9,NYY,AL,2003,877,716,101,163,1


In [12]:
def calc_run_diff(runs_scored, runs_allowed):

    run_diff = runs_scored - runs_allowed

    return run_diff

In [13]:
run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for row in yankees_df.itertuples():
    
    runs_scored = row.RS
    runs_allowed = row.RA

    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

# Append new column
yankees_df['RD'] = run_diffs
print(yankees_df)

    Team League  Year   RS   RA    W    G  Playoffs   RD
0    NYY     AL  2012  804  668   95  162         1  136
1    NYY     AL  2011  867  657   97  162         1  210
2    NYY     AL  2010  859  693   95  162         1  166
3    NYY     AL  2009  915  753  103  162         1  162
4    NYY     AL  2008  789  727   89  162         0   62
5    NYY     AL  2007  968  777   94  162         1  191
6    NYY     AL  2006  930  767   97  162         1  163
7    NYY     AL  2005  886  789   95  162         1   97
8    NYY     AL  2004  897  808  101  162         1   89
9    NYY     AL  2003  877  716  101  163         1  161
10   NYY     AL  2002  897  697  103  161         1  200
11   NYY     AL  2001  804  713   95  161         1   91
12   NYY     AL  2000  871  814   87  161         1   57
13   NYY     AL  1999  900  731   98  162         1  169
14   NYY     AL  1998  965  656  114  162         1  309
15   NYY     AL  1997  891  688   96  162         1  203
16   NYY     AL  1996  871  787

# Analyzing baseball stats with .apply()
The Tampa Bay Rays want you to analyze their data.

They'd like the following metrics:

The sum of each column in the data
The total amount of runs scored in a year ('RS' + 'RA' for each year)
The 'Playoffs' column in text format rather than using 1's and 0's
The below function can be used to convert the 'Playoffs' column to text:

```python
def text_playoffs(num_playoffs): 
    if num_playoffs == 1:
        return 'Yes'
    else:
        return 'No' 
```

Use .apply() to get these metrics. A DataFrame (rays_df) has been loaded and printed to the console. This DataFrame is indexed on the 'Year' column.

In [14]:
rays_df = pd.read_csv('rays.csv')
rays_df

Unnamed: 0,Year,RS,RA,W,Playoffs
0,2012,697,577,90,0
1,2011,707,614,91,1
2,2010,802,649,96,1
3,2009,803,754,84,0
4,2008,774,671,97,1


In [15]:
rays_df.set_index('Year')

Unnamed: 0_level_0,RS,RA,W,Playoffs
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2012,697,577,90,0
2011,707,614,91,1
2010,802,649,96,1
2009,803,754,84,0
2008,774,671,97,1


In [16]:
# Gather sum of all columns
stat_totals = rays_df.apply(sum, axis=0)
print(stat_totals)

Year        10050
RS           3783
RA           3265
W             458
Playoffs        3
dtype: int64


In [17]:
# Gather total runs scored in all games per year
total_runs_scored = rays_df[['RS', 'RA']].apply(sum, axis=1)
print(total_runs_scored)

0    1274
1    1321
2    1451
3    1557
4    1445
dtype: int64


In [18]:
def text_playoffs(num_playoffs): 
    if num_playoffs == 1:
        return 'Yes'
    else:
        return 'No' 
        
# Convert numeric playoffs to text by applying text_playoffs()
textual_playoffs = rays_df.apply(lambda row: text_playoffs(row['Playoffs']), axis=1)
print(textual_playoffs)

0     No
1    Yes
2    Yes
3     No
4    Yes
dtype: object


# Settle a debate with .apply()
Word has gotten to the Arizona Diamondbacks about your awesome analytics skills. They'd like for you to help settle a debate amongst the managers. One manager claims that the team has made the playoffs every year they have had a win percentage of 0.50 or greater. Another manager says this is not true.

Let's use the below function and the .apply() method to see which manager is correct.

```python
def calc_win_perc(wins, games_played):
    win_perc = wins / games_played
    return np.round(win_perc,2)
```    

A DataFrame named dbacks_df has been loaded into your session.

In [19]:
dbacks_df = pd.read_csv('dbacks.csv')
dbacks_df

Unnamed: 0,Team,League,Year,RS,RA,W,G,Playoffs
0,ARI,NL,2012,734,688,81,162,0
1,ARI,NL,2011,731,662,94,162,1
2,ARI,NL,2010,713,836,65,162,0
3,ARI,NL,2009,720,782,70,162,0
4,ARI,NL,2008,720,706,82,162,0
5,ARI,NL,2007,712,732,90,162,1
6,ARI,NL,2006,773,788,76,162,0
7,ARI,NL,2005,696,856,77,162,0
8,ARI,NL,2004,615,899,51,162,0
9,ARI,NL,2003,717,685,84,162,0


In [20]:
# Display the first five rows of the DataFrame
print(dbacks_df.head())

  Team League  Year   RS   RA   W    G  Playoffs
0  ARI     NL  2012  734  688  81  162         0
1  ARI     NL  2011  731  662  94  162         1
2  ARI     NL  2010  713  836  65  162         0
3  ARI     NL  2009  720  782  70  162         0
4  ARI     NL  2008  720  706  82  162         0


In [21]:
import numpy as np

def calc_win_perc(wins, games_played):
    win_perc = wins / games_played
    return np.round(win_perc,2)

# Create a win percentage Series 
win_percs = dbacks_df.apply(lambda row: calc_win_perc(row['W'], row['G']), axis=1)
print(win_percs, '\n')

0     0.50
1     0.58
2     0.40
3     0.43
4     0.51
5     0.56
6     0.47
7     0.48
8     0.31
9     0.52
10    0.60
11    0.57
12    0.52
13    0.62
14    0.40
dtype: float64 



In [22]:
# Append a new column to dbacks_df
dbacks_df['WP'] = win_percs
print(dbacks_df, '\n')

# Display dbacks_df where WP is greater than 0.50
print(dbacks_df[dbacks_df['WP'] >= 0.50])

   Team League  Year   RS   RA    W    G  Playoffs    WP
0   ARI     NL  2012  734  688   81  162         0  0.50
1   ARI     NL  2011  731  662   94  162         1  0.58
2   ARI     NL  2010  713  836   65  162         0  0.40
3   ARI     NL  2009  720  782   70  162         0  0.43
4   ARI     NL  2008  720  706   82  162         0  0.51
5   ARI     NL  2007  712  732   90  162         1  0.56
6   ARI     NL  2006  773  788   76  162         0  0.47
7   ARI     NL  2005  696  856   77  162         0  0.48
8   ARI     NL  2004  615  899   51  162         0  0.31
9   ARI     NL  2003  717  685   84  162         0  0.52
10  ARI     NL  2002  819  674   98  162         1  0.60
11  ARI     NL  2001  818  677   92  162         1  0.57
12  ARI     NL  2000  792  754   85  162         0  0.52
13  ARI     NL  1999  908  676  100  162         1  0.62
14  ARI     NL  1998  665  812   65  162         0  0.40 

   Team League  Year   RS   RA    W    G  Playoffs    WP
0   ARI     NL  2012  734  68

# Replacing .iloc with underlying arrays
Now that you have a better grasp on a DataFrame's internals let's update one of your previous analyses to leverage a DataFrame's underlying arrays. You'll revisit the win percentage calculations you performed row by row with the .iloc method:

```python
def calc_win_perc(wins, games_played):
    win_perc = wins / games_played
    return np.round(win_perc,2)

win_percs_list = []

for i in range(len(baseball_df)):
    row = baseball_df.iloc[i]

    wins = row['W']
    games_played = row['G']

    win_perc = calc_win_perc(wins, games_played)

    win_percs_list.append(win_perc)

baseball_df['WP'] = win_percs_list
```
Let's update this analysis to use arrays instead of the .iloc method. A DataFrame (baseball_df) has been loaded into your session.

In [23]:
import numpy as np 
import pandas as pd 

baseball_df = pd.read_csv('baseball_stats.csv')

In [24]:
baseball_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424


In [25]:
def calc_win_perc(wins, games_played):
    win_perc = wins / games_played
    return np.round(win_perc,2)

win_percs_list = []

for i in range(len(baseball_df)):
    row = baseball_df.iloc[i]

    wins = row['W']
    games_played = row['G']

    win_perc = calc_win_perc(wins, games_played)

    win_percs_list.append(win_perc)

baseball_df['WP'] = win_percs_list

# Use the W array and G array to calculate win percentages
win_percs_np = calc_win_perc(baseball_df['W'], baseball_df['G'])

In [26]:
# Use the W array and G array to calculate win percentages
win_percs_np = calc_win_perc(baseball_df['W'].values, baseball_df['G'].values)

# Append a new column to baseball_df that stores all win percentages
baseball_df['WP'] = win_percs_np

print(baseball_df.head())

  Team League  Year   RS   RA   W    OBP    SLG     BA  Playoffs  RankSeason  \
0  ARI     NL  2012  734  688  81  0.328  0.418  0.259         0         NaN   
1  ATL     NL  2012  700  600  94  0.320  0.389  0.247         1         4.0   
2  BAL     AL  2012  712  705  93  0.311  0.417  0.247         1         5.0   
3  BOS     AL  2012  734  806  69  0.315  0.415  0.260         0         NaN   
4  CHC     NL  2012  613  759  61  0.302  0.378  0.240         0         NaN   

   RankPlayoffs    G   OOBP   OSLG    WP  
0           NaN  162  0.317  0.415  0.50  
1           5.0  162  0.306  0.378  0.58  
2           4.0  162  0.315  0.403  0.57  
3           NaN  162  0.331  0.428  0.43  
4           NaN  162  0.335  0.424  0.38  


# Bringing it all together: Predict win percentage
A pandas DataFrame (baseball_df) has been loaded into your session. For convenience, a dictionary describing each column within baseball_df has been printed into your console. You can reference these descriptions throughout the exercise.

You'd like to attempt to predict a team's win percentage for a given season by using the team's total runs scored in a season ('RS') and total runs allowed in a season ('RA') with the following function:

```python
def predict_win_perc(RS, RA):
    prediction = RS ** 2 / (RS ** 2 + RA ** 2)
    return np.round(prediction, 2)
```

Let's compare the approaches you've learned to calculate a predicted win percentage for each season (or row) in your DataFrame.

In [27]:
win_perc_preds_loop = []

def predict_win_perc(RS, RA):
    prediction = RS ** 2 / (RS ** 2 + RA ** 2)
    return np.round(prediction, 2)

# Use a loop and .itertuples() to collect each row's predicted win percentage
for row in baseball_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)

# Apply predict_win_perc to each row of the DataFrame
win_perc_preds_apply = baseball_df.apply(lambda row: predict_win_perc(row['RS'], row['RA']), axis=1)

# Calculate the win percentage predictions using NumPy arrays
win_perc_preds_np = predict_win_perc(baseball_df['RS'].values, baseball_df['RA'].values)
baseball_df['WP_preds'] = win_perc_preds_np
print(baseball_df.head())

  Team League  Year   RS   RA   W    OBP    SLG     BA  Playoffs  RankSeason  \
0  ARI     NL  2012  734  688  81  0.328  0.418  0.259         0         NaN   
1  ATL     NL  2012  700  600  94  0.320  0.389  0.247         1         4.0   
2  BAL     AL  2012  712  705  93  0.311  0.417  0.247         1         5.0   
3  BOS     AL  2012  734  806  69  0.315  0.415  0.260         0         NaN   
4  CHC     NL  2012  613  759  61  0.302  0.378  0.240         0         NaN   

   RankPlayoffs    G   OOBP   OSLG    WP  WP_preds  
0           NaN  162  0.317  0.415  0.50      0.53  
1           5.0  162  0.306  0.378  0.58      0.58  
2           4.0  162  0.315  0.403  0.57      0.50  
3           NaN  162  0.331  0.428  0.43      0.45  
4           NaN  162  0.335  0.424  0.38      0.39  
