In [1]:
from bs4 import BeautifulSoup as bs
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

In [2]:
with open('baseball_data.pkl','rb') as cellar:
    season_html = pickle.load(cellar)
len(season_html)

57

In [3]:
type(season_html)

list

These pages are so incredibly gross. I've scraped a lot of pig shit (literally) in my life, and I prefer that to this.

The tables of interest have these headers:
* MLB Detailed Standings
* Team Standard Batting
* Team Standard Pitching
* MLB Wins Above Avg By Position
* Team Fielding

I envision a dataframe storing the league summary for each statistic (columns) by year (rows). I envisioned a numpy array of three dimensions storing data by team (thickness...). However, I suspect a better approach would be to create a dictionary of dataframes, one for each team. baseball-reference uses a three letter code for each team, although it does change when a franchise moves. I don't anticipate this being a problem, and the number of moves is not so overpowering as to prevent concatenating the dataframes into single entries later if I thought it useful.

In [4]:
season_html[0].text.find('MLB Detailed Standings')

-1

Checking the page source in my browser, then.

Oh, you hoser. Two spaces in the text.

In [5]:
season_html[0].text.find('MLB  Detailed Standings')

72595

(nervous laughter)

Let's check what formatting code works in BeautifulSoup.

In [6]:
soup = bs(season_html[0].text)
print(soup)

<!DOCTYPE html>
<html class="no-js" data-root="/home/br/build" data-version="klecko-" itemscope="" itemtype="https://schema.org/WebSite" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=2.0" name="viewport"/>
<link href="https://d2p3bygnnzw9w3.cloudfront.net/req/202004032" rel="dns-prefetch"/>
<!-- no:cookie fast load the css.           -->
<script>function gup(n) {n = n.replace(/[\[]/, '\\[').replace(/[\]]/, '\\]'); var r = new RegExp('[\\?&]'+n+'=([^&#]*)'); var re = r.exec(location.search);   return re === null?'':decodeURIComponent(re[1].replace(/\+/g,' '));}; document.srdev = gup('srdev')</script>
<link crossorigin="" href="https://d2p3bygnnzw9w3.cloudfront.net" rel="preconnect"/>
<link crossorigin="" href="https://d3k2oh6evki4b7.cloudfront.net" rel="preconnect"/>
<link as="style" crossorigin="" href="https://d2p3bygnnzw9w3.cloudfront.net/req/202004081/css/br/sr-min.

Oops. I guess None works.

In [7]:
# set up a dictionary to call seasons by number
# first year is 1962
# format data via BeautifulSoup
season_soup = {}
for i, season in enumerate(season_html):
    season_soup[1962+i] = bs(season.text)

In [8]:
season_soup[1982].find('a')

<a href="https://www.sports-reference.com/"><svg height="15px" width="20px"><use xlink:href="#ic-sr-pennant"></use></svg> Sports Reference</a>

The analysis I want to conduct is time-based, so I want to have dataframes for the big leagues as a whole, maybe the individual leagues, and the individual teams where the stats are columns and the row index is years.

In order to get there, I can start by taking the data the way it comes, individual years of data with tables (dataframes) for standings/team record, batting, pitching, WAR, fielding. I can groupby league on these tables to get summary data rather than reading in the separate league pages... and in fact I can neglect this entirely for now.

Let's tinker with the first table and start building up a dictionary of dictionaries where the top level key is the year / season and the second level key is the table caption.

In [9]:
print(season_soup[1962].find('table').find('caption').text)

MLB  Detailed Standings


In [10]:
season_dict={}
season_dict[1962]={}
for table in season_soup[1962].find_all('table'):
    season_dict[1962][table.find('caption').text]='a table'
print(season_dict)

{1962: {'MLB  Detailed Standings': 'a table', 'Team Standard Batting Table': 'a table'}}


In [11]:
print(len(season_soup[1962].find_all('table')))

2


BeautifulSoup ignores stuff between the pseudotags \<!-- -->. Yet at least some of that info displays on the webpage and contains critical data (the other three tables). This is insane. Come back to it and parse the first table.

The first table does this horrid thing where what displays in the table header on the webpage (the abbreviations for the stats) oscillates back and forth between a... wait. That's metadata garbage inside the tag, not the tag text. The actual text for the tag will do. Let's create a dataframe with those column names and the data beneath. Will need to watch out because the team names are links.

In [12]:
columns = [thing.text for thing in season_soup[1962].find('table').find('tr').find_all('th')]
print(columns)

['Rk', 'Tm', 'Lg', 'G', 'W', 'L', 'W-L%', 'R', 'RA', 'Rdiff', 'SOS', 'SRS', 'pythWL', 'Luck', 'Home', 'Road', 'ExInn', '1Run', 'vRHP', 'vLHP', '≥.500', '<.500']


In [13]:
rows = season_soup[1962].find('table').find_all('tr')
row = [thing.text for thing in rows[1].find_all('th')]
print(row)

['1']


Oh. Those are the ranks, which I don't care about anyway. I want:

In [14]:
row = [thing.text for thing in rows[1].find_all('td')]
print(row)

['SFG', 'NL', '165', '103', '62', '.624', '5.3', '4.2', '1.1', '-0.1', '1.0', '100-65', '3', '61-21', '42-41', '5-5', '26-18', '67-47', '36-15', '66-45', '37-17']


Now can I dump this into a dataframe?

In [15]:
rows = season_soup[1962].find('table').find_all('tr')
rows.pop(0)
columns = [thing.text for thing in season_soup[1962].find('table').find('tr').find_all('th')]
columns.pop(0)
test62 = []
for team in rows:
    row = [thing.text for thing in team.find_all('td')]
    test62.append(row)
test62_df = pd.DataFrame(test62, columns=columns)
test62_df.head()

Unnamed: 0,Tm,Lg,G,W,L,W-L%,R,RA,Rdiff,SOS,...,pythWL,Luck,Home,Road,ExInn,1Run,vRHP,vLHP,≥.500,<.500
0,SFG,NL,165,103,62,0.624,5.3,4.2,1.1,-0.1,...,100-65,3,61-21,42-41,5-5,26-18,67-47,36-15,66-45,37-17
1,NYY,AL,162,96,66,0.593,5.0,4.2,0.8,-0.1,...,94-68,2,50-30,46-36,10-6,28-26,64-52,32-14,42-30,54-36
2,LAD,NL,165,102,63,0.618,5.1,4.2,0.9,-0.1,...,97-68,5,54-29,48-34,8-5,23-18,65-42,37-21,60-51,42-12
3,MIN,AL,163,91,71,0.562,4.9,4.4,0.5,-0.1,...,89-73,2,45-36,46-35,9-6,21-25,81-65,10-6,39-33,52-38
4,CIN,NL,162,98,64,0.605,5.0,4.2,0.7,-0.1,...,93-69,5,58-23,40-41,12-6,30-22,70-39,28-25,58-50,40-14


In [16]:
test62_df

Unnamed: 0,Tm,Lg,G,W,L,W-L%,R,RA,Rdiff,SOS,...,pythWL,Luck,Home,Road,ExInn,1Run,vRHP,vLHP,≥.500,<.500
0,SFG,NL,165.0,103,62,0.624,5.3,4.2,1.1,-0.1,...,100-65,3.0,61-21,42-41,5-5,26-18,67-47,36-15,66-45,37-17
1,NYY,AL,162.0,96,66,0.593,5.0,4.2,0.8,-0.1,...,94-68,2.0,50-30,46-36,10-6,28-26,64-52,32-14,42-30,54-36
2,LAD,NL,165.0,102,63,0.618,5.1,4.2,0.9,-0.1,...,97-68,5.0,54-29,48-34,8-5,23-18,65-42,37-21,60-51,42-12
3,MIN,AL,163.0,91,71,0.562,4.9,4.4,0.5,-0.1,...,89-73,2.0,45-36,46-35,9-6,21-25,81-65,10-6,39-33,52-38
4,CIN,NL,162.0,98,64,0.605,5.0,4.2,0.7,-0.1,...,93-69,5.0,58-23,40-41,12-6,30-22,70-39,28-25,58-50,40-14
5,LAA,AL,162.0,86,76,0.531,4.4,4.4,0.1,0.0,...,82-80,4.0,40-41,46-35,10-9,28-26,56-52,30-24,32-40,54-36
6,PIT,NL,161.0,93,68,0.578,4.4,3.9,0.5,-0.1,...,89-72,4.0,51-30,42-38,7-4,32-22,64-50,29-18,50-57,43-11
7,DET,AL,161.0,85,76,0.528,4.7,4.3,0.4,0.0,...,87-74,-2.0,49-33,36-43,7-8,28-24,67-57,18-19,32-40,53-36
8,MLN,NL,162.0,86,76,0.531,4.5,4.1,0.4,0.0,...,88-74,-2.0,49-32,37-44,4-5,24-32,62-53,24-23,53-55,33-21
9,CHW,AL,162.0,85,77,0.525,4.4,4.1,0.3,0.0,...,86-76,-1.0,43-38,42-39,9-4,30-25,68-61,17-16,35-37,50-40


Well, that's a dataframe.

I've manually deleted the offending \<!-- and --> from a test file while I wait for answers from the TAs.

In [17]:
with open('1962-test.html','r') as page:
    test62_html = page.read()
test62_soup = bs(test62_html)

In [18]:
test62_dict = {}
for table in test62_soup.find_all('table'):
    test62_dict[table.find('caption').text]=[]
print(test62_dict)

{'MLB  Detailed Standings': [], 'Postseason': [], 'Team Standard Batting Table': [], 'Team Standard Pitching Table': [], 'MLB Wins Above Avg By Position Table': [], 'Team Fielding Table': []}


In [19]:
for table in test62_soup.find_all('table'):
    print(table.find('caption').text)
    if table.find('caption').text == 'Postseason':
        pass
    list_holder = []
    rows = table.find_all('tr')
    columns = [thing.text for thing in rows[0].find_all('th')]
    rows.pop(0)
    if columns:
        columns.pop(0)
    for team in rows:
        row = [thing.text for thing in team.find_all('td')]
        list_holder.append(row)
    df = pd.DataFrame(list_holder,columns=columns)
    test62_dict[table.find('caption').text] = df
print(len(test62_dict))

MLB  Detailed Standings
Postseason
Team Standard Batting Table
Team Standard Pitching Table
MLB Wins Above Avg By Position Table
Team Fielding Table
6


In [20]:
print(test62_dict.keys())

dict_keys(['MLB  Detailed Standings', 'Postseason', 'Team Standard Batting Table', 'Team Standard Pitching Table', 'MLB Wins Above Avg By Position Table', 'Team Fielding Table'])


In [21]:
print(test62_dict['Team Standard Batting Table'])

    #Bat BatAge   R/G     G      PA      AB      R      H    2B    3B  ...  \
0     42   27.2  4.02   162    6159    5491    652   1363   225    34  ...   
1     33   27.8  4.42   160    6177    5530    707   1429   257    53  ...   
2     43   25.8  3.90   162    6174    5534    632   1398   196    56  ...   
3     40   28.9  4.36   162    6297    5514    707   1415   250    56  ...   
4     37   27.2  4.95   162    6275    5645    802   1523   252    40  ...   
5     41   27.7  4.21   162    6138    5484    682   1341   202    22  ...   
6     37   28.4  4.71   161    6235    5456    758   1352   191    36  ...   
7     43   29.5  3.65   162    6201    5558    592   1370   170    47  ...   
8     45   27.2  4.60   162    6301    5576    745   1467   220    58  ...   
9     43   27.7  4.43   162    6256    5499    718   1377   232    35  ...   
10    30   27.0  5.10   165    6363    5628    842   1510   192    65  ...   
11    36   26.6  4.90   163    6362    5561    798   1445   215 

Obviously this is a problem... I can skip the rank for the tables that have it, but the team batting / pitching / fielding do not. When I tried to read in the 'th' element in front of the 'td' elements for the rows, the code fucked up royally. Let me leave all that working code as-is and try again down here.

In [22]:
test62_dict = {}
for table in test62_soup.find_all('table'):
    print(table.find('caption').text)
    if table.find('caption').text == 'Postseason':
        continue
    list_holder = []
    rows = table.find_all('tr')
    columns = [thing.text for thing in rows[0].find_all('th')]
    rows.pop(0)
    for team in rows:
        row = [team.find('th')]
        for thing in team.find_all('td'):
            row.append(thing.text)
        list_holder.append(row)
    df = pd.DataFrame(list_holder,columns=columns)
    test62_dict[table.find('caption').text] = df
print(test62_dict.keys())

MLB  Detailed Standings
Postseason
Team Standard Batting Table
Team Standard Pitching Table
MLB Wins Above Avg By Position Table
Team Fielding Table
dict_keys(['MLB  Detailed Standings', 'Team Standard Batting Table', 'Team Standard Pitching Table', 'MLB Wins Above Avg By Position Table', 'Team Fielding Table'])


In [23]:
print(test62_dict['Team Standard Batting Table'])

           Tm  #Bat BatAge   R/G     G      PA      AB      R      H    2B  \
0     [[BAL]]    42   27.2  4.02   162    6159    5491    652   1363   225   
1     [[BOS]]    33   27.8  4.42   160    6177    5530    707   1429   257   
2     [[CHC]]    43   25.8  3.90   162    6174    5534    632   1398   196   
3     [[CHW]]    40   28.9  4.36   162    6297    5514    707   1415   250   
4     [[CIN]]    37   27.2  4.95   162    6275    5645    802   1523   252   
5     [[CLE]]    41   27.7  4.21   162    6138    5484    682   1341   202   
6     [[DET]]    37   28.4  4.71   161    6235    5456    758   1352   191   
7     [[HOU]]    43   29.5  3.65   162    6201    5558    592   1370   170   
8     [[KCA]]    45   27.2  4.60   162    6301    5576    745   1467   220   
9     [[LAA]]    43   27.7  4.43   162    6256    5499    718   1377   232   
10    [[LAD]]    30   27.0  5.10   165    6363    5628    842   1510   192   
11    [[MIN]]    36   26.6  4.90   163    6362    5561    798   

In [24]:
print(test62_dict['Team Standard Pitching Table'])

           Tm    #P  PAge  RA/G     W     L  W-L%   ERA     G    GS  ...  \
0     [[BAL]]    17  28.0  4.20    77    85  .475  3.69   162   162  ...   
1     [[BOS]]    18  27.6  4.72    76    84  .475  4.22   160   160  ...   
2     [[CHC]]    19  26.6  5.10    59   103  .364  4.54   162   162  ...   
3     [[CHW]]    18  29.2  4.06    85    77  .525  3.73   162   162  ...   
4     [[CIN]]    16  28.4  4.23    98    64  .605  3.75   162   162  ...   
5     [[CLE]]    18  26.8  4.60    80    82  .494  4.14   162   162  ...   
6     [[DET]]    17  29.5  4.30    85    76  .528  3.81   161   161  ...   
7     [[HOU]]    17  28.9  4.43    64    96  .400  3.83   162   162  ...   
8     [[KCA]]    22  26.7  5.17    72    90  .444  4.79   162   162  ...   
9     [[LAA]]    18  27.3  4.36    86    76  .531  3.70   162   162  ...   
10    [[LAD]]    12  25.7  4.22   102    63  .618  3.62   165   165  ...   
11    [[MIN]]    18  26.7  4.37    91    71  .562  3.89   163   163  ...   
12    [[MLN]

In [25]:
print(test62_dict['Team Fielding Table'])

         Tm  #Fld  RA/G DefEff     G     GS     CG       Inn      Ch     PO  \
0   [[BAL]]    41  4.20   .715   162   1458   1127   13161.0    6217   4384   
1   [[BOS]]    33  4.72   .703   160   1440   1246   12939.0    6136   4307   
2   [[CHC]]    41  5.10   .694   162   1457   1215   12945.0    6372   4315   
3   [[CHW]]    40  4.06   .714   162   1458   1139   13065.0    6205   4335   
4   [[CIN]]    36  4.23   .703   162   1458   1179   13146.0    6212   4382   
5   [[CLE]]    39  4.60   .713   162   1458   1188   12969.0    6172   4325   
6   [[DET]]    36  4.30   .703   161   1449   1218   12993.0    5926   4327   
7   [[HOU]]    43  4.43   .680   162   1458   1107   13083.0    6308   4361   
8   [[KCA]]    43  5.17   .709   162   1458   1169   12906.0    6150   4294   
9   [[LAA]]    43  4.36   .702   162   1458   1165   13194.0    6350   4388   
10  [[LAD]]    30  4.22   .691   165   1485   1045   13398.0    6393   4466   
11  [[MIN]]    36  4.37   .708   163   1467   1130  

In [26]:
print(test62_dict['MLB Wins Above Avg By Position Table'])

       Rk     Total     All P       SP       RP     Non-P        C       1B  \
0     [1]   SFG15.1   DET10.0   CIN8.3   PIT4.4   SFG18.0   CLE3.1   KCA3.3   
1     [2]   CIN10.5    PIT9.5   DET5.4   DET4.4   NYY12.9   SFG2.6   CHW1.9   
2     [3]   NYY10.1    HOU7.4   MIN5.1   HOU3.7   LAD10.7   MLN2.5   DET1.5   
3     [4]    LAD9.3    STL6.5   STL4.5   CHW2.0    MLN6.5   NYY1.6   STL1.1   
4     [5]    STL9.1    CHW5.9   PIT4.3   STL1.6    MIN4.8   PHI1.4   SFG0.8   
5     [6]    DET8.4    CIN5.8   HOU3.1   BAL1.5    CIN4.7   LAD1.0   PHI0.8   
6     [7]    MIN7.3    MIN2.5   CHW2.8   LAA0.8    STL2.6   STL0.8   BOS0.5   
7     [8]    PIT6.5    LAA1.7   WSA2.4  BOS-0.4    PHI1.8   PIT0.4   CHC0.4   
8     [9]    MLN5.8   BAL-0.1   MLN2.0  WSA-1.2    BOS0.7   MIN0.3   NYY0.4   
9    [10]    CHW5.1   WSA-0.5   LAA1.2  LAD-1.6    KCA0.3   WSA0.1   BAL0.1   
10   [11]   BAL-1.0   MLN-0.7   NYY1.0  CIN-2.0   CHW-0.8  CIN-0.2  MIN-0.1   
11   [12]   BOS-2.3   LAD-1.4   LAD0.9  PHI-2.1   BA

Shit, that last table is horrible. Looking closer, *each table entry* has these div-left and div-right tags. I could split those with Soup somehow, but a look at the 1967 and 1968 data tells me that WAA is NOT looking promising. Overall pitching is -0.1 WAA in 1968, just like 1962. The starting / relief pitching split might be interesting... maybe. In any case, I have the league overall numbers if I want them already.

In [27]:
st_csv = pd.read_csv('1962-stand.csv')

In [28]:
hit_csv = pd.read_csv('1962-bat.csv')
pitch_csv = pd.read_csv('1962-pitch.csv')
waa_csv = pd.read_csv('1962-waa.csv')
field_csv = pd.read_csv('1962-field.csv')

In [29]:
print(test62_dict['Team Standard Batting Table'])
print(hit_csv)

           Tm  #Bat BatAge   R/G     G      PA      AB      R      H    2B  \
0     [[BAL]]    42   27.2  4.02   162    6159    5491    652   1363   225   
1     [[BOS]]    33   27.8  4.42   160    6177    5530    707   1429   257   
2     [[CHC]]    43   25.8  3.90   162    6174    5534    632   1398   196   
3     [[CHW]]    40   28.9  4.36   162    6297    5514    707   1415   250   
4     [[CIN]]    37   27.2  4.95   162    6275    5645    802   1523   252   
5     [[CLE]]    41   27.7  4.21   162    6138    5484    682   1341   202   
6     [[DET]]    37   28.4  4.71   161    6235    5456    758   1352   191   
7     [[HOU]]    43   29.5  3.65   162    6201    5558    592   1370   170   
8     [[KCA]]    45   27.2  4.60   162    6301    5576    745   1467   220   
9     [[LAA]]    43   27.7  4.43   162    6256    5499    718   1377   232   
10    [[LAD]]    30   27.0  5.10   165    6363    5628    842   1510   192   
11    [[MIN]]    36   26.6  4.90   163    6362    5561    798   

In [30]:
print(test62_dict['Team Standard Pitching Table'])
print(pitch_csv)

           Tm    #P  PAge  RA/G     W     L  W-L%   ERA     G    GS  ...  \
0     [[BAL]]    17  28.0  4.20    77    85  .475  3.69   162   162  ...   
1     [[BOS]]    18  27.6  4.72    76    84  .475  4.22   160   160  ...   
2     [[CHC]]    19  26.6  5.10    59   103  .364  4.54   162   162  ...   
3     [[CHW]]    18  29.2  4.06    85    77  .525  3.73   162   162  ...   
4     [[CIN]]    16  28.4  4.23    98    64  .605  3.75   162   162  ...   
5     [[CLE]]    18  26.8  4.60    80    82  .494  4.14   162   162  ...   
6     [[DET]]    17  29.5  4.30    85    76  .528  3.81   161   161  ...   
7     [[HOU]]    17  28.9  4.43    64    96  .400  3.83   162   162  ...   
8     [[KCA]]    22  26.7  5.17    72    90  .444  4.79   162   162  ...   
9     [[LAA]]    18  27.3  4.36    86    76  .531  3.70   162   162  ...   
10    [[LAD]]    12  25.7  4.22   102    63  .618  3.62   165   165  ...   
11    [[MIN]]    18  26.7  4.37    91    71  .562  3.89   163   163  ...   
12    [[MLN]

In [31]:
print(test62_dict['Team Fielding Table'])
print(field_csv)

         Tm  #Fld  RA/G DefEff     G     GS     CG       Inn      Ch     PO  \
0   [[BAL]]    41  4.20   .715   162   1458   1127   13161.0    6217   4384   
1   [[BOS]]    33  4.72   .703   160   1440   1246   12939.0    6136   4307   
2   [[CHC]]    41  5.10   .694   162   1457   1215   12945.0    6372   4315   
3   [[CHW]]    40  4.06   .714   162   1458   1139   13065.0    6205   4335   
4   [[CIN]]    36  4.23   .703   162   1458   1179   13146.0    6212   4382   
5   [[CLE]]    39  4.60   .713   162   1458   1188   12969.0    6172   4325   
6   [[DET]]    36  4.30   .703   161   1449   1218   12993.0    5926   4327   
7   [[HOU]]    43  4.43   .680   162   1458   1107   13083.0    6308   4361   
8   [[KCA]]    43  5.17   .709   162   1458   1169   12906.0    6150   4294   
9   [[LAA]]    43  4.36   .702   162   1458   1165   13194.0    6350   4388   
10  [[LAD]]    30  4.22   .691   165   1485   1045   13398.0    6393   4466   
11  [[MIN]]    36  4.37   .708   163   1467   1130  

In [32]:
print(test62_dict['MLB Wins Above Avg By Position Table'])
print(waa_csv)

       Rk     Total     All P       SP       RP     Non-P        C       1B  \
0     [1]   SFG15.1   DET10.0   CIN8.3   PIT4.4   SFG18.0   CLE3.1   KCA3.3   
1     [2]   CIN10.5    PIT9.5   DET5.4   DET4.4   NYY12.9   SFG2.6   CHW1.9   
2     [3]   NYY10.1    HOU7.4   MIN5.1   HOU3.7   LAD10.7   MLN2.5   DET1.5   
3     [4]    LAD9.3    STL6.5   STL4.5   CHW2.0    MLN6.5   NYY1.6   STL1.1   
4     [5]    STL9.1    CHW5.9   PIT4.3   STL1.6    MIN4.8   PHI1.4   SFG0.8   
5     [6]    DET8.4    CIN5.8   HOU3.1   BAL1.5    CIN4.7   LAD1.0   PHI0.8   
6     [7]    MIN7.3    MIN2.5   CHW2.8   LAA0.8    STL2.6   STL0.8   BOS0.5   
7     [8]    PIT6.5    LAA1.7   WSA2.4  BOS-0.4    PHI1.8   PIT0.4   CHC0.4   
8     [9]    MLN5.8   BAL-0.1   MLN2.0  WSA-1.2    BOS0.7   MIN0.3   NYY0.4   
9    [10]    CHW5.1   WSA-0.5   LAA1.2  LAD-1.6    KCA0.3   WSA0.1   BAL0.1   
10   [11]   BAL-1.0   MLN-0.7   NYY1.0  CIN-2.0   CHW-0.8  CIN-0.2  MIN-0.1   
11   [12]   BOS-2.3   LAD-1.4   LAD0.9  PHI-2.1   BA

In [33]:
print(test62_dict['MLB  Detailed Standings'])
print(st_csv)

       Rk   Tm  Lg    G    W    L  W-L%     R    RA Rdiff  ...  pythWL Luck  \
0   [[1]]  SFG  NL  165  103   62  .624   5.3   4.2   1.1  ...  100-65    3   
1   [[2]]  NYY  AL  162   96   66  .593   5.0   4.2   0.8  ...   94-68    2   
2     [3]  LAD  NL  165  102   63  .618   5.1   4.2   0.9  ...   97-68    5   
3     [4]  MIN  AL  163   91   71  .562   4.9   4.4   0.5  ...   89-73    2   
4     [5]  CIN  NL  162   98   64  .605   5.0   4.2   0.7  ...   93-69    5   
5     [6]  LAA  AL  162   86   76  .531   4.4   4.4   0.1  ...   82-80    4   
6     [7]  PIT  NL  161   93   68  .578   4.4   3.9   0.5  ...   89-72    4   
7     [8]  DET  AL  161   85   76  .528   4.7   4.3   0.4  ...   87-74   -2   
8     [9]  MLN  NL  162   86   76  .531   4.5   4.1   0.4  ...   88-74   -2   
9    [10]  CHW  AL  162   85   77  .525   4.4   4.1   0.3  ...   86-76   -1   
10   [11]  STL  NL  163   84   78  .518   4.7   4.1   0.7  ...   92-70   -8   
11   [12]  CLE  AL  162   80   82  .494   4.2   4.6 

That all looks good to me. The parsing routine is functioning; now to pull in all the data from all years and shuffle it into league and team dataframes with rows for each season.

First I have to solve the problem with the \<!-- comment tags hiding valuable content.

In [34]:
print(soup.find('comment'))

None


Joe gave me code that theoretically should have stripped the comment tags, but it completely failed to change the situation.

In [35]:
cleaned_text = season_html[0].text.replace("<!--","").replace("-->","")
with open('codetest1962.html','w') as testfile:
    testfile.write(cleaned_text)

That worked... the comment tags are gone. No reason why that code shouldn't work now.

In [37]:
cleaned_soup = bs(cleaned_text)
print(cleaned_soup.find('table').find('caption').text)

MLB  Detailed Standings


In [38]:
for table in cleaned_soup.find_all('table'):
    print(table.find('caption').text)

MLB  Detailed Standings
Postseason
Team Standard Batting Table
Team Standard Pitching Table
MLB Wins Above Avg By Position Table
Team Fielding Table
