# Batting collapse frequency. Are England unique?

### Research Question:
- How often do batting collapses happen?
- Does England collapse more often that other teams?
- Is this statement true? "Joe Root rarely stops a collapse, and is often a part of it"
- Which players are best at stopping a collapse?


### Methodology:
- Create a table of fall of wicket:
    - MatchID
    - Date
    - Batting team
    - Bowling team
    - Match type
    - Innings
    - Fall of Wicket 1 (runs, batsman)
    - Fall of Wicket 2 (runs, batsman)
    - etc.


### Problem breakdown:
- Data Source: howstat cricket scorecards
- Extract Fall Of Wickets from a single game
- Extract FoW from multiple games
- Extract FoW from all (relevant) games

In [1]:
import urllib
import re
import pandas as pd
import numpy as np

from dateutil.parser import parse
from bs4 import BeautifulSoup

In [2]:
with urllib.request.urlopen("http://howstat.com/cricket/Statistics/Matches/MatchScorecard.asp?MatchCode=1800") as url:
    s = url.read()

In [3]:
soup = BeautifulSoup(s, 'html.parser')

In [4]:
from cricsheet.io_html import scrape_howstat

In [5]:
d_match_info = scrape_howstat.parse_match_info(soup)

In [6]:
# Loop through each TextBlackBold8 element
# If the text contains the word 'Innings', the next section will be the innings scorecard: so parse it.
d_scorecards = {}
l_innings = []
for item in soup.find_all(class_="TextBlackBold8"):
    item_text = item.text.replace('\xa0', ' ').strip()
    if 'Innings' in item_text:
        
        # Extract the Innings number and Team
        #l_innings.append(item_text.split('Innings')[0])
        
        # Go through siblings until the Total. This will be the batting scorecard
        #print(item.parent.next_sibling)
        #l_items.append(item)
        
        # parse the scorecard
        innings, df_scorecard = scrape_howstat.parse_scorecard(item)
        print(f'Parsed scorecard for: {innings} innings')
        l_innings.append(innings)
        
        
        d_scorecards[innings] = df_scorecard

Parsed scorecard for: Pakistan 1st innings
Parsed scorecard for: India 1st innings
Parsed scorecard for: Pakistan 2nd innings
Parsed scorecard for: India 2nd innings


In [7]:
# Loop through each Fall of Wickets section, and parse the FoW record.   
# Store in a dict with keys as innings names from parsed scorecard section.
# This assumes there will always be a FoW for each scorecard
l_fow = []
fow_sections = soup.findAll("td", text=re.compile('Fall of Wickets'))
for item in fow_sections:
    df_fow = scrape_howstat.parse_fall_of_wickets(item)
    l_fow.append(df_fow)
    print('Parsed FoW')
 
# Convert list of FoW dfs to dict
if len(l_fow) == len(l_innings):
    print(f'There are {len(l_fow)} innings with fall of wicket data')
    d_fow = {l_innings[i]: l_fow[i] for i in range(len(l_fow))}
    

Parsed FoW
Parsed FoW
Parsed FoW
Parsed FoW
There are 4 innings with fall of wicket data


In [8]:
df_scorecards = scrape_howstat.clean_scorecards(d_scorecards, d_match_info, l_innings)

In [9]:
df_fow = scrape_howstat.clean_fow(d_fow, d_match_info, l_innings)

In [10]:
df_scorecards.head()

Unnamed: 0,MatchDate,Team,Innings,ScorecardIdx,Player,Details,R,BF,4s,6s,SR,% of Total
0,2006-01-21,Pakistan,1st,0,Shoaib Malik,c Dravid b R P Singh,19.0,33.0,4.0,0.0,57.58,3.23%
1,2006-01-21,Pakistan,1st,1,Salman Butt,c †Dhoni b Khan,37.0,57.0,7.0,0.0,64.91,6.29%
2,2006-01-21,Pakistan,1st,2,Younis Khan,c Yuvraj Singh b R P Singh,83.0,131.0,13.0,0.0,63.36,14.12%
3,2006-01-21,Pakistan,1st,3,Mohammad Yousuf,c †Dhoni b R P Singh,65.0,119.0,8.0,1.0,54.62,11.05%
4,2006-01-21,Pakistan,1st,4,Inzamam-ul-Haq*,c †Dhoni b Khan,119.0,193.0,12.0,1.0,61.66,20.24%


In [11]:
df_fow.head()

Unnamed: 0,MatchDate,Team,Innings,Wicket,Runs,Player
0,2006-01-21,Pakistan,1st,1,49,Malik
1,2006-01-21,Pakistan,1st,2,65,Butt
2,2006-01-21,Pakistan,1st,3,207,Khan
3,2006-01-21,Pakistan,1st,4,216,Yousuf
4,2006-01-21,Pakistan,1st,5,467,Afridi


In [12]:
# FoW
"""HTML structure: 
    innings + team name
    innings scorecard
    fall of wickets
    innings + team name
    innings scorecard
    fall of wickets
    
    etc.
    
    
Procedure:
Get this table:
    Read innings + team name (s)
    Read the scorecard (s)
    Read the FoW (s)
    Split by innings + team name.
Reformat into readable table
"""




'HTML structure: \n    innings + team name\n    innings scorecard\n    fall of wickets\n    innings + team name\n    innings scorecard\n    fall of wickets\n    \n    etc.\n    \n    \nProcedure:\nGet this table:\n    Read innings + team name (s)\n    Read the scorecard (s)\n    Read the FoW (s)\n    Split by innings + team name.\nReformat into readable table\n'