# Data Mining and Feature Engineering

Disclaimer: The following code was written in Python 3.5.2

### Write raw html

If you're writing and debugging code, store the data locally. There are two good reasons:
1. Sending the same request to a server dozens or possibly hundreds of times puts unnecessary burden on their systems.
2. It's faster to pull data from a local file.

In [9]:
import requests

s = requests.Session()
s.headers

{'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'python-requests/2.10.0', 'Connection': 'keep-alive', 'Accept': '*/*'}

### Load raw html and make soup

In [1]:
from bs4 import BeautifulSoup
with open('./../html/today\'s box scores.txt', 'rb') as raw_html:
    soup = BeautifulSoup(raw_html.read().decode(), 'html.parser')


### Collect team names and scores

In [2]:
from collections import namedtuple
Team = namedtuple('Team', ('name', 'points', 'won'))

game_summaries = soup.find(class_="game_summaries").find_all('div', {"class":'game_summary nohover'})

winner = game_summaries[0].find(class_='winner')
team1 = winner.a.text.strip()
team1_score = winner.find(class_='right').text.strip()
Team(team1, team1_score, True)

Team(name='Alcorn State', points='81', won=True)

### Clean it up

In [3]:
from collections import namedtuple
Team = namedtuple('Team', ('name', 'points', 'won'))

def collect_teams(game_summary):
    """
    Collect winning team data and losing team data.
    Return winning team and losing team in namedtuple('Team', ('name', 'points', 'won'))
    """
    team_data = game_summary.find(class_='winner')
    winner = Team(name=team_data.a.text.strip(), 
                  points=team_data.find(class_='right').text.strip(), 
                  won=True)
    team_data = game_summary.find(class_='loser')
    loser = Team(name=team_data.a.text.strip(), 
                  points=team_data.find(class_='right').text.strip(), 
                  won=False)
    return winner, loser

for game in game_summaries:
    winner, loser = collect_teams(game)
    print("{:<25} {:>4} {:>4} {:>25}".format(winner.name, winner.points, loser.points, loser.name))


Alcorn State                81   70              Alabama A&M;
Alabama State               79   65                  Southern
Troy                        78   69      Arkansas-Little Rock
Arkansas-Pine Bluff         71   68              Prairie View
Arkansas State              74   62             South Alabama
Norfolk State               74   64              Coppin State
North Carolina State        84   82                      Duke
Maryland-Eastern Shore      86   79              Florida A&M;
Georgia Southern            91   80          Coastal Carolina
Georgia State               83   72         Appalachian State
Green Bay                   83   73           Cleveland State
Holy Cross                  63   55                  American
Iona                        84   74                Quinnipiac
Mississippi Valley State   103   89            Texas Southern
Niagara                     91   84                  Canisius
North Carolina Central      74   39                    Howard
Oklahoma

We're missing the date. That seems like valuable information we should hold onto.

### Method 1
Write a function to grab the date from each page.

In [4]:
def get_date(soup):
    """
    Collects date from html.
    """
    raw = soup.find(class_="game_summaries").h2.text.strip()
    scores, date = raw.split('—')
    month, day, year = date.strip().replace(',', '').split(' ')
    return month, day, year
get_date(soup)

('Jan', '23', '2017')

When you're about to do something difficult, remember: 

## DON'T

**D**emands: Does my project require this?  
**O**nline sources: Has someone else done it better?  
**N**etwork: Are my friends smarter than me?  
**T**ry something else.

### Method 2

Notice the URL takes date parameters.

> http://www.sports-reference.com/cbb/boxscores/index.cgi?month=01&day=22&year=2017 

We can just scrape the scores for a given date and use the date we specified. This solves the additional problem of navigating to other dates.

Remember: DONT

In [5]:
url = "http://www.sports-reference.com/cbb/boxscores/index.cgi?month={}&day={}&year={}"
print('American Format:     ', url.format(1,1,2017))

# we can even flip the parameters around to make our url less ambiguous for unamerican communists.
url = "http://www.sports-reference.com/cbb/boxscores/index.cgi?year={year}&month={month}&day={day}"
print("International Format:", url.format(year=2017,month=1,day=27))

American Format:      http://www.sports-reference.com/cbb/boxscores/index.cgi?month=1&day=1&year=2017
International Format: http://www.sports-reference.com/cbb/boxscores/index.cgi?year=2017&month=1&day=27


# Let's Pull It All Together

In [6]:
for game_html in soup.find_all(class_='teams'):
    winner, loser = collect_teams(game_html)
    print(winner)
    print(loser)
    print()  # empty line for easier viewing

Team(name='Alcorn State', points='81', won=True)
Team(name='Alabama A&M;', points='70', won=False)

Team(name='Alabama State', points='79', won=True)
Team(name='Southern', points='65', won=False)

Team(name='Troy', points='78', won=True)
Team(name='Arkansas-Little Rock', points='69', won=False)

Team(name='Arkansas-Pine Bluff', points='71', won=True)
Team(name='Prairie View', points='68', won=False)

Team(name='Arkansas State', points='74', won=True)
Team(name='South Alabama', points='62', won=False)

Team(name='Norfolk State', points='74', won=True)
Team(name='Coppin State', points='64', won=False)

Team(name='North Carolina State', points='84', won=True)
Team(name='Duke', points='82', won=False)

Team(name='Maryland-Eastern Shore', points='86', won=True)
Team(name='Florida A&M;', points='79', won=False)

Team(name='Georgia Southern', points='91', won=True)
Team(name='Coastal Carolina', points='80', won=False)

Team(name='Georgia State', points='83', won=True)
Team(name='Appalachian S

# Storing Data

'3.5.2 |Anaconda 4.1.1 (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]'