<h2>Web Scraping Notebook</h2>

Scraped the NFL data from: 
<a href>https://www.lineups.com/nfl/roster/new-england-patriots</a>







In [1]:
# my imports
import requests
from lxml import html
import pandas as pd
from bs4 import BeautifulSoup
import re

<h3>Requesting data from website only once</h3>

You should request data and save the data on your local computer because pinging the website with to many requests will mark you as spam. You can change your User Agent if that happens.

In [42]:
def pageRequestData():
    # first you have the URL of the site you want to webscrape from.
    url = "https://www.lineups.com/nfl/roster/new-england-patriots"

    # then you send a get request for the web html document and that comes back to you as a string
    page = requests.get(url).text

    # and then it turns that string into html formatted Python print and Tag-attached code. 
    soup = BeautifulSoup(page, 'lxml')

# url, page, soup = pageRequestData()

<h3>GitHub cannot sent page requests</h3>

I need to save soup as an html file on my local machine


In [2]:
filename = "nfl_ne_roster.html"

# with open(filename, "w") as file:
#     file.write(str(soup))

soupString = open(filename, "r").read()

# turn back into BeautifulSoup
soup = BeautifulSoup(soupString, "html.parser")


<h3>Get headers for excel file</h3>

In [3]:
# Get headers
headersInfo = soup.find('thead')
headerCols = headersInfo.findAll('a')

# so what I am doing here is naming a variable headersInfo to find the first @thead in the document
# In this particular table, in thead, the first @a in the document numerically in the code after the html element @thead 
# is the info we need. The @a tagged elements need to be clicked in the original document so that you are able to sort by 
# relevance. 

# next we .findAll the a elements in the "new" document because thats the tag we need.
# now that we have a -list of all the table data we need for our excel file of NFL data,
# we need to GET THE TEXT FOR THE HEADERS AND SAVE IT IN THE TOP OF THE DATAFRAME I NEED TO USE

# make a list for the headers
headerColList = []

# append as the first element Team which is not in the table
headerColList.append("Team")

for i,headerCol in enumerate(headerCols):
#     print(headerCols[i].text.strip()) # to see the headerCols in good format
    headerColList.append(headerCols[i].text.strip())

print("----Spreadsheet Headers----")
print(headerColList) # The header for the dataframe for excel

----Spreadsheet Headers----
['Team', 'Pos', 'Name', 'Number', 'Rating', 'Ranking', 'Depth', 'Height', 'Weight', 'Age', 'Birthday', 'Exp.', 'Drafted', 'Draft Round', 'Draft Pick', 'College']


<h3>Get data of each player in the roster data</h3>

In [38]:
team = "NE"

# I am getting the team data for the 2019-20 season. 
# I can see some upcoming problems. 
# I am going to have to figure out how to use chromeDriver or another way to switch the html buttons 
# Also, I would also like to put player stats data for the 2018-2019 season which happened last year. The stats are set
# up to present the 2018-2019 season while the team is set up for the current year which hasn't started yet.
# This means that the tables won't match up exactly, which is ok. Some of the players won't have any data in the beginning.
# Well if there is a difference than that means they were around the year before, so players without data are new. 

# ok now I need to get my actual player data of the 2019-2020 roster without stats, just info = PlayerInformationTable
# The code below finds the first tbody element in the soup document which is what we need for the data
PlayerRosterInfoData = soup.find('tbody')

# Make a list of all the 'tr' rows which are each individual players info
t_content_rows = PlayerRosterInfoData.findAll('tr',{'class':'t-content'})
# print(len(t_content_rows)) # number of players in table

playerData = []

# for each player
for i,each_player in enumerate(t_content_rows):
#     get all the column info for each found in the @td tag
    player_row = each_player.findAll('td')

    tempData = []
    tempData.append(team)

#     each td contains info for the dataframe
    for i,col in enumerate(player_row):
        
#       the players name at indices 1 needs to be formated correctly
        if (i!=1):
#             print(col.text.strip())

#             turn string digits into digits
            if(col.text.strip().isdigit()):
                tempData.append(int(col.text.strip()))
            else:
                tempData.append(col.text.strip())
        else:
            playerName_re = re.compile('\S+\s\S+') 
            tempData.append(playerName_re.match(col.text.strip()).group())
#             print(playerName_re.match(col.text.strip()).group())

    playerData.append(tempData)
print(playerData[0])


['NE', 'QB', 'Tom Brady', 12, 95, '#1 QB', 1, '6\'4"', 225, 42, '8/3/77', 20, 2000, 6, 199, 'Michigan']


<h3>Put data into dataframe and export it as an excel sheet</h3>

In [40]:
df = pd.DataFrame(playerData, columns = headerColList)
df.to_excel('output.xlsx')

<h4> Coding Notes </h4>

In [5]:
# NOTES while coding
# imports I tried
# from lxml.html.clean import clean_html
# from selenium import webdriver
# from selenium.webdriver.support.ui import Select
# from selenium.webdriver.common.keys import Keys

# the location of a webdriver on my computer
# this thing is probably what I need if I want to extract from complex websites where forms, and buttons, and parameters
# need to be changed on the website.
# driver = webdriver.Chrome('C:/NEW PROGRAMS/chromedriver_win32/chromedriver.exe')

# if I wanted to replace a certain letter in a string
# and how to make things lowercase
# for i, playerName in enumerate(players):
# #     replace spaces with a dash and make name lowercase
#     players[i] = players[i].replace(" ", "-").lower()

# An example of what a x --
# /html/body/app-root/app-nfl/app-roster/div/div/div[2]/div[2]/div/div/table/tbody/tr[1] -- xpat code from a website
# looks like

# get the current date and time
# current_time = datetime.datetime.now()

# enumerated for loop
# for i, player in enumerate(playerData):
#     print (i, player)

# a nooby way to get rid of spaces and enters
# for each in playerInfo[0].text_content():
#     if(each != ' ' and each != '\n'):
#         print(each)

# URL 
# baseURL = "https://www.lineups.com/nfl/player-stats/"
# tempUrl = baseURL + players[0]

In [6]:
# tree = html.fromstring(page.content)

# playerInfo = tree.xpath('//tbody') 
# # print(playerInfo[0].text_content()[0:100])

# for each in playerInfo[0].text_content():
#     if(each != ' ' and each != '\n'):
#         print(each)

# print(type(playerInfo[0].text_content()))


# print(playerInfo[0].xpath('//td')[0])

# players = tree.xpath('//span[@class = "player-name-col-lg"]/text()')


# for i, playerName in enumerate(players):
# #     replace spaces with a dash and make name lowercase
#     players[i] = players[i].replace(" ", "-").lower()

# print(players[0])

# baseURL = "https://www.lineups.com/nfl/player-stats/"
# tempUrl = baseURL + players[0]

# print(tempUrl)