## Data Scrape - All-Star Rosters

This notebook will be used to collect data on the NBA all-star rosters since 2008. This information will be used to generate my target variable.

### Contents

- [Imports](#Imports)
- [Test Scrape](#Test-Scrape)
- [Complete Scrape](#Complete-Scrape)


### Imports

In [1]:
# Import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

### Test Scrape

In [2]:
#Testing scrape on the 2018 NBA All-Star game
url = 'https://www.basketball-reference.com/allstar/NBA_2018.html'
rec = requests.get(url)

In [3]:
rec.status_code

200

In [4]:
soup = BeautifulSoup(rec.content, 'lxml')

In [5]:
#table for first team
table = soup.find_all('table')[1]

In [6]:
table.find('tbody').find_all('tr')[0].find('th').text

'LeBron James'

In [7]:
#looping through first table on site to get players from the first team
[tr.find('th').text for tr in table.find('tbody').find_all('tr') if tr.find('th').text != ('Reserves')]

['LeBron James',
 'Kevin Durant',
 'Russell Westbrook',
 'Kyrie Irving',
 'Anthony Davis',
 '',
 'Paul George',
 'Andre Drummond',
 'Bradley Beal',
 'Victor Oladipo',
 'Kemba Walker',
 'Goran Dragić',
 'LaMarcus Aldridge']

In [8]:
#table for second team
table_2 = soup.find_all('table')[2]

In [9]:
table_2.find('tbody').find_all('tr')[0].find('th').text

'James Harden'

In [10]:
#looping through to get players from the second team
[tr.find('th').text for tr in table_2.find('tbody').find_all('tr') if tr.find('th').text != 'Reserves']

['James Harden',
 'DeMar DeRozan',
 'Stephen Curry',
 'Giannis Antetokounmpo',
 'Joel Embiid',
 '',
 'Kyle Lowry',
 'Klay Thompson',
 'Damian Lillard',
 'Draymond Green',
 'Karl-Anthony Towns',
 'Al Horford']

### Complete Scrape

In [11]:
#getting All-Star rosters from 2006-2009
player_list = []

for i in range(2006, 2019):
    print(f'Scraping the {i} All-Star game')
    url = f'https://www.basketball-reference.com/allstar/NBA_{i}.html'
    rec = requests.get(url)
    if rec.status_code == 200:
        soup = BeautifulSoup(rec.content, 'lxml')
        
        #getting players for team 1
        table_1 = soup.find_all('table')[1]
        team_1 = [tr.find('th').text for tr in table_1.find('tbody').find_all('tr') if tr.find('th').text != ('Reserves')]
        player_list.extend(team_1)
        
        #getting players for team 2
        table_2 = soup.find_all('table')[2]
        team_2 = [tr.find('th').text for tr in table_2.find('tbody').find_all('tr') if tr.find('th').text != ('Reserves')]
        player_list.extend(team_2)
        
    else: 
        print('website error')
    
    time.sleep(1)
        
    

Scraping the 2006 All-Star game
Scraping the 2007 All-Star game
Scraping the 2008 All-Star game
Scraping the 2009 All-Star game
Scraping the 2010 All-Star game
Scraping the 2011 All-Star game
Scraping the 2012 All-Star game
Scraping the 2013 All-Star game
Scraping the 2014 All-Star game
Scraping the 2015 All-Star game
Scraping the 2016 All-Star game
Scraping the 2017 All-Star game
Scraping the 2018 All-Star game


In [12]:
#creating dataframe
all_star_df = pd.DataFrame(player_list, columns = ['Name'])

In [13]:
all_star_df.head()

Unnamed: 0,Name
0,Dwyane Wade
1,LeBron James
2,Allen Iverson
3,Shaquille O'Neal
4,Vince Carter


In [14]:
all_star_df.shape

(337, 1)

In [15]:
#Dropping duplicates
#Many players have made multiple games
all_star_df.drop_duplicates(inplace = True)

In [16]:
#need to remove empty name
all_star_df.sort_values(by = 'Name').head()

Unnamed: 0,Name
5,
114,Al Horford
2,Allen Iverson
36,Amar'e Stoudemire
283,Andre Drummond


In [17]:
#dropping missing value
all_star_df.drop(5, inplace = True)

In [18]:
#resetting index
all_star_df.reset_index(drop = True, inplace = True)

In [19]:
#saving to csv file
all_star_df.to_csv('../Data_Files/all_star_rosters.csv')
