# Math  1376: Programming for Data Science
---

In [None]:
import numpy as np #We will use numpy in this lecture
import matplotlib.pyplot as plt
%matplotlib inline

## Module 5: Datasets, web scraping, and reading/writing files
---

Before we fully transition into the more "data science-y" part of the course, we describe some basic concepts for obtaining data to analyze in the Python environment.

### A high-level overview of web scraping
---

To discuss [web scraping](https://en.wikipedia.org/wiki/Web_scraping), we first require some terminology.

- *Fetching* refers to the downloading of a web page. This is what your browser (e.g., Chrome, Firefox, Safari, etc.) does when you view the page, which is why you can still view the contents of a web page even if your lose your connection to the internet.

- *Scraping* refers to the extraction of data from the downloaded contents of a web page. 

You have probably acted as a *human web scraper* at some point by using copy/paste. The basic idea of a web scraper is to automate this otherwise mundane human process. However, there are some issues with this. As a human, you can use your judgment as to what data are relevant on a given web page no matter the format. In other words, even if we are unsure of how the data will appear that we are looking for, we *know it when we see it*. This is a uniquely human trait that an automated web scraper does not possess. Moreover, we only see what I would call *relevant* information on a web page. A web scraper will load in all the data of a web page as raw html code through a fetch command, which will include *metadata* describing how things are supposed to look on the web page as well as any other "behind the scenes" information. This generally means that in order to construct a useful web scraper, we have to understand something about the html structure of websites and how the data we wish to analyze from these websites is presented. This allows us to *scrape* only the data we want from this mess of html data. Consequently, web scrapers are usually built to explore web pages belonging to a single website. For instance, we may create a web scraper to optimally collect statistics on various NBA players available from the website https://www.espn.com/nba/ where each individual athlete has a separate "profile" web page within this website (e.g., see [Jimmy Butler](https://www.espn.com/nba/player/_/id/6430/jimmy-butler)). The way in which the data is structured and potentially scaffolded across multiple sub-pages can vary significantly from how other websites may present the data (e.g., see [Jimmy Butler](https://www.nba.com/players/jimmy/butler/202710) on https://www.nba.com/). 

Web scraping also comes with some important legal and ethical concerns. If you think of a web scraper as a *bot*, then once this *bot* is tuned to a website and turned loose, it can clearly gather data much faster than even an army of human users possibly could. Subsequently, if you use the *bot* too aggressively (i.e., you *spam* the website with lots of requests for data), you may break the website. Moreover, the legality of web scraping is also murky. You should check the *Terms and Conditions* of a website before you scrape it. Find and read the statements about the legal use of data, which usually allow you to scrape data as long as it is not for commercial purposes. Some websites have a robots.txt file that may explicitly say whether or not they allow web scrapers.

### A high-level overview of reading/writing data files
---

Suppose your institution already possesses the data you need to analyze in a file with a *typical format* that is understood by common data analysis programs (such as Excel or SQL). In this case, you do not need to go searching through web sites, but you do need to know how to *read* the data from the file in order to analyze it. Your analysis may need to be added to this existing file or put in a new file to be used by others. In these cases,  you need to know how to *write* data to a file in the correct format. You may even find yourself reading in data from a file, adding in new data from web scraping, and then saving that data along with any analysis of that data to an altogether new file. 

The basic process of reading/writing to files falls under the methods referred to as the [I/O (input/output)](https://docs.python.org/3/tutorial/inputoutput.html). 

### Part (a): `read_html` and some `pandas` practice (Approx. Time: 1 hour)
---

### Some available web scraping tools and our focus
---

While there are many ways to perform web scraping and reading/writing data to files, we will focus on just a few to get you started while commenting on some of the other options that we do not have time to explore in detail. We first discuss web scraping tools and later discuss options for reading/writing data to files. 

- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) is a module that you may want to learn if you are going to do a lot of web scraping. It is well [documented](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). But, if you plan on becoming a serious web scraper developer, then you should also learn more about html as well. There are some useful resources on getting started with Beautiful Soup, e.g., see [towards data science's introduction on web scraping](https://towardsdatascience.com/an-introduction-to-web-scraping-with-python-a2601e8619e5), [data camp's web scraping tutorial](https://www.datacamp.com/community/tutorials/web-scraping-using-python), or [medium's discussion on scraping data from a website](https://medium.com/backticks-tildes/how-to-scrape-data-from-a-website-ceda61204f67). But, to fully appreciate and utilize all that Beautiful Soup has to offer while simultaneously familiarizing ourselves with HTML tags requires more time than is available to us in this course, so we simply mention it as an option you should investigate further if you plan on doing a lot of web scraping.


- [Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html) is a very impressive module that you may become intimately familiar with if you ever do any serious data analysis. The name is derived from the term econometrics term [*panel data*](https://en.wikipedia.org/wiki/Panel_data#:~:text=In%20statistics%20and%20econometrics%2C%20panel,the%20same%20firms%20or%20individuals.) (sometimes referred to as longitudinal data by statisticians) that describes multi-dimensional data involving measurements over time. From the overview of pandas:
>  ...[provides] fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. [Pandas] aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

   I highly recommend taking 3-5 minutes and just reading the full [overview of pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html). 
   Pandas is often used in conjunction with Beautiful Soup to turn scraped data into a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html), which is one of two types of data structures available in pandas (the other is called a Series). 
   However, pandas also has its own method for scraping specific types of data from HTML. 
   Specifically, the method [`read_html`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html) will read HTML tables into a *list* of DataFrame objects. 
   ***We will focus on using `read_html` to create lists of DataFrame objects in this notebook.***
   
   We will also use a fairly limited number of other basic methods in pandas in the context of analyzing these HTML tables of data scraped from the web using `read_html`. There is a rather thorough [10 minute introductory tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) to pandas that I recommend going through. In fact, you can easily take that 10 minute tutorial and turn it into your own Jupyter notebook, which is a good exercise that I highly recommend.

**Let's see how easy it is to use `read_html` in practice by reading in [roster data](http://www.denverbroncos.com/team/roster.html) from the Denver Broncos official website.**

In [None]:
import pandas as pd #import the pandas module

In [None]:
# Define a string variable, which we call url, containing the web address with data in an HTML table format
url = 'http://www.denverbroncos.com/team/roster.html'

In [None]:
df = pd.read_html(url) # Create a list of DataFrames stored as df (for DataFrame)

In [None]:
print(type(df)) # See, it is a list!
print(type(df[0])) # Each item in the list is a DataFrame

In [None]:
print(len(df)) # How many DataFrames are in this list?

In [None]:
print(df[0]) # See the structure of the DataFrame

In [None]:
# View the DataFrame in a visually more appealing way by not using print
df[0]

In [None]:
# Okay, I am tired of typing df[0] since there is only one DataFrame in this list

df = pd.read_html(url)[0] #This just returns the first (and only in this case) DataFrame in the list as df

In [None]:
df

### Now for some useful built-in methods in pandas to manipulate/interrogate DataFrame objects

In [None]:
col_names = df.columns
print(col_names)
print(type(col_names))

In [None]:
print(df.dtypes) # See what types of data types are stored in each column (we will discuss objects and classes in the next module)

In [None]:
# Data in a column is easily accessible using the dot convention
names = df.Player
print(names)

In [None]:
print(type(names)) # Each column of data defines a data type known as a Series

In [None]:
# What is the average of numerical data found in this DataFrame?
df.mean()

In [None]:
# You can also just specify a specific column to compute a statistic
df.WT.mean()

In [None]:
# What is the deal with HT data? It is given as a string to represent units of feet followed by inches
print(df.HT[0])
print(type(df.HT[0]))

In [None]:
# Can we replace this data with numerical data in units of just inches so that we can better analyze it?

# Here is a new function for parsing strings
ht_data = df.HT[0].split('-') 
print(ht_data)

In [None]:
# Hmmmm, what if...a list comprehension! 
# (I am usually not this slick, but I just couldn't resist.)

# First, consider how to transform a single entry of the Series 
# into a list of the feet and inches as separate entries
ht_data = [int(data) for data in df.HT[0].split('-')]
print(ht_data)

In [None]:
# Now imagine that we want a list of lists for the feet and inches units of each player
# It would look something like this

ht_data = [[int(data) for data in df.HT[i].split('-')] for i in range(len(df.HT))]
print(ht_data)

In [None]:
# Now transform into inches
ht_data_inches = [12*data[0]+data[1] for data in ht_data]
print(ht_data_inches)

In [None]:
# Now transform data in df
df.HT = ht_data_inches
df

In [None]:
# Rename the HT column to be more informative and 
df.rename(columns = {'HT' : 'HT_in_inches'}, inplace=True) # inplace=True means we change the actual df variable
df

In [None]:
df.HT_in_inches

In [None]:
df.mean() # Now we can compute the means and get the mean HT

### What about cleaning up other types of data to be more amenable to statistical analysis?

Sometimes data is missing (Age data is sometimes not available) or is incongruent with the form of other types of data. 

For example, in the above data set, experience (listed as Exp), is given in years except for rookies that are denoted with an R instead of a 0 for their years of experience. 

If you wanted to compute statistics relating to the experience of players on a roster (e.g., to try and quantify the "collective NFL wisdom" of the roster), then this is immediately problematic.

In [None]:
idx = np.where(df.Exp[:] == 'R')[0]
print(idx)

In [None]:
df.loc[idx,'Exp'] = 0 # Writing zero entries for rookies experience

In [None]:
df.mean() # So what is the average experience?

In [None]:
df.Exp[0] # The data type is maybe not what you think for other entries

In [None]:
df.Exp[1] # Hmmm, we seem to be mixing ints with strings

In [None]:
df.Exp = df.Exp.astype(int) # We should make sure each Series of data in a single column of a DataFrame is the same data type
df.Exp[0]

In [None]:
df.mean() # Now we can get means for experience as well

## Activity
---

***To-Do:*** Nuggets team roster data from ESPN. Transform salary into floats and label into `Salary_in_USD`, change missing salary data into np.nan (why would such data be missing?), change missing college data into a string 'N/A' (why is such information missing?), change WT and HT similarly into floats and relabel these columns. Compute some statistics.

In [None]:
url = 'https://www.espn.com/nba/team/roster/_/name/den'
df = pd.read_html(url)[0]

In [None]:
df

## Activity
---

***To-Do:*** Look at an individual player's statistics from ESPN. Check out the web page and see how many different tables are read into the list. Discuss these different tables. Figure out the url to read in postseason data. What if you tried to get this data from basketball-reference.com or nba.com?

In [None]:
url = 'https://www.espn.com/nba/player/stats/_/id/3112335/nikola-jokic'
df = pd.read_html(url)
len(df)

In [None]:
df[3]

In [None]:
url = 'https://www.basketball-reference.com/players/j/jokicni01.html'
df = pd.read_html(url)
len(df)

In [None]:
df[0]

In [None]:
url = 'http://bkref.com/pi/shareit/e3F6L'
df = pd.read_html(url)
len(df)

In [None]:
df[0]

In [None]:
# What is going on at nba.com?
url = 'https://stats.nba.com/player/203999/traditional/'
df = pd.read_html(url)
len(df)

### Part (b): Reading/writing data to files
---

***To-Do:*** Discuss the following tools: Pands I/O options https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html, pickle https://docs.python.org/3/library/pickle.html#module-pickle, some basic I/O in Python https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files, and scipy I/O (in particular savemat and loadmat) https://docs.scipy.org/doc/scipy/reference/io.html