# Math  1376: Programming for Data Science
---

In [None]:
import numpy as np #We will use numpy in this lecture
import matplotlib.pyplot as plt
%matplotlib inline

## Module 05: Datasets, web scraping, and reading/writing files
---

Before we fully transition into the more "data science-y" part of the course, we describe some basic concepts for obtaining data to analyze in the Python environment.

<span style='background:rgba(255,255,0, 0.25); color:black'> Run the code cell below and click the "play" button to see the first recorded lecture associated with this notebook.</span>

In [None]:
# 1. Running this cell with embed the short recorded lecture associated with this part of the notebook
# 2. Press on the "play" button to start the video.

from IPython.display import YouTubeVideo

YouTubeVideo('fZR9gTbaVSA', width=800, height=450)

## The big picture
---

### A high-level overview of web scraping
---

To discuss [web scraping](https://en.wikipedia.org/wiki/Web_scraping), we first provide an analogy and then some useful terminology.

<span style='background:rgba(255,0,255, 0.25); color:black'> ***An analogy:*** <span>

- You have probably acted as a *human web scraper* at some point by using copy/paste. 

  For instance, suppose you wanted to complain about the Broncos roster construction to a friend. You may go to their [roster web page](http://www.denverbroncos.com/team/roster.html) and copy the roster data from it before pasting it in an email. 

  The basic idea of a web scraper is to automate this otherwise mundane human process.
  
  For instance, while a single copy/paste of roster data into a single email may not seem that annoying. Imagine that you wanted to put the contents of various rosters into multiple emails because you are managing an email list for a sports website. You are going to get really annoyed very quickly with having to constantly check for updates on the rosters across all NFL teams.
  
  However, there are some issues with this. 
  
  As a human, you can use your judgment as to what data are relevant on a given web page no matter the format. In other words, even if we are unsure of how the data will appear that we are looking for, we *know it when we see it*. This is a uniquely human trait that an automated web scraper does not possess. 

<span style='background:rgba(255,0,255, 0.25); color:black'> ***Terminology:*** <span>

- *Fetching* refers to the downloading of a web page. In the analogy, this is like copying the entire web page. This is what your browser (e.g., Chrome, Firefox, Safari, etc.) does when you view the page, which is why you can still view the contents of a web page even if your lose your connection to the internet. 

- *Scraping* refers to the extraction of data from the downloaded contents of a web page. In the analogy, this is determining what part of the copied web page data to paste. 


<span style='background:rgba(255,0,255, 0.25); color:black'> ***The computers are only as smart as we are:*** <span>

- When we look at a web page, we only see what I would call *relevant* information on a web page. 

  A web scraper will load in all the data of a web page as raw html code through a fetch command, which will include *metadata* describing how things are supposed to look on the web page as well as any other "behind the scenes" information. 
  
  This generally means that in order to construct a useful web scraper, we (meaning us as human beings) have to understand something about the html structure of websites and how the data we wish to analyze from these websites is presented. 
  
  This allows us to *scrape* only the data we want from this mess of html data. 
  
  Consequently, web scrapers are usually built to explore web pages belonging to a single website. 

- For instance, we may create a web scraper to optimally collect statistics on various NBA players available from the website https://www.espn.com/nba/ where each individual athlete has a separate "profile" web page within this website (e.g., see [Jimmy Butler](https://www.espn.com/nba/player/_/id/6430/jimmy-butler)). The way in which the data is structured and potentially scaffolded across multiple sub-pages can vary significantly from how other websites may present the data (e.g., see [Jimmy Butler](https://www.nba.com/players/jimmy/butler/202710) on https://www.nba.com/). 

<span style='background:rgba(255,0,255, 0.25); color:black'> ***Ethics:*** <span>

Web scraping also comes with some important legal and ethical concerns. If you think of a web scraper as a *bot*, then once this *bot* is tuned to a website and turned loose, it can clearly gather data much faster than even an army of human users possibly could. Subsequently, if you use the *bot* too aggressively (i.e., you *spam* the website with lots of requests for data), you may break the website. Moreover, the legality of web scraping is also murky. You should check the *Terms and Conditions* of a website before you scrape it. Find and read the statements about the legal use of data, which usually allow you to scrape data as long as it is not for commercial purposes. Some websites have a robots.txt file that may explicitly say whether or not they allow web scrapers.

### A high-level overview of reading/writing data files
---

Suppose your institution already possesses the data you need to analyze in a file with a *typical format* that is understood by common data analysis programs (such as Excel or SQL). In this case, you do not need to go searching through web sites, but you do need to know how to *read* the data from the file in order to analyze it. Your analysis may need to be added to this existing file or put in a new file to be used by others. In these cases,  you need to know how to *write* data to a file in the correct format. You may even find yourself reading in data from a file, adding in new data from web scraping, and then saving that data along with any analysis of that data to an altogether new file. 

The basic process of reading/writing to files falls under the methods referred to as the [I/O (input/output)](https://docs.python.org/3/tutorial/inputoutput.html). 

## Learning Objectives
---

- Create code to scrape data directly from websites using pandas.

- Understand what a DataFrame is and how to manipulate it.

- Be able to create/read/write data files in different formats.

## Notebook contents <a id='Contents'>

* <a href='#read_html'>Part (a): `read_html` and some `pandas` practice</a>

    * <a href='#activity-nuggets'>Activity: Go Nuggets Go!</a>
        
    * <a href='#activity-Jokic'>Activity: Nothing is as consistent as Jokic! Not even his statistics...</a>

* <a href='#I/O'>Part (b): Reading/writing data to files</a>
    
* <a href='#activity-summary'>Activity: Summary</a>

## Part (a): `read_html` and some `pandas` practice <a id='read_html'>
---
    
**Expected time to completion: 6-9 hours**    

<span style='background:rgba(255,255,0, 0.25); color:black'> Run the code cell below and click the "play" button to see the first recorded lecture associated with this notebook.</span>

In [None]:
# 1. Running this cell with embed the short recorded lecture associated with this part of the notebook
# 2. Press on the "play" button to start the video.

from IPython.display import YouTubeVideo

YouTubeVideo('Z4WEbV4nk6c', width=800, height=450)

### Some available web scraping tools and our focus
---

While there are many ways to perform web scraping and reading/writing data to files, we will focus on just a few to get you started while commenting on some of the other options that we do not have time to explore in detail. 

We first discuss web scraping tools and later discuss options for reading/writing data to files. 

- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) is a module that you may want to learn if you are going to do a lot of web scraping. 

    - It is well [documented](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). But, if you plan on becoming a serious web scraper developer, then you should also learn more about html as well. 

    - There are some useful resources on getting started with Beautiful Soup, e.g., see [towards data science's introduction on web scraping](https://towardsdatascience.com/an-introduction-to-web-scraping-with-python-a2601e8619e5), [data camp's web scraping tutorial](https://www.datacamp.com/community/tutorials/web-scraping-using-python), or [medium's discussion on scraping data from a website](https://medium.com/backticks-tildes/how-to-scrape-data-from-a-website-ceda61204f67). 


To fully appreciate and utilize all that Beautiful Soup has to offer while simultaneously familiarizing ourselves with HTML tags requires more time than is available to us in this course, so we simply mention it as an option you should investigate further if you plan on doing a lot of web scraping.
<br>

- [Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html) is a very impressive module that you may become intimately familiar with if you ever do any serious data analysis. The name is derived from the econometrics term [*panel data*](https://en.wikipedia.org/wiki/Panel_data#:~:text=In%20statistics%20and%20econometrics%2C%20panel,the%20same%20firms%20or%20individuals.) (sometimes referred to as longitudinal data by statisticians) that describes multi-dimensional data involving measurements over time. From the overview of pandas:
>  ...[provides] fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. [Pandas] aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

   I highly recommend taking 3-5 minutes and just reading the full [overview of pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html).
   
   Pandas is often used in conjunction with Beautiful Soup to turn scraped data into a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html), which is one of two types of data structures available in pandas (the other is called a Series). 
   
   However, pandas also has its own method for scraping specific types of data from HTML. 
   Specifically, the method [`read_html`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html) will read HTML tables into a *list* of DataFrame objects. 
   
   ***We will focus on using `read_html` to create lists of DataFrame objects in this notebook.***
   
   We will also use a fairly limited number of other basic methods in pandas in the context of analyzing these HTML tables of data scraped from the web using `read_html`. 

- There is a rather thorough [10 minute introductory tutorial](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) to pandas that I recommend going through. 

  *In fact, you can easily take that 10 minute tutorial and turn it into your own Jupyter notebook, which is a good activity that I highly recommend.*

**Let's see how easy it is to use `read_html` in practice by reading in [roster data](http://www.denverbroncos.com/team/roster.html) from the Denver Broncos official website.**

In [None]:
import pandas as pd #import the pandas module

In [None]:
# Define a string variable, which we call url, containing the web address with data in an HTML table format
url = 'https://www.denverbroncos.com/team/players-roster/'

In [None]:
list_of_dfs = pd.read_html(url) # Create a list of DataFrames stored as df (for DataFrame)

In [None]:
print(type(list_of_dfs)) # See, it is a list!
print(type(list_of_dfs[0])) # Each item in the list is a DataFrame

In [None]:
print(len(list_of_dfs)) # How many DataFrames are in this list?

In [None]:
print(list_of_dfs[0]) # See the structure of the DataFrame

In [None]:
# View the DataFrame in a visually more appealing way by not using print
list_of_dfs[0]

In [None]:
# Okay, I am tired of typing df[0] since there is only one DataFrame in this list

df_broncos = pd.read_html(url)[0] #This just returns the first (and only in this case) DataFrame in the list as df

In [None]:
df_broncos # view the DataFrame

In [None]:
df_broncos.head() #View the first 5 rows

In [None]:
df_broncos.head(10) #view the first 10 rows

In [None]:
df_broncos.shape #View how many rows and columns there are in the DataFrame

In [None]:
df_broncos.iloc[0,:] #View all data related to first row

In [None]:
df_broncos.iloc[[0],:] #View all data related to first row in nicer format

In [None]:
df_broncos.iloc[-5:,:] #view last 5 rows of data

In [None]:
# The iloc is expecting integer inputs for its indexing
df_broncos.iloc[-5:,[0, 3, 4]] #view names, HT, and WT

In [None]:
# The loc allows for a mixture of integers and features, but 
# it is LABEL based. What does this mean?
# Negative values are not present in the index labels, so 
# keep this in mind. Running the code below will be interpreted as
# asking for all rows starting at -5 and onward, which is fundamentally
# different than starting at the 5th from the end.
df_broncos.loc[-5:,['Player', 'HT', 'WT']]

In [None]:
df_broncos.loc[74:,['Player', 'HT', 'WT']] #using loc to print last 5

### <span style='background:rgba(0,255,255, 0.5); color:black'>Try it out for yourself.</span>

This is a suggested activity. Create some code cells below to explore differences between using `loc` and `iloc` to print out some specific rows (and columns) of data from the `df_broncos` DataFrame.

### Some style options for aid in visual inspection of data

In [None]:
# Maybe some useful visuals for numerical data
df_broncos.style.bar(subset=['Age', 'WT'])

In [None]:
# A different type of visual for some data
df_broncos.style.background_gradient(subset=['WT']) 

In [None]:
def highlight_greaterthan_300(s): # Highlight the 300lb players on the roster
    if s.WT >= 300.0:
        return ['background-color: yellow']*len(df_broncos.columns)
    else:
        return ['background-color: black']*len(df_broncos.columns)


df_broncos.style.apply(highlight_greaterthan_300, axis=1)

In [None]:
def highlight_WRs(s): # Highlight the wide receivers
    if s.Pos == 'WR':
        return ['background-color: yellow']*len(df_broncos.columns)
    else:
        return ['background-color: black']*len(df_broncos.columns)


df_broncos.style.apply(highlight_WRs, axis=1)

### <span style='background:rgba(0,255,255, 0.5); color:black'>Try it out for yourself.</span>

This is a suggested activity. Use the code cell below to highlight all the linebackers on the roster. There are three types on the roster. An LB is just a general linebacker. An OLB is an outside linebacker and an ILB is an inside linebacker. If you are clever about using slicing, you can avoid using logical `or` in the function.

### Now for some useful built-in methods in pandas to manipulate/interrogate DataFrame objects

In [None]:
col_names = df_broncos.columns
print(col_names)
print(type(col_names))

In [None]:
print(df_broncos.dtypes) # See what types of data types are stored in each column (we will discuss objects and classes in the next module)

In [None]:
# Data in a column is easily accessible using the dot convention
names = df_broncos.Player
print(names)

In [None]:
print(type(names)) # Each column of data defines a data type known as a Series

In [None]:
# What is the average of numerical data found in this DataFrame?
df_broncos.mean()

In [None]:
# You can also just specify a specific column to compute a statistic
df_broncos.WT.mean()

In [None]:
# Try to take the mean of the height data (this will not work)
df_broncos.HT.mean()

In [None]:
# What is the deal with HT data? It is given as a string to represent units of feet followed by inches
print(df_broncos.HT[0])
print(type(df_broncos.HT[0]))

In [None]:
# Can we replace this data with numerical data in units of just inches so that we can better analyze it?

# Here is a new function for parsing strings
ht_data = df_broncos.HT[0].split('-') 
print(ht_data)

In [None]:
# Hmmmm, what if...a list comprehension! 
# (I am usually not this slick, but I just couldn't resist.)

# First, consider how to transform a single entry of the Series 
# into a list of the feet and inches as separate entries
ht_data = [int(data) for data in df_broncos.HT[0].split('-')]
print(ht_data)

In [None]:
# Now imagine that we want a list of lists for the feet and inches units of each player
# It would look something like this

ht_data = [[int(data) for data in df_broncos.HT[i].split('-')] for i in range(len(df_broncos.HT))]
print(ht_data)

In [None]:
# Now transform into inches
ht_data_inches = [12*data[0]+data[1] for data in ht_data]
print(ht_data_inches)

***There are lots of other ways we could have transformed the height data into inches.***

The use of two list comprehensions above was *completely* arbitrary. We could have created a user-defined function to do the same thing. We show this below. **You should add comments to this function to make sure you understand all the lines of code.**

- While list comprehensions are pretty slick, they do not necessarily make for the most "readable" code except to seasoned Python programmers. 

- In my opinion, it is best to opt for clarity over cleverness. I usually opt for writing code that is as straightforward and as easy to read/understand on a line-by-line basis as I can manage. 

In [None]:
def xform_broncos_ht_to_inches(ht_data):
    ht_data_inches = []
    for i in range(len(ht_data)):
        data = ht_data[i].split('-')
        ht_data_inches.append(int(data[0])*12 + int(data[1]))
    return ht_data_inches

In [None]:
ht_data_inches_redux = xform_broncos_ht_to_inches(df_broncos.HT)
print(ht_data_inches_redux)

In [None]:
# Since we actually know the structure of the data as x-y or x-yz where
# x is an integer of ft and y or yz are two integers for the number of inches
# we could also just use array slicing techniques to do all the work for us
def xform_broncos_ht_to_inches_slicing(ht_data):
    ht_data_inches = []
    for data in ht_data:
        ht_data_inches.append(int(data[0])*12 + int(data[2:]))
    return ht_data_inches

In [None]:
ht_data_inches_yet_again = xform_broncos_ht_to_inches_slicing(df_broncos.HT)
print(ht_data_inches_yet_again)

**What is the point of all these last few code cells?** 

- There are *lots* of ways we can go about manipulating the data to get it into a format we want. 

- Okay, so which is best? There is no best. You just use the method that makes sense to you. 

If you can see a quick/easy way to do list comprehensions, then great, do that. If you don't see how to do that, then fine, try a user-defined function that formats things how you want. Does the data have a regular/predictable/repeating structure to it that allows you to use array slicing techniques or other methods? Or, does it have some type of string character (or characters) that you need to get rid of? Maybe the answer to all of these is yes. Then, you have a lot of options. Maybe the answer to only some of these is yes. Then, you have less options. No matter what, *examining and thinking* about what the format of the data is in vs. what the format you *want* it to be in will be key to you choosing how to manipulate it.

In [None]:
# Now transform data in df
df_broncos.HT = ht_data_inches
df_broncos

In [None]:
# Rename the HT column to be more informative and 
df_broncos.rename(columns = {'HT' : 'HT_in_inches'}, inplace=True) # inplace=True means we change the actual df variable
df_broncos

In [None]:
df_broncos.HT_in_inches

In [None]:
df_broncos.mean() # Now we can compute the means and get the mean HT

### What about cleaning up other types of data to be more amenable to statistical analysis?

Sometimes data is missing (Age data is sometimes not available) or is incongruent with the form of other types of data. 

For example, in the above data set, experience (listed as Exp), is given in years except for rookies that are denoted with an R instead of a 0 for their years of experience. 

If you wanted to compute statistics relating to the experience of players on a roster (e.g., to try and quantify the "collective NFL wisdom" of the roster), then this is immediately problematic.

In [None]:
idx = np.where(df_broncos.Exp[:] == 'R')[0]
print(idx)

In [None]:
df_broncos.loc[idx,'Exp'] = 0 # Writing zero entries for rookies experience

In [None]:
df_broncos.mean() # So what is the average experience?

In [None]:
df_broncos.Exp[0] # The data type is maybe not what you think for other entries

In [None]:
df_broncos.Exp[1] # Hmmm, we seem to be mixing ints with strings

In [None]:
df_broncos.Exp = df_broncos.Exp.astype(int) # We should make sure each Series of data in a single column of a DataFrame is the same data type
df_broncos.Exp[1]

In [None]:
df_broncos.mean() # Now we can get means for experience as well

<hr style="border:5px solid cyan"> </hr>

## <span style='background:rgba(0,255,255, 0.5); color:black'>Activity: Go Nuggets Go!</span><a id='activity-nuggets'>

- Read in the [Nuggets team roster data](https://www.espn.com/nba/team/roster/_/name/den) from ESPN. 

- Transform salary into floats and label into `Salary_in_USD` and change any missing salary data into np.nan (why do you think such important data may be missing?).

- Change missing college data into a string 'N/A' (why do you think such (unimportant?) information may be missing?).

- Change WT and HT similarly into floats and relabel these columns as `WT_lbs` and `HT_inches`.

- Compute some statistics on certain columns and comment on results.

<hr style="border:5px solid cyan"> </hr>

<hr style="border:5px solid cyan"> </hr>

## <span style='background:rgba(0,255,255, 0.5); color:black'>Activity: Nothing is as consistent as Jokic! Not even his statistics...</span><a id='activity-Jokic'>
---

- Read in [Nikola Jokic's](https://www.espn.com/nba/player/stats/_/id/3112335/nikola-jokic) statistics from ESPN. Check out the web page and see how many different tables are read into the list. Show and discuss what some of these different tables contain. 

- Repeat the above to read in his postseason data (you will have to search on the web page for how to access his postseason data). 

- What if you tried to get this data from basketball-reference.com or nba.com? Try it. Comment on any difficulties you encounter and try to explain them. In fact, it may not even be possible to read in this data from a web page even though it looks like the data are available on the page. See if you can figure out why.

<hr style="border:5px solid cyan"> </hr>

## Part (b): Reading/writing data to files <a id='I/O'>
---

<span style='background:rgba(255,255,0, 0.25); color:black'> Run the code cell below and click the "play" button to see the second recorded lecture associated with this notebook.</span>

In [None]:
# 1. Running this cell with embed the short recorded lecture associated with this part of the notebook
# 2. Press on the "play" button to start the video.

from IPython.display import YouTubeVideo

YouTubeVideo('XkBgE395dgM', width=800, height=450)

There are several tools available for read/writing data to files including, but not limited to, these popular options:

- [some basic I/O in Python](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files)

- [`pickle`](https://docs.python.org/3/library/pickle.html#module-pickle)

- [Pandas I/O options](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)

- [`scipy` I/O (in particular `savemat` and `loadmat`)](https://docs.scipy.org/doc/scipy/reference/io.html)

Below, we show some typical use cases of a few of these with some code comments. 

<span style='background:rgba(0,255,255, 0.5); color:black'> ***You should experiment and think here. Make your own activities. You need to make at least one of your own activities that you also solve in this part of the notebook. Make sure to demarcate your activity with cyan borders so that I can easily locate and grade it.*** <span>

An implicit/suggested/recommended activity for you is to expand/add onto the comments you see below (in code or in Markdown cells), read through the documentation and try out different options that you also comment on (in code or Markdown cells), and perhaps even create some Markdown cells that try to explain the pros/cons of the various methods, scenarios where you may prefer one to another, or anything else you think is useful to comment on. 

This is really about *you*. Maybe you already have experience working with various file formats containing data. Maybe you do not. Either way, think about the file types you may need, will encounter, or have encountered. This may require you to ask some professors, students, professionals you know, etc. so that you can get a good feel for what you may work on someday. Or, try searching  online for what file types people typically deal with in practice (you can use some of the documentation and things discussed/linked to in this notebook as a starting point for your searches). 

### Starting with the basics
---

We will show the basic way of opening, writing, and reading a file's contents assuming the file contains the basic data type of strings.

Check out the [basic I/O in Python](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) for some more details.

We will quickly see though that there are some annoying limitations to this approach.

In [None]:
# The built-in open function does not require any import call
f = open('broncos_players', 'w') #'w' means we create/open a file in "write" mode

# This will DELETE any data that may exist in the file if that file was already 
# previously created, so be careful!!!!

After running the above code, go to the "main" Jupyter tab where you launched this notebook. 

Do you see the file broncos_players there? If not, try refreshing that tab.

In [None]:
# We can write strings to f, but not other variable types
# even if those variable types are objects that contain strings

# This will produce an error
f.write(df_broncos['Player'])

In [None]:
# If we want to create a file that stores all the names and colleges of players
# on individual lines, then we can do this
for (name,college) in zip(df_broncos['Player'],df_broncos['College']):
    var_str = 'Name: ' + name + '; College: ' + college + '\n'
    f.write(var_str)

Now go back to the main Jupyter tab where you launched this notebook and open this file to view its contents.

Does it appear empty?

In [None]:
f.close() #This prevents more data from being written to this file by accident.
# You should ALWAYS close your file object when you are done with it.

In [None]:
f.closed #Check if the file is closed

Now go back to the main Jupyter tab where you launched this notebook and open this file to view its contents (or refresh it if you left it open). 

Is the data there now? It should be.

Let's try reading it into this environment.

In [None]:
f.read() #This will produce an error! We closed the file!

In [None]:
# Let's try this correctly now.

# Open the file in a read-only mode so that we do not accidently edit something
# as we examine its contents.

f = open('broncos_players', 'r')

In [None]:
f.read() #Yuck!

In [None]:
# Want to try again?
f.read() #Uh, huh?

When you read from a file, it is keeping track of where it stopped reading (like an internal bookmarking of the data). 

In [None]:
f.tell() # This is the "bookmark" position

To change the file object’s (i.e., "bookmark") position, use `f.seek(offset, whence)` where `offset` is how far off from a reference point (defined by the `whence`) you want to set this position.

A `whence` value of 0 sets the reference point as the beginning of the file, 1 uses the current file position, and 2 uses the end of the file as the reference point. 

`whence` can be omitted and defaults to 0, using the beginning of the file as the reference point. You always need to provide an `offset` value (it has no default value).

We are not doing anything that fancy here. We simply want to go back to the starting point of the file.

In [None]:
f.seek(0) #omitting whence to use default 0
f.tell()

In [None]:
# Now look at how this produces results in a nice list of strings
f.readlines()

In [None]:
f.close()

This data looks incomplete. Hmmm... Oh yeah! We should have included the player positions in it as well. This would be a nice data file that has player names, their colleges, and what position they play. 

How can we can add the player position data to the end of each line?

From [stackoverflow](https://stackoverflow.com/questions/125703/how-to-modify-a-text-file)

> Unfortunately there is no way to insert into the middle of a file without re-writing it. As previous posters have indicated, you can append to a file or overwrite part of it using seek but if you want to add stuff at the beginning or the middle, you'll have to rewrite it.
<br>
<br>
   This is an operating system thing, not a Python thing. It is the same in all languages. 

The problem with `open` is that you can open a file in write mode (which will "wipe" the data clean in the file to be re-written), read mode (which will allow you to view but not edit the contents), read/write mode (which allows you to read contents as you write, but you can only "overwrite" existing data and not insert it in place), or you can append data at the end of the file. No good options with just `open` exist for editing lines of data *in place* that do not simply overwrite the existing data.

There are some workarounds (see the [stackoverflow](https://stackoverflow.com/questions/125703/how-to-modify-a-text-file) for some options), but we will not dwell on this further. 

We will simply leave this as a suggested activity for inserting the position data for each player in this data file. 

Before we move onto the use of Pandas I/O options, we suggest you read a bit more about saving data types more complicated than strings and potentially with more complex structure using [JSON](https://docs.python.org/3/tutorial/inputoutput.html#saving-structured-data-with-json) and [pickle](https://docs.python.org/3/library/pickle.html#module-pickle). 

### [Pandas I/O](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)
---

<span style='background:rgba(255,0,255, 0.25); color:black'> ***Key points:*** <span>

- If data is already in a pandas DataFrame object, then that object has several attributes to write the data into files with various formats including csv (comma separated values) and Microsoft Excel formats. 

- Data can be read in directly to a DataFrame object using various methods available within pandas.

- If working with data in a panel format makes the most sense to you, then the pandas options are probably the best and easiest for you to use.

In [None]:
# Write Broncos roster data, in its entirey, to two different file types

df_broncos.to_csv('broncos_roster_data')

df_broncos.to_excel('broncos_roster_data_excel.xls') #Needs to end in .xls or .xlsx

In [None]:
# A csv file is all in text format, so we can read it in as a file with open
#
# But, why would we do this?

f = open('broncos_roster_data', 'r')

In [None]:
f.readlines()

In [None]:
f.close() # Eh, let's not do this anymore

In [None]:
# Read in data files with pandas attributes because it formats
# this data into a dataframe

df_from_csv = pd.read_csv('broncos_roster_data')

In [None]:
df_from_csv #Look at the left two columns, can you explain what happened?

In [None]:
df_from_xls = pd.read_excel('broncos_roster_data_excel.xls')

In [None]:
df_from_xls #This looks just like the original df

### Some [`scipy` I/O (in particular `savemat` and `loadmat`)](https://docs.scipy.org/doc/scipy/reference/io.html) options
---

In a DataFrame, there is an "entry" for each row and column (sure, data may be missing, but this is just like an entry of `None` in a sense, so there is always something there even if it is a type of null data). 

In computational science, we often generate lots of different types of data in the form of arrays that can have different formats/data types and different shapes/sizes. We also often generate lots of "meta-data" describing details of how the problem was solved. Taken all together, this means a DataFrame is not necessarily suitable to describe all the data we generate for a typical problem solved in computational science. 

The `scipy.io` library offers some nice solutions to these types of problems where data may be saved in a `.mat` format (which may seem familiar to you if you ever used Matlab). In such a format, we use a dictionary (`dict` data type) to associate each array of data with a particular name. We mentioned dictionaries briefly in Module 02. They are basically keyword functions.

Below, we first show what a `dict` is and then how to save/load files in a `.mat` format.

In [None]:
# Create some very different types of data (this is arbitrary and silly)

a = np.array([0, 1, 2])

b = [[2,6,'ta da!'], np.array([5,6,7])]

c = 2.0

In [None]:
# Create a dict that maps to our different types of strange data
# Pay attention to the formats here

# We could use keywords for the data that are the variable names themselves
dict_1 = {'a': a, 'b': b, 'c': c} 

# Or, we can express the deepest appreciation for the work I put into these notebooks
dict_2 = {"I like you": a, "because your examples": b, "are very unusual":c}

In [None]:
# How to access the data in a dict

print(dict_1['a']) 
print()

print(dict_1['b'])
print()

print(dict_1['c'])
print()

print(dict_1['b'][0][2])

In [None]:
print(dict_2['I like you'])
print(dict_2['because your examples'])
print(dict_2['are very unusual'])

In [None]:
import scipy.io as sio # Now let's import the scipy to get ready to read/write data to file

In [None]:
# Write some data to file using savemat

sio.savemat('my_nonsense_data_1.mat', dict_1)

sio.savemat('my_nonsense_data_2.mat', dict_2)

In [None]:
# Now read the data back in using loadmat

my_BS_1 = sio.loadmat('my_nonsense_data_1.mat') # Obviously the BS stands for Bachelor's of Science

my_BS_2 = sio.loadmat('my_nonsense_data_2.mat')

In [None]:
# This looks "scary" but is not that bad.
# Basically, when we load in a .mat file we get the original dict plus some more
# meta-data about the formatting that data was saved in including the time the file
# was created (see the '__header__' keyword in the dictionary)

print(my_BS_1) 

In [None]:
# To get back at the arrays, just use the right keyword
my_BS_1['a']

In [None]:
# Ah, but what if we have a lot of keywords and we do not want to print out the
# whole dang dict to figure it out? Luckily, dicts have some nice attributes

print(my_BS_2.keys())

In [None]:
# We can turn anything in the dict into a new variable very easily

my_new_a = my_BS_2['I like you']

print(my_new_a)

In [None]:
# a dict is an iterable object tthat you can loop through like this

for key in my_BS_2:
    print(key, my_BS_2[key], '\n')

We end things here, but I hope you did some of the suggested/implicit activities in this section. This is a good part of the course to focus on asking your own questions and deciding for yourself what you need to focus more of your efforts on understanding. There are many directions we can go now as we get into more advanced topics. Too many in fact. While I will select some very important things for you to do, I cannot possibly create an exhaustive list of what is important for you personally. Explore. Experiment. Learn. And, have fun! 

<hr style="border:5px solid cyan"> </hr>

## <span style='background:rgba(0,255,255, 0.5); color:black'>Activity: Summary</span> <a id='activity-summary'/>

Summarize some of the key takeaways/points from this notebook in a list below and prepare a few code examples related to these takeaways/points in the code cells below. You need to have at least one example for each of your summary points and you need at least three summary points.

In this notebook, we have seen the following:

- [Your summary point 1 goes here]




- [Your summary point 2 goes here]




- [Your summary point 3 goes here]

<hr style="border:5px solid cyan"> </hr>

### <a href='#Contents'>Click here to return to Notebook Contents</a>