# Exploratory Data Analysis for Tuesday, October 24


The goals for this notebook are:

1. Downloading/Reading/Appending/Writing Data


This notebook contains code cells that will allow you to practice and experiment as we proceed through today's lecture. We will pause freuqently to allow you to complete exercises. Do not hesitate to ask questions on the fly.





The simplest way to work with Big Data is to have Big Data. Let's assume that you have some files on your own computers you would like to analyze. How do we load this data into pandas? 

First we need to find the location where the data is stored on our drive and relate that information to our Anaconda interpreter. A usefule package in this regard is the `os` package. It houses functions to navigate the file system on your computers.

There is a file saved on the X:\ drive titled "TimeSeriesData". Move this file to your current working directory. If you do now know where your current working directory is, this is where the `os` module can be helpful. Use the following to find your current working directory:

In [8]:
import os 
os.getcwd()

'c:\\Users\\carso\\OneDrive - UW-Madison\\23_24_Fall\\econ695\\lecture'

The file contains several time series from the Federal Reserve's FRED repository. These include 

* Production of Total Industry in Japan (JPNPROINDQISMEI)
* Chain-type Price Index (PCECTPI)
* Real Gross Domestic Product (GDPC1)

Open the file using the code below. 

In [11]:
import pandas as pd
gdp = pd.read_csv("../data/TimeSeriesData.csv")       # read_csv is a part of Pandas, so we need the pd. 
print(type(gdp))   
print(gdp)

ParserError: Error tokenizing data. C error: Expected 2 fields in line 6, saw 4


Since this is already time series data, we have no need for a seperate index. To replace the index with the values in the date column try the code below.

It also would have been possible to make this change while the file was being read into the dataframe. To see how, use the `?` , i.e. `pd.read_csv?`

In [12]:
gdp_new_index = gdp.set_index('DATE')   # We could use 'inplace = True' if we didn't need a copy.

print(gdp_new_index.head())

NameError: name 'gdp' is not defined

The code below would have achieved the same end in one less step:

In [13]:
gdp_alternative = pd.read_csv("TimeSeriesData.csv", index_col = 0)  
print(gdp_alternative)

FileNotFoundError: [Errno 2] No such file or directory: 'TimeSeriesData.csv'


Reading Excel files is a similar process. 


In [16]:
gdp_alternative_xlsx = pd.read_excel("../data/TimeSeriesDataFull.xlsx") 

  warn(msg)


In [17]:
gdp_alternative_xlsx


Unnamed: 0,Data List: TB
0,Data Updated: 2018-03-14
1,FRED (Federal Reserve Economic Data)
2,Link: https://fred.stlouisfed.org
3,Help: https://fred.stlouisfed.org/help-faq
4,Economic Research Division
...,...
393,a float adjusted market capitalization that in...
394,"not considered available to ""ordinary"" investo..."
395,Wilshire Associates Incorporated. Reprinted wi...
396,information about the various indexes from Wil...


Not bad, we got something but it isn't what we want. We are using the wrong sheet! To define which sheet to open, reference the title as below:

In [19]:
gdp_xlsx = pd.read_excel("../data/TimeSeriesDataFull.xlsx", sheet_name='Quarterly') 
gdp_xlsx


  warn(msg)


Unnamed: 0,DATE,GDPC1,JPNPROINDQISMEI,PCECTPI
0,1955-01-01,2683.766,999.000000,15.755
1,1955-04-01,2727.452,999.000000,15.771
2,1955-07-01,2764.128,999.000000,15.834
3,1955-10-01,2780.762,999.000000,15.878
4,1956-01-01,2770.032,999.000000,15.943
...,...,...,...,...
247,2016-10-01,16851.420,99.125073,111.583
248,2017-01-01,16903.240,99.291726,112.198
249,2017-04-01,17031.085,101.324890,112.273
250,2017-07-01,17163.894,101.724856,112.699


In many cases you will be working with only a subset of the variables provided in a dataset. It may be useful to ignore the extraneous variables when importing your data. Practice with the code below to open various permutations of the variable space. 

In [20]:
gdp_xlsx = pd.read_excel("../data/TimeSeriesDataFull.xlsx", sheet_name='Quarterly', usecols = [0,1,3]) 
gdp_xlsx


  warn(msg)


Unnamed: 0,DATE,GDPC1,PCECTPI
0,1955-01-01,2683.766,15.755
1,1955-04-01,2727.452,15.771
2,1955-07-01,2764.128,15.834
3,1955-10-01,2780.762,15.878
4,1956-01-01,2770.032,15.943
...,...,...,...
247,2016-10-01,16851.420,111.583
248,2017-01-01,16903.240,112.198
249,2017-04-01,17031.085,112.273
250,2017-07-01,17163.894,112.699


# Retrieving Online Files

[Requests](https://docs.python-requests.org/en/latest/) is a python library that you can use to, among other things, download files from web using URLs.

In [21]:
# import the requests library
import requests

# URL of the image to be downloaded will be named image_url
image_url = "https://matthewfriedman945902870.files.wordpress.com/2018/08/ajhike.jpg?w=2000&h="
  
# create the communications protocol (https) response object
a = requests.get(image_url) 
  
# send a HTTPS request to the server and save the HTTP response in a response object called b
with open("AJ_Hiking.png",'wb') as b:
    # Saving received content as a png file in write the contents of the response (r.content) to a new file.
    b.write(a.content)

The codeblock above downloads an image from the web. To confirm it worked, check your local directory. If you need help determining where to look, again, use the `os module to get the current working directory.



In [22]:
import os 
os.getcwd()

'c:\\Users\\carso\\OneDrive - UW-Madison\\23_24_Fall\\econ695\\lecture'

`r.content` is the file data in a single string. This is not ideal for large files. More typically it will make sense to just download the entire file. One way, of many, to accomplish this uses `urllib`. This package is similar to requests in that it collects several modules for working with URLs. It offers a very simple interface in the form of the urlopen function and is capable of fetching web content using multiple protocols.



In [23]:
import urllib.request 

img_url2 = 'https://matthewfriedman945902870.files.wordpress.com/2018/07/the-badlands.jpg?w=2000&h='

urllib.request.urlretrieve(img_url2, "Badlands.jpg")

('Badlands.jpg', <http.client.HTTPMessage at 0x17f81091e80>)

You want to work with Big Data, not pictures, so let's try to download some files in a more conventional format for data analysis: comma seperated values. Comma seperated values are commonly referenced by their file extension (csv). Here is an example of some code that can be used to download a web-hosted csv file. 

In [24]:

req = requests.get('https://www.princeton.edu/~mwatson/Stock-Watson_4E/mydata.csv')

url_content = req.content
csv_file = open('downloaded.csv', 'wb')

csv_file.write(url_content)
csv_file.close()

Let's say the file you are interested in accessing is located inside a zip folder. The ZIP file format is a common archive and compression standard. To access the file you will first need to unzip it. The [`zipfile`](https://docs.python.org/3/library/zipfile.html) module provides tools to create, read, write, append, and list a ZIP file.

In [25]:
import requests, zipfile, io
r = requests.get('https://www.princeton.edu/~mwatson/Stock-Watson_4E/Earnings_and_Height.zip')
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

The `io` module provides Python’s main facilities for dealing with various types of I/O. It is useful in performing file-related reading and writing operations. While you can use the normal read() and write() methods to read/write to a file, this module gives us a lot more flexibility regarding these operations. `zipfile.Zipfile()` is the class for reading and writing ZIP files.

`io.BytesIO()` essentially provides the same service as `open()`. However instead of writing the contents to a file, it's written to an in memory buffer. Using `io.BytesIO()` also has the advantage that it can be used in place of a file object. So in a case where you have a function that expects a file object to write to you can utilize in-memory buffer instead of a file.


Suppose we wanted to save these files somewhere other than the current working directory. To accomplish this we would enter the path to the desired destination directory as such:

In [28]:
import requests, zipfile, io
r = requests.get('https://www.princeton.edu/~mwatson/Stock-Watson_4E/Earnings_and_Height.zip')
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall('c:\\Users\\carso\\OneDrive - UW-Madison\\23_24_Fall\\econ695\\lecture\\princeton')

Suppose you wanted to investigate something related to historical property sale prices recorded by the Wisconsin Department of Revenue. A good place to start would be an internet search. Open a browser and see what data you can find. Is the data easily accessible? Attempt to assemble records of sales going back five years. Take 10 minutes and remember that this is supposed to be practice with batch downloading. Ask for help as necessary. 

In [35]:
years = ["2018","2019","2020","2021","2022","2023"]
months = ["01","02","03","04","05","06","07","08","09","10","11","12"]
for year in years:
    for month in months:
        if year == "2023" and month == "10":
            break
        r = requests.get(f"https://www.revenue.wi.gov/SLFReportsHistSales/{year}{month}CSV.zip")
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall('c:\\Users\\carso\\OneDrive - UW-Madison\\23_24_Fall\\econ695\\lecture\\Revenue')