### Convert tarfile (*.tgz, *.tar.gz) to pandas DataFrame

This is a demo of getting and parsing California Housing Data from the following website to a pandas DataFrame object.
URL: https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The tgz file contained a folder with two files (folder/filename):
    * CaliforniaHousing/cal_housing.data
    * CaliforniaHousing/cal_housing.domain
    
The `.data` file contains the core data we're after and the `.domain` file contains column names and data types in a colon-separated list.

In [1]:
# Standard library imports
import tarfile
from io import BytesIO
import urllib.request as ureq

# Third-party library import
import pandas as pd

# Target url
tar_url = "https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz"

We'll use `.urlopen()` from `urllib.request` to bring our data into a `BytesIO` object.

**Note**: We'll need to 'reset' the stream posiiton to the start after creating the `BytesIO` object by seeking position zero.

In [3]:
with ureq.urlopen(tar_url) as resp:
    bio = BytesIO(resp.read())

bio.seek(0) # Change stream position to start of the stream.

0

We'll open our tarfile by passing the BytesIO object to the fileobj parameter in tarfile.open().  All other parameters remain at their default settings.

Once opened, since we know which files to look for, we can set variables to those results.  We'll then:
    * extract the file into memory
    * read the file data
    * decode into unicode

**Note**: If we didn't know the name of the files, we could just add in a print(f.name) section to the top of the for-loop method.

In [4]:
tar = tarfile.open(fileobj=bio)

for f in tar:
    if str(f.name).endswith(".domain"):
        col_data = tar.extractfile(f.name).read().decode("utf-8")
    elif str(f.name).endswith(".data"):
        body_data = tar.extractfile(f.name).read().decode("utf-8")

In [5]:
# Column data
print(col_data)

longitude: continuous.
latitude: continuous.
housingMedianAge: continuous. 
totalRooms: continuous. 
totalBedrooms: continuous. 
population: continuous. 
households: continuous. 
medianIncome: continuous. 
medianHouseValue: continuous. 



We'll split up our column data and then create an enumerated dictionary to update names in our pandas DataFrame once that is processed.

In [8]:
cols = [i.split(":")[0] for i in col_data.strip().split("\n")]
col_dict = {k:v for k, v in enumerate(cols)}
print("\n".join([f"{k}: {v}" for k,v in col_dict.items()]))

0: longitude
1: latitude
2: housingMedianAge
3: totalRooms
4: totalBedrooms
5: population
6: households
7: medianIncome
8: medianHouseValue


For our core data, we'll just strip off any whitespace at the start or end of the string and split each line since values are already separated by commas.

In [11]:
lines = body_data.strip().split("\n")
print("\n".join(lines[:2]))

-122.230000,37.880000,41.000000,880.000000,129.000000,322.000000,126.000000,8.325200,452600.000000
-122.220000,37.860000,21.000000,7099.000000,1106.000000,2401.000000,1138.000000,8.301400,358500.000000


We can then create a DataFrame from that data.

Because all of the data ends up in the first column, we will split the data into new columns by each comma.  The column count of nine should align with our column dictionar (`col_dict`) and we can then rename each column based on mapping the key values to the initial column names.

In [16]:
df = pd.DataFrame(lines)
df.head()

Unnamed: 0,0
0,"-122.230000,37.880000,41.000000,880.000000,129..."
1,"-122.220000,37.860000,21.000000,7099.000000,11..."
2,"-122.240000,37.850000,52.000000,1467.000000,19..."
3,"-122.250000,37.850000,52.000000,1274.000000,23..."
4,"-122.250000,37.850000,52.000000,1627.000000,28..."


In [17]:
df = (df.iloc[:, 0] # Fancy way of selecting the first indexed column
      .str.split(",", expand=True) # Split by comma. Don't forget the '.str' part first!
      .rename(columns=col_dict) # Map our new column names
     )

Then, we can take a look at the first few rows of our data to visually inspect results and validate the information.

For example, if totalBedrooms ended up being the first column name, we might guess that that is incorrect based on the negative value and the mantissa value being greater than zero.

In [19]:
df.head()

Unnamed: 0,longitude,latitude,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome,medianHouseValue
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0


### Misc. Metrics
While not an imperative portion of this notebook, unless you're 100% certain of what you're dealing with, it's usually a good idea to do some exploratory analysis of the data.

Let's first check to see if we have any missing values.

In [20]:
df.isna().sum()

longitude           0
latitude            0
housingMedianAge    0
totalRooms          0
totalBedrooms       0
population          0
households          0
medianIncome        0
medianHouseValue    0
dtype: int64

Then, let's check our data types.  We'll want to convert non-numeric values to float for evaluating any further metrics.

In [22]:
df.dtypes

longitude           object
latitude            object
housingMedianAge    object
totalRooms          object
totalBedrooms       object
population          object
households          object
medianIncome        object
medianHouseValue    object
dtype: object

In [23]:
for col in df.columns:
    df[col] = df[col].astype(float)

df.dtypes  

longitude           float64
latitude            float64
housingMedianAge    float64
totalRooms          float64
totalBedrooms       float64
population          float64
households          float64
medianIncome        float64
medianHouseValue    float64
dtype: object

Now, we can check out other descriptive measures for each numeric column pretty easily in pandas using `.describe()`.

For this demo, we can skip over latitude and longitude columns.

In [29]:
no_latlong_cols = [i for i in df.columns if not i in ["longitude", "latitude"]]
df_desc = df.loc[:,no_latlong_cols].describe().T
n = df_desc.loc["totalRooms", "count"]
df_desc.drop("count", axis=1, inplace=True)
df_desc

Unnamed: 0,mean,std,min,25%,50%,75%,max
housingMedianAge,28.639486,12.585558,1.0,18.0,29.0,37.0,52.0
totalRooms,2635.763081,2181.615252,2.0,1447.75,2127.0,3148.0,39320.0
totalBedrooms,537.898014,421.247906,1.0,295.0,435.0,647.0,6445.0
population,1425.476744,1132.462122,3.0,787.0,1166.0,1725.0,35682.0
households,499.53968,382.329753,1.0,280.0,409.0,605.0,6082.0
medianIncome,3.870671,1.899822,0.4999,2.5634,3.5348,4.74325,15.0001
medianHouseValue,206855.816909,115395.615874,14999.0,119600.0,179700.0,264725.0,500001.0
