# Denison CS-181/DA-210 Homework

---

## Tabular Operations: `pandas` Access

> Transition from mimicing operations with a close template from the book to
> using different data sets and word specification of what needs to be done
> to access data in `pandas` data frames.

In [2]:
import os
import io
import sys
import re
import pandas as pd
import numpy

from contextlib import redirect_stdout

def add_modules():
    """
    Starting at the current directory and proceeding up the file system
    tree, search for a directory named `modules`.  If found, and if not
    already there, add to the Python module search path.
    
    Params: None
    
    Return: None
    """
    directory = "."
    levels = 0
    while not os.path.isdir(os.path.join(directory, "modules")) and \
          levels < 5:
        directory = os.path.join(directory, "..")
        levels += 1
    module_path = os.path.abspath(os.path.join(directory, "modules"))
    if os.path.isdir(module_path):
        if not module_path in sys.path:
            sys.path.append(module_path)

add_modules()
import util

datadir = util.resolve_dir("tabulardata")

### Data Set

The data for these exercises is real data that comes from FAA information from the collection of New York City airports and contains separate tables regarding flights, planes, and airports covering all flights between NYC and any destination for the entire year of 2013.  This is a well known data set in the R and data science community, known as `nycflights13`.

We will see, as we work through these exercises, that the data is of significant size.  Effectively debugging of the processing on larger data sets involves trying your solution on a smaller set of the data.

The files we will use are in the `"tabulardata"` directory, referred to by the Python variable `datadir`.  The files are `"airports.csv"`, `"planes.csv"`, and `"flights.csv"`.

Because we want to acclimate to the process of data exploration, some of the questions will ask you to look at the data and reflect on what you observe, using manually graded Markdown cells.

**Q1** In the data directory, you will find `"airports.csv"`.  Read this into a data frame, assigning to `airports0`, with **no index**.  Print the first five rows of the data frame.

In [3]:
result = io.StringIO()
with redirect_stdout(result):
    airports_path = os.path.join(datadir, "airports.csv")
    airports0 = pd.read_csv(airports_path)
    print(airports0.head())
print(result.getvalue())

   faa                           name        lat        lon   alt  tz dst  \
0  04G              Lansdowne Airport  41.130472 -80.619583  1044  -5   A   
1  06A  Moton Field Municipal Airport  32.460572 -85.680028   264  -6   A   
2  06C            Schaumburg Regional  41.989341 -88.101243   801  -6   A   
3  06N                Randall Airport  41.431912 -74.391561   523  -5   A   
4  09J          Jekyll Island Airport  31.074472 -81.427778    11  -5   A   

              tzone  
0  America/New_York  
1   America/Chicago  
2   America/Chicago  
3  America/New_York  
4  America/New_York  



In [4]:
# Testing cell

assert len(airports0) == 1458
assert result.getvalue() == '   faa                           name        lat        lon   alt  tz dst  \\\n0  04G              Lansdowne Airport  41.130472 -80.619583  1044  -5   A   \n1  06A  Moton Field Municipal Airport  32.460572 -85.680028   264  -6   A   \n2  06C            Schaumburg Regional  41.989341 -88.101243   801  -6   A   \n3  06N                Randall Airport  41.431912 -74.391561   523  -5   A   \n4  09J          Jekyll Island Airport  31.074472 -81.427778    11  -5   A   \n\n              tzone  \n0  America/New_York  \n1   America/Chicago  \n2   America/Chicago  \n3  America/New_York  \n4  America/New_York  \n'

**Q2** An index for the rows should, generally, give a unique means of referring to rows.  In `airports0`, we would guess that the `faa` column is unique.  Project the `faa` column as a Series, assigning to `faas`.  Then assign to `unique_faa` the result of invoking the `unique()` method on `faas`.  Compare the `len()` of `unique_faa` to the `len()` of `airports0`, adding a comment in the solution cell with what you observe.

In [5]:
faas = airports0['faa']
unique_faa = airports0['faa'].unique()
comp_len = lambda x, y: "equal" if x == y else ("unique shorter" if x < y else "unique longer")
comp_len(len(unique_faa), len(airports0))
#with the lamda function we can see that the len of airports0 and
#unique_faa is the same which makes sense because all airports have
#a unique 3 character abreviation

'equal'

In [6]:
# Testing cell

assert list(faas.head()) == ['04G', '06A', '06C', '06N', '09J']
assert list(unique_faa[:5]) == ['04G', '06A', '06C', '06N', '09J']

**Q3** Repeat construction of a data frame from `"airports.csv"`, but **at construction time** use the `faa` column to be the index.  Assign to `airports`.  Print the first five rows.

In [7]:
result = io.StringIO()
with redirect_stdout(result):
    airports = pd.read_csv(airports_path, index_col=['faa'])
    print(airports.head())
print(result.getvalue())

                              name        lat        lon   alt  tz dst  \
faa                                                                      
04G              Lansdowne Airport  41.130472 -80.619583  1044  -5   A   
06A  Moton Field Municipal Airport  32.460572 -85.680028   264  -6   A   
06C            Schaumburg Regional  41.989341 -88.101243   801  -6   A   
06N                Randall Airport  41.431912 -74.391561   523  -5   A   
09J          Jekyll Island Airport  31.074472 -81.427778    11  -5   A   

                tzone  
faa                    
04G  America/New_York  
06A   America/Chicago  
06C   America/Chicago  
06N  America/New_York  
09J  America/New_York  



In [8]:
# Testing cell

assert len(airports) == 1458
assert result.getvalue() == '                              name        lat        lon   alt  tz dst  \\\nfaa                                                                      \n04G              Lansdowne Airport  41.130472 -80.619583  1044  -5   A   \n06A  Moton Field Municipal Airport  32.460572 -85.680028   264  -6   A   \n06C            Schaumburg Regional  41.989341 -88.101243   801  -6   A   \n06N                Randall Airport  41.431912 -74.391561   523  -5   A   \n09J          Jekyll Island Airport  31.074472 -81.427778    11  -5   A   \n\n                tzone  \nfaa                    \n04G  America/New_York  \n06A   America/Chicago  \n06C   America/Chicago  \n06N  America/New_York  \n09J  America/New_York  \n'

**Q4** Project the three locational columns of `lat`, `lon`, and `alt` (with all rows), making a separate **copy** of the result, and assigning to `locations`.  

In [9]:
locations = airports0[['lat','lon','alt']].copy()
print(locations)

            lat         lon   alt
0     41.130472  -80.619583  1044
1     32.460572  -85.680028   264
2     41.989341  -88.101243   801
3     41.431912  -74.391561   523
4     31.074472  -81.427778    11
...         ...         ...   ...
1453  35.083228 -108.791778  6454
1454  41.298669  -72.925992     7
1455  39.736667  -75.551667     0
1456  38.897460  -77.006430    76
1457  40.750500  -73.993500    35

[1458 rows x 3 columns]


In [10]:
# Testing cell

assert locations.shape == (1458, 3)
assert list(locations.columns) == ['lat', 'lon', 'alt']

**Q5** From `airports`, select the contiguous set of rows from the airport `CMH` up to and including the airport `COD`.  Make a seprate copy of the result, and assign to `airport_subset`.  Print the result.

In [11]:
result = io.StringIO()
with redirect_stdout(result):
    airport_subset = airports['CMH':'COD'].copy()
    print(airport_subset)
print(result.getvalue())

                                 name        lat         lon   alt  tz dst  \
faa                                                                          
CMH                Port Columbus Intl  39.997972  -82.891889   815  -5   A   
CMI                         Champaign  40.039250  -88.278056   754  -6   A   
CMX  Houghton County Memorial Airport  47.168400  -88.489100  1095  -5   A   
CNM          Cavern City Air Terminal  32.337472 -104.263278  3295  -7   A   
CNW                         Tstc Waco  31.637831  -97.074139   470  -6   A   
CNY                 Canyonlands Field  38.755000 -109.754722  4555  -7   A   
COD                  Yellowstone Rgnl  44.520194 -109.023806  5102  -7   A   

                tzone  
faa                    
CMH  America/New_York  
CMI   America/Chicago  
CMX  America/New_York  
CNM    America/Denver  
CNW   America/Chicago  
CNY    America/Denver  
COD    America/Denver  



In [12]:
# Testing cell

assert airport_subset.shape == (7, 7)
assert result.getvalue() == '                                 name        lat         lon   alt  tz dst  \\\nfaa                                                                          \nCMH                Port Columbus Intl  39.997972  -82.891889   815  -5   A   \nCMI                         Champaign  40.039250  -88.278056   754  -6   A   \nCMX  Houghton County Memorial Airport  47.168400  -88.489100  1095  -5   A   \nCNM          Cavern City Air Terminal  32.337472 -104.263278  3295  -7   A   \nCNW                         Tstc Waco  31.637831  -97.074139   470  -6   A   \nCNY                 Canyonlands Field  38.755000 -109.754722  4555  -7   A   \nCOD                  Yellowstone Rgnl  44.520194 -109.023806  5102  -7   A   \n\n                tzone  \nfaa                    \nCMH  America/New_York  \nCMI   America/Chicago  \nCMX  America/New_York  \nCNM    America/Denver  \nCNW   America/Chicago  \nCNY    America/Denver  \nCOD    America/Denver  \n'

**Q6** Create a dataframe for `"planes.csv"`, assigning to `planes0`.  Then create `planes` by starting from `planes0` and explicitly setting an index that corresponds to the tail number for the plane.

In [13]:
planes0 = pd.read_csv(os.path.join(datadir, 'planes.csv'))
planes = planes0.set_index('tailnum')
print(planes0.head())
print(planes.head())

  tailnum    year                     type      manufacturer      model  \
0  N10156  2004.0  Fixed wing multi engine           EMBRAER  EMB-145XR   
1  N102UW  1998.0  Fixed wing multi engine  AIRBUS INDUSTRIE   A320-214   
2  N103US  1999.0  Fixed wing multi engine  AIRBUS INDUSTRIE   A320-214   
3  N104UW  1999.0  Fixed wing multi engine  AIRBUS INDUSTRIE   A320-214   
4  N10575  2002.0  Fixed wing multi engine           EMBRAER  EMB-145LR   

   engines  seats  speed     engine  
0        2     55    NaN  Turbo-fan  
1        2    182    NaN  Turbo-fan  
2        2    182    NaN  Turbo-fan  
3        2    182    NaN  Turbo-fan  
4        2     55    NaN  Turbo-fan  
           year                     type      manufacturer      model  \
tailnum                                                                 
N10156   2004.0  Fixed wing multi engine           EMBRAER  EMB-145XR   
N102UW   1998.0  Fixed wing multi engine  AIRBUS INDUSTRIE   A320-214   
N103US   1999.0  Fixed wing m

In [14]:
# Testing cell

assert planes0.shape == (3322, 9)
assert planes.shape == (3322, 8)

**Q7** Determine which columns of `planes` have missing data, showing your code along with a comment in the code cell answering: for each many rows are missing?

In [27]:
D = {}
for collabel, colseries in planes.iteritems():
    D[collabel] = len((planes[planes[collabel].isna() == True]))
print(D)
#for year there are 70 missing and for speed there are 3299 missing
#none missing for the rest

{'year': 70, 'type': 0, 'manufacturer': 0, 'model': 0, 'engines': 0, 'seats': 0, 'speed': 3299, 'engine': 0}


**Q8** The `isna()` method is one that can be used on a Series column vector, and results in a Boolean Series whose value is `True` if the value for an entry is missing.  Use this method to obtain a data frame consisting of the selection of rows that have missing data for the `year` column.  Assign to `planes_missing_year`.  Print the first five rows of this dataframe.

In [16]:
result = io.StringIO()
with redirect_stdout(result):
    planes_missing_year = planes[planes['year'].isna() == True]
    print(planes_missing_year.head())
print(result.getvalue())

         year                     type      manufacturer      model  engines  \
tailnum                                                                        
N14558    NaN  Fixed wing multi engine           EMBRAER  EMB-145LR        2   
N15555    NaN  Fixed wing multi engine           EMBRAER  EMB-145LR        2   
N15574    NaN  Fixed wing multi engine           EMBRAER  EMB-145LR        2   
N174US    NaN  Fixed wing multi engine  AIRBUS INDUSTRIE   A321-211        2   
N177US    NaN  Fixed wing multi engine  AIRBUS INDUSTRIE   A321-211        2   

         seats  speed     engine  
tailnum                           
N14558      55    NaN  Turbo-fan  
N15555      55    NaN  Turbo-fan  
N15574      55    NaN  Turbo-fan  
N174US     199    NaN  Turbo-jet  
N177US     199    NaN  Turbo-jet  



In [17]:
# Testing cell

assert planes_missing_year.shape == (70, 8)
assert result.getvalue() == '         year                     type      manufacturer      model  engines  \\\ntailnum                                                                        \nN14558    NaN  Fixed wing multi engine           EMBRAER  EMB-145LR        2   \nN15555    NaN  Fixed wing multi engine           EMBRAER  EMB-145LR        2   \nN15574    NaN  Fixed wing multi engine           EMBRAER  EMB-145LR        2   \nN174US    NaN  Fixed wing multi engine  AIRBUS INDUSTRIE   A321-211        2   \nN177US    NaN  Fixed wing multi engine  AIRBUS INDUSTRIE   A321-211        2   \n\n         seats  speed     engine  \ntailnum                           \nN14558      55    NaN  Turbo-fan  \nN15555      55    NaN  Turbo-fan  \nN15574      55    NaN  Turbo-fan  \nN174US     199    NaN  Turbo-jet  \nN177US     199    NaN  Turbo-jet  \n'

**Q9** Find the unique values for the `type` field in `planes`.  Assign to `unique_plane_type` and print this.

In [18]:
result = io.StringIO()
with redirect_stdout(result):
    unique_plane_type = planes['type'].unique()
    print(unique_plane_type)
print(result.getvalue())

['Fixed wing multi engine' 'Fixed wing single engine' 'Rotorcraft']



In [19]:
# Testing cell

assert len(unique_plane_type == 3)
assert result.getvalue() == "['Fixed wing multi engine' 'Fixed wing single engine' 'Rotorcraft']\n"

**Q10** Find the subset of rows from `planes` where type is `"Rotorcraft"` and project the columns of manufacturer, model, engines, seats, and engine.  Use a single expression to perform **both** projection and selection.  Assign to `rotors` and print the result.

In [20]:
result = io.StringIO()
with redirect_stdout(result):
    rotors = planes.loc[planes["type"]=="Rotorcraft",["manufacturer", "model", "engines", "seats"]]
    print(rotors)
print(result.getvalue())

                   manufacturer  model  engines  seats
tailnum                                               
N347AA                 SIKORSKY  S-76A        2     14
N365AA               AGUSTA SPA  A109E        2      8
N393AA                     BELL    230        2     11
N508AA                     BELL   206B        1      5
N537JB   ROBINSON HELICOPTER CO    R66        1      5



In [21]:
# Testing cell

assert rotors.shape == (5, 4)
assert result.getvalue() == '                   manufacturer  model  engines  seats\ntailnum                                               \nN347AA                 SIKORSKY  S-76A        2     14\nN365AA               AGUSTA SPA  A109E        2      8\nN393AA                     BELL    230        2     11\nN508AA                     BELL   206B        1      5\nN537JB   ROBINSON HELICOPTER CO    R66        1      5\n'

**Q11** When a value is **missing** in a field in a `pandas` dataframe, the entry is either an integer or float data type with a value that does not correspond to any "valid" value.  It is displayed as `NaN`, which is an acronym for "Not A Number".  Our goal in the next two questions is to process the `tzone` column from the `airports` table, and extract the "city" portion of the time zone.  However, some of the values in that column are missing, so we have to write code that can adapt and do something reasonable in such a case.

Start by defining a `lambda` function of one argument, `v`, that, if the argument is a string data type (i.e, `isinstance(v, str)`, it splits the argument on the `/` character and returns the **second** field.  If, however, the argument is not a string, the lambda yields `numpy.nan`.  Assign this lambda to `tz_city`.

In [22]:
tz_city = lambda v: v.split("/")[1] if (isinstance(v,str)) else numpy.nan

In [23]:
# Testing cell

assert numpy.isnan(tz_city(5))
assert numpy.isnan(tz_city(2.5))
assert tz_city('America/New York') == 'New York'

**Q12** Create a Series that is a column vector whose values are the "city" portion of the `tzone` field for each of the rows of `airports`.  Do this by applying the unary function, `tz_city`, to the `tzone` column of `airports`.  Assign the result to `tz_cities`, and then print the result of converting the **unique** values from this vector to a list.

In [24]:
result = io.StringIO()
with redirect_stdout(result):
    tz_cities = pd.Series(dtype = 'object')
    for rowlabel, rowseries in airports.iterrows():
        if (airports.loc[rowlabel,"tzone"]!=numpy.nan):
            city_name = tz_city(airports.loc[rowlabel, "tzone"])
            city_series = pd.Series(city_name)
            #print(city_series)
            tz_cities = tz_cities.append(city_series, ignore_index = True)
        else:
            tz_cities = tz_cities.append(pd.Series(numpy.nan))
    
    uniqueList = []
    for i in range(len(tz_cities.unique())):
        uniqueList.append(tz_cities.unique()[i])
    print(uniqueList)
print(result.getvalue())

['New_York', 'Chicago', 'Los_Angeles', 'Vancouver', 'Phoenix', 'Anchorage', 'Denver', 'Honolulu', 'Chongqing', nan]



In [25]:
# Testing cell

assert result.getvalue() == "['New_York', 'Chicago', 'Los_Angeles', 'Vancouver', 'Phoenix', 'Anchorage', 'Denver', 'Honolulu', 'Chongqing', nan]\n"