# Denison CS-181/DA-210 Homework

---

## Tabular Transformations Exercises

In [30]:
import os
import io
import sys
import re
import pandas as pd

from contextlib import redirect_stdout

def add_modules():
    """
    Starting at the current directory and proceeding up the file system
    tree, search for a directory named `modules`.  If found, and if not
    already there, add to the Python module search path.
    
    Params: None
    
    Return: None
    """
    directory = "."
    levels = 0
    while not os.path.isdir(os.path.join(directory, "modules")) and \
          levels < 5:
        directory = os.path.join(directory, "..")
        levels += 1
    module_path = os.path.abspath(os.path.join(directory, "modules"))
    if os.path.isdir(module_path):
        if not module_path in sys.path:
            sys.path.append(module_path)

add_modules()
import util

datadir = util.resolve_dir("tabulardata")

**Q1** Make the following into a `pandas` data frame, assigning it to variable `df1`.  Do not specify an index.

    {'bar': ['one','two','one','two','one','two'],
     'baz': ['A', 'A', 'B', 'B', 'C', 'C'],
     'foo': [1, 2, 3, 4, 5, 6]}
     
Suppose the values `A` and `B` and `C` from column `baz` should head columns (thus it takes more than one row to interpret a single observation, and `A`, `B`, and `C` really refer to variables), and the values themselves come from the `foo` column. 

Transform/reshape `df1` to obtain a tidy version of this data.  Your result **should only have one Index level for column labels**.  Assign to `transform1` and print the result.

In [31]:
# Student Experimentation Cell



In [32]:
# Solution Cell

result = io.StringIO()
with redirect_stdout(result):
    df1 = pd.DataFrame.from_dict({'bar':['one','two','one','two','one','two'],
                       'baz':['A','A','B','B','C','C'],
                       'foo':[1,2,3,4,5,6]})
    transform1 = df1.pivot(index='bar', columns='baz',values='foo')
    print(transform1)
print(result.getvalue())

baz  A  B  C
bar         
one  1  3  5
two  2  4  6



In [33]:
# Testing Cell

assert isinstance(df1, pd.core.frame.DataFrame)
assert isinstance(transform1, pd.core.frame.DataFrame)

assert transform1.shape == (2, 3)


**Q2** Make the following into a `pandas` data frame (with no index).  Assign it to `df2`.

    {'A': {0: 2, 1: 4, 2: 6},
    'B': {0: 'a', 1: 'b', 2: 'c'},
    'C': {0: 1, 1: 3, 2: 5},
    'D': {0: 1, 1: 2, 2: 4}}

Suppose further that we have determined that columns labels `A` and `C` are really *values* of a *variable* called `X`.  Transform `df2` into a tidy form, making sure that `X` is appropriately labeled in the result.  You need not specify a value name.  Assign to `transform2`.  You need not print the result.

In [34]:
# Student Experimentation Cell
df2

Unnamed: 0,A,B,C,D
0,2,a,1,1
1,4,b,3,2
2,6,c,5,4


In [35]:
# Solution Cell

df2 = pd.DataFrame.from_dict({'A':{0:2,1:4,2:6},
                             'B':{0:'a',1:'b',2:'c'},
                             'C':{0:1,1:3,2:5},
                             'D':{0:1,1:2,2:4}})
transform2 = df2.melt(id_vars = ['B','D'], value_vars = ['A','C'], var_name = 'X')
transform2

Unnamed: 0,B,D,X,value
0,a,1,A,2
1,b,2,A,4
2,c,4,A,6
3,a,1,C,1
4,b,2,C,3
5,c,4,C,5


In [36]:
# Testing cell

assert isinstance(df2, pd.core.frame.DataFrame)
assert isinstance(transform2, pd.core.frame.DataFrame)

assert transform2.shape == (6, 4)


**Q3** Consider the file `restaurants_gender.csv`, that has aggregated rating data by gender and whose rows map from an id, restaurant, and gender to an average rating.  So, relative to this aggregation, the data is tidy as it stands.  Transform the `restaurants_gender` data into a matrix presentation with gender down one axis (as a row-label index) and restaurant across the other axis (as column label Index), a form that might make for good presentation. Store the result as `rest_matrix`.

In [37]:
# Solution cell
resturants_gender = pd.read_csv(os.path.join(datadir, 'restaurants_gender.csv'))
rest_matrix = resturants_gender.pivot(index = 'gender', columns = 'restaurant')
rest_matrix

Unnamed: 0_level_0,id,id,rating,rating
restaurant,A,B,A,B
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
F,1,3,82,57
M,2,4,79,68


In [38]:
# Testing cell
assert True


**Q4** Consider the file `ratings.csv`.  It has columns for first name, last name, `RatingA`, used for rating a particular restaurant (A), and `RatingB`, used for rating a different restaurant (B).  We want the name of a "rater" should be a single variable (a string with first name, a space, and then last name).  The particular restaurants (`A` and `B`) are *values* of the data set.  Transform the given dataset into a tidy data set, naming it `ratings_tidy`.  Do not give the new data set a row label index.  You need not print the result.

Note that this question involves both transformation and mutations to normalize this data.

In [39]:
# Solution cell

ratings_tidy = pd.read_csv(os.path.join(datadir, 'ratings.csv'))
ratings_tidy['First'] = ratings_tidy['First'] +' '+ ratings_tidy['Last']
del ratings_tidy['Last']
ratings_tidy.rename(columns={'First':'Name'}, inplace = True)
ratings_tidy


Unnamed: 0,Name,RatingA,RatingB
0,Hamid Hirst,73,52
1,Tanya Hale,57,72
2,Wei Chang,85,69


In [40]:
# Testing cell
assert True


**Q5** Read `us_rent_income.csv` into a dataframe, then understand the data. Note: this dataset contains estimated income and rent in each state, as well as the margin of error for each of these quantities. This data came from the US Census.

Read the data into a data frame, `rent_income`, with no index.  Your goal is to make this into a tidy data set.  Hint: In a normalized version of this data set, the unique row index would be based on `GEOID` and `NAME` (which correspond one-for-one, and either could be the independent variable for the data).

Name your tidy data set `rent_income_tidy`.  It is ok to have a two level column Index.  No need to print the result.

In [41]:
# Student Experimentation Cell



In [44]:
rent_income = pd.read_csv(os.path.join(datadir, 'us_rent_income.csv'))
rent_income_tidy = rent_income.pivot(index = ['GEOID','NAME'], columns = 'variable', values = ['estimate','moe'])
rent_income_tidy

Unnamed: 0_level_0,Unnamed: 1_level_0,estimate,estimate,moe,moe
Unnamed: 0_level_1,variable,income,rent,income,rent
GEOID,NAME,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1,Alabama,24476.0,747.0,136.0,3.0
2,Alaska,32940.0,1200.0,508.0,13.0
4,Arizona,27517.0,972.0,148.0,4.0
5,Arkansas,23789.0,709.0,165.0,5.0
6,California,29454.0,1358.0,109.0,3.0
8,Colorado,32401.0,1125.0,109.0,5.0
9,Connecticut,35326.0,1123.0,195.0,5.0
10,Delaware,31560.0,1076.0,247.0,10.0
11,District of Columbia,43198.0,1424.0,681.0,17.0
12,Florida,25952.0,1077.0,70.0,3.0


In [43]:
# Testing cell

assert(rent_income_tidy.shape == (52,4))