# Denison CS-181/DA-210 Homework

---

## Aggregation Exercises

In [47]:
import os
import io
import sys
import re
import pandas as pd

from contextlib import redirect_stdout

def add_modules():
    """
    Starting at the current directory and proceeding up the file system
    tree, search for a directory named `modules`.  If found, and if not
    already there, add to the Python module search path.
    
    Params: None
    
    Return: None
    """
    directory = "."
    levels = 0
    while not os.path.isdir(os.path.join(directory, "modules")) and \
          levels < 5:
        directory = os.path.join(directory, "..")
        levels += 1
    module_path = os.path.abspath(os.path.join(directory, "modules"))
    if os.path.isdir(module_path):
        if not module_path in sys.path:
            sys.path.append(module_path)

add_modules()
import util

datadir = util.resolve_dir("tabulardata")

**Q1** Read the CSV file `indicators2.csv` in the `"tabulardata"` data directory into a data frame named `indicators`, using the `code` column for the index.  Using individual aggregation operations on the respective `Series`, find the median and mean gdp, the average life expectancy, and the median land area.  Assign to `median_gdp`, `mean_gdp`, `mean_life`,  and `median_land`, respectively.  **Without additional annotation** and in a single print invocation, print these four values.

In [48]:
result = io.StringIO()
with redirect_stdout(result):
    indicators = pd.read_csv(os.path.join(datadir, "indicators2.csv"), index_col = "code")
    median_gdp = indicators["gdp"].median()
    mean_gdp = indicators["gdp"].mean()
    mean_life = indicators["life"].mean()
    median_land = indicators["land"].median()
    print(median_gdp, mean_gdp, mean_life, median_land)
print(result.getvalue())

20.395 320.5834313725491 70.48905 95300.0



In [49]:
assert isinstance(median_gdp, float)
assert isinstance(mean_gdp, float)
assert isinstance(mean_life, float)
assert isinstance(median_land, float)
assert result.getvalue() == '20.395 320.583431372549 70.48904999999999 95300.0\n'

AssertionError: 

**Q2** Using a single invocation of the `agg` method **on a DataFrame** perform the same computation, assigning to `metrics` and print the result.

In [50]:
result = io.StringIO()
with redirect_stdout(result):
    D = {'gdp':['median','mean'],'life':'mean','land':'median'}
    metrics = indicators.agg(D)
    print(metrics)
print(result.getvalue())

               gdp      life     land
mean    320.583431  70.48905      NaN
median   20.395000       NaN  95300.0



In [51]:
assert isinstance(metrics, pd.core.frame.DataFrame)
assert result.getvalue() == '               gdp      life     land\nmean    320.583431  70.48905      NaN\nmedian   20.395000       NaN  95300.0\n'

**Q3** Perform the aggregation functions of max, min, median, count, and size on the column of `cell` in a single operation. Do this twice, once by invoking the `agg` method on the DataFrame (appropriately), and the second time by first projecting the Series, and then invoking the `agg` method on the Series.  Name the first `metrics1` and the second `metrics2`.  In **separate print invocations** print each.  In a comment in your solution cell, tell me the data type of `metrics1`, and the data type of `metrics2`.

In [52]:
result = io.StringIO()
with redirect_stdout(result):
    metrics1 = indicators.agg({'cell':['max', 'min', 'median', 'count', 'size']})
    metrics2 = indicators['cell'].agg(['max', 'min', 'median', 'count', 'size'])
    print(metrics1)
    print(metrics2)
print(result.getvalue())
#metrics1 is a dataframe while metrics2 is a series

          cell
max     859.00
min       0.00
median    4.61
count   207.00
size    217.00
max       859.00
min         0.00
median      4.61
count     207.00
size      217.00
Name: cell, dtype: float64



In [53]:
# Testing cell
assert True

**Q4** In the data directory, you will find a file sat.csv, based on data from www.onlinestatbook.com. This data contains the high school GPA, SAT math score, SAT verbal score, and college GPAs, for a number of students. Read the data into a pandas dataframe `dfsat`. We are going to experiment using `agg` to compute the average of all of the numeric columns in the data set. 

The four variations:

1. Pass a string naming an aggregation function to `agg`, assigning the result to `average1`.
2. Pass a list consisting of a string naming an aggregation function, assigning the result to `average2`.
3. Pass a dictionary with keys for each column and mapping to a string for the aggregation function, assigning the result to `average3`.
4. Pass a dictionary mapping each of the columns to a list consisting of a string naming the aggregation function.  Assign to `average4`.

In separate print statements, print each of these results.

In [54]:
result = io.StringIO()
with redirect_stdout(result):
    dfsat = pd.read_csv(os.path.join(datadir, "sat.csv"))
    average1 = dfsat.agg('mean')
    average2 = dfsat.agg(['mean'])
    average3 = dfsat.agg({'mean'})
    average4 = dfsat.agg({'hs_GPA':['mean'],'math_SAT':['mean'],'verb_SAT':['mean'], 'college_GPA':['mean']})
    print(average1)
    print(average2)
    print(average3)
    print(average4)
    
    print(type(average1))
    print(type(average2))
    print(type(average3))
    print(type(average4))
print(repr(result.getvalue()))

"hs_GPA           3.076381\nmath_SAT       623.076190\nverb_SAT       598.600000\ncollege_GPA      3.172857\ndtype: float64\n        hs_GPA   math_SAT  verb_SAT  college_GPA\nmean  3.076381  623.07619     598.6     3.172857\n        hs_GPA   math_SAT  verb_SAT  college_GPA\nmean  3.076381  623.07619     598.6     3.172857\n        hs_GPA   math_SAT  verb_SAT  college_GPA\nmean  3.076381  623.07619     598.6     3.172857\n<class 'pandas.core.series.Series'>\n<class 'pandas.core.frame.DataFrame'>\n<class 'pandas.core.frame.DataFrame'>\n<class 'pandas.core.frame.DataFrame'>\n"


In [55]:
# Testing cell
assert True

**Q5** Compare and contrast the four cases in the previous question, including observations on data types and equivalencies.

The first average is a series while the rest of them are data frames but they each have the same numbers in them as the average.

**Q6** The `indicators` data set in `indicators2.csv` has been augmented to include income categories for each country, and also has added the column, `gdp_percap`, which is the per capita gdp.

Write code to **filter** `indicators` to just the countries that are designated as `"High income"`, and then aggregate the `gdp_percap` column to obtain mean, median, and max, and aggregate `life` by the same three aggegations.  Use aggregation on the projected Series.

Assign the result to `highincome_stats` and print the result.

In [56]:
result = io.StringIO()
with redirect_stdout(result):
    indicators = pd.read_csv(os.path.join(datadir, "indicators2.csv"))
    high = indicators[indicators["income"] == "High income"]
    highincome_stats = high.agg({"gdp_percap":["mean","median","max"], "life":["mean",'median','max']})
    print(highincome_stats)
print(repr(result.getvalue()))

'        gdp_percap       life\nmean     37.790349  78.434328\nmedian   33.696600  79.100000\nmax     133.750000  82.980000\n'


In [57]:
# Testing cell


**Q7** Repeat the filtering and aggregation from the last question, but this time filter by the income category `"Low income"`.  

Use aggregation on the projected Series.

Assign the result to `lowincome_stats` and print the result.

After you have given the code, write a short paragraph comparing and contrasting what you learned through the results of these last two questions.

In [58]:
result = io.StringIO()
with redirect_stdout(result):
    low = indicators[indicators["income"] == "Low income"]
    lowincome_stats = low.agg({'gdp_percap':['mean','median','max'], 'life':['mean','median','max']})
    print(lowincome_stats)
print(repr(result.getvalue()))

'        gdp_percap       life\nmean      0.593454  58.969677\nmedian    0.557800  57.460000\nmax       1.562100  72.110000\n'


In [59]:
# Testing cell


Living in high income vs a low income has a significant effect on your life. Looking at the gdp_percap we know that the low income places do not cost nearly as much as high income places. We also know that the average lifespan is about 20 years less in low income than in high income places. Knowing all of this we can come to the conclusion that high income places have a positive effect on your age and low income places have a negative effect on your lifespan.