# Course Solutions

1. [Pandas Numpy Intro](#1.-Pandas-Numpy-Intro)
2. [Series - Boolean Indexing](#2.-Series---Boolean-Indexing)
3. [The DataFrame](#3.-The-DataFrame)
4. [More on DataFrames](#4.-More-on-DataFrames)

# 1. Pandas Numpy Intro

In [3]:
import pandas as pd
import numpy as np

pd.options.display.max_columns = 40

### Problem 1
<span  style="color:green; font-size:16px">What type of object is returned from the values of the index of a Series?</span>

In [5]:
# your code here

# The values of the index of a series are a numpy ndarray

s = pd.Series([1, 3, 3, 77])

type(s.index.values)

numpy.ndarray

### Problem 2
<span  style="color:green; font-size:16px">Create a 3 element pandas Series using the Series constructor with characters as the index and numbers as the values. Output the Series.</span>

In [6]:
# your code here

In [8]:
s = pd.Series(data=[4, 9, 100], index=['b', 'c', 'g'])
s

b      4
c      9
g    100
dtype: int64

### Problem 3
<span  style="color:green; font-size:16px">Another way to create a series is to pass a dictionary to the pandas series constructor. The keys of the dictionary become the Series index and the dictionary values become the Series values. Create a dictionary with at least 3 elements and use it to create a series. Output the Series.</span>

In [11]:
# your code here
d = {'Houston':'South', 'Dallas':'North', 'El Paso':'West'}
s = pd.Series(d)

s

Dallas     North
El Paso     West
Houston    South
dtype: object

In [13]:
idx = range(10, 110) # generate values for the index
values = np.random.rand(100) # generate random numbers for the values

s = pd.Series(data=values, index=idx)

### Problem 4
<span  style="color:green; font-size:16px">Output to the screen the first 10 numbers in the Series above. Remember to only use **loc** and **iloc** when accessing Series elements.</span>

In [14]:
# your code here
s.iloc[:10]

10    0.035355
11    0.727694
12    0.310739
13    0.476686
14    0.790694
15    0.403608
16    0.372937
17    0.708290
18    0.672696
19    0.740472
dtype: float64

### Problem 5
<span  style="color:green; font-size:16px">Output elements with labels 40, 50 and 99 from the Series above.</span>

In [15]:
# your code here
s.loc[[40, 50, 99]]

40    0.718009
50    0.813319
99    0.345779
dtype: float64

### Problem 6
<span  style="color:green; font-size:16px">Output the last ten elements of the Series.</span>

In [16]:
# your code here
s[-10:]

100    0.763810
101    0.509787
102    0.176679
103    0.962719
104    0.649265
105    0.247904
106    0.823940
107    0.054817
108    0.480611
109    0.484518
dtype: float64

### Problem 7
<span  style="color:green; font-size:16px">Output the last ten elements of the Series.</span>

In [17]:
# your code here
s.iloc[::15]

10     0.035355
25     0.897934
40     0.718009
55     0.055821
70     0.941857
85     0.214580
100    0.763810
dtype: float64

### Problem 8
<span  style="color:green; font-size:16px">Write a function that accepts a single argument. The argument will be a Series. Have the function return the difference between the largest and smallest Series value. Run your functio with the Series above.</span>

In [19]:
# your code here
def min_max(series):
    return series.max() - series.min()

min_max(s)

0.96213063980758462

### Problem 9
<span  style="color:green; font-size:16px">If two Series are added with no indices in common, what will be the outcome? Check your answer by coding this situation.</span>

In [21]:
# your code here
# all missing values
s1 = pd.Series(data=[1,2,3], index=list('abc'))
s2 = pd.Series(data=[1,2,3], index=list('def'))

s1 + s2

a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
f   NaN
dtype: float64

### Problem 10
<span  style="color:green; font-size:16px">What if the two series from problem 9 were subtracted, multiplied or divided together?</span>

In [23]:
# your code here
# same thing would happen
s1 / s2

a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
f   NaN
dtype: float64

### Problem 11
<span  style="color:green; font-size:16px">Create two Series that have 3 elements each and when added together yield a Series that has four 4 total elements that are all not missing.</span>

In [24]:
# your code here
s1 = pd.Series(data=[1,2,3], index=list('aab'))
s2 = pd.Series(data=[1,2,3], index=list('abb'))

s1 + s2

a    2
a    3
b    5
b    6
dtype: int64

# 2. Series - Boolean Indexing

### Problem 1
<span  style="color:green; font-size:16px">Create a function that accepts a stock ticker symbol (amzn for example) and returns the percentage of prices within 1, 2, and 3 standard deviations from the mean. Use your function to return results for different stocks (tsla, fb, slb, gm, etc...)</span>

In [3]:
from pandas_datareader import data

In [4]:
# your code here
def stock_pct_finder(symbol):
    prices = data.DataReader(symbol, 'google')
    close = prices['Close']
    close_change = close.pct_change()
    close_change_drop = close_change.dropna()
    
    n_obs = close_change_drop.size
    mean = close_change_drop.mean()
    std = close_change_drop.std()
    
    criteria = abs(close_change_drop - mean) / std < 1
    close_1std = close_change_drop[criteria]
    
    criteria = abs(close_change_drop - mean) / std < 2
    close_2std = close_change_drop[criteria]

    criteria = abs(close_change_drop - mean) / std < 3
    close_3std = close_change_drop[criteria]
    
    pct_within1 = round(close_1std.size / n_obs, 3)
    pct_within2 = round(close_2std.size / n_obs, 3)
    pct_within3 = round(close_3std.size / n_obs, 3)

    return pct_within1, pct_within2, pct_within3

In [5]:
stock_pct_finder('amzn')

(0.787, 0.957, 0.985)

In [31]:
stock_pct_finder('fb')

(0.793, 0.958, 0.989)

In [32]:
stock_pct_finder('slb')

(0.741, 0.945, 0.987)

In [33]:
stock_pct_finder('tsla')

(0.788, 0.951, 0.984)

### Problem 2
<span  style="color:green; font-size:16px">Write a one line expression that returns the maximum number of standard deviations from the mean for any one day return.</span>

In [35]:
prices = data.DataReader('amzn', 'google')
close = prices['Close']
close_change = close.pct_change()
close_change_drop = close_change.dropna()

n_obs = close_change_drop.size
mean = close_change_drop.mean()
std = close_change_drop.std()

In [37]:
# your code here
max(abs(close_change_drop - mean) / std)

7.6175368993564412

In [39]:
# can also do this in stages

# calculate absolute z-scores (standard deviations from mean)
z_scores = abs(close_change_drop - mean) / std

#inspect the
z_scores.head()

Date
2010-01-05    0.227934
2010-01-06    0.943016
2010-01-07    0.889263
2010-01-08    1.260480
2010-01-11    1.231939
Name: Close, dtype: float64

In [42]:
# use max method
z_scores.max()

7.6175368993564412

In [41]:
# can use builtin python function
max(z_scores)

7.6175368993564412

### Problem 3
<span  style="color:green; font-size:16px">Create a Series, **`tsla`**, by getting the closing price of Tesla motors (symbol TSLA) using the pandas_datareader library. Series  **`tsla`** will be continually referred to for the rest of the problems. Output the first 10 values.</span>

In [6]:
# your code here
tsla = data.DataReader('tsla', 'google')['Close']

tsla.head()

Date
2010-06-29    23.89
2010-06-30    23.83
2010-07-01    21.96
2010-07-02    19.20
2010-07-06    16.11
Name: Close, dtype: float64

### Problem 4
<span  style="color:green; font-size:16px">There are many dozens of Series methods and attributes. Use the following methods correctly with variable **tsla** : **`rank, cummax, sample, describe, quantile, idxmax, empty`**. Make sure to read the documentation for correct usage. Use a new code cell for each command and remember to use the **`head`** method to restrict long output.</span>

In [54]:
# your code here
tsla.rank().head()

Date
2010-06-29    135.5
2010-06-30    132.0
2010-07-01     89.0
2010-07-02     16.5
2010-07-06      2.0
Name: Close, dtype: float64

In [69]:
tsla.cummax().head(10)

Date
2010-06-29    23.89
2010-06-30    23.89
2010-07-01    23.89
2010-07-02    23.89
2010-07-06    23.89
2010-07-07    23.89
2010-07-08    23.89
2010-07-09    23.89
2010-07-12    23.89
2010-07-13    23.89
Name: Close, dtype: float64

In [70]:
tsla.sample(10)

Date
2016-10-07    196.61
2011-12-13     29.45
2011-03-14     23.25
2012-02-23     34.53
2011-02-14     23.08
2010-12-09     32.05
2011-02-18     23.18
2011-10-25     28.25
2013-06-20    100.65
2012-06-27     31.96
Name: Close, dtype: float64

In [58]:
tsla.describe()

count    1616.000000
mean      127.608045
std        93.593492
min        15.800000
25%        29.480000
50%       142.170000
75%       217.915000
max       286.040000
Name: Close, dtype: float64

In [60]:
tsla.quantile(.2), tsla.quantile(.8)

(28.239999999999998, 224.81999999999999)

In [61]:
tsla.idxmax()

Timestamp('2014-09-04 00:00:00')

In [62]:
tsla.empty

False

### Problem 5
<span  style="color:green; font-size:16px"> Sort **`tsla`** by descending values and save it to variable **`tsla1`**. Output the first 10 values to check that the sort happened correctly.</span>

In [71]:
# your code here
tsla1 = tsla.sort_values(ascending=False)

tsla1.head(10)

Date
2014-09-04    286.04
2014-09-02    284.12
2015-07-20    282.26
2014-09-08    282.11
2014-09-03    281.19
2014-09-10    281.10
2014-09-11    280.31
2015-07-02    280.02
2015-07-06    279.72
2014-09-12    279.20
Name: Close, dtype: float64

### Problem 6
<span  style="color:green; font-size:16px">Use the **`sort_index`** method to re-sort **`tsla1`** back to its original ordering. Store this Series to **`tsla2`**. Use the **`equals`** method to check that **`tsla2`** does in fact equal **`tsla`**.</span>

In [81]:
# your code here
tsla2 = tsla1.sort_index()

tsla2.equals(tsla)

True

### Problem 7
<span  style="color:green; font-size:16px">Instead of using the **`equals`** method, test equality between **`tsla`** and **`tsla2`** by using the comparison operator **`==`** and the **`all`** method.</span>

In [82]:
# your code here
(tsla == tsla2).all()

True

### Problem 8
<span  style="color:green; font-size:16px">Slicing! Using iloc, slice series **`tsla`**:<ol><li>Retrieve the first 6 elements</li><li>Retrieve every 18th element</li><li>Reverse the series</li><li>Write two ways to get every 4th element starting from the 993rd element to the 593rd</li><li>Chain your slicing multiple times by getting every other element, then every third element, then every 4th element, then every 5th element</li></ol>
</span>

Use **`head`** method to shorten long output.

In [84]:
# your code here (use keyboard shortcut ESC + B to make more cells below for each slice)
tsla.iloc[:6]

Date
2010-06-29    23.89
2010-06-30    23.83
2010-07-01    21.96
2010-07-02    19.20
2010-07-06    16.11
2010-07-07    15.80
Name: Close, dtype: float64

In [87]:
tsla.iloc[::18].head(10)

Date
2010-06-29    23.89
2010-07-26    20.95
2010-08-19    18.79
2010-09-15    21.98
2010-10-11    20.24
2010-11-04    24.90
2010-12-01    34.35
2010-12-28    26.41
2011-01-24    24.49
2011-02-17    23.60
Name: Close, dtype: float64

In [91]:
tsla.iloc[::-1].head(10)

Date
2016-11-25    196.65
2016-11-23    193.14
2016-11-22    191.17
2016-11-21    184.52
2016-11-18    185.02
2016-11-17    188.66
2016-11-16    183.93
2016-11-15    183.77
2016-11-14    181.45
2016-11-11    188.56
Name: Close, dtype: float64

In [92]:
tsla.iloc[993:592:-4].head()

Date
2014-06-10    202.30
2014-06-04    203.99
2014-05-29    210.24
2014-05-22    204.88
2014-05-16    191.56
Name: Close, dtype: float64

In [93]:
tsla.iloc[-7:-408:-4].head()

Date
2016-11-16    183.93
2016-11-10    185.35
2016-11-04    190.56
2016-10-31    197.73
2016-10-25    202.34
Name: Close, dtype: float64

In [94]:
tsla.iloc[::2].iloc[::3].iloc[::4].iloc[::5]

Date
2010-06-29     23.89
2010-12-17     31.36
2011-06-10     27.86
2011-11-30     32.74
2012-05-23     31.02
2012-11-14     31.38
2013-05-09     69.40
2013-10-29    164.47
2014-04-23    207.99
2014-10-13    224.59
2015-04-07    203.25
2015-09-25    256.91
2016-03-18    232.74
2016-09-08    197.36
Name: Close, dtype: float64

### Problem 9
<span  style="color:green; font-size:16px"> How many days did Tesla close above 100 and below 150?</span>

In [96]:
# your code here
criteria = (tsla > 100) & (tsla < 150)

#inspect data
tsla[criteria].head(15)

Date
2013-05-28    110.33
2013-05-29    104.63
2013-05-30    104.95
2013-06-07    102.04
2013-06-10    100.05
2013-06-14    100.30
2013-06-17    102.20
2013-06-18    103.39
2013-06-19    104.68
2013-06-20    100.65
2013-06-24    101.49
2013-06-25    102.40
2013-06-26    105.72
2013-06-27    109.25
2013-06-28    107.36
Name: Close, dtype: float64

In [97]:
tsla[criteria].size

90

In [98]:
# or using sum method
criteria.sum()

90

In [101]:
# one line
((tsla > 100) & (tsla < 150)).sum()

90

### Problem 10
<span  style="color:green; font-size:16px"> How many days did Tesla close below 50 or above 200?</span>

In [104]:
# your code here
((tsla < 50) | (tsla > 200)).sum()

1299

### Problem 11
<span  style="color:green; font-size:16px"> Lookup the definition for interquartile range and smartly slice your way from series s until you are left with the IQR </span>

In [7]:
# your code here 
# a few ways to do this
n = tsla.size
first_q = n // 4
third_q = n // 4 * 3

tsla.sort_values().iloc[first_q:third_q].head()

Date
2012-07-24    29.84
2011-10-28    29.87
2010-11-18    29.89
2012-06-15    29.91
2012-08-10    29.94
Name: Close, dtype: float64

In [8]:
# can use the quantile funciton
q1 = tsla.quantile(.25)
q3 = tsla.quantile(.75)

criteria = (tsla >= q1) & (tsla <= q3)

tsla[criteria].sort_values().head()

Date
2011-10-28    29.87
2010-11-18    29.89
2012-06-15    29.91
2012-08-10    29.94
2012-08-22    29.95
Name: Close, dtype: float64

### Problem 12
<span  style="color:green; font-size:16px">Use the **`idxmax`** method to find the index labeled position of the highest closing price. Find out how many trading days it has been since tsla recorded it highest closing price.</span>

In [9]:
# your code here
highest_index = tsla.idxmax()

highest_index

Timestamp('2014-09-04 00:00:00')

In [10]:
tsla.loc[highest_index:].size

634

### Problem 13
<span  style="color:green; font-size:16px">Assign Series tsla to tsla3 **`tsla3 = tsla`** and then change the value of the 0th element of **`tsla3`** to 100. What has happened to **`tsla`**? We didn't talk about series assignment, but does it work as you would expect?</span>

In [119]:
# your code here
tsla3 = tsla

tsla3.iloc[0] = 100

tsla.head() # mutation!

Date
2010-06-29    100.00
2010-06-30     23.83
2010-07-01     21.96
2010-07-02     19.20
2010-07-06     16.11
Name: Close, dtype: float64

### Problem 14
<span  style="color:green; font-size:16px">Assign every 2nd element of **`tsla3`** the mean of **`tsla3`**. What happened to **`tsla`**? Is it the same size? Now assign elements 100 to 900 of **`tsla`** the variance of **`tsla`**. How much has each Series changed?</span>

In [120]:
# your code here
tsla3.iloc[::2] = tsla3.mean()

In [121]:
# inspect - looks like it mutated tsla as well
tsla.head(10)

Date
2010-06-29    127.655142
2010-06-30     23.830000
2010-07-01    127.655142
2010-07-02     19.200000
2010-07-06    127.655142
2010-07-07     15.800000
2010-07-08    127.655142
2010-07-09     17.400000
2010-07-12    127.655142
2010-07-13     18.140000
Name: Close, dtype: float64

In [122]:
tsla.iloc[100:900] = tsla.var()

In [124]:
# they are same objects
tsla is tsla3

True

In [126]:
# they still are same length. Each of the elements 100 - 900 got replaced by the variance
tsla.size

1616

### Problem 15
<span  style="color:green; font-size:16px">Get the original **`tsla`** dataset back by running code in Problem 3 again. Now redo problem 13 except assign **`tsla3`** a be copy of **`tsla`** using the **`copy`** Series method.</span>

In [131]:
# your code here
tsla = data.DataReader('tsla', 'google')['Close']

tsla3 = tsla.copy()

In [133]:
tsla3.iloc[0] = 100

tsla.head() # no mutation! Its a different object

Date
2010-06-29    23.89
2010-06-30    23.83
2010-07-01    21.96
2010-07-02    19.20
2010-07-06    16.11
Name: Close, dtype: float64

### Problem 16
<span  style="color:green; font-size:16px">Using the inversion operator the tilde (**`~`**) to output a Series of all **`tsla`** values greater than 270.</span>

In [140]:
# your code here
tsla[~(tsla < 270)]

Date
2014-09-02    284.12
2014-09-03    281.19
2014-09-04    286.04
2014-09-05    277.39
2014-09-08    282.11
2014-09-09    278.48
2014-09-10    281.10
2014-09-11    280.31
2014-09-12    279.20
2015-07-02    280.02
2015-07-06    279.72
2015-07-17    274.66
2015-07-20    282.26
2015-08-05    270.13
Name: Close, dtype: float64

### Problem 17
<span  style="color:green; font-size:16px">Read the documentation on the **`where`** Series method and use it with **`tsla`** to change all values less than 100 to 0.</span>

In [11]:
tsla.where(tsla > 100, 0)

Date
2010-06-29      0.00
2010-06-30      0.00
2010-07-01      0.00
2010-07-02      0.00
2010-07-06      0.00
2010-07-07      0.00
2010-07-08      0.00
2010-07-09      0.00
2010-07-12      0.00
2010-07-13      0.00
2010-07-14      0.00
2010-07-15      0.00
2010-07-16      0.00
2010-07-19      0.00
2010-07-20      0.00
2010-07-21      0.00
2010-07-22      0.00
2010-07-23      0.00
2010-07-26      0.00
2010-07-27      0.00
2010-07-28      0.00
2010-07-29      0.00
2010-07-30      0.00
2010-08-02      0.00
2010-08-03      0.00
2010-08-04      0.00
2010-08-05      0.00
2010-08-06      0.00
2010-08-09      0.00
2010-08-10      0.00
               ...  
2017-01-27    252.95
2017-01-30    250.63
2017-01-31    251.93
2017-02-01    249.24
2017-02-02    251.55
2017-02-03    251.33
2017-02-06    257.77
2017-02-07    257.48
2017-02-08    262.08
2017-02-09    269.20
2017-02-10    269.23
2017-02-13    280.60
2017-02-14    280.98
2017-02-15    279.76
2017-02-16    268.95
2017-02-17    272.23
2017-02-

### Problem 18
<span  style="color:green; font-size:16px">Read the documentation on the **`mask`** Series method and use it with **`tsla`** to change all values greater than or equal to 100 to 0.</span>

In [146]:
# your code here
tsla.mask(tsla > 100, 0)

Date
2010-06-29    23.89
2010-06-30    23.83
2010-07-01    21.96
2010-07-02    19.20
2010-07-06    16.11
2010-07-07    15.80
2010-07-08    17.46
2010-07-09    17.40
2010-07-12    17.05
2010-07-13    18.14
2010-07-14    19.84
2010-07-15    19.89
2010-07-16    20.64
2010-07-19    21.91
2010-07-20    20.30
2010-07-21    20.22
2010-07-22    21.00
2010-07-23    21.29
2010-07-26    20.95
2010-07-27    20.55
2010-07-28    20.72
2010-07-29    20.35
2010-07-30    19.94
2010-08-02    20.92
2010-08-03    21.95
2010-08-04    21.26
2010-08-05    20.45
2010-08-06    19.59
2010-08-09    19.60
2010-08-10    19.03
              ...  
2016-10-14     0.00
2016-10-17     0.00
2016-10-18     0.00
2016-10-19     0.00
2016-10-20     0.00
2016-10-21     0.00
2016-10-24     0.00
2016-10-25     0.00
2016-10-26     0.00
2016-10-27     0.00
2016-10-28     0.00
2016-10-31     0.00
2016-11-01     0.00
2016-11-02     0.00
2016-11-03     0.00
2016-11-04     0.00
2016-11-07     0.00
2016-11-08     0.00
2016-11-09     

# 3. The DataFrame

In [14]:
import pandas as pd
import numpy as np
df_coh = pd.read_csv('data/coh_employee.csv')

### Problem 1
<span  style="color:green; font-size:16px">What object is returned from the above code **`df_coh.describe(include=['object']).T`**?</span>

In [162]:
# your code here

# another dataframe
type(df_coh.describe(include=['object']).T)

pandas.core.frame.DataFrame

### Problem 2
<span  style="color:green; font-size:16px">What object are the values of an **`Index`**?</span>

In [163]:
# your code here
# its a numpy array
type(df_coh.index.values)

numpy.ndarray

### Problem 3
<span  style="color:green; font-size:16px">Retrieve the LAST_NAME column and assign it to a variable. Count the number of employees with a last name of 'Johnson'.</span>

In [166]:
# your code here
last_name = df_coh['LAST_NAME']
criteria = last_name == 'JOHNSON'

criteria.sum()

30

In [167]:
# one line
(df_coh['LAST_NAME'] == 'JOHNSON').sum()

30

In [169]:
# also one line
sum(df_coh['LAST_NAME'] == 'JOHNSON')

30

### Problem 4
<span  style="color:green; font-size:16px">Count all the occurrences of the each value in the column PAY_GRADE. Remember there is as single Series method that does this.</span>

In [171]:
# your code here
df_coh['RACE'].value_counts()

Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
American Indian or Alaskan Native     11
Others                                 2
Name: RACE, dtype: int64

### Problem 5
<span  style="color:green; font-size:16px">How many employees are White Females or 'Black or African American' Males and make more than 100,000? Create variables for each of three categories below before putting them all together in one criteria.</span>

In [18]:
# your code here
white_female = (df_coh['RACE'] == 'White') & (df_coh['GENDER'] == 'Female')
black_male = (df_coh['RACE'] == 'Black or African American') & (df_coh['GENDER'] == 'Male')
over_100k = df_coh['BASE_SALARY'] > 100000

criteria = (white_female | black_male) & over_100k

criteria.sum()


20

### Problem 6
<span  style="color:green; font-size:16px">What is the third most common department?</span>

In [178]:
# your code here
# from the Series below it is Public Works & Engineering-PWE
df_coh['DEPARTMENT'].value_counts()

Houston Police Department-HPD     638
Houston Fire Department (HFD)     384
Public Works & Engineering-PWE    343
Health & Human Services           110
Houston Airport System (HAS)      106
Parks & Recreation                 74
Solid Waste Management             43
Library                            36
Fleet Management Department        36
Admn. & Regulatory Affairs         29
Municipal Courts Department        28
Human Resources Dept.              24
Houston Emergency Center (HEC)     23
Housing and Community Devp.        22
General Services Department        22
Legal Department                   17
Dept of Neighborhoods (DON)        17
City Council                       11
Finance                            10
Houston Information Tech Svcs       9
Planning & Development              7
City Controller's Office            5
Mayor's Office                      5
Convention and Entertainment        1
Name: DEPARTMENT, dtype: int64

In [179]:
# Can do this programmatically
df_coh['DEPARTMENT'].value_counts().index.values[2]

'Public Works & Engineering-PWE'

### Problem 7
<span  style="color:green; font-size:16px">Who makes more money, 'Black or African American' Females or White Males? Use RACE and GENDER columns.</span>

In [203]:
criteria = (df_coh['RACE'] == 'Black or African American') & (df_coh['GENDER'] == 'Female')
black_female = df_coh[criteria]

criteria = (df_coh['RACE'] == 'White') & (df_coh['GENDER'] == 'Male')
white_male = df_coh[criteria]

black_female['BASE_SALARY'].mean(), white_male['BASE_SALARY'].mean()

(48915.42123287671, 63940.38811881188)

### Problem 8
<span  style="color:green; font-size:16px">Who makes more money, 'POLICE OFFICER', 'ENGINEER/OPERATOR' or 'ELECTRICIAN'? Use the POSITION_TITLE column.</span>

In [204]:
# your code here
police = df_coh[df_coh['POSITION_TITLE'] == 'POLICE OFFICER']
engineer = df_coh[df_coh['POSITION_TITLE'] == 'ENGINEER/OPERATOR']
electrician = df_coh[df_coh['POSITION_TITLE'] == 'ELECTRICIAN']


police['BASE_SALARY'].mean(), engineer['BASE_SALARY'].mean(), electrician['BASE_SALARY'].mean()

(52592.99456521739, 62606.36781609195, 49816.0)

### Problem 9
<span  style="color:green; font-size:16px">Select employee 28693.</span>

In [181]:
# your code here
criteria = df_coh['UNIQUE_ID'] == 28693
df_coh[criteria]

Unnamed: 0,UNIQUE_ID,FIRST_NAME,LAST_NAME,POSITION_NUMBER,POSITION_JOB_CODE,POSITION_TITLE,BUSINESS_AREA,DEPARTMENT,PAY_GRADE,FUND_ID,...,EMPLOYMENT_SUB_GROUP,EXEMPT,GENDER,EMPLOYMENT_STATUS,CIVIL_SERVICE_TYPE,HIRE_DATE,JOB_DATE,COMP_DATE,EEOJ,SNAPSHOT_DATE
2,28693,JONATHAN,JORSCH JR.,30063393,108.0,POLICE OFFICER,1000,Houston Police Department-HPD,PA03,1000,...,Non-Exempt Excptn,N,Male,Active,CLASSIFIED,2015-02-03,2015-02-03,2014-02-03,D: Protective Service Workers,2016-06-01


In [182]:
# can do it in one line
df_coh[df_coh['UNIQUE_ID'] == 28693]

Unnamed: 0,UNIQUE_ID,FIRST_NAME,LAST_NAME,POSITION_NUMBER,POSITION_JOB_CODE,POSITION_TITLE,BUSINESS_AREA,DEPARTMENT,PAY_GRADE,FUND_ID,...,EMPLOYMENT_SUB_GROUP,EXEMPT,GENDER,EMPLOYMENT_STATUS,CIVIL_SERVICE_TYPE,HIRE_DATE,JOB_DATE,COMP_DATE,EEOJ,SNAPSHOT_DATE
2,28693,JONATHAN,JORSCH JR.,30063393,108.0,POLICE OFFICER,1000,Houston Police Department-HPD,PA03,1000,...,Non-Exempt Excptn,N,Male,Active,CLASSIFIED,2015-02-03,2015-02-03,2014-02-03,D: Protective Service Workers,2016-06-01


### Problem 10
<span  style="color:green; font-size:16px">Select rows 7, 77 and 777 from **`df_coh`**.</span>

In [180]:
# your code here
df_coh.iloc[[7, 77, 777]]

Unnamed: 0,UNIQUE_ID,FIRST_NAME,LAST_NAME,POSITION_NUMBER,POSITION_JOB_CODE,POSITION_TITLE,BUSINESS_AREA,DEPARTMENT,PAY_GRADE,FUND_ID,...,EMPLOYMENT_SUB_GROUP,EXEMPT,GENDER,EMPLOYMENT_STATUS,CIVIL_SERVICE_TYPE,HIRE_DATE,JOB_DATE,COMP_DATE,EEOJ,SNAPSHOT_DATE
7,28212,JEFFREY,DELLING,30043726,520.3,CARPENTER,2800,Houston Airport System (HAS),14,8001,...,Non Exempt Postv,N,Male,Active,CIVILIAN,2013-11-04,2013-11-04,2013-11-04,G: Skilled Craft Workers,2016-06-01
77,23457,DANIEL,RIVERA,30036326,108.0,POLICE OFFICER,1000,Houston Police Department-HPD,PA03,1000,...,Non-Exempt Excptn,N,Male,Active,CLASSIFIED,2010-03-29,2011-03-29,2010-03-29,D: Protective Service Workers,2016-06-01
777,31942,RAMON,GUILLEN,30064705,108.6,"POLICE OFFICER,PROBATIONARY",1000,Houston Police Department-HPD,PA02,1000,...,Non-Exempt Excptn,N,Male,Active,CLASSIFIED,2015-10-12,2016-04-27,2015-10-12,D: Protective Service Workers,2016-06-01


### Problem 11
<span  style="color:green; font-size:16px">Select employees with IDs 3105, 24767 and 31578.</span>

In [183]:
df_coh = df_coh.set_index('UNIQUE_ID')

In [184]:
# your code here
df_coh.loc[[3105, 24767, 31578]]

Unnamed: 0_level_0,FIRST_NAME,LAST_NAME,POSITION_NUMBER,POSITION_JOB_CODE,POSITION_TITLE,BUSINESS_AREA,DEPARTMENT,PAY_GRADE,FUND_ID,FUND_NAME,...,EMPLOYMENT_SUB_GROUP,EXEMPT,GENDER,EMPLOYMENT_STATUS,CIVIL_SERVICE_TYPE,HIRE_DATE,JOB_DATE,COMP_DATE,EEOJ,SNAPSHOT_DATE
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3105,FRANCISCO,ORTIZ,30001931,106.4,POLICE SERGEANT,1000,Houston Police Department-HPD,PA06,1000,General Fund,...,Non-Exempt Excptn,N,Male,Active,CLASSIFIED,1983-09-12,1998-11-28,1984-08-01,C: Technicians,2016-06-01
24767,JOSHUA,YERIAN,30022228,103.3,FIRE FIGHTER,1200,Houston Fire Department (HFD),FD03,1000,General Fund,...,NE Suprsn Excptn,N,Male,Active,CLASSIFIED,2011-08-22,2012-11-22,2011-08-22,D: Protective Service Workers,2016-06-01
31578,MARK,SIMS,30028122,513.4,SEMI-SKILLED LABORER,2100,Solid Waste Management,06,1000,General Fund,...,Non Exempt Postv,N,Male,Active,CIVILIAN,2015-08-31,2015-08-31,2015-08-31,H: Service/Maintenance,2016-06-01


### Problem 12
<span  style="color:green; font-size:16px">Brackets are used to select disjoint rows or columns. Select rows 10, 100 and 500 along with columns 3, 7 and 20.</span>

In [185]:
# your code here
df_coh.iloc[[10, 100, 500], [3, 7, 20]]

Unnamed: 0_level_0,POSITION_JOB_CODE,PAY_GRADE,EMPLOYMENT_SUB_GROUP
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
18633,103.3,FD03,NE Suprsn Excptn
12214,103.1,FD05,NE Suprsn Excptn
32485,721.1,12,Non Exempt Postv


### Problem 13
<span  style="color:green; font-size:16px">Use label based indexing to select employee IDs 12214 and 8317 along with columns HIRE_DATE and GENDER.</span>

In [188]:
# your code here
df_coh.loc[[12214, 8317], ['HIRE_DATE', 'GENDER']]

Unnamed: 0_level_0,HIRE_DATE,GENDER
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
12214,2000-03-21,Male
8317,1994-04-11,Male


### Problem 14: Advanced
<span  style="color:green; font-size:16px">Use label based indexing to select employee IDs 12214 and 8317 along with columns all columns except FIRST_NAME and POSITION_NUMBER. [Use this stackoverflow answer for guidance.](http://stackoverflow.com/a/37441204/3707607)</span>

In [195]:
# your code here
df_coh.loc[[12214, 8317], df_coh.columns.difference(['FIRST_NAME', 'POSITION_NUMBER'])]

Unnamed: 0_level_0,BASE_SALARY,BUSINESS_AREA,CIVIL_SERVICE_TYPE,COMP_DATE,COST_CENTER,COST_CENTER_NAME,DEPARTMENT,EEOJ,EMPLOYMENT_STATUS,EMPLOYMENT_SUB_GROUP,EMPLOYMENT_TYPE,ETHNICITY,EXEMPT,FUND_ID,FUND_NAME,GENDER,HIRE_DATE,JOB_DATE,LAST_NAME,MANAGER_NAME,ORG_UNIT,ORG_UNIT_NAME,PAY_GRADE,POSITION_JOB_CODE,POSITION_TITLE,RACE,REPORTS_TO_POSITION,SNAPSHOT_DATE
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
12214,66523.0,1200,CLASSIFIED,2000-03-21,1200030001,HFD-Deployment,Houston Fire Department (HFD),B: Professionals,Active,NE Suprsn Excptn,Full Time,"WHITE, NOT OF HISPANIC ORIGIN",N,1000,General Fund,Male,2000-03-21,2009-09-29,MADDIN,,12.1210,HFD-FIRE & EMS OPERATIONS,FD05,103.1,CAPTAIN,White,,2016-06-01
8317,66614.0,1000,CLASSIFIED,1994-04-11,1000010027,HPD-North Ptrl,Houston Police Department-HPD,D: Protective Service Workers,Active,Non-Exempt Excptn,Full Time,"WHITE, NOT OF HISPANIC ORIGIN",N,1000,General Fund,Male,1994-04-11,2014-04-12,JOHNSON,JOHN NICKELL,10.1622.0230,HPD-NORTH PATROL,PA04,108.2,SENIOR POLICE OFFICER,White,30005911.0,2016-06-01


### Problem 15
<span  style="color:green; font-size:16px">Use **`.iat`** correctly and explain what happened.</span>

In [19]:
# your code here
# selects the 11th row and 6th column value
df_coh.iat[10, 5]

'FIRE FIGHTER'

### Problem 16
<span  style="color:green; font-size:16px">Use the **`timeit`** magic command to see the speed difference between **`.at`** and **`.loc`** for the same selection. How much faster is **`.at`**?</span>

In [200]:
%timeit df_coh.loc[5123, 'GENDER']

The slowest run took 7.24 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 176 µs per loop


In [201]:
%timeit df_coh.at[5123, 'GENDER']

The slowest run took 28.19 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.51 µs per loop


about 25 times faster

### Problem 17
<span  style="color:green; font-size:16px">Create new columns **`BONUS`** and **`TOTAL_COMP`**. Use column **`RANDOM_BONUS`** to calculate the bonus.</span>

In [205]:
df_coh['RANDOM_BONUS'] = np.random.rand(len(df_coh)) * .1

In [207]:
# your code here
df_coh['BONUS'] = df_coh['RANDOM_RAISE'] * df_coh['BASE_SALARY']
df_coh['TOTAL_COMP'] = df_coh['BASE_SALARY'] + df_coh['BONUS']

df_coh.iloc[:3, -5:]

Unnamed: 0_level_0,EEOJ,SNAPSHOT_DATE,RANDOM_RAISE,BONUS,TOTAL_COMP
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
9172,A: Officials & Administrators,2016-06-01,0.06776,8257.394152,130119.394152
12311,E: Para-Professionals,2016-06-01,0.085042,2221.733171,28346.733171
28693,D: Protective Service Workers,2016-06-01,0.041886,1896.560524,47175.560524


### Problem 18
<span  style="color:green; font-size:16px">Use the **`EXPERIENCE_LEVEL`** column to determine if more experienced employees make more money.</span>

In [210]:
df_coh['HIRE_DATE'] = df_coh['HIRE_DATE'].astype(np.datetime64)
today = pd.to_datetime('today')
experience = today - df_coh['HIRE_DATE']
years_experience = experience / np.timedelta64(1, 'Y')
df_coh['YEARS_EXPERIENCE'] = years_experience
exp_categories = pd.cut(years_experience, bins=[0, 5, 15, 100], labels=['Novice', 'Experienced', 'Senior'])
df_coh['EXPERIENCE_LEVEL'] = exp_categories

In [212]:
# your code here
novice = df_coh[df_coh['EXPERIENCE_LEVEL'] == 'Novice']['BASE_SALARY']
exper = df_coh[df_coh['EXPERIENCE_LEVEL'] == 'Experienced']['BASE_SALARY']
senior = df_coh[df_coh['EXPERIENCE_LEVEL'] == 'Senior']['BASE_SALARY']

novice.mean(), exper.mean(), senior.mean()

(44987.484, 55264.92867981791, 63638.224209078406)

# 4. More on DataFrames

In [4]:
college = pd.read_csv('data/college.csv', index_col='INSTNM')

### Problem 1
<span  style="color:green; font-size:16px">Re-read the college.csv file into the variable **`college2`**. Use the documentation of the **`read_csv`** function to assign the index column INSTNM on read, skip the first 20 rows but keep the header.</span>

In [5]:
college2 = pd.read_csv('data/college.csv', index_col='INSTNM', header=0, skiprows=range(1,21))
college2.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
George C Wallace State Community College-Hanceville,Hanceville,AL,0.0,0.0,0.0,0,,,0.0,4920.0,0.863,0.0612,0.0362,0.0065,0.0089,0.0,0.0,0.0059,0.0183,0.4203,1,0.5026,0.4192,0.3229,28800,11186
George C Wallace State Community College-Selma,Selma,AL,0.0,0.0,0.0,0,,,0.0,1513.0,0.1956,0.7449,0.0026,0.004,0.0013,0.0,0.0033,0.004,0.0443,0.384,1,0.7645,0.0,0.3318,24200,PrivacySuppressed
Herzing University-Birmingham,Birmingham,AL,0.0,0.0,0.0,0,,,0.0,302.0,0.3543,0.5265,0.0166,0.0066,0.0,0.0,0.0563,0.0,0.0397,0.5497,1,0.6541,0.7736,0.7813,42300,23216.5
Huntingdon College,Montgomery,AL,0.0,0.0,0.0,1,510.0,490.0,0.0,1149.0,0.6388,0.1993,0.0252,0.0078,0.0122,0.0017,0.0261,0.0061,0.0827,0.2097,1,0.3982,0.7153,0.1937,36500,26230
Heritage Christian University,Florence,AL,0.0,0.0,0.0,1,,,0.0,62.0,0.7419,0.1129,0.0484,0.0,0.0323,0.0161,0.0,0.0161,0.0323,0.4355,1,0.6087,0.4493,0.5942,PrivacySuppressed,PrivacySuppressed


### Problem 2
<span  style="color:green; font-size:16px">Take a close look at the **`min`** and **`max`** columns. Many columns range from 0 to 1. What kind of data do you think they represent?</span>

They are probably binary indicator variables or percentages with real numbers from 0 to 1.

### Problem 3
<span  style="color:green; font-size:16px">Sort first by **`STABBR`** ascending and by **`CITY`** descending. Rread the docs on **`sort_values`** to learn how to sort two columns at the same time.</span>

In [6]:
# your code here
college.sort_values(by=['STABBR', 'CITY'], ascending=[True, False]).head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
Alaska Christian College,Soldotna,AK,0.0,0.0,0.0,1,,,0.0,68.0,0.0588,0.0,0.0147,0.0,0.7794,0.0,0.0147,0.0,0.1324,0.0735,1,0.8868,0.6792,0.2264,,PrivacySuppressed
AVTEC-Alaska's Institute of Technology,Seward,AK,0.0,0.0,0.0,0,,,0.0,889.0,0.5388,0.0112,0.0427,0.0157,0.1879,0.0112,0.0529,0.0,0.1395,0.6817,1,0.0737,0.0664,0.7127,33500.0,PrivacySuppressed
Alaska Bible College,Palmer,AK,0.0,0.0,0.0,1,,,0.0,27.0,0.8519,0.0,0.037,0.0,0.0741,0.0,0.037,0.0,0.0,0.1481,1,0.3571,0.2857,0.4286,,PrivacySuppressed
University of Alaska Southeast,Juneau,AK,0.0,0.0,0.0,0,,,0.0,1428.0,0.4748,0.0119,0.0623,0.0357,0.1029,0.0147,0.0686,0.0049,0.2241,0.5112,1,0.1769,0.1996,0.555,37400.0,16875
University of Alaska Fairbanks,Fairbanks,AK,0.0,0.0,0.0,0,,,0.0,5536.0,0.4259,0.021,0.0522,0.0126,0.1284,0.0027,0.0401,0.011,0.306,0.3887,1,0.2263,0.255,0.4519,36200.0,19355


### Problem 4
<span  style="color:green; font-size:16px">Rename column **`HBCU`** to **`HISTORICALLY_BLACK`**, **`STABBR`** to **`STATE_ABBR`** and index **`Alabama State University`** to **`ASU`** all in one line of code. </span>

In [7]:
# your code here
college.rename(columns={'HBCU':'HISTORICALLY_BLACK', 'STABBR':'STATE_ABBR'}, 
               index={'Alabama State University':'ASU'})

Unnamed: 0_level_0,CITY,STATE_ABBR,HISTORICALLY_BLACK,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0000,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.2600,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.0100,0.2607,1,0.3460,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.2990,0.4192,0.0069,0.0034,0.0000,0.0000,0.0000,0.0000,0.2715,0.4536,1,0.6801,0.7795,0.8540,40100,23370
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.0350,0.2146,1,0.3072,0.4596,0.2640,45500,24097
ASU,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.0010,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.1270,26600,33118.5
The University of Alabama,Tuscaloosa,AL,0.0,0.0,0.0,0,555.0,565.0,0.0,29851.0,0.7825,0.1119,0.0348,0.0106,0.0038,0.0009,0.0261,0.0268,0.0026,0.0844,1,0.2040,0.4010,0.0853,41900,23750
Central Alabama Community College,Alexander City,AL,0.0,0.0,0.0,0,,,0.0,1592.0,0.7255,0.2613,0.0044,0.0025,0.0044,0.0000,0.0000,0.0000,0.0019,0.3882,1,0.5892,0.3977,0.3153,27500,16127
Athens State University,Athens,AL,0.0,0.0,0.0,0,,,0.0,2991.0,0.7823,0.1200,0.0191,0.0053,0.0157,0.0010,0.0174,0.0057,0.0334,0.5517,1,0.4088,0.6296,0.6410,39000,18595
Auburn University at Montgomery,Montgomery,AL,0.0,0.0,0.0,0,486.0,509.0,0.0,4304.0,0.5328,0.3376,0.0074,0.0221,0.0044,0.0016,0.0297,0.0397,0.0246,0.2853,1,0.4192,0.5803,0.2930,35000,21335
Auburn University,Auburn,AL,0.0,0.0,0.0,0,575.0,588.0,0.0,20514.0,0.8507,0.0704,0.0248,0.0227,0.0074,0.0000,0.0000,0.0100,0.0140,0.0862,1,0.1610,0.3494,0.0415,45700,21831


### Problem 5
<span  style="color:green; font-size:16px">Sort the index in-place. Output the head of the DataFrame.</span>

In [8]:
# your code here
college.sort_index(inplace=True)

In [9]:
college.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
A & W Healthcare Educators,New Orleans,LA,0.0,0.0,0.0,0,,,0.0,40.0,0.0,0.975,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.125,1,0.7018,0.8596,0.6667,,19022.5
A T Still University of Health Sciences,Kirksville,MO,0.0,0.0,0.0,0,,,0.0,,,,,,,,,,,,1,,,,219800,PrivacySuppressed
ABC Beauty Academy,Garland,TX,0.0,0.0,0.0,0,,,0.0,30.0,0.0,0.0333,0.0333,0.9333,0.0,0.0,0.0,0.0,0.0,0.0,0,0.7857,0.0,0.8286,,PrivacySuppressed
ABC Beauty College Inc,Arkadelphia,AR,0.0,0.0,0.0,0,,,0.0,38.0,0.2895,0.6579,0.0526,0.0,0.0,0.0,0.0,0.0,0.0,0.2105,1,0.9815,1.0,0.4688,PrivacySuppressed,16500
AI Miami International University of Art and Design,Miami,FL,0.0,0.0,0.0,0,,,0.0,2778.0,0.0324,0.0198,0.4773,0.0018,0.0,0.0,0.0018,0.0025,0.4644,0.2185,1,0.5507,0.6966,0.3262,29900,31000


### Problem 6
<span  style="color:green; font-size:16px">Use the **`max`** method across the rows for DataFrame **`college_ugds`**. Take the results and apply the pandas **`cut`** function to create a Series with 3 category labels on how 'diverse' the school is.</span>

In [10]:
college_ugds = college.filter(like='UGDS_')

In [11]:
# your code here
max_race = college_ugds.max(axis='columns')

diversity = pd.cut(max_race, bins=[0, .4, .7, 1], labels=['High', 'Medium', 'Low'])

diversity.head(15)

INSTNM
A & W Healthcare Educators                                Low
A T Still University of Health Sciences                   NaN
ABC Beauty Academy                                        Low
ABC Beauty College Inc                                 Medium
AI Miami International University of Art and Design    Medium
AIB College of Business                                Medium
AOMA Graduate School of Integrative Medicine              NaN
ASA College                                              High
ASI Career Institute                                   Medium
ASM Beauty World Academy                                  Low
ATA Career Education                                   Medium
ATA College                                              High
ATEP at IVC                                               NaN
ATI College-Norwalk                                    Medium
ATS Institute of Technology                               Low
dtype: category
Categories (3, object): [High < Medium < Low]

### Problem 7
<span  style="color:green; font-size:16px">Use the **`select_dtpyes`** method on the **`college`** DataFrame to select only the numeric columns. Save this DataFrame to **`college_num`**. Google is your friend here.</span>

In [12]:
# your code here
college_num = college.select_dtypes(include=[np.number])

college_num.head()

Unnamed: 0_level_0,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
A & W Healthcare Educators,0.0,0.0,0.0,0,,,0.0,40.0,0.0,0.975,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.125,1,0.7018,0.8596,0.6667
A T Still University of Health Sciences,0.0,0.0,0.0,0,,,0.0,,,,,,,,,,,,1,,,
ABC Beauty Academy,0.0,0.0,0.0,0,,,0.0,30.0,0.0,0.0333,0.0333,0.9333,0.0,0.0,0.0,0.0,0.0,0.0,0,0.7857,0.0,0.8286
ABC Beauty College Inc,0.0,0.0,0.0,0,,,0.0,38.0,0.2895,0.6579,0.0526,0.0,0.0,0.0,0.0,0.0,0.0,0.2105,1,0.9815,1.0,0.4688
AI Miami International University of Art and Design,0.0,0.0,0.0,0,,,0.0,2778.0,0.0324,0.0198,0.4773,0.0018,0.0,0.0,0.0018,0.0025,0.4644,0.2185,1,0.5507,0.6966,0.3262


### Problem 8
<span  style="color:green; font-size:16px">Use **`filter`** to slim your DataFrame down to the **SAT** columns. Then lookup how to use the **`dropna`** method and return a DataFrame that has no missing values. Use the style **`bar`** on the top 10 rows of this DataFrame.</span>

In [13]:
# your code here
sat = college.filter(like='SAT').dropna()

sat.head(10).style.bar()

Unnamed: 0_level_0,SATVRMID,SATMTMID
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
Abilene Christian University,530,545
Abraham Baldwin Agricultural College,465,460
Adams State University,475,509
Adelphi University,550,565
Adrian College,500,490
Adventist University of Health Sciences,473,453
Alabama A & M University,424,420
Alabama State University,425,430
Alaska Pacific University,555,503
Albany College of Pharmacy and Health Sciences,555,610


### Problem 9
<span  style="color:green; font-size:16px">How many colleges have more than 10,000 students and  are religiously affiliated?</span>

In [15]:
college[(college.RELAFFIL == 1) & (college.UGDS > 10000)]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
Baylor University,Waco,TX,0.0,0.0,0.0,1,610.0,620.0,0.0,13801.0,0.6402,0.0733,0.1413,0.0625,0.0036,0.0005,0.0451,0.0309,0.0024,0.0162,1,0.2135,0.452,0.0245,48200,25131
Brigham Young University-Idaho,Rexburg,ID,0.0,0.0,0.0,1,515.0,505.0,0.0,23865.0,0.8011,0.0048,0.0303,0.0094,0.0035,0.0044,0.0569,0.0659,0.0238,0.3462,1,0.4733,0.2138,0.371,38800,11000
Brigham Young University-Provo,Provo,UT,0.0,0.0,0.0,1,630.0,630.0,0.0,27163.0,0.832,0.005,0.0563,0.0195,0.0037,0.0058,0.0344,0.0314,0.0118,0.0981,1,0.3702,0.1921,0.122,57200,11000
DePaul University,Chicago,IL,0.0,0.0,0.0,1,,,0.0,15858.0,0.5518,0.0832,0.1756,0.0778,0.0008,0.0017,0.0388,0.0292,0.0411,0.1438,1,0.3504,0.578,0.2019,50300,23500
Indiana Wesleyan University-Marion,Marion,IN,0.0,0.0,0.0,1,530.0,525.0,0.0,10218.0,0.7531,0.1825,0.0307,0.0065,0.0024,0.0008,0.0206,0.0023,0.0012,0.0762,1,0.3816,0.7019,0.6919,46300,24160
Kennesaw State University,Kennesaw,GA,0.0,0.0,0.0,1,545.0,535.0,0.0,23058.0,0.6082,0.1904,0.0755,0.0339,0.0023,0.0016,0.042,0.0186,0.0273,0.2397,1,0.4067,0.5462,0.2518,40000,22750
Liberty University,Lynchburg,VA,0.0,0.0,0.0,1,525.0,510.0,0.0,49340.0,0.5121,0.155,0.0166,0.0093,0.0059,0.0022,0.0227,0.0135,0.2626,0.4458,1,0.4984,0.6648,0.6265,35600,23250
Loyola University Chicago,Chicago,IL,0.0,0.0,0.0,1,575.0,580.0,0.0,10042.0,0.6028,0.0376,0.132,0.1105,0.001,0.0021,0.0604,0.0352,0.0183,0.0833,1,0.2817,0.6092,0.0804,50700,25000
Saint Leo University,Saint Leo,FL,0.0,0.0,0.0,1,,,0.0,11976.0,0.3823,0.3696,0.1104,0.0124,0.005,0.0022,0.0145,0.0256,0.0781,0.3059,1,0.4828,0.6032,0.7228,42100,25000
St John's University-New York,Queens,NY,0.0,0.0,0.0,1,540.0,560.0,0.0,10878.0,0.338,0.189,0.1452,0.182,0.0024,0.0033,0.0448,0.053,0.0423,0.0297,1,0.3009,0.5599,0.038,52700,25910


In [17]:
# final answer
college[(college.RELAFFIL == 1) & (college.UGDS > 10000)].shape[0]

10