# The DataFrame
The vast majority of the work as a data analyst will be in tabular - rows/columns form. The **DataFrame** is the primary pandas data structure and can be thought of as a collection of Series. Each row and column of the DataFrame have their own **Index**. The **Index** for the columns is simply referred to as the **column names**. Operations on DataFrames can be applied to all elements or by row or by column. The technical term **axis** is used to refer to the horizontal and vertical components of the frame. 

The row axis is numbered 0 and the column axis is numbered 1, which is convention borrowed from numpy where ndarrays can have limitless number of axes beginning with 0. The **`axis`** argument shows up in most all DataFrame methods, meaning you can choose to do the operation over the columns or the rows.

Just as with Series, alignment of indices silently takes place behind the scenes for DataFrames, so care needs to be taken when operating on 2 different DataFrames at the same time.

### Manually construct a DataFrame
It is quite rare that you will actually need to construct a DataFrame manually.  Most frequently you will be reading external flat files, getting data from the web or reading from relational databases. Nonetheless, it does occur.

Just like the **`pd.Series`** is technically a constructor, **`pd.DataFrame`** constructs a DataFrame by accepting a dictionary with strings as the **keys** and lists as the **values**.

In [173]:
# Lets import our packages.
import pandas as pd
import numpy as np

In [174]:
# create dataframe from a dictionary of lists. The keys are the column names NOT the indices
df = pd.DataFrame({'name':['Ted', 'Ned', 'Jed'], 'Phone':['Samsung', 'Samsung', 'IOS'], 'Favorite Number':[99, 7, 4]})
df

Unnamed: 0,Favorite Number,Phone,name
0,99,Samsung,Ted
1,7,Samsung,Ned
2,4,IOS,Jed


### Closely Examine Output
It is clear that there are three columns and three rows to the DataFrame. What's not as obvious is that there exists an index for the rows with labels 0,1,2. The column names (which are also Index objects) are Favorite Number, Phone and name.

### Why did the column order get changed?
If you looked closely, the order of the columns did not match the order in which they were written in the dictionary. Dictionaries are inherently unordered, so it isn't likely that the order of the keys in the dictionary will be the same as the in the DataFrame produced from it.

It is possible to specify the order explicitly in the constructor with the **`columns`** parameter.

In [175]:
# Let's fix the column order 
df = pd.DataFrame({'name':['Ted', 'Ned', 'Jed'], 'Phone':['Samsung', 'Samsung', 'IOS'], 'Favorite Number':[99, 7, 4]},
                 columns=['name', 'Phone', 'Favorite Number'])
df

Unnamed: 0,name,Phone,Favorite Number
0,Ted,Samsung,99
1,Ned,Samsung,7
2,Jed,IOS,4


### More info on the DataFrame
When a Series is outputted, the data type of the values is printed to the screen. This doesn't happen with a DataFrame. Use the **`info`** method to get more info.

In [176]:
# get more info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
name               3 non-null object
Phone              3 non-null object
Favorite Number    3 non-null int64
dtypes: int64(1), object(2)
memory usage: 152.0+ bytes


Lots of info above: the object type (DataFrame), the row index type(a RangeIndex), the column names and their types ('object' is used for strings) and whether there are any **null** values. A summary of the datatypes (2 object and 1 integer) and the memory usage.

### DataFrame by Pieces
Each DataFrame has both a column **Index** and a row **Index** and values that can be represented by a 2 dimensional NumPy array. Let's see those pieces.

In [177]:
# Row index is retrieved by .index
df.index

RangeIndex(start=0, stop=3, step=1)

In [178]:
# column index is just .columns
df.columns

Index(['name', 'Phone', 'Favorite Number'], dtype='object')

In [179]:
# the values are a numpy array
df.values

array([['Ted', 'Samsung', 99],
       ['Ned', 'Samsung', 7],
       ['Jed', 'IOS', 4]], dtype=object)

### Get summary statistics for numeric columns
Basic summary statistics for numeric columns can be retrieved through the **`describe`** method. **`describe`** ignores non-numeric columns

In [180]:
df.describe()

Unnamed: 0,Favorite Number
count,3.0
mean,36.666667
std,54.003086
min,4.0
25%,5.5
50%,7.0
75%,53.0
max,99.0


Different summary statistics for columns with characters ('objects') can be retrieved with the following

In [181]:
# get summary statistics on only the 'objects'
df.describe(include=['object'])

Unnamed: 0,name,Phone
count,3,3
unique,3,2
top,Ned,Samsung
freq,1,2


### Creating a DataFrame from NumPy
An easy way to create a DataFrame is using a 2-d numpy array. Below, a 5x10 numpy array is used with both column and row indexes provided.

In [182]:
df = pd.DataFrame(np.random.rand(10,5), columns=['one', 'two', 'three', 'four', 'five'], index=list('abcdefghij'))
df

Unnamed: 0,one,two,three,four,five
a,0.103241,0.386309,0.113734,0.45232,0.885414
b,0.047796,0.464106,0.562185,0.614653,0.900322
c,0.972248,0.67083,0.731325,0.703621,0.955253
d,0.814749,0.987684,0.103421,0.07862,0.232815
e,0.706825,0.361361,0.431343,0.364113,0.554111
f,0.335477,0.004456,0.709579,0.544189,0.777761
g,0.854552,0.277256,0.908826,0.358026,0.815923
h,0.181461,0.913917,0.023292,0.936569,0.48771
i,0.082133,0.274848,0.186008,0.879219,0.329196
j,0.824497,0.044611,0.895049,0.748504,0.983914


## Your data is horrible. Can we speed it up and get some real data?
Yes, there is a dataset that was downloaded from the [city of Houston](http://data.ohouston.org/dataset/city-of-houston-current-employee-roster) that has employee information. Information for 2000 emmployees such as name, department, salary, race and others are provided. The **`read_csv`** pandas function can read in csv files. There are many options in the function for handling the wide variety of csvs that you are likely to encounter in the wild.

In [183]:
# There are numerous options to read_csv but this is already a clean dataset so we don't use them now
df_coh = pd.read_csv('data/coh_employee.csv')

### Standard operations on a new dataset
When reading in a new dataset, I usually run a standard set of commands to get basic information on the dataset. **`shape, head, info, describe`** 

In [184]:
# get the dimensions of the dataset
df_coh.shape

(2000, 31)

In [185]:
#lets get a small glimpse of it
df_coh.head(10)

Unnamed: 0,UNIQUE_ID,FIRST_NAME,LAST_NAME,POSITION_NUMBER,POSITION_JOB_CODE,POSITION_TITLE,BUSINESS_AREA,DEPARTMENT,PAY_GRADE,FUND_ID,FUND_NAME,COST_CENTER,COST_CENTER_NAME,REPORTS_TO_POSITION,MANAGER_NAME,BASE_SALARY,ORG_UNIT,ORG_UNIT_NAME,ETHNICITY,RACE,EMPLOYMENT_TYPE,EMPLOYMENT_SUB_GROUP,EXEMPT,GENDER,EMPLOYMENT_STATUS,CIVIL_SERVICE_TYPE,HIRE_DATE,JOB_DATE,COMP_DATE,EEOJ,SNAPSHOT_DATE
0,9172,LILLIAN,WARDEN,30027193,306.2,ASSISTANT DIRECTOR (EX LVL),1600,Municipal Courts Department,32,1000,General Fund,1600010001,MCD-Admin Services,30044747.0,NELLY SANTOS,121862.0,16.1410,MCD-ADMINISTRATIVE SERVICES,HISPANIC,Hispanic/Latino,Full Time,Exempt Excptn,E,Female,Active,CIVILIAN,2006-06-12,2012-10-13,1998-05-13,A: Officials & Administrators,2016-06-01
1,12311,ESPERANZA,RODRIGUEZ,30044399,901.2,LIBRARY ASSISTANT,3400,Library,05,1000,General Fund,3400070001,HPL-Neigh Lib Syst,30044159.0,MIGDALIA SEPULVEDA,26125.0,34.7001.27,HPL-NEIGHBORHOOD LIBRARY SERVICES,HISPANIC,Hispanic/Latino,Full Time,Non Exempt Postv,N,Female,Active,CIVILIAN,2000-07-19,2010-09-18,2000-07-19,E: Para-Professionals,2016-06-01
2,28693,JONATHAN,JORSCH JR.,30063393,108.0,POLICE OFFICER,1000,Houston Police Department-HPD,PA03,1000,General Fund,1000010068,HPD-Vehicular Crimes,30060941.0,CATRINA WILLIAMS,45279.0,10.1687.0130,HPD-VEHICULAR CRIMES DIVISION,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,Non-Exempt Excptn,N,Male,Active,CLASSIFIED,2015-02-03,2015-02-03,2014-02-03,D: Protective Service Workers,2016-06-01
3,2359,JOHN,HARE,30007418,103.2,ENGINEER/OPERATOR,1200,Houston Fire Department (HFD),FD04,1000,General Fund,1200030001,HFD-Deployment,,,63166.0,12.1210,HFD-FIRE & EMS OPERATIONS,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,NE Suprsn Excptn,N,Male,Active,CLASSIFIED,1982-02-08,1991-05-25,1982-02-08,D: Protective Service Workers,2016-06-01
4,5123,RUSSELL,GALBREATH,30006294,523.2,ELECTRICIAN,2500,General Services Department,18,2105,Maintenance Renewal and Replacement Fund,2500100003,GSD-MRR-Maint,30008673.0,JOHN BOGNEY,56347.0,25.1416,GSD-PROPERTY MGMT MAINT & OPER,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,Non Exempt Postv,N,Male,Active,CIVILIAN,1989-06-19,1994-10-22,1989-06-19,G: Skilled Craft Workers,2016-06-01
5,3688,CRAIG,PENNAMON,30005060,108.2,SENIOR POLICE OFFICER,1000,Houston Police Department-HPD,PA04,1000,General Fund,1000010007,HPD-NightCmd/Secur,30060536.0,SOTHEN SOR,66614.0,10.1781.0040,HPD-NIGHT COMMAND/SECURITY OPERATIONS,"BLACK, NOT OF HISPANIC ORIGIN",Black or African American,Full Time,Non-Exempt Excptn,N,Male,Active,CLASSIFIED,1984-11-26,2005-03-26,1984-11-26,D: Protective Service Workers,2016-06-01
6,25295,PETER,TRUONG,30049153,778.4,ENGINEER,2000,Public Works & Engineering-PWE,26,8300,PWE-W & S System Operating Fund,2000040009,PWE-WWO Contr Mgmt,30049155.0,MONCEF TIHAMI,71680.0,20.1846.06,PWE-ENGINEERING SUPPORT,ASIAN OR PACIFIC ISLANDER,Asian/Pacific Islander,Full Time,Exempt Excptn,E,Male,Active,CIVILIAN,2012-03-26,2012-03-26,2012-03-26,B: Professionals,2016-06-01
7,28212,JEFFREY,DELLING,30043726,520.3,CARPENTER,2800,Houston Airport System (HAS),14,8001,HAS-Revenue Fund,2800040017,HAS-IAH-AMG A&G,30047472.0,SHANTEL WOODS,42390.0,28.1377.3000,HAS-IAH SCHEDULED MAINTENANCE,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,Non Exempt Postv,N,Male,Active,CIVILIAN,2013-11-04,2013-11-04,2013-11-04,G: Skilled Craft Workers,2016-06-01
8,8010,ROBERT,OAKES,30051248,306.3,DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV,2000,Public Works & Engineering-PWE,30,2301,Building Inspection Fund,2000060004,PWE-BldgOfficialOffi,30043470.0,EARL GREER,107962.0,20.2123.01,PWE-OFFICE OF THE BLDG OFFICIAL,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,Exempt Excptn,E,Male,Active,CIVILIAN,1993-11-15,2013-01-05,1993-11-15,A: Officials & Administrators,2016-06-01
9,32504,DAMIAN,LEFFLER,30019784,926.3,AIRPORT OPERATIONS COORDINATOR,2800,Houston Airport System (HAS),20,8001,HAS-Revenue Fund,2800060007,HAS-HOU-SEP-ID Badge,30015643.0,FLETCHER CLARK III,44616.0,28.1585.1200,HAS-HOU SEP SECURITY,"WHITE, NOT OF HISPANIC ORIGIN",,Full Time,Non Exempt Postv,N,Male,Active,CIVILIAN,2016-03-14,2016-03-14,2016-03-14,B: Professionals,2016-06-01


### Changing display settings
If you scroll to the right, you will notice that not all 31 columns are displayed to the screen. Pandas comes with default values for a couple dozen display settings to help control output. One of these parameters is the number of columns displayed to the screen. The options can all be found under **`pd.options.display`**. The all the display settings that can be changed below.

In [186]:
dir(pd.options.display)

['chop_threshold',
 'colheader_justify',
 'column_space',
 'date_dayfirst',
 'date_yearfirst',
 'encoding',
 'expand_frame_repr',
 'float_format',
 'height',
 'large_repr',
 'latex',
 'line_width',
 'max_categories',
 'max_columns',
 'max_colwidth',
 'max_info_columns',
 'max_info_rows',
 'max_rows',
 'max_seq_items',
 'memory_usage',
 'mpl_style',
 'multi_sparse',
 'notebook_repr_html',
 'pprint_nest_depth',
 'precision',
 'show_dimensions',
 'unicode',
 'width']

In [187]:
# Lets see the current settings
pd.options.display.max_columns

40

Only 20 of the 30 columns are printed out in the notebook. We can adjust this setting easily by reassigning its value. More info on [pandas options settings](http://pandas.pydata.org/pandas-docs/stable/options.html).

In [188]:
# reassign max_columns
pd.options.display.max_columns = 40

In [189]:
# and now inspect the dataframe again
df_coh.head()

Unnamed: 0,UNIQUE_ID,FIRST_NAME,LAST_NAME,POSITION_NUMBER,POSITION_JOB_CODE,POSITION_TITLE,BUSINESS_AREA,DEPARTMENT,PAY_GRADE,FUND_ID,FUND_NAME,COST_CENTER,COST_CENTER_NAME,REPORTS_TO_POSITION,MANAGER_NAME,BASE_SALARY,ORG_UNIT,ORG_UNIT_NAME,ETHNICITY,RACE,EMPLOYMENT_TYPE,EMPLOYMENT_SUB_GROUP,EXEMPT,GENDER,EMPLOYMENT_STATUS,CIVIL_SERVICE_TYPE,HIRE_DATE,JOB_DATE,COMP_DATE,EEOJ,SNAPSHOT_DATE
0,9172,LILLIAN,WARDEN,30027193,306.2,ASSISTANT DIRECTOR (EX LVL),1600,Municipal Courts Department,32,1000,General Fund,1600010001,MCD-Admin Services,30044747.0,NELLY SANTOS,121862.0,16.1410,MCD-ADMINISTRATIVE SERVICES,HISPANIC,Hispanic/Latino,Full Time,Exempt Excptn,E,Female,Active,CIVILIAN,2006-06-12,2012-10-13,1998-05-13,A: Officials & Administrators,2016-06-01
1,12311,ESPERANZA,RODRIGUEZ,30044399,901.2,LIBRARY ASSISTANT,3400,Library,05,1000,General Fund,3400070001,HPL-Neigh Lib Syst,30044159.0,MIGDALIA SEPULVEDA,26125.0,34.7001.27,HPL-NEIGHBORHOOD LIBRARY SERVICES,HISPANIC,Hispanic/Latino,Full Time,Non Exempt Postv,N,Female,Active,CIVILIAN,2000-07-19,2010-09-18,2000-07-19,E: Para-Professionals,2016-06-01
2,28693,JONATHAN,JORSCH JR.,30063393,108.0,POLICE OFFICER,1000,Houston Police Department-HPD,PA03,1000,General Fund,1000010068,HPD-Vehicular Crimes,30060941.0,CATRINA WILLIAMS,45279.0,10.1687.0130,HPD-VEHICULAR CRIMES DIVISION,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,Non-Exempt Excptn,N,Male,Active,CLASSIFIED,2015-02-03,2015-02-03,2014-02-03,D: Protective Service Workers,2016-06-01
3,2359,JOHN,HARE,30007418,103.2,ENGINEER/OPERATOR,1200,Houston Fire Department (HFD),FD04,1000,General Fund,1200030001,HFD-Deployment,,,63166.0,12.1210,HFD-FIRE & EMS OPERATIONS,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,NE Suprsn Excptn,N,Male,Active,CLASSIFIED,1982-02-08,1991-05-25,1982-02-08,D: Protective Service Workers,2016-06-01
4,5123,RUSSELL,GALBREATH,30006294,523.2,ELECTRICIAN,2500,General Services Department,18,2105,Maintenance Renewal and Replacement Fund,2500100003,GSD-MRR-Maint,30008673.0,JOHN BOGNEY,56347.0,25.1416,GSD-PROPERTY MGMT MAINT & OPER,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,Non Exempt Postv,N,Male,Active,CIVILIAN,1989-06-19,1994-10-22,1989-06-19,G: Skilled Craft Workers,2016-06-01


In [190]:
# continue by getting the 'info'
df_coh.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 31 columns):
UNIQUE_ID               2000 non-null int64
FIRST_NAME              2000 non-null object
LAST_NAME               2000 non-null object
POSITION_NUMBER         2000 non-null int64
POSITION_JOB_CODE       2000 non-null float64
POSITION_TITLE          2000 non-null object
BUSINESS_AREA           2000 non-null int64
DEPARTMENT              2000 non-null object
PAY_GRADE               2000 non-null object
FUND_ID                 2000 non-null int64
FUND_NAME               2000 non-null object
COST_CENTER             2000 non-null int64
COST_CENTER_NAME        2000 non-null object
REPORTS_TO_POSITION     1688 non-null float64
MANAGER_NAME            1678 non-null object
BASE_SALARY             1886 non-null float64
ORG_UNIT                2000 non-null object
ORG_UNIT_NAME           2000 non-null object
ETHNICITY               1998 non-null object
RACE                    1965 non-null ob

There are 3 columns with float values, 5 with integer and 23 with strings. There are **missing** values in this dataset. Since it is know that there are 2000 rows, any column that has less than 2000 **non-null** values has some missing values. For instance the **REPORTS_TO_POSITION** column has 1688 non-null values which means it must have 312 missing ones.

In [191]:
# get summary statistics. describe defaults to only numeric columns
# the only meaningful column here is BASE_SALARY
df_coh.describe()

Unnamed: 0,UNIQUE_ID,POSITION_NUMBER,POSITION_JOB_CODE,BUSINESS_AREA,FUND_ID,COST_CENTER,REPORTS_TO_POSITION,BASE_SALARY
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,1688.0,1886.0
mean,16247.6525,30032170.0,372.6786,2135.35,2401.2215,2135389000.0,30031670.0,55767.931601
std,9509.483839,21823.83,283.143263,1660.687906,2607.231936,1660714000.0,23372.96,21693.706679
min,5.0,30000010.0,47.0,1000.0,1000.0,1000010000.0,30000040.0,24960.0
25%,7981.0,30011070.0,108.0,1000.0,1000.0,1000010000.0,30006610.0,40170.0
50%,15316.0,30029440.0,303.5,1200.0,1000.0,1200050000.0,30033950.0,54461.0
75%,24480.5,30054360.0,568.15,2800.0,2301.0,2800020000.0,30055020.0,66614.0
max,32967.0,30067370.0,978.2,9000.0,9000.0,9000160000.0,30067370.0,275000.0


In [192]:
# describe just the string columns
# these are much more interesting
df_coh.describe(include=['object'])

Unnamed: 0,FIRST_NAME,LAST_NAME,POSITION_TITLE,DEPARTMENT,PAY_GRADE,FUND_NAME,COST_CENTER_NAME,MANAGER_NAME,ORG_UNIT,ORG_UNIT_NAME,ETHNICITY,RACE,EMPLOYMENT_TYPE,EMPLOYMENT_SUB_GROUP,EXEMPT,GENDER,EMPLOYMENT_STATUS,CIVIL_SERVICE_TYPE,HIRE_DATE,JOB_DATE,COMP_DATE,EEOJ,SNAPSHOT_DATE
count,2000,2000,2000,2000,2000,2000,2000,1678,2000.0,2000,1998,1965,2000,2000,1999,2000,2000,2000,2000,1997,2000,1999,2000
unique,961,1371,330,24,80,39,448,1201,1181.0,478,5,6,5,17,2,2,2,3,999,947,1084,8,1
top,MICHAEL,JOHNSON,SENIOR POLICE OFFICER,Houston Police Department-HPD,PA03,General Fund,HFD-Deployment,HERMAN GONZALES,12.121,HFD-FIRE & EMS OPERATIONS,"BLACK, NOT OF HISPANIC ORIGIN",Black or African American,Full Time,Non Exempt Postv,N,Male,Active,CIVILIAN,2016-03-28,2002-01-05,1986-03-03,D: Protective Service Workers,2016-06-01
freq,44,30,220,638,184,1304,297,56,288.0,289,715,700,1954,792,1681,1397,1991,1100,11,34,10,728,2000


### Transpose a DataFrame
Occasionally its useful to transpose a DataFrame - swap the columns and the rows. In this case, the output will be much, much more readable. Use the **`.T`** attribute.

In [193]:
df_coh.describe(include=['object']).T

Unnamed: 0,count,unique,top,freq
FIRST_NAME,2000,961,MICHAEL,44
LAST_NAME,2000,1371,JOHNSON,30
POSITION_TITLE,2000,330,SENIOR POLICE OFFICER,220
DEPARTMENT,2000,24,Houston Police Department-HPD,638
PAY_GRADE,2000,80,PA03,184
FUND_NAME,2000,39,General Fund,1304
COST_CENTER_NAME,2000,448,HFD-Deployment,297
MANAGER_NAME,1678,1201,HERMAN GONZALES,56
ORG_UNIT,2000,1181,12.1210,288
ORG_UNIT_NAME,2000,478,HFD-FIRE & EMS OPERATIONS,289


### Always know your Index
Its important to be aware of what your (row) index is in a DataFrame. The column index is usually obvious as it's just the column names.

In [194]:
# when the read_csv is not given an index, the default is make the index integers starting from 0
df_coh.index

RangeIndex(start=0, stop=2000, step=1)

In [195]:
# See the values of a RangeIndex
df_coh.index.values

array([   0,    1,    2, ..., 1997, 1998, 1999], dtype=int64)

In [196]:
# the column index is also just a numpy array
df_coh.columns.values

array(['UNIQUE_ID', 'FIRST_NAME', 'LAST_NAME', 'POSITION_NUMBER',
       'POSITION_JOB_CODE', 'POSITION_TITLE', 'BUSINESS_AREA',
       'DEPARTMENT', 'PAY_GRADE', 'FUND_ID', 'FUND_NAME', 'COST_CENTER',
       'COST_CENTER_NAME', 'REPORTS_TO_POSITION', 'MANAGER_NAME',
       'BASE_SALARY', 'ORG_UNIT', 'ORG_UNIT_NAME', 'ETHNICITY', 'RACE',
       'EMPLOYMENT_TYPE', 'EMPLOYMENT_SUB_GROUP', 'EXEMPT', 'GENDER',
       'EMPLOYMENT_STATUS', 'CIVIL_SERVICE_TYPE', 'HIRE_DATE', 'JOB_DATE',
       'COMP_DATE', 'EEOJ', 'SNAPSHOT_DATE'], dtype=object)

## The [ ] is completely different for DataFrames than for Series
The bracket's primary use is to retrieve a column(s) from a DataFrame. Simply write the name of the column into the brackets and a Series will be returned. This behavior for the [ ] is completely different for Series where it is used to retrieve value(s) by the index.

In [197]:
# Access just the FIRST_NAME column
# shorten output with head
# returns a Series

df_coh['FIRST_NAME'].head(10)

0      LILLIAN
1    ESPERANZA
2     JONATHAN
3         JOHN
4      RUSSELL
5        CRAIG
6        PETER
7      JEFFREY
8       ROBERT
9       DAMIAN
Name: FIRST_NAME, dtype: object

### Use a list to access multiple columns

In [198]:
# get three columns - put names in a list
# Returns a DataFrame

df_coh[['FIRST_NAME', 'GENDER', 'RACE']].head()

Unnamed: 0,FIRST_NAME,GENDER,RACE
0,LILLIAN,Female,Hispanic/Latino
1,ESPERANZA,Female,Hispanic/Latino
2,JONATHAN,Male,White
3,JOHN,Male,White
4,RUSSELL,Male,White


In [199]:
# If you want to return a single column as a dataframe use a list with one column name
df_coh[['FIRST_NAME']].head(10)

Unnamed: 0,FIRST_NAME
0,LILLIAN
1,ESPERANZA
2,JONATHAN
3,JOHN
4,RUSSELL
5,CRAIG
6,PETER
7,JEFFREY
8,ROBERT
9,DAMIAN


### You have a Series when you access a single column
It's important to know that when you use the [ ] operator for a DataFrame and pass a single column name into it, you are back to having a Series. All the Series functions will work as before.

In [200]:
# use the same Series functions as before

# use max series method
df_coh['BASE_SALARY'].max()

275000.0

In [201]:
# sum total salary
df_coh['BASE_SALARY'].sum()

105178319.0

In [202]:
# find the standard deviation in salaries
df_coh['BASE_SALARY'].std()

21693.706679449504

In [203]:
# use boolean selection to select salaries > 100,000
salaries = df_coh['BASE_SALARY']
criteria = salaries > 100000

salaries[criteria].head(10)

0      121862.0
8      107962.0
11     180416.0
43     165216.0
66     100791.0
169    120916.0
178    210588.0
186    110881.0
217    102019.0
237    130416.0
Name: BASE_SALARY, dtype: float64

In [204]:
# you can also use the criteria to filter the entire dataframe
# All columns from those with a salary > 100,000

df_coh[criteria].head()

Unnamed: 0,UNIQUE_ID,FIRST_NAME,LAST_NAME,POSITION_NUMBER,POSITION_JOB_CODE,POSITION_TITLE,BUSINESS_AREA,DEPARTMENT,PAY_GRADE,FUND_ID,FUND_NAME,COST_CENTER,COST_CENTER_NAME,REPORTS_TO_POSITION,MANAGER_NAME,BASE_SALARY,ORG_UNIT,ORG_UNIT_NAME,ETHNICITY,RACE,EMPLOYMENT_TYPE,EMPLOYMENT_SUB_GROUP,EXEMPT,GENDER,EMPLOYMENT_STATUS,CIVIL_SERVICE_TYPE,HIRE_DATE,JOB_DATE,COMP_DATE,EEOJ,SNAPSHOT_DATE
0,9172,LILLIAN,WARDEN,30027193,306.2,ASSISTANT DIRECTOR (EX LVL),1600,Municipal Courts Department,32,1000,General Fund,1600010001,MCD-Admin Services,30044747.0,NELLY SANTOS,121862.0,16.1410,MCD-ADMINISTRATIVE SERVICES,HISPANIC,Hispanic/Latino,Full Time,Exempt Excptn,E,Female,Active,CIVILIAN,2006-06-12,2012-10-13,1998-05-13,A: Officials & Administrators,2016-06-01
8,8010,ROBERT,OAKES,30051248,306.3,DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV,2000,Public Works & Engineering-PWE,30,2301,Building Inspection Fund,2000060004,PWE-BldgOfficialOffi,30043470.0,EARL GREER,107962.0,20.2123.01,PWE-OFFICE OF THE BLDG OFFICIAL,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,Exempt Excptn,E,Male,Active,CIVILIAN,1993-11-15,2013-01-05,1993-11-15,A: Officials & Administrators,2016-06-01
11,4560,LUTHER,HARRELL,30002735,724.2,"CHIEF PHYSICIAN,MD",3800,Health & Human Services,35,1000,General Fund,3800050025,HHS-CHS AreaOprSppt,30048040.0,RISHA JONES,180416.0,38.5038,HHS-FAMILY HLTH ADMN,"BLACK, NOT OF HISPANIC ORIGIN",Black or African American,Full Time,Exempt Excptn,E,Male,Active,CIVILIAN,1987-05-22,1999-08-28,1987-05-22,A: Officials & Administrators,2016-06-01
43,27959,MICHAEL,GONZALEZ,30058271,656.6,ASSOCIATE EMS PHYSICIAN DIRECTOR,1200,Houston Fire Department (HFD),35,2010,Essential Public Health Services,1200030004,HFD-Medical Dir/Q.A.,30009179.0,DAVID PERSSE,165216.0,12.1225.03,HFD-MEDICAL DIRECTION & Q.A.,HISPANIC,Hispanic/Latino,Full Time,Exempt Excptn,E,Male,Active,CIVILIAN,2013-08-31,2013-08-31,2013-08-31,A: Officials & Administrators,2016-06-01
66,32229,SALLY,ABOUASSAF,30058430,706.5,"PUBLIC HEALTH DENTIST,DDS",3800,Health & Human Services,29,2010,Essential Public Health Services,3800070010,HHS-GeriatricOralHlt,30021232.0,DEBORAH GARCIA,100791.0,38.5023,HHS-ORAL HEALTH,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,Exempt Excptn,E,Female,Active,CIVILIAN,2015-12-28,2015-12-28,2015-12-28,B: Professionals,2016-06-01


### More interesting boolean logic with DataFrames
Many more interesting questions can now be asked and solved with boolean operations in the data in a very similar manner to how they were in the Series notebook.

In [205]:
# find people with make more 100,000 and are female
criteria = (df_coh['BASE_SALARY'] > 100000) & (df_coh['GENDER'] == 'Female')

df_coh[criteria].head()

Unnamed: 0,UNIQUE_ID,FIRST_NAME,LAST_NAME,POSITION_NUMBER,POSITION_JOB_CODE,POSITION_TITLE,BUSINESS_AREA,DEPARTMENT,PAY_GRADE,FUND_ID,FUND_NAME,COST_CENTER,COST_CENTER_NAME,REPORTS_TO_POSITION,MANAGER_NAME,BASE_SALARY,ORG_UNIT,ORG_UNIT_NAME,ETHNICITY,RACE,EMPLOYMENT_TYPE,EMPLOYMENT_SUB_GROUP,EXEMPT,GENDER,EMPLOYMENT_STATUS,CIVIL_SERVICE_TYPE,HIRE_DATE,JOB_DATE,COMP_DATE,EEOJ,SNAPSHOT_DATE
0,9172,LILLIAN,WARDEN,30027193,306.2,ASSISTANT DIRECTOR (EX LVL),1600,Municipal Courts Department,32,1000,General Fund,1600010001,MCD-Admin Services,30044747.0,NELLY SANTOS,121862.0,16.1410,MCD-ADMINISTRATIVE SERVICES,HISPANIC,Hispanic/Latino,Full Time,Exempt Excptn,E,Female,Active,CIVILIAN,2006-06-12,2012-10-13,1998-05-13,A: Officials & Administrators,2016-06-01
66,32229,SALLY,ABOUASSAF,30058430,706.5,"PUBLIC HEALTH DENTIST,DDS",3800,Health & Human Services,29,2010,Essential Public Health Services,3800070010,HHS-GeriatricOralHlt,30021232.0,DEBORAH GARCIA,100791.0,38.5023,HHS-ORAL HEALTH,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,Exempt Excptn,E,Female,Active,CIVILIAN,2015-12-28,2015-12-28,2015-12-28,B: Professionals,2016-06-01
237,10297,MARIA,IRSHAD,30057398,306.2,ASSISTANT DIRECTOR (EXECUTIVE LEVEL),6500,Admn. & Regulatory Affairs,32,8700,Parking Mangement Operating Fund,6500090001,ARA-Parking Cust Ser,30030918.0,ERNESTINA PAEZ,130416.0,65.1700,ARA-PARKING MANAGEMENT,ASIAN OR PACIFIC ISLANDER,Asian/Pacific Islander,Full Time,Exempt Excptn,E,Female,Active,CIVILIAN,2002-05-24,2013-07-20,2002-05-24,A: Officials & Administrators,2016-06-01
366,29194,ELEANOR,HOLMAN,30060598,306.3,DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV,5000,Mayor's Office,30,2429,Houston Civic Events Fund,5000030001,MYR-Hou Civic Events,30062588.0,SUSAN CHRISTIAN,110000.0,50.1430.05,MYR-SPECIAL EVENTS,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,Exempt Excptn,E,Female,Active,CIVILIAN,2014-05-13,2014-05-13,2014-05-13,A: Officials & Administrators,2016-06-01
522,24914,KERTECIA,MOND,30049334,306.3,DEPUTY ASSISTANT DIRECTOR (EX LVL),2800,Houston Airport System (HAS),30,8001,HAS-Revenue Fund,2800010003,HAS-DO Intern Audit,30047588.0,BALRAM BHEODARI,110686.0,28.1113.1000,HAS-DO INTERNAL AUDIT,"BLACK, NOT OF HISPANIC ORIGIN",Black or African American,Full Time,Exempt Excptn,E,Female,Active,CIVILIAN,2011-11-07,2011-11-07,2011-11-07,A: Officials & Administrators,2016-06-01


In [206]:
df_coh['RACE'].value_counts()

Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
American Indian or Alaskan Native     11
Others                                 2
Name: RACE, dtype: int64

In [207]:
# your code here
white_female = 
black_male = 
over_100k = 

criteria = 

SyntaxError: invalid syntax (<ipython-input-207-72293e33b01b>, line 2)

### Problem 6
<span  style="color:green; font-size:16px">What is the third most common department?</span>

In [None]:
# your code here

### Do men or woman make more money?
Boolean indexing can help answer this question. We first create two new DataFrames that are filtered for each gender and then find the mean of the salary. Men make about $5,000 more.

In [None]:
men = df_coh[df_coh['GENDER'] == 'Male']
women = df_coh[df_coh['GENDER'] == 'Female']

men['BASE_SALARY'].mean(), women['BASE_SALARY'].mean()

### All unique values of a column
Many times its nice to know all the possible values that a column can be. I usually use the **`value_counts`** method which counts the frequencies but also available is the **`unique`** method.

In [None]:
# Get all unique values of RACE
df_coh['RACE'].unique()

In [None]:
# but I prefer value_counts which returns frequencies as well
df_coh['RACE'].value_counts()

### Problem 7
<span  style="color:green; font-size:16px">Who makes more money, 'Black or African American' Females or White Males? Use RACE and GENDER columns.</span>

In [None]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">Who makes more money, 'POLICE OFFICER', 'ENGINEER/OPERATOR' or 'ELECTRICIAN'? Use the POSITION_TITLE column.</span>

In [None]:
# your code here

### Is there a better way to answer this last couple questions?
Yes! The **groupby** method allows for grouping of these categorical variables and will be explained in greater detail in the next notebook.

### Selecting a single employee
Many datasets have a column with an integer that uniquely identifies each row.  In database speak this column is called the table's **primary key**. The primary key allows for easy and direct access to each employee.

In the **df_coh** table it appears that the first column, **UNIQUE_ID** is the primary key. Most good datasets will have a **data dictionary** that describes each column of the table so you won't have to take a guess as to what the primary key is. The data dictionary (a.k.a metadata) for the current dataset can be [found here.](http://data.ohouston.org/dataset/city-of-houston-current-employee-roster/resource/98448c04-e76f-4fa0-8916-12786e6e5883)

### Ensuring uniqueness
The most important aspect of a primary key is its uniqueness.

In [None]:
# ensure that unique_id is unique
# use nunique Series method

df_coh['UNIQUE_ID'].nunique()

In [None]:
# 2000 is the total number of rows so unique_id is indeed unique
len(df_coh)

### Using boolean indexing to select an employee
Now that it is clear that **UNIQUE_ID** is indeed unique, let's practice selecting Employee **28693** using boolean indexing.

### Problem 9
<span  style="color:green; font-size:16px">Select employee 28693.</span>

In [None]:
# your code here

### Taking advantage of the index
Selecting an employee with **`df_coh[df_coh['UNIQUE_ID'] == 28693]`** is a bit cumbersome. A more elegant way involves replacing the current index with **UNIQUE_ID**. This is easily done with the **set_index** method and will allow for more straightfoward employee selections.

In [None]:
df_coh = pd.read_csv('data/coh_employee.csv')

In [None]:
# use set_index
# must assign result to save it
df_coh = df_coh.set_index('UNIQUE_ID')

In [None]:
# inspect frame
df_coh.head()

### Inspecting the new DataFrame output
The index is now meaningful. The index, which previously was just a range begininnnig at 0, is now the employee ID. The name **UNIQUE_ID** is now the **name** of the index. The values for the index are still **bold** and a reminder that these values are part of the index and not a column. 

You should notice the blank first row in the new DataFrame above. This is actually not a row at all but extra space that is formed because the index now has a name.

In [None]:
# the index object has a name attribute
df_coh.index.name

In [None]:
# if that first empty row bothers you, you can delete the name of the index
del df_coh.index.name

In [None]:
# index still has the same meaning but does not have a name.
# empty row space is now gone

df_coh.head()

### Setting the index on read
When first reading the dataset using **`read_csv`**, the argument **`index_col`** can be given the integer position of the column to use as the index.

In [None]:
df_coh = pd.read_csv('data/coh_employee.csv', index_col=0)
df_coh.head()

### Selecting certain rows
Selecting columns was just covered above and is done using the [ ] operator. Rows can be selected using **`.iloc`** and **`.loc`** as was done with the Series.

In [None]:
# get the first 3 rows by integer location
df_coh.iloc[:3]

In [None]:
# get row 155
# returns a Series

df_coh.iloc[155]

### Problem 10
<span  style="color:green; font-size:16px">Select rows 7, 77 and 777 from **`df_coh`**.</span>

In [None]:
# your code here

In [None]:
# Select using the index label
df_coh.loc[28693]

In [None]:
# the above returns a series
# if you want a dataframe, wrap lables in a list

df_coh.loc[[28693]]

Notice how selecting an employee with the **`UNIQUE_ID`** in the index with **`.loc`** is easier than with boolean indexing.

In [None]:
# it is actually possible to slice from one index label to another
# but you must know the ordering

df_coh.loc[9172:28693]

In [None]:
# see the same as above but reversed. Nothing is selected
df_coh.loc[28693:9172]

### Problem 11
<span  style="color:green; font-size:16px">Select employees with IDs 3105, 24767 and 31578.</span>

In [None]:
# your code here

### Selecting a subset of rows and columns together
We now have looked at independently selecting subsets of columns and rows. We will now select subsets of both columns and rows together with **`.iloc`** and **`.loc`**.  Withing the brackets, a **comma** is used to separate the rows references (which come first) from the column references.

In [None]:
# using integer location select rows 100:105 and the first 5 columns
df_coh.iloc[100:105, :5]

### Problem 12
<span  style="color:green; font-size:16px">Brackets are used to select disjoint rows or columns. Select rows 10, 100 and 500 along with columns 3, 7 and 20.</span>

In [None]:
# your code here

### Problem 13
<span  style="color:green; font-size:16px">Use label based indexing to select employee IDs 12214 and 8317 along with columns HIRE_DATE and GENDER.</span>

In [None]:
# your code here

### Problem 14: Advanced
<span  style="color:green; font-size:16px">Use label based indexing to select employee IDs 12214 and 8317 along with columns all columns except FIRST_NAME and POSITION_NUMBER. [Use this stackoverflow answer for guidance.](http://stackoverflow.com/a/37441204/3707607)</span>

In [None]:
# your code here

### Some confusion with brackets vs .iloc vs .loc
It was written earlier that only column names go inside of brackets for DataFrames. This isn't entirely true. It is possible to pass a slice to the brackets to return rows. I would highly recommend against doing this and use brackets only for column names, **.loc** for index labeled based selection and **.iloc** for integer position selection.

In [None]:
# select the first 5 rows with brackets - bad idea!
df_coh[:5]

### Using **`.ix`** to mix both integer and label indexing
The **`.ix`** operator provides some flexibility if a case ever arises where selection needs to occur with both label and integer. I tend to avoid this operator as its still ambigious whether you are using integer or location selection when the index is integers. [Check the docs for more info](http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing)

### Selecting single elements with `.at` and `.iat`
In the rare case that you would like to select exactly one cell of data, you can use **`.at`** for label based selection and **`.iat`** for integer based selection. They work analogously to **`.loc`** and **`.iloc`** and don't provide any extra functionality just are faster. Technically, a single element is called a **scalar** value.

They don't provide any extra functionality but are faster than their counterparts.

In [None]:
# scalar selection
df_coh.at[5123, 'GENDER']

In [None]:
# scalar selection with .loc
df_coh.loc[5123, 'GENDER']

### Problem 15
<span  style="color:green; font-size:16px">Use **`.iat`** correctly and explain what happened.</span>

In [None]:
# your code here

### Problem 16
<span  style="color:green; font-size:16px">Use the **`timeit`** magic command to see the speed difference between **`.at`** and **`.loc`** for the same selection. How much faster is **`.at`**?</span>

In [None]:
# your code here

### Adding new columns to the DataFrame
A very common task during analysis is to add new columns to the working DataFrame. Usually, some operation is performed using the existing columns with the outcome added as a new column.

In [None]:
# add a constant value as a new column
# the new column is always last

df_coh['NEW_CONSTANT_COLUMN'] = 5

In [None]:
#inspect dataframe
# look at first three rows and last 5 columns
# sure enough, there is our new constant column

df_coh.iloc[:3, -5:]

In [None]:
# making a new column: FULL NAME
# column of object type can be concatenated together just like strings

df_coh['FULL_NAME'] = df_coh['FIRST_NAME'] + ' ' + df_coh['LAST_NAME']

In [None]:
# inspect the name columns to see if new column is as intended
df_coh[['FIRST_NAME', 'LAST_NAME', 'FULL_NAME']].head()

### Adding a random numeric column
Let's say the city of Houston wants to give everyone a random bonus ranging anywhere between 0 and 10% of their current base salary. To begin, we will make a column called **RANDOM_BONUS** and assign each employee a random number between 0 and .1.

In [None]:
# create a numpy array the same length as the DataFrame
# with random values from 0 and .1

df_coh['RANDOM_BONUS'] = np.random.rand(len(df_coh)) * .1

In [None]:
# inspect last columns

df_coh.iloc[:3, -5:]

### Data Types
Every column of the DataFrame has a data type. This was seen at the top of the notebook with the **`info`** method. To get a more clean view of the data types use the **dtypes** attribute. The most common data types are **int64, float64, bool, datetime64, timedelta64 and object**. These are [all numpy data types](https://docs.scipy.org/doc/numpy/user/basics.types.html). 

The **object** is a catch-all for any strings, complex Python data types like lists, dictionaries, etc... or any mix of strings and numbers.

The **datetime64** data type holds specific dates in time such as Jan 24, 2017 5:30 p.m. **timedelta64** holds only time values such as 6 days, 4 hours, 14 minutes and x nanoseconds. Both **datetime64** and **timedelta64** have nanosecond precision. These time data types will be discussed in great detail in later notebooks.

In [None]:
# only look at the data types
df_coh.dtypes

### Changing Data Types
Occasionally, your data will not of the desired type and need to be changed to a different type. Perhaps the most common situation is when dealing with dates that are read in as strings and thus are objects. The Series method **astype** will attempt to force a column to a different type.

The argument passed to **astype** must be the name of the new data type as either a string or a numpy object. The **HIRE_DATE** column is currently a string and will be converted to a **datetime64** and replace the old column. Other date columns will be converted 

In [None]:
# using a string
df_coh['HIRE_DATE'] = df_coh['HIRE_DATE'].astype('datetime64')

# using the numpy object
df_coh['COMP_DATE'] = df_coh['COMP_DATE'].astype(np.datetime64)

# yet another way using a function with flexibility to do more things
df_coh['SNAPSHOT_DATE'] = pd.to_datetime(df_coh['SNAPSHOT_DATE'])

# Case Study: Do people with more experience make more money?
To answer this question, the number of years of experience needs to be calculated from the column **HIRE_DATE**. **datetime64** columns can be subtracted from one another. The current date will be needed and can be retrieved from the **to_datetime** pandas function.

In [None]:
# get today's date
today = pd.to_datetime('today')

today

In [None]:
# subtract the hire date from today to get the number of days of experience
experience = today - df_coh['HIRE_DATE']

# print out head of series
experience.head()

### Converting to years
Notice that the data type is now **timedelta64** which just reprsents an amount of time in days. To convert this to years an esoteric command must be run. [See here for more detail](http://pandas.pydata.org/pandas-docs/stable/timedeltas.html#frequency-conversion)

In [None]:
# convert to years
years_experience = experience / np.timedelta64(1, 'Y')

# inspect and check that it makes sense
years_experience.head()

In [None]:
# Make a new column
df_coh['YEARS_EXPERIENCE'] = years_experience

### Creating categories for years of experience
It's possible to divide numerical columns into different categories based on their value. The pandas **cut** function accepts a Series or an array and a list of the edges of the **bins**. Each category can be given a **label** as well. A series is returned that is of **categorical** type - unique to Pandas. [More on categorical data](http://pandas.pydata.org/pandas-docs/stable/categorical.html)

In [None]:
# create Series of categorical data
exp_categories = pd.cut(years_experience, bins=[0, 5, 15, 100], labels=['Novice', 'Experienced', 'Senior'])

In [None]:
# inspect Seriers
exp_categories.head(10)

In [None]:
# get some summary statistics
exp_categories.value_counts()

In [None]:
# Create new column
df_coh['EXPERIENCE_LEVEL'] = exp_categories

# Section end: You should know...
* a DataFrame has an index for both the columns and the rows
* import data with **`pd.read_csv`**
* Use **`shape, head, info, describe`** as standard operations to inspect a newly imported DataFrame
* Select a single column (a Series) with brackets
* Select multiple columns with a list of column names in brackets
* Be explicit when selecting rows with **`iloc`** and **`loc`**
* Select rows and columns together with **`iloc`** and **`loc`** by separating row and column references with a **comma**
* boolean indexing with multiple column comparisons
* Use **`set_index`** to move a column into the index.
* Primary keys are good choices for the index
* Know all the most common numpy data types **`int64, float64, bool, datetime64, timedelta64 and object`**
* change data type with **`astype`** method
* Create new columns
* use the **`cut`** function to create categorical variables

# Problems
If you have not run the code from the beginning of the notebook then run the code in the block below before beginning the problems. Most of the problems will be about the city of Houston employee dataset.

In [273]:
# run this code block first
df_coh = pd.read_csv('data/coh_employee.csv', index_col=0)
df_coh['RANDOM_BONUS'] = np.random.rand(len(df_coh)) * .1
df_coh['HIRE_DATE'] = df_coh['HIRE_DATE'].astype(np.datetime64)
today = pd.to_datetime('today')
experience = today - df_coh['HIRE_DATE']
years_experience = experience / np.timedelta64(1, 'Y')
df_coh['YEARS_EXPERIENCE'] = years_experience
exp_categories = pd.cut(years_experience, bins=[0, 5, 15, 100], labels=['Novice', 'Experienced', 'Senior'])
df_coh['EXPERIENCE_LEVEL'] = exp_categories

### Problem 1
<span  style="color:green; font-size:16px">What object is returned from the above code **`df_coh.describe(include=['object']).T`**?</span>

In [274]:
df_coh.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 9172 to 21962
Data columns (total 33 columns):
FIRST_NAME              2000 non-null object
LAST_NAME               2000 non-null object
POSITION_NUMBER         2000 non-null int64
POSITION_JOB_CODE       2000 non-null float64
POSITION_TITLE          2000 non-null object
BUSINESS_AREA           2000 non-null int64
DEPARTMENT              2000 non-null object
PAY_GRADE               2000 non-null object
FUND_ID                 2000 non-null int64
FUND_NAME               2000 non-null object
COST_CENTER             2000 non-null int64
COST_CENTER_NAME        2000 non-null object
REPORTS_TO_POSITION     1688 non-null float64
MANAGER_NAME            1678 non-null object
BASE_SALARY             1886 non-null float64
ORG_UNIT                2000 non-null object
ORG_UNIT_NAME           2000 non-null object
ETHNICITY               1998 non-null object
RACE                    1965 non-null object
EMPLOYMENT_TYPE         2000 non-nu

In [275]:
df_coh.describe(include=['object']).T

Unnamed: 0,count,unique,top,freq
FIRST_NAME,2000,961,MICHAEL,44
LAST_NAME,2000,1371,JOHNSON,30
POSITION_TITLE,2000,330,SENIOR POLICE OFFICER,220
DEPARTMENT,2000,24,Houston Police Department-HPD,638
PAY_GRADE,2000,80,PA03,184
FUND_NAME,2000,39,General Fund,1304
COST_CENTER_NAME,2000,448,HFD-Deployment,297
MANAGER_NAME,1678,1201,HERMAN GONZALES,56
ORG_UNIT,2000,1181,12.1210,288
ORG_UNIT_NAME,2000,478,HFD-FIRE & EMS OPERATIONS,289


In [276]:
df_coh.describe(include=['float64']).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
POSITION_JOB_CODE,2000.0,372.6786,283.143263,47.0,108.0,303.5,568.15,978.2
REPORTS_TO_POSITION,1688.0,30031670.0,23372.95736,30000040.0,30006610.0,30033950.0,30055020.0,30067370.0
BASE_SALARY,1886.0,55767.93,21693.706679,24960.0,40170.0,54461.0,66614.0,275000.0
RANDOM_BONUS,2000.0,0.05092327,0.02894,2.945107e-05,0.02552164,0.05210021,0.07705689,0.09992049
YEARS_EXPERIENCE,2000.0,14.13133,10.581658,0.5366298,4.659918,11.54165,22.52202,57.95875


### Problem 2
<span  style="color:green; font-size:16px">What object are the values of an **`Index`**?</span>

In [277]:
type(df_coh.index.values[0])

numpy.int64

### Problem 3
<span  style="color:green; font-size:16px">Retrieve the LAST_NAME column and assign it to a variable. Count the number of employees with a last name of 'JOHNSON'. Do the problem again in one line of code.</span>

In [278]:
last_names = df_coh.LAST_NAME
(last_names == 'JOHNSON').sum()

30

In [279]:
df_coh[df_coh.LAST_NAME == 'JOHNSON'].shape[0]

30

In [280]:
(df_coh.LAST_NAME == 'JOHNSON').sum()

30

### Problem 4
<span  style="color:green; font-size:16px">Count all the occurrences of each value in the column RACE. Remember there is single Series method that does this.</span>

In [281]:
df_coh.RACE.value_counts()

Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
American Indian or Alaskan Native     11
Others                                 2
Name: RACE, dtype: int64

### Problem 5
<span  style="color:green; font-size:16px">How many employees are White Females or 'Black or African American' Males and make more than 100,000? Create variables for each of three categories below before putting them all together in one criteria.</span>

In [282]:
# your code here
white_female = (df_coh.GENDER == 'Female') & (df_coh.RACE == 'White')
black_male = (df_coh.GENDER == 'Male') & (df_coh.RACE == 'Black or African American')
over_100k = df_coh.BASE_SALARY > 100000

criteria = (white_female | black_male) & over_100k
criteria.sum()

20

### Problem 6
<span  style="color:green; font-size:16px">What is the third most common department?</span>

In [283]:
# df_coh.DEPARTMENT.value_counts().iloc[:3]
df_coh.DEPARTMENT.value_counts().index[2]

'Public Works & Engineering-PWE'

### Problem 7
<span  style="color:green; font-size:16px">Who makes more money, 'Black or African American' Females or White Males? Use RACE and GENDER columns.</span>

In [284]:
criteria = (df_coh.RACE == 'Black or African American') & (df_coh.GENDER == 'Female')
print('Black or African American Females: ${}'.format(int(df_coh.loc[criteria, 'BASE_SALARY'].mean())))

criteria = (df_coh.RACE == 'White') & (df_coh.GENDER == 'Male')
print('White Males: ${}'.format(int(df_coh.loc[criteria, 'BASE_SALARY'].mean())))

Black or African American Females: $48915
White Males: $63940


In [285]:
criteria = (df_coh.RACE == 'Black or African American') & (df_coh.GENDER == 'Female') & (df_coh.POSITION_TITLE == 'SENIOR POLICE OFFICER')
print('Black or African American Females: ${}'.format(int(df_coh.loc[criteria, 'BASE_SALARY'].mean())))

criteria = (df_coh.RACE == 'White') & (df_coh.GENDER == 'Male') & (df_coh.POSITION_TITLE == 'SENIOR POLICE OFFICER')
print('White Males: ${}'.format(int(df_coh.loc[criteria, 'BASE_SALARY'].mean())))

Black or African American Females: $65903
White Males: $66341


In [286]:
criteria = (df_coh.RACE == 'Black or African American') & (df_coh.GENDER == 'Female') & (df_coh.POSITION_TITLE == 'POLICE OFFICER')
print('Black or African American Females: ${}'.format(int(df_coh.loc[criteria, 'BASE_SALARY'].mean())))

criteria = (df_coh.RACE == 'White') & (df_coh.GENDER == 'Male') & (df_coh.POSITION_TITLE == 'POLICE OFFICER')
print('White Males: ${}'.format(int(df_coh.loc[criteria, 'BASE_SALARY'].mean())))

Black or African American Females: $55105
White Males: $52873


In [287]:
criteria = (df_coh.RACE == 'Black or African American') & (df_coh.GENDER == 'Female') & (df_coh.POSITION_TITLE == 'SENIOR POLICE OFFICER')
print('Black or African American Females: {} years'.format(int(df_coh.loc[criteria, 'YEARS_EXPERIENCE'].mean())))

criteria = (df_coh.RACE == 'White') & (df_coh.GENDER == 'Male') & (df_coh.POSITION_TITLE == 'SENIOR POLICE OFFICER')
print('White Males: {} years'.format(int(df_coh.loc[criteria, 'YEARS_EXPERIENCE'].mean())))

Black or African American Females: 23 years
White Males: 28 years


In [288]:
criteria = (df_coh.RACE == 'Black or African American') & (df_coh.GENDER == 'Female') & (df_coh.POSITION_TITLE == 'POLICE OFFICER')
print('Black or African American Females: {} years'.format(int(df_coh.loc[criteria, 'YEARS_EXPERIENCE'].mean())))

criteria = (df_coh.RACE == 'White') & (df_coh.GENDER == 'Male') & (df_coh.POSITION_TITLE == 'POLICE OFFICER')
print('White Males: {} years'.format(int(df_coh.loc[criteria, 'YEARS_EXPERIENCE'].mean())))

Black or African American Females: 8 years
White Males: 7 years


In [289]:
df_coh.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 9172 to 21962
Data columns (total 33 columns):
FIRST_NAME              2000 non-null object
LAST_NAME               2000 non-null object
POSITION_NUMBER         2000 non-null int64
POSITION_JOB_CODE       2000 non-null float64
POSITION_TITLE          2000 non-null object
BUSINESS_AREA           2000 non-null int64
DEPARTMENT              2000 non-null object
PAY_GRADE               2000 non-null object
FUND_ID                 2000 non-null int64
FUND_NAME               2000 non-null object
COST_CENTER             2000 non-null int64
COST_CENTER_NAME        2000 non-null object
REPORTS_TO_POSITION     1688 non-null float64
MANAGER_NAME            1678 non-null object
BASE_SALARY             1886 non-null float64
ORG_UNIT                2000 non-null object
ORG_UNIT_NAME           2000 non-null object
ETHNICITY               1998 non-null object
RACE                    1965 non-null object
EMPLOYMENT_TYPE         2000 non-nu

### Problem 8
<span  style="color:green; font-size:16px">Who makes more money, 'POLICE OFFICER', 'ENGINEER/OPERATOR' or 'ELECTRICIAN'? Use the POSITION_TITLE column.</span>

In [290]:
print('Police officer: ${}'.format(int(df_coh.loc[df_coh.POSITION_TITLE=='POLICE OFFICER', 'BASE_SALARY'].mean())))
print('Engineer/Operator: ${}'.format(int(df_coh.loc[df_coh.POSITION_TITLE=='ENGINEER/OPERATOR', 'BASE_SALARY'].mean())))
print('Electrician: ${}'.format(int(df_coh.loc[df_coh.POSITION_TITLE=='ELECTRICIAN', 'BASE_SALARY'].mean())))

Police officer: $52592
Engineer/Operator: $62606
Electrician: $49816


### Problem 9
<span  style="color:green; font-size:16px">Select employee 28693.</span>

In [291]:
df_coh.loc[28693]

FIRST_NAME                                   JONATHAN
LAST_NAME                                  JORSCH JR.
POSITION_NUMBER                              30063393
POSITION_JOB_CODE                                 108
POSITION_TITLE                         POLICE OFFICER
BUSINESS_AREA                                    1000
DEPARTMENT              Houston Police Department-HPD
PAY_GRADE                                        PA03
FUND_ID                                          1000
FUND_NAME                                General Fund
COST_CENTER                                1000010068
COST_CENTER_NAME                 HPD-Vehicular Crimes
REPORTS_TO_POSITION                       3.00609e+07
MANAGER_NAME                         CATRINA WILLIAMS
BASE_SALARY                                     45279
ORG_UNIT                                 10.1687.0130
ORG_UNIT_NAME           HPD-VEHICULAR CRIMES DIVISION
ETHNICITY               WHITE, NOT OF HISPANIC ORIGIN
RACE                        

### Problem 10
<span  style="color:green; font-size:16px">Select rows 7, 77 and 777 from **`df_coh`**.</span>

In [292]:
df_coh.iloc[[7, 77, 777],:]

Unnamed: 0_level_0,FIRST_NAME,LAST_NAME,POSITION_NUMBER,POSITION_JOB_CODE,POSITION_TITLE,BUSINESS_AREA,DEPARTMENT,PAY_GRADE,FUND_ID,FUND_NAME,COST_CENTER,COST_CENTER_NAME,REPORTS_TO_POSITION,MANAGER_NAME,BASE_SALARY,ORG_UNIT,ORG_UNIT_NAME,ETHNICITY,RACE,EMPLOYMENT_TYPE,EMPLOYMENT_SUB_GROUP,EXEMPT,GENDER,EMPLOYMENT_STATUS,CIVIL_SERVICE_TYPE,HIRE_DATE,JOB_DATE,COMP_DATE,EEOJ,SNAPSHOT_DATE,RANDOM_BONUS,YEARS_EXPERIENCE,EXPERIENCE_LEVEL
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1
28212,JEFFREY,DELLING,30043726,520.3,CARPENTER,2800,Houston Airport System (HAS),14,8001,HAS-Revenue Fund,2800040017,HAS-IAH-AMG A&G,30047472.0,SHANTEL WOODS,42390.0,28.1377.3000,HAS-IAH SCHEDULED MAINTENANCE,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,Non Exempt Postv,N,Male,Active,CIVILIAN,2013-11-04,2013-11-04,2013-11-04,G: Skilled Craft Workers,2016-06-01,0.076101,3.107524,Novice
23457,DANIEL,RIVERA,30036326,108.0,POLICE OFFICER,1000,Houston Police Department-HPD,PA03,1000,General Fund,1000010051,HPD-Vice,30059301.0,JAMES WALKER,52514.0,10.1675.0050,HPD-VICE,HISPANIC,Hispanic/Latino,Full Time,Non-Exempt Excptn,N,Male,Active,CLASSIFIED,2010-03-29,2011-03-29,2010-03-29,D: Protective Service Workers,2016-06-01,0.051389,6.71061,Experienced
31942,RAMON,GUILLEN,30064705,108.6,"POLICE OFFICER,PROBATIONARY",1000,Houston Police Department-HPD,PA02,1000,General Fund,1000010028,HPD-Northeast Patrol,30058137.0,MICHAEL THIAC,42000.0,10.1642,HPD-NORTHEAST DIVISION,HISPANIC,Hispanic/Latino,Full Time,Non-Exempt Excptn,N,Male,Active,CLASSIFIED,2015-10-12,2016-04-27,2015-10-12,D: Protective Service Workers,2016-06-01,0.088206,1.171824,Novice


### Problem 11
<span  style="color:green; font-size:16px">Select employees with IDs 3105, 24767 and 31578.</span>

In [293]:
df_coh.loc[[3105, 24767, 31578],:]

Unnamed: 0_level_0,FIRST_NAME,LAST_NAME,POSITION_NUMBER,POSITION_JOB_CODE,POSITION_TITLE,BUSINESS_AREA,DEPARTMENT,PAY_GRADE,FUND_ID,FUND_NAME,COST_CENTER,COST_CENTER_NAME,REPORTS_TO_POSITION,MANAGER_NAME,BASE_SALARY,ORG_UNIT,ORG_UNIT_NAME,ETHNICITY,RACE,EMPLOYMENT_TYPE,EMPLOYMENT_SUB_GROUP,EXEMPT,GENDER,EMPLOYMENT_STATUS,CIVIL_SERVICE_TYPE,HIRE_DATE,JOB_DATE,COMP_DATE,EEOJ,SNAPSHOT_DATE,RANDOM_BONUS,YEARS_EXPERIENCE,EXPERIENCE_LEVEL
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1
3105,FRANCISCO,ORTIZ,30001931,106.4,POLICE SERGEANT,1000,Houston Police Department-HPD,PA06,1000,General Fund,1000010028,HPD-Northeast Patrol,30067353.0,JOSLYN JOHNSON,81239.0,10.1642.0030,HPD-NORTHEAST PATROL,HISPANIC,Hispanic/Latino,Full Time,Non-Exempt Excptn,N,Male,Active,CLASSIFIED,1983-09-12,1998-11-28,1984-08-01,C: Technicians,2016-06-01,0.010904,33.254619,Senior
24767,JOSHUA,YERIAN,30022228,103.3,FIRE FIGHTER,1200,Houston Fire Department (HFD),FD03,1000,General Fund,1200030001,HFD-Deployment,,,48190.0,12.1210,HFD-FIRE & EMS OPERATIONS,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,NE Suprsn Excptn,N,Male,Active,CLASSIFIED,2011-08-22,2012-11-22,2011-08-22,D: Protective Service Workers,2016-06-01,0.034396,5.31154,Experienced
31578,MARK,SIMS,30028122,513.4,SEMI-SKILLED LABORER,2100,Solid Waste Management,06,1000,General Fund,2100020001,SWM-Maintenance,30028112.0,JUAN RODRIGUEZ,27622.0,21.2010.02,SWM-MAINTENANCE,"BLACK, NOT OF HISPANIC ORIGIN",Black or African American,Full Time,Non Exempt Postv,N,Male,Active,CIVILIAN,2015-08-31,2015-08-31,2015-08-31,H: Service/Maintenance,2016-06-01,0.035088,1.286816,Novice


### Problem 12
<span  style="color:green; font-size:16px">Brackets are used to select disjoint rows or columns. Select rows 10, 100 and 500 along with columns 3, 7 and 20.</span>

In [294]:
df_coh.iloc[[10, 100, 500], [3, 7, 20]]

Unnamed: 0_level_0,POSITION_JOB_CODE,PAY_GRADE,EMPLOYMENT_SUB_GROUP
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
18633,103.3,FD03,NE Suprsn Excptn
12214,103.1,FD05,NE Suprsn Excptn
32485,721.1,12,Non Exempt Postv


### Problem 13
<span  style="color:green; font-size:16px">Use label based indexing to select employee IDs 12214 and 8317 along with columns HIRE_DATE and GENDER.</span>

In [295]:
df_coh.loc[[12214, 8317], ['HIRE_DATE', 'GENDER']]

Unnamed: 0_level_0,HIRE_DATE,GENDER
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
12214,2000-03-21,Male
8317,1994-04-11,Male


### Problem 14: Advanced
<span  style="color:green; font-size:16px">Use label based indexing to select employee IDs 12214 and 8317 along with columns all columns except FIRST_NAME and POSITION_NUMBER. [Use this stackoverflow answer for guidance.](http://stackoverflow.com/a/37441204/3707607)</span>

In [296]:
df_coh.loc[[12214, 8317], [col for col in df_coh.columns if col not in ['FIRST_NAME', 'POSITION_NUMBER']]]

Unnamed: 0_level_0,LAST_NAME,POSITION_JOB_CODE,POSITION_TITLE,BUSINESS_AREA,DEPARTMENT,PAY_GRADE,FUND_ID,FUND_NAME,COST_CENTER,COST_CENTER_NAME,REPORTS_TO_POSITION,MANAGER_NAME,BASE_SALARY,ORG_UNIT,ORG_UNIT_NAME,ETHNICITY,RACE,EMPLOYMENT_TYPE,EMPLOYMENT_SUB_GROUP,EXEMPT,GENDER,EMPLOYMENT_STATUS,CIVIL_SERVICE_TYPE,HIRE_DATE,JOB_DATE,COMP_DATE,EEOJ,SNAPSHOT_DATE,RANDOM_BONUS,YEARS_EXPERIENCE,EXPERIENCE_LEVEL
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
12214,MADDIN,103.1,CAPTAIN,1200,Houston Fire Department (HFD),FD05,1000,General Fund,1200030001,HFD-Deployment,,,66523.0,12.1210,HFD-FIRE & EMS OPERATIONS,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,NE Suprsn Excptn,N,Male,Active,CLASSIFIED,2000-03-21,2009-09-29,2000-03-21,B: Professionals,2016-06-01,0.007683,16.73135,Senior
8317,JOHNSON,108.2,SENIOR POLICE OFFICER,1000,Houston Police Department-HPD,PA04,1000,General Fund,1000010027,HPD-North Ptrl,30005911.0,JOHN NICKELL,66614.0,10.1622.0230,HPD-NORTH PATROL,"WHITE, NOT OF HISPANIC ORIGIN",White,Full Time,Non-Exempt Excptn,N,Male,Active,CLASSIFIED,1994-04-11,2014-04-12,1994-04-11,D: Protective Service Workers,2016-06-01,0.001741,22.675346,Senior


In [297]:
df_coh.loc[[12214, 8317], df_coh.columns.difference(['FIRST_NAME', 'POSITION_NUMBER'])]

Unnamed: 0_level_0,BASE_SALARY,BUSINESS_AREA,CIVIL_SERVICE_TYPE,COMP_DATE,COST_CENTER,COST_CENTER_NAME,DEPARTMENT,EEOJ,EMPLOYMENT_STATUS,EMPLOYMENT_SUB_GROUP,EMPLOYMENT_TYPE,ETHNICITY,EXEMPT,EXPERIENCE_LEVEL,FUND_ID,FUND_NAME,GENDER,HIRE_DATE,JOB_DATE,LAST_NAME,MANAGER_NAME,ORG_UNIT,ORG_UNIT_NAME,PAY_GRADE,POSITION_JOB_CODE,POSITION_TITLE,RACE,RANDOM_BONUS,REPORTS_TO_POSITION,SNAPSHOT_DATE,YEARS_EXPERIENCE
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
12214,66523.0,1200,CLASSIFIED,2000-03-21,1200030001,HFD-Deployment,Houston Fire Department (HFD),B: Professionals,Active,NE Suprsn Excptn,Full Time,"WHITE, NOT OF HISPANIC ORIGIN",N,Senior,1000,General Fund,Male,2000-03-21,2009-09-29,MADDIN,,12.1210,HFD-FIRE & EMS OPERATIONS,FD05,103.1,CAPTAIN,White,0.007683,,2016-06-01,16.73135
8317,66614.0,1000,CLASSIFIED,1994-04-11,1000010027,HPD-North Ptrl,Houston Police Department-HPD,D: Protective Service Workers,Active,Non-Exempt Excptn,Full Time,"WHITE, NOT OF HISPANIC ORIGIN",N,Senior,1000,General Fund,Male,1994-04-11,2014-04-12,JOHNSON,JOHN NICKELL,10.1622.0230,HPD-NORTH PATROL,PA04,108.2,SENIOR POLICE OFFICER,White,0.001741,30005911.0,2016-06-01,22.675346


### Problem 15
<span  style="color:green; font-size:16px">Use **`.iat`** correctly and explain what happened.</span>

In [298]:
df_coh.iat[0, 0] # only accesses one element

'LILLIAN'

### Problem 16
<span  style="color:green; font-size:16px">Use the **`timeit`** magic command to see the speed difference between **`.at`** and **`.loc`** for the same selection. How much faster is **`.at`**?</span>

In [299]:
%timeit df_coh.loc[12214, 'FIRST_NAME']
%timeit df_coh.at[12214, 'FIRST_NAME']

1000 loops, best of 3: 413 µs per loop
10000 loops, best of 3: 64.1 µs per loop


### Problem 17
<span  style="color:green; font-size:16px">Create new columns **`BONUS`** and **`TOTAL_COMP`**. Use column **`RANDOM_BONUS`** to calculate the bonus.</span>

In [300]:
df_coh['BONUS'] = df_coh.BASE_SALARY * df_coh.RANDOM_BONUS
df_coh['TOTAL_COMP'] = df_coh.BASE_SALARY + df_coh.BONUS
df_coh[['BASE_SALARY', 'TOTAL_COMP']].head()

Unnamed: 0_level_0,BASE_SALARY,TOTAL_COMP
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
9172,121862.0,129413.917518
12311,26125.0,26357.207125
28693,45279.0,45467.584378
2359,63166.0,63641.195027
5123,56347.0,58492.645153


### Problem 18
<span  style="color:green; font-size:16px">Use the **`EXPERIENCE_LEVEL`** column to determine if more experienced employees make more money.</span>

In [301]:
df_coh.groupby('EXPERIENCE_LEVEL')['TOTAL_COMP'].mean()

EXPERIENCE_LEVEL
Novice         47312.792996
Experienced    57909.641936
Senior         66913.667730
Name: TOTAL_COMP, dtype: float64

In [302]:
df_coh['YEARS_EXPERIENCE_AM'] = pd.to_datetime('today') - df_coh.HIRE_DATE
df_coh['YEARS_EXPERIENCE_AM'] = df_coh['YEARS_EXPERIENCE_AM'] / np.timedelta64(1, 'Y')
df_coh[['YEARS_EXPERIENCE', 'YEARS_EXPERIENCE_AM']].head()

Unnamed: 0_level_0,YEARS_EXPERIENCE,YEARS_EXPERIENCE_AM
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
9172,10.505349,10.505349
12311,16.402801,16.402801
28693,1.859039,1.859039
2359,34.845342,34.845342
5123,27.485848,27.485848
