<img src="img/01.png">

#### According to [REF1](../README.md) :

Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

# 02. Pandas - Data Frames

## 02.01 Data Frames - basics

A Pandas Series is a one-dimensional array of indexed data.
* Create a `Series` object

<img src="img/02.png">

## 02.02 Data Frames - preprocessing


In [1]:
import numpy as np
import pandas as pd
from IPython.display import display

In [2]:
# Dataset comes from:
# https://www.kaggle.com/rhuebner/human-resources-data-set#core_dataset.csv
# reading CSV is one of the most common forms of creating DataFrame
df = pd.read_csv("../92_data/emplyees.csv")

In [3]:
# to keep the screen clean we display only the head of dataframe (default is 5 rows)
df.head()

Unnamed: 0,Employee Name,Employee Number,State,Zip,DOB,Age,Sex,MaritalDesc,CitizenDesc,Hispanic/Latino,...,Date of Hire,Date of Termination,Reason For Term,Employment Status,Department,Position,Pay Rate,Manager Name,Employee Source,Performance Score
0,"Brown, Mia",1103024456,MA,1450,11/24/1985,32,Female,Married,US Citizen,No,...,10/27/2008,,N/A - still employed,Active,Admin Offices,Accountant I,28.5,Brandon R. LeBlanc,Diversity Job Fair,Fully Meets
1,"LaRotonda, William",1106026572,MA,1460,4/26/1984,33,Male,Divorced,US Citizen,No,...,1/6/2014,,N/A - still employed,Active,Admin Offices,Accountant I,23.0,Brandon R. LeBlanc,Website Banner Ads,Fully Meets
2,"Steans, Tyrone",1302053333,MA,2703,9/1/1986,31,Male,Single,US Citizen,No,...,9/29/2014,,N/A - still employed,Active,Admin Offices,Accountant I,29.0,Brandon R. LeBlanc,Internet Search,Fully Meets
3,"Howard, Estelle",1211050782,MA,2170,9/16/1985,32,Female,Married,US Citizen,No,...,2/16/2015,4/15/2015,N/A - still employed,Active,Admin Offices,Administrative Assistant,21.5,Brandon R. LeBlanc,Pay Per Click - Google,N/A- too early to review
4,"Singh, Nan",1307059817,MA,2330,5/19/1988,29,Female,Single,US Citizen,No,...,5/1/2015,,N/A - still employed,Active,Admin Offices,Administrative Assistant,16.56,Brandon R. LeBlanc,Website Banner Ads,N/A- too early to review


In [4]:
# investigate shape of dataframe
df.shape
print("Number of rows (emplyees) = {}".format(df.shape[0]))
print("Number of columns         = {}".format(df.shape[1]))

Number of rows (emplyees) = 301
Number of columns         = 21


In [5]:
# columns names
df.columns

Index(['Employee Name', 'Employee Number', 'State', 'Zip', 'DOB', 'Age', 'Sex',
       'MaritalDesc', 'CitizenDesc', 'Hispanic/Latino', 'RaceDesc',
       'Date of Hire', 'Date of Termination', 'Reason For Term',
       'Employment Status', 'Department', 'Position', 'Pay Rate',
       'Manager Name', 'Employee Source', 'Performance Score'],
      dtype='object')

In [6]:
# display one emplyee
df.loc[1]

Employee Name               LaRotonda, William  
Employee Number                       1106026572
State                                         MA
Zip                                         1460
DOB                                    4/26/1984
Age                                           33
Sex                                         Male
MaritalDesc                             Divorced
CitizenDesc                           US Citizen
Hispanic/Latino                               No
RaceDesc               Black or African American
Date of Hire                            1/6/2014
Date of Termination                          NaN
Reason For Term             N/A - still employed
Employment Status                         Active
Department                         Admin Offices
Position                            Accountant I
Pay Rate                                      23
Manager Name                  Brandon R. LeBlanc
Employee Source               Website Banner Ads
Performance Score   

In [7]:
# DataFrame.describe() is very useful to get basic intuition about the numerical data
df.describe().round(2)

Unnamed: 0,Employee Number,Zip,Age,Pay Rate
count,301.0,301.0,301.0,301.0
mean,1205421000.0,6705.2,38.55,30.72
std,182661600.0,17167.53,8.94,15.22
min,602000300.0,1013.0,25.0,14.0
25%,1102024000.0,1901.0,31.0,20.0
50%,1204033000.0,2132.0,37.0,24.0
75%,1401065000.0,2421.0,44.0,43.0
max,1988300000.0,98052.0,67.0,80.0


In [8]:
# count values of categorical columns
for c_name in df.columns:
    series = df[c_name]
    if series.dtype.kind == 'O': # strings are recognized as (O)bjects in pandas
        display(series.value_counts())

Langton, Enrico       1
Johnston, Yen         1
Friedman, Gerry       1
Ivey, Rose            1
Pelech, Emil          1
                     ..
Power, Morissa        1
King, Janet           1
Brown, Mia            1
Peterson, Ebonee      1
Becker, Renee         1
Name: Employee Name, Length: 301, dtype: int64

MA    266
CT      6
TX      3
VT      2
FL      1
CA      1
WA      1
MT      1
CO      1
KY      1
GA      1
PA      1
OR      1
AL      1
OH      1
IN      1
VA      1
RI      1
NH      1
UT      1
ID      1
ND      1
TN      1
NV      1
ME      1
NY      1
AZ      1
NC      1
Name: State, dtype: int64

9/22/1976     2
7/7/1984      2
9/9/1965      2
6/5/1967      1
5/6/1989      1
             ..
1/18/1952     1
11/9/1972     1
11/15/1976    1
7/20/1968     1
4/24/1970     1
Name: DOB, Length: 298, dtype: int64

Female    174
Male      126
male        1
Name: Sex, dtype: int64

Single       127
Married      119
Divorced      30
Separated     14
widowed       11
Name: MaritalDesc, dtype: int64

US Citizen             285
Eligible NonCitizen     12
Non-Citizen              4
Name: CitizenDesc, dtype: int64

No     271
Yes     27
no       2
yes      1
Name: Hispanic/Latino, dtype: int64

White                               190
Black or African American            54
Asian                                31
Two or more races                    18
American Indian or Alaska Native      4
Hispanic                              4
Name: RaceDesc, dtype: int64

1/10/2011    14
3/30/2015    12
1/5/2015     11
9/29/2014    11
5/16/2011    10
             ..
8/16/2012     1
7/4/2016      1
5/2/2011      1
1/9/2006      1
6/10/2011     1
Name: Date of Hire, Length: 93, dtype: int64

4/7/2012     2
9/26/2011    2
4/4/2014     2
8/19/2013    2
1/9/2012     2
            ..
5/25/2016    1
4/6/2013     1
3/15/2015    1
7/2/2012     1
1/12/2014    1
Name: Date of Termination, Length: 93, dtype: int64

N/A - still employed                188
Another position                     20
unhappy                              14
N/A - Has not started yet            11
more money                           11
career change                         9
hours                                 9
attendance                            7
relocation out of area                5
return to school                      5
performance                           4
military                              4
retiring                              4
maternity leave - did not return      3
medical issues                        3
no-call, no-show                      3
gross misconduct                      1
Name: Reason For Term, dtype: int64

Active                    174
Voluntarily Terminated     88
Terminated for Cause       14
Leave of Absence           14
Future Start               11
Name: Employment Status, dtype: int64

Production              208
IT/IS                    41
Sales                    31
Software Engineering     10
Admin Offices            10
Executive Office          1
Name: Department, dtype: int64

Production Technician I         136
Production Technician II         57
Area Sales Manager               27
Production Manager               14
Database Administrator           13
Software Engineer                 9
Network Engineer                  9
Sr. Network Engineer              5
Sr. DBA                           4
IT Support                        4
Sales Manager                     3
Accountant I                      3
Administrative Assistant          3
Sr. Accountant                    2
Shared Services Manager           2
IT Manager - DB                   2
IT Manager - Infra                1
Director of Sales                 1
President & CEO                   1
Software Engineering Manager      1
Director of Operations            1
IT Manager - Support              1
CIO                               1
IT Director                       1
Name: Position, dtype: int64

Michael Albert        22
Elijiah Gray          22
Kelley Spirea         22
Kissy Sullivan        22
David Stanley         21
Brannon Miller        21
Ketsia Liebig         21
Amy Dunn              21
Webster Butler        21
Janet King            19
Simon Roup            17
John Smith            14
Peter Monroe          14
Lynn Daneault         13
Alex Sweetwater        9
Brandon R. LeBlanc     7
Jennifer Zamora        6
Eric Dougall           4
Debra Houlihan         3
Board of Directors     2
Name: Manager Name, dtype: int64

Employee Referral                         31
Diversity Job Fair                        29
Search Engine - Google Bing Yahoo         25
Monster.com                               24
Pay Per Click - Google                    21
Professional Society                      19
Newspager/Magazine                        18
MBTA ads                                  17
Billboard                                 16
Vendor Referral                           15
Glassdoor                                 14
Website Banner Ads                        13
Word of Mouth                             13
On-campus Recruiting                      12
Social Networks - Facebook Twitter etc    11
Other                                      9
Internet Search                            6
Information Session                        4
Company Intranet - Partner                 1
On-line Web application                    1
Careerbuilder                              1
Pay Per Click                              1
Name: Empl

Fully Meets                 172
N/A- too early to review     37
90-day meets                 31
Exceeds                      28
Needs Improvement            15
Exceptional                   9
PIP                           9
Name: Performance Score, dtype: int64

## 02.03 Data Frames - add / select data 

In [9]:
# It is possible to select Columns like properties - Personally I do not recommend this way
# imagine you have column named `mean`, then what will happen is you call `DataFrame.mean`?
df.Age

0      32
1      33
2      31
3      32
4      29
       ..
296    38
297    31
298    34
299    34
300    51
Name: Age, Length: 301, dtype: int64

In [10]:
# Calling column like a python dict is much more intuitive for me
# as you may notice it returns a `Series` object
df['Age']

0      32
1      33
2      31
3      32
4      29
       ..
296    38
297    31
298    34
299    34
300    51
Name: Age, Length: 301, dtype: int64

In [11]:
# of course all the tricks from previous tutorial work as well
df.loc[2:10:3, ['Age', 'Position']]

Unnamed: 0,Age,Position
2,31,Accountant I
5,30,Administrative Assistant
8,30,Sr. Accountant


In [12]:
# now we select the data we would like to work with
selected_columns = ['Employee Name', 'Age', 'Sex', 'MaritalDesc', 'Department', 'Position', 'Pay Rate', 'Manager Name']
df = df[selected_columns]
df.head()

Unnamed: 0,Employee Name,Age,Sex,MaritalDesc,Department,Position,Pay Rate,Manager Name
0,"Brown, Mia",32,Female,Married,Admin Offices,Accountant I,28.5,Brandon R. LeBlanc
1,"LaRotonda, William",33,Male,Divorced,Admin Offices,Accountant I,23.0,Brandon R. LeBlanc
2,"Steans, Tyrone",31,Male,Single,Admin Offices,Accountant I,29.0,Brandon R. LeBlanc
3,"Howard, Estelle",32,Female,Married,Admin Offices,Administrative Assistant,21.5,Brandon R. LeBlanc
4,"Singh, Nan",29,Female,Single,Admin Offices,Administrative Assistant,16.56,Brandon R. LeBlanc


`pd.Series` have `str` methods to work with strings efficiently, which map strig methods to whole series, moreover the `str` methods accept regexp patterns as default

* change "male" to "Male"

In [13]:
df['Sex'].value_counts()

Female    174
Male      126
male        1
Name: Sex, dtype: int64

In [14]:
df['Sex']= df['Sex'].str.capitalize()

In [15]:
# Now everything seems to be OK
df['Sex'].value_counts()

Female    174
Male      127
Name: Sex, dtype: int64

* Split column `Employee Name` into `First_Name` and `Last_Name`

In [16]:
# first we need to check if every column contains `,` to make shure we can split using this pattern
mask_comma = df['Employee Name'].str.contains(',')

# this is the correct way
df_view = df.loc[~mask_comma, 'Employee Name']
df_view

272    Jeremy Prater
Name: Employee Name, dtype: object

In [17]:
# all we need to do is:
df_view.str.replace(pat = ' ', repl = ', ', n=1)

272    Jeremy, Prater
Name: Employee Name, dtype: object

In [18]:
# Now care this is tricky part!!!

# ####### !IMPORTANT! #######
# this is WRONG habit do not use '][' when you work with pandas!!! see more here: 
# https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

# this is the WRONG WAY!! 
# df[~mask_comma]['Employee Name'] = df_view.str.replace(pat = ' ', repl = ', ', n=1)

# this is correct way
df.loc[~mask_comma, 'Employee Name'] = df_view.str.replace(pat = ' ', repl = ', ', n=1)

In [19]:
# check if everything went OK
df[~mask_comma]['Employee Name']

272    Jeremy, Prater
Name: Employee Name, dtype: object

In [20]:
# now we can split the 'Employee Name'
additional_cols = df['Employee Name'].str.split(pat=',', n=1, expand=True)

In [21]:
additional_cols_dict = {'last_name':additional_cols[0], 'first_name':additional_cols[1]}
additional_cols_dict = {'last_name':additional_cols[0]}

In [22]:
df['Last_Name']  = additional_cols[0]
df['First_Name'] = additional_cols[1]

In [23]:
df.head()

Unnamed: 0,Employee Name,Age,Sex,MaritalDesc,Department,Position,Pay Rate,Manager Name,Last_Name,First_Name
0,"Brown, Mia",32,Female,Married,Admin Offices,Accountant I,28.5,Brandon R. LeBlanc,Brown,Mia
1,"LaRotonda, William",33,Male,Divorced,Admin Offices,Accountant I,23.0,Brandon R. LeBlanc,LaRotonda,William
2,"Steans, Tyrone",31,Male,Single,Admin Offices,Accountant I,29.0,Brandon R. LeBlanc,Steans,Tyrone
3,"Howard, Estelle",32,Female,Married,Admin Offices,Administrative Assistant,21.5,Brandon R. LeBlanc,Howard,Estelle
4,"Singh, Nan",29,Female,Single,Admin Offices,Administrative Assistant,16.56,Brandon R. LeBlanc,Singh,Nan


In [24]:
# last thing we need to do is to sort the columns and get rid of old `Employee Name` name
selected_columns_fln = ['First_Name', 'Last_Name'] + selected_columns[1:]
df = df[selected_columns_fln]
df.head()

Unnamed: 0,First_Name,Last_Name,Age,Sex,MaritalDesc,Department,Position,Pay Rate,Manager Name
0,Mia,Brown,32,Female,Married,Admin Offices,Accountant I,28.5,Brandon R. LeBlanc
1,William,LaRotonda,33,Male,Divorced,Admin Offices,Accountant I,23.0,Brandon R. LeBlanc
2,Tyrone,Steans,31,Male,Single,Admin Offices,Accountant I,29.0,Brandon R. LeBlanc
3,Estelle,Howard,32,Female,Married,Admin Offices,Administrative Assistant,21.5,Brandon R. LeBlanc
4,Nan,Singh,29,Female,Single,Admin Offices,Administrative Assistant,16.56,Brandon R. LeBlanc


In [25]:
# intersting part is to check if during the splitting
# whitespaces was removed from the beginning and end of `First_Name` and `Last_Name` 
# to check this we will need to use mapping
# btw `.apply` is one of the MOST important concepts in this spreadshit

whitespace_check = df[['First_Name', 'Last_Name']].apply(lambda x:x.str.contains(pat='\s'))
whitespace_check.head()

Unnamed: 0,First_Name,Last_Name
0,True,False
1,True,False
2,True,False
3,True,False
4,True,False


In [26]:
# Lets check the scale of the phenomenon
whitespace_check.sum()

First_Name    298
Last_Name       3
dtype: int64

In [27]:
# to make sure check what went wrong
df.loc[0, 'First_Name']

' Mia'

In [28]:
# let's aply the update function
df[['First_Name', 'Last_Name']] = df[['First_Name', 'Last_Name']].apply(lambda x:x.str.strip())

In [29]:
# much more better!
df.loc[0, 'First_Name']

'Mia'

In [30]:
# let's check once again
whitespace_check = df[['First_Name', 'Last_Name']].apply(lambda x:x.str.contains(pat='\s'))
df[whitespace_check.any(axis=1)]

Unnamed: 0,First_Name,Last_Name,Age,Sex,MaritalDesc,Department,Position,Pay Rate,Manager Name
5,Leigh Ann,Smith,30,Female,Married,Admin Offices,Administrative Assistant,20.5,Brandon R. LeBlanc
6,Brandon R,LeBlanc,33,Male,Married,Admin Offices,Shared Services Manager,55.0,Janet King
43,Karthikeyan,Ait Sidi,42,Male,Married,IT/IS,Sr. DBA,62.0,Simon Roup
44,Claudia N,Carr,31,Female,Single,IT/IS,Sr. DBA,61.3,Simon Roup
55,Webster L,Butler,34,Male,Single,Production,Production Manager,55.0,Janet King
66,Courtney E,Wallace,62,Female,Married,Production,Production Manager,33.5,Janet King
67,Wilson K,Adinolfi,34,Male,Single,Production,Production Technician I,20.0,Michael Albert
75,Francesco A,Barone,34,Male,Single,Production,Production Technician I,16.76,Kelley Spirea
80,Lowan M,Biden,59,Female,Divorced,Production,Production Technician I,22.0,Ketsia Liebig
87,Donovan E,Chang,34,Male,Single,Production,Production Technician I,22.0,Webster Butler


Now everything seems much more better! Whitespaces in `First_Name` and `Last_Name` seems like a reasonable data. We can move forward!

## 02.04 Data Frames - Grouping  (<-- this will be your best friend!)

`pd.DataFrame.groupby`

In [31]:
# most basic `groupby` use
# let's check what is the mean `Age` and `Pay Rate` for every `Position`
simple1_group = df.groupby(['Position']).mean().round(2)
simple1_group.sort_values('Age')

Unnamed: 0_level_0,Age,Pay Rate
Position,Unnamed: 1_level_1,Unnamed: 2_level_1
Sales Manager,29.33,56.75
Administrative Assistant,30.33,19.52
IT Manager - Infra,31.0,63.0
Accountant I,32.0,26.83
Shared Services Manager,33.0,55.0
Software Engineer,33.67,51.07
Network Engineer,33.78,39.68
Sr. Accountant,34.0,34.95
Director of Operations,34.0,60.0
Database Administrator,34.54,39.48


In [34]:
# we can group by more than one column
simple2_group = df.groupby(['Sex','Department'])['Age'].agg(['mean', 'count']).round(2)
simple2_group

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,count
Sex,Department,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,Admin Offices,31.83,6
Female,Executive Office,63.0,1
Female,IT/IS,38.26,19
Female,Production,39.34,127
Female,Sales,35.73,15
Female,Software Engineering,32.83,6
Male,Admin Offices,32.5,4
Male,IT/IS,37.41,22
Male,Production,38.53,81
Male,Sales,41.38,16


In [38]:
# sometimes it is useful to play with the index/column using `stack` and `unstack` methods
simple2_group.unstack(0)

Unnamed: 0_level_0,mean,mean,count,count
Sex,Female,Male,Female,Male
Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Admin Offices,31.83,32.5,6.0,4.0
Executive Office,63.0,,1.0,
IT/IS,38.26,37.41,19.0,22.0
Production,39.34,38.53,127.0,81.0
Sales,35.73,41.38,15.0,16.0
Software Engineering,32.83,39.25,6.0,4.0


In [43]:
# nothing left in columns, so we got the `Series` :)
simple2_group.stack()

Sex     Department                 
Female  Admin Offices         mean      31.83
                              count      6.00
        Executive Office      mean      63.00
                              count      1.00
        IT/IS                 mean      38.26
                              count     19.00
        Production            mean      39.34
                              count    127.00
        Sales                 mean      35.73
                              count     15.00
        Software Engineering  mean      32.83
                              count      6.00
Male    Admin Offices         mean      32.50
                              count      4.00
        IT/IS                 mean      37.41
                              count     22.00
        Production            mean      38.53
                              count     81.00
        Sales                 mean      41.38
                              count     16.00
        Software Engineering  mean      39.2