
# Data Analysis Follow Along

### Previously we learned how to import and clean data in data frames using pandas. Now we're going to learn how to operate on that data to pull out useful information.

#### We will start by learning some of the built in functions Pandas has to offer

In [1]:
#first things first:
import numpy as np
import pandas as pd

We will practice on data from College Board collected in 2010, about the average SAT scores for different schools:

In [2]:
sat = pd.read_csv('SAT_2010.csv')

In [3]:
sat #see what the Data set contains

Unnamed: 0,DBN,School Name,Number of Test Takers,Critical Reading Mean,Mathematics Mean,Writing Mean
0,01M292,Henry Street School for International Studies,31.0,391.0,425.0,385.0
1,01M448,University Neighborhood High School,60.0,394.0,419.0,387.0
2,01M450,East Side Community High School,69.0,418.0,431.0,402.0
3,01M458,SATELLITE ACADEMY FORSYTH ST,26.0,385.0,370.0,378.0
4,01M509,CMSP HIGH SCHOOL,,,,
...,...,...,...,...,...,...
455,75R025,South Richmond High School,10.0,407.0,421.0,400.0
456,75X012,PS12X LEWIS AND CLARK SCHOOL,,,,
457,75X754,P754 X - Jeffrey M. Rapport School for Career ...,,,,
458,76K460,John Jay High School,9.0,390.0,381.0,398.0


If we only want to see the first 'n' items of the table, where n is some positive integer, we can use the .head(n) function.

To display only the first 3 items of our data frame 'sat':

In [5]:
sat.head(3)

Unnamed: 0,DBN,School Name,Number of Test Takers,Critical Reading Mean,Mathematics Mean,Writing Mean
0,01M292,Henry Street School for International Studies,31.0,391.0,425.0,385.0
1,01M448,University Neighborhood High School,60.0,394.0,419.0,387.0
2,01M450,East Side Community High School,69.0,418.0,431.0,402.0
3,01M458,SATELLITE ACADEMY FORSYTH ST,26.0,385.0,370.0,378.0
4,01M509,CMSP HIGH SCHOOL,,,,
5,01M515,Lower East Side Preparatory High School,154.0,314.0,532.0,314.0


Now, let's use the function .info(), this prints out information of a Data Frame, seeing how it works is best understood by seeing it's output:

In [6]:
#now use the .info() function, and see what information it gives you:
sat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 460 entries, 0 to 459
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   DBN                    460 non-null    object 
 1   School Name            460 non-null    object 
 2   Number of Test Takers  386 non-null    float64
 3   Critical Reading Mean  386 non-null    float64
 4   Mathematics Mean       386 non-null    float64
 5   Writing Mean           386 non-null    float64
dtypes: float64(4), object(2)
memory usage: 21.7+ KB


Now we can easily see what each column is, and how much data it contains. What makes some columns contain 460 pieces of information while others contain only 386?

Let's try the function .isnull(), which returns the data frame back with Boolean Values, depending on if a value is null (NaN) or not:

In [None]:
sat.isnull()

The miss-match in data quantity from column to column comes from some of the values being Null.

Let's consider another possible issue, how can we tell if certain data has repeats? We can use the function .nunique() to see how many unique values are in each column:

In [None]:
sat.nunique()

The number of DBN values (an arbitrary code given to each school) is less than the number of unique school names, what does this mean and why may it be problematic for School Board?

How can we find which DBN is repeated?

We can use the function .value_counts(), however we need to use it only on a specific column, which we do using the same syntax we learned in for data cleaning:

In [None]:
sat['DBN'].value_counts()

Now we can see there are two DBN Values that repeat. Notice that the .value_counts() function automatically sorts in order from greatest to least.

Practice pulling the data from the data frame where the DBN is one of the repeated values to see what schools share their values:

In [None]:
#your code here

The repeated values are given to what looks like two branches of the same two schools, however the extra repeats all have null values, so they won't scew the data in any way if we remove them.

Clear out all the schools (rows) that contain Null Data using the methods learned in the Data Cleaning lesson, make sure to reset the index as well. (Do this in-place to permanenetly alter the Data Frame)

In [None]:
#Your code here

In [None]:
sat #Does the result look correct?

In [None]:
sat.info() #make sure each column now only contains 386 pieces of data

In [None]:
sat.nunique()

Similar to .nunique() is the function .unique(), this function is applied to a single column at a time, using the same syntax as .value_counts().

We will use a data set of employee data for this:

In [11]:
HR_Data = pd.read_csv('HR.csv')

In [None]:
HR_Data.head(5) #get a look at what data this contains

Use .unique() on the column 'Position' to find the different jobs that exist in the data set:

In [None]:
#your code here

.nunique() can also be used on a single column, given is one way of finding the number of unique jobs by using .unique(), but try using .nunique() to find the value:

In [None]:
len(HR_Data['Position'].unique())

In [None]:
#find the same information with nunique()

Some data sets have a lot of columns, rather than manually scrolling through the ouput of .head() or .info() we could simply use .columns to return a list of strings

In [13]:
HR_Data.columns
#Note: we do not use (), this is not a function, but an attribute of the data frame object, just like 'shape'

Index(['Employee_Name', 'EmpID', 'PayRate', 'Position', 'State', 'Zip', 'Sex',
       'MaritalDesc', 'DateofHire', 'EmploymentStatus', 'Department',
       'ManagerName', 'EngagementSurvey', 'EmpSatisfaction',
       'SpecialProjectsCount', 'DaysLateLast30'],
      dtype='object')

We can sort the data based off of a column using .sort_values('column name')

Try sorting HR_Data by EmpID:

In [None]:
#your code here

We can use a variety of other functions on single columns to find information about them, such as .max(), .min(), .mean(), .median(), .sum(), and more using the same syntax we did for .unique() or .value_counts(), lets practice:

In [14]:
#What's the average pay rate?

31.28480645161292

In [None]:
#in the SAT data, what was the lowest average writing score for a school?
#your code here

But what school does this data belong to? We can use the Syntax we did for cleaning to pull that specific row:

In [9]:
sat[sat['Writing Mean'] == sat['Writing Mean'].min()]

Unnamed: 0,DBN,School Name,Number of Test Takers,Critical Reading Mean,Mathematics Mean,Writing Mean
173,10X342,International School of Liberal Arts,46.0,333.0,336.0,285.0


Find the name of the employee in our HR_Data with the Highest Pay

In [None]:
#Your code here

### Practice

The toal SAT score is calculated by simply adding up the scores for reading, writing, and math. Lets try to get a weighted average of the total SAT score for all the students. We say 'weighted average' because each school had a different number of students who contributed to it's average score. 

To get the weighted average, first, find the avearge total SAT score for each school by summing the average of math, reading, and writing. Then, multiply this number by the number of students that contributed to the schools score. Take the sum of this product from all schools, and divide it by the sum of all students from all schools.

Challenege: do this all in one line

Ans: 1335.922

In [None]:
#your code here
sum(((sat['Critical Reading Mean'] + sat['Mathematics Mean'] + sat['Writing Mean']) * sat['Number of Test Takers'])/sat['Number of Test Takers'].sum())

### Correlations in Data with .corr()

The last built in function we'll look at is .corr()

This function finds the correlation between the data in the columns of a data frame, and outputs them into a matrix

In [None]:
sat.corr()

Notice that the rows and columns of this output are the same.

Each value in the matrix will have a value between -1 and 1. A value of 0 means that the two sets of data have no correlation aka they're unrelated/independendt of each other. A value of 1 means that two sets of Data are Completely coorelated, aka equal to each other. A negative correlation means two data sets are related in that an increase in one means a decrease in the other.

Take a look at the correlation within the HR_Data data frame:

In [15]:
HR_Data.corr()

Unnamed: 0,EmpID,PayRate,Zip,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,DaysLateLast30
EmpID,1.0,-0.02031,0.130735,-0.00572,-0.017726,-0.171329,
PayRate,-0.02031,1.0,0.391945,-0.024223,0.061525,0.448484,
Zip,0.130735,0.391945,1.0,-0.000982,-0.064478,-0.096083,
EngagementSurvey,-0.00572,-0.024223,-0.000982,1.0,-0.017123,-0.040253,
EmpSatisfaction,-0.017726,0.061525,-0.064478,-0.017123,1.0,0.036456,
SpecialProjectsCount,-0.171329,0.448484,-0.096083,-0.040253,0.036456,1.0,
DaysLateLast30,,,,,,,


As a disclaimer, this data is made up for the purposes of practicing data science.

One way to confirm this is to look at the coorelation of the zip code of the employees to other data. Any given Zip Code should be nearly random for every employee, as it's arbitrarily based on location. However we see a somewhat significant correlation between the zip code and the pay rate. Why is this? Perhaps the PayRate was generated based off of the zip code when the data set was made, or vis-versa. Perhaps there is a legitamate reason for it, maybe people in a certain location have a higher pay-rates due to a building with executives residing in a region with a higher zip code, while buildings with lower level employees reside in regions of smaller zip codes.

The math that generates the output comes from statistics. We don't have time to prove it, but the general formula is:

$ \rho_{X,Y} = \frac{E[XY]-E[X]E[Y]}{\sqrt{E[X^2]-[E[X]]^2} \sqrt{E[Y^2]-[E[Y]]^2}} $
 
 Where $\rho_{X,Y}$ is the correlation value of data sets (Columns) X and Y, $E[X]$ is the 'expectation value' (fancy term for mean) of the values of X, $E[Y]$ is the 'expectation value' of the values of Y, and likewise $E[XY]$ would be the average of every corresponding data point x and y multiplied together, and similar for $E[X^2], E[Y^2]$
 
If you don't understand this yet it's okay, you woun't need to program this in yourself, just know that the numbers generated by corr() is mathmatics and not magic

### Apply() and user defined/ Lambda functions

Lets look again at our HR_Data:

In [16]:
HR_Data.head(3)

Unnamed: 0,Employee_Name,EmpID,PayRate,Position,State,Zip,Sex,MaritalDesc,DateofHire,EmploymentStatus,Department,ManagerName,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,DaysLateLast30
0,"Brown, Mia",1103024456,28.5,Accountant I,MA,1450,F,Married,10/27/2008,Active,Admin Offices,Brandon R. LeBlanc,2.04,2,6,0.0
1,"LaRotonda, William",1106026572,23.0,Accountant I,MA,1460,M,Divorced,1/6/2014,Active,Admin Offices,Brandon R. LeBlanc,5.0,4,4,0.0
2,"Steans, Tyrone",1302053333,29.0,Accountant I,MA,2703,M,Single,9/29/2014,Active,Admin Offices,Brandon R. LeBlanc,3.9,5,5,0.0


Say we want to make a column of data that contained only the year an employee was hired, how can we extract this data from the 'DateofHire' column? 

This is where the .apply() function comes in, .apply() takes in a function as it's argument, and applies it to every value within the series (either a row or a column).

Maybe we could find a built in function to do this, but it's far easier to write a function ourselves to do this for our specific formatting.

Lets first define a function that uses split('/') on the 'DateOfHire' Column to get a list of \[month, day, year\], to ultimately return just the year:

In [17]:
def year_extract(date_string):
    return date_string.split('/')[2]

In [18]:
#test it out
test_data = '2/3/2022'
print(year_extract(test_data))

2021


Now lets apply the function to create a new column that contains just the year the employee was hired:

In [19]:
HR_Data['YearofHire'] = HR_Data['DateofHire'].apply(year_extract)
#we don't need to pass in any arguments to our function 'year_extract', the data in the row is assumed to be the argument

In [20]:
HR_Data.head(4)

Unnamed: 0,Employee_Name,EmpID,PayRate,Position,State,Zip,Sex,MaritalDesc,DateofHire,EmploymentStatus,Department,ManagerName,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,DaysLateLast30,YearofHire
0,"Brown, Mia",1103024456,28.5,Accountant I,MA,1450,F,Married,10/27/2008,Active,Admin Offices,Brandon R. LeBlanc,2.04,2,6,0.0,2008
1,"LaRotonda, William",1106026572,23.0,Accountant I,MA,1460,M,Divorced,1/6/2014,Active,Admin Offices,Brandon R. LeBlanc,5.0,4,4,0.0,2014
2,"Steans, Tyrone",1302053333,29.0,Accountant I,MA,2703,M,Single,9/29/2014,Active,Admin Offices,Brandon R. LeBlanc,3.9,5,5,0.0,2014
3,"Howard, Estelle",1211050782,21.5,Administrative Assistant,MA,2170,F,Married,2/16/2015,Terminated for Cause,Admin Offices,Brandon R. LeBlanc,3.24,3,4,,2015


Now try writing two functions that takes the employee_Name and splits it into the first and last names respectivelys:

In [None]:
def extract_first(name):
    #your code here

def extract_last(name):
    #your code here

Now apply this function to add two separate columns of 'FirstName' and 'LastName':

In [None]:
#your code here

In [None]:
HR_Data.head(3) #check your output

This method of defining a function and then using it works just fine, but it can be a bit clunky. Because we only use them once, we can abreviate it in our code by passing in a 'Lambda Function', which is the same thing we just did, but we don't give our function a name or give it permanent memory space, but instead breifly pass in the formula to be used the one time.

The syntax for them is different than what we've seen, so let's start by re-creating the year_extract function and break-down the differences:

In [None]:
HR_Data['YearofHireLambda'] = HR_Data['DateofHire'].apply(lambda year: year.split('/')[2])

In [None]:
HR_Data.head(3)

We did the exact same thing, but all in one line!
How did the syntax translate? we wrote:

`lambda year : year.split('/')[2]`

Think of 'lambda' as our 'def', and 'year' as the name of our input variable, to give `lambda year :` to be like `def nameless_function(year):`

The left side of the colon is our function definition, see how it's the same thing we did in the original function definition

We can abbreviate further, most times when people write lambda functions they simply use 'x' and 'y' as their variables,

For example `lambda x: x.split(',')[0]` may be used to get the last names if this lambda function were applied to the 'Employee_Name' column

When working on your own code, if you don't feel comfortable with lambda functions right away it's okay. You can still use the first method that gives it an identifier. Later if you feel up to the challenge, try to go back through and adapt them to lambda functions.

### Practice

In [None]:
HR_Data = pd.read_csv('HR.csv')
HR_Data.drop(axis=1,inplace=True,labels='Department') #modifying our table as if the department column were not included

In the HR data set, there are 5 positions relate to IT. First, try using .unique() to see the different positions in the department, then use .apply() to generate a new data frame with only people relating to the IT department by devising your own function.

Export this as a new csv file 'HR_IT_Data.csv' without the index (I'm not going to give you this answer, try using the help() function or google. Self teaching is the best way to learn past the basics, as teaching every function is impossible in our amount of time)

Challenge: Do this all in one line

In [None]:
#your code here

In [22]:
#check your answer
HR_IT = pd.read_csv('HR_IT_Data.csv')
HR_IT

Unnamed: 0,Employee_Name,EmpID,PayRate,Position,State,Zip,Sex,MaritalDesc,DateofHire,EmploymentStatus,ManagerName,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,DaysLateLast30
0,"Foss, Jason",1192991000,65.0,IT Director,MA,1460,M,Single,4/15/2011,Active,Jennifer Zamora,3.76,5,5,0.0
1,"Roup,Simon",1106026933,62.0,IT Manager - DB,MA,2481,M,Single,1/20/2013,Active,Jennifer Zamora,2.96,5,7,0.0
2,"Ruiz, Ricardo",1001175250,21.0,IT Manager - DB,MA,1915,M,Divorced,1/9/2012,Voluntarily Terminated,Jennifer Zamora,2.51,4,6,
3,"Monroe, Peter",1011022863,63.0,IT Manager - Infra,MA,2134,M,Married,2/15/2012,Active,Jennifer Zamora,2.39,3,6,0.0
4,"Dougall, Eric",1101023754,64.0,IT Manager - Support,MA,1886,M,Single,1/5/2014,Active,Jennifer Zamora,1.99,5,5,0.0
5,"Clayton, Rick",1301052902,28.99,IT Support,MA,2170,M,Single,9/5/2012,Active,Eric Dougall,2.55,3,6,0.0
6,"Galia, Lisa",1501072093,31.4,IT Support,CT,6040,F,Single,5/1/2010,Active,Eric Dougall,1.21,4,5,0.0
7,"Lindsay, Leonara",602000312,26.0,IT Support,CT,6070,F,Single,1/21/2011,Active,Eric Dougall,4.64,4,5,0.0
8,"Soto, Julia",1203032263,27.49,IT Support,MA,2360,F,Married,6/10/2011,Active,Eric Dougall,4.3,3,5,0.0
9,"Bacong, Alejandro",1212052023,45.0,IT Support,MA,1886,M,Divorced,1/5/2015,Active,Peter Monroe,1.84,5,6,0.0


## Group by

The last thing we will learn is 'Group-By'
This will enable us to group data by certain values, for example, we could group people in the HR_IT data by their manager name, in order to see how different managers effect employee satisfaction or engangement.

In [23]:
HR_IT.groupby('ManagerName')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000018250BA11F0>

This returns an object that, by it-self, we can't pull data from, or even display. However, we can apply functions to such as .mean(), .sum(), and other functions.

In [24]:
IT_Manager = HR_IT.groupby('ManagerName') #give our object a name

In [25]:
IT_Manager.mean() #find the mean values across each manager group

Unnamed: 0_level_0,EmpID,PayRate,Zip,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,DaysLateLast30
ManagerName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Eric Dougall,1151789000.0,28.47,4160.0,3.175,3.5,5.25,0.0
Jennifer Zamora,1082448000.0,55.0,1975.2,2.722,4.4,5.8,0.0
Peter Monroe,1350850000.0,40.75,2153.0,3.1925,4.5,6.0,0.0


This is powerful, We can show analytically who is performing well and who is performing poorly, all with just a few lines of code. If you were to present this to a board or an exective, you could clean it up, screen shot it, and put it in a powerpoint with an explanation of how you obtained the data. It is clear, precise, and objective.

### Practice
Try running this same process on the entrire set of HR_Data and sort by the sum of employee satisfaction and engagement:

In [None]:
#your code here

Is there a correlation between average payrates of employees under a manager and the sum of engagement and satisfaction averages?