# Pandas DataFrame

The object `DataFrame` of the package pandas represents a table of data. Each column of the table is a Series object. All columns share a common index.

We need to first import the `pandas` package. We usually import the `numpy` package as well.

In [1]:
import numpy as np
import pandas as pd

## Creating a DataFrame

### From a dict of equal-length lists or arrays: `pandas.DataFrame()`

In [2]:
data = {'state': ['ohio', 'ohio', 'ohio', 'nevada', 'nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.6, 1.7, 1.4, 2.4, 2.9]   
}

In [3]:
df = pd.DataFrame(data)
df

Unnamed: 0,state,year,pop
0,ohio,2000,1.6
1,ohio,2001,1.7
2,ohio,2002,1.4
3,nevada,2001,2.4
4,nevada,2002,2.9


### From a csv file: `pandas.read_csv()`

Before this, make sure you've downloaded and placed the 'students.csv' data file in the same folder as this ipynb file.

In [4]:
students = pd.read_csv('students.csv')
students

Unnamed: 0,Name,hw1,hw2,program
0,Dorian,10.0,10.0,MSIS
1,Jeannine,6.0,7.0,MSIS
2,Iluminada,2.0,,MBA
3,Luci,7.0,7.0,MSIS
4,Jenny,8.0,,
5,Demetria,2.0,4.0,MSIS
6,Michael,6.0,10.0,MBA
7,Garland,9.0,1.0,MSIS
8,Shelby,1.0,10.0,MSIS
9,Mercy,5.0,6.0,MSIS


In [11]:
students['hw1_high_score'] = students.hw1.apply(lambda x : 'high' if x>6 else 'low')
students

Unnamed: 0,Name,hw1,hw2,program,hw1_high_score
0,Dorian,10.0,10.0,MSIS,high
1,Jeannine,6.0,7.0,MSIS,low
2,Iluminada,2.0,,MBA,low
3,Luci,7.0,7.0,MSIS,high
4,Jenny,8.0,,,high
5,Demetria,2.0,4.0,MSIS,low
6,Michael,6.0,10.0,MBA,low
7,Garland,9.0,1.0,MSIS,high
8,Shelby,1.0,10.0,MSIS,low
9,Mercy,5.0,6.0,MSIS,low


Recall that pandas Series has two index systems: **positional index** and **index**. The same extends to DataFrame. pandas always automatically creates the **positional index** which is a list of integers 0, 1, ... . The **index**, however, is optional.

 In the example above, we didn't specify a column as the index yet (and often don't bother in practice). But we can always do so with method `DataFrame.set_index()`. Let's designate 'Name' as the index: 

In [12]:
#df.set_index('Name', inplace=True)   #inplace=True： replace the original index
df=df.set_index('Name')
df

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Jeannine,6.0,7.0,MSIS
Iluminada,2.0,,MBA
Luci,7.0,7.0,MSIS
Jenny,8.0,,
Demetria,2.0,4.0,MSIS
Michael,6.0,10.0,MBA
Garland,9.0,1.0,MSIS
Shelby,1.0,10.0,MSIS
Mercy,5.0,6.0,MSIS


### Using  `DataFrame.to_csv()` to write to a file 'temp.csv'

In [None]:
df.to_csv("temp.csv")

### Missing Value
`NaN` - not a number

Example: Change Dorian's hw1 grade as missing value 

In [None]:
df.loc['Dorian', 'hw1'] = np.nan

In [None]:
df

## index, columns, values

`DataFrame.index` returns the index labels

In [14]:
df.index

Index(['Dorian', 'Jeannine', 'Iluminada', 'Luci', 'Jenny', 'Demetria',
       'Michael', 'Garland', 'Shelby', 'Mercy', 'John'],
      dtype='object', name='Name')

`DataFrame.columns` returns the list of column names (as an index object)

In [15]:
df.columns

Index(['hw1', 'hw2', 'program'], dtype='object')

`DataFrame.values` returns a 2-dimensional ndarray of values

In [16]:
df.values

array([[10.0, 10.0, 'MSIS'],
       [6.0, 7.0, 'MSIS'],
       [2.0, nan, 'MBA'],
       [7.0, 7.0, 'MSIS'],
       [8.0, nan, nan],
       [2.0, 4.0, 'MSIS'],
       [6.0, 10.0, 'MBA'],
       [9.0, 1.0, 'MSIS'],
       [1.0, 10.0, 'MSIS'],
       [5.0, 6.0, 'MSIS'],
       [nan, 10.0, 'MSIS']], dtype=object)

## Selecting an element or a subset

### df.iloc[x, y]

Access one specific value, i.e. retrieving the elemnent at the third row and second column

Access using the positional index. 
<ul>
<li><b>x</b> is the information needed to select the rows: positional index or range of integers</li>
<li><b>y (optional)</b> is the information needed to select the columns: positional index or range of integers</li>
</ul>

In [17]:
df.iloc[2, 1]

nan

Access one row by specifying a positional index, i.e. retrieving the third row

In [18]:
df.iloc[2,:]

hw1        2.0
hw2        NaN
program    MBA
Name: Iluminada, dtype: object

Or, more simply:

In [19]:
df.iloc[2]

hw1        2.0
hw2        NaN
program    MBA
Name: Iluminada, dtype: object

Access one column by specifying positional index of the column, i.e. retrieving the second column

In [20]:
df.iloc[:, 1]

Name
Dorian       10.0
Jeannine      7.0
Iluminada     NaN
Luci          7.0
Jenny         NaN
Demetria      4.0
Michael      10.0
Garland       1.0
Shelby       10.0
Mercy         6.0
John         10.0
Name: hw2, dtype: float64

Access a subset of rows and of columns, i.e. retrieving elements from the first row to the six row and the lest second column to the end

In [21]:
df.iloc[:5, -2:]

Unnamed: 0_level_0,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Dorian,10.0,MSIS
Jeannine,7.0,MSIS
Iluminada,,MBA
Luci,7.0,MSIS
Jenny,,


### df.loc[x, y]

Access using the index labels. 
<ul>
<li><b>x</b> is the information needed to select the rows: label index, range of index labels, or boolean masks</li>
<li><b>y (optional)</b> is the information needed to select the columns: label index, range of index labels, or boolean masks</li>
</ul>

Acccess one specific value by specifying index label and column name

In [22]:
df

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Jeannine,6.0,7.0,MSIS
Iluminada,2.0,,MBA
Luci,7.0,7.0,MSIS
Jenny,8.0,,
Demetria,2.0,4.0,MSIS
Michael,6.0,10.0,MBA
Garland,9.0,1.0,MSIS
Shelby,1.0,10.0,MSIS
Mercy,5.0,6.0,MSIS


Retrieve Garland's hw2 score

In [23]:
df.loc['Garland', 'hw2']

1.0

Access one row by specifying index label, i.e. retrieving Garland's whole information

In [25]:
df.loc['Garland',:]

hw1         9.0
hw2         1.0
program    MSIS
Name: Garland, dtype: object

or, more simply:

In [26]:
df.loc['Garland']

hw1         9.0
hw2         1.0
program    MSIS
Name: Garland, dtype: object

Access one column by specifying index label, i.e. retrieving everyone's hw1 score

In [27]:
df.loc[:, 'hw1']

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          NaN
Name: hw1, dtype: float64

For the above operation of retrieving a single column, pandas provides two simpler codes:

In [28]:
df['hw1']  #within bracket--select columns

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          NaN
Name: hw1, dtype: float64

In [29]:
df.hw1

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          NaN
Name: hw1, dtype: float64

In [30]:
df[['hw1','hw2']] #list

Unnamed: 0_level_0,hw1,hw2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Dorian,10.0,10.0
Jeannine,6.0,7.0
Iluminada,2.0,
Luci,7.0,7.0
Jenny,8.0,
Demetria,2.0,4.0
Michael,6.0,10.0
Garland,9.0,1.0
Shelby,1.0,10.0
Mercy,5.0,6.0


### Boolean mask
Similar code as we tried on Series. Used for retrieving rows that satisfy the conditions we impose.

Determine whether names starts with 'J'

In [32]:
(df.index >= 'J') & (df.index < 'K')

array([False,  True, False, False,  True, False, False, False, False,
       False,  True])

Retrieve rows with names starting with 'J'

In [31]:
df[(df.index >= 'J') & (df.index < 'K')] #within bracket[]--condition

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jeannine,6.0,7.0,MSIS
Jenny,8.0,,
John,,10.0,MSIS


### Exercise

Retrieve Shelby's hw1 grade

In [33]:
df.loc['Shelby','hw1']

1.0

Retrieve all information about Shelby

In [34]:
df.loc['Shelby']

hw1         1.0
hw2        10.0
program    MSIS
Name: Shelby, dtype: object

Find all information about those students that obtained the highest grade in hw2. 
+ Caution: there are possible ties. 
+ Hint: use .max() and boolean mask

In [37]:
df[df.hw2==df.hw2.max()]
#df[condition]

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Michael,6.0,10.0,MBA
Shelby,1.0,10.0,MSIS
John,,10.0,MSIS


Find the average hw1 score of those students who got a hw2 score greater than 5.

In [41]:
df[df.hw2>5].hw1.mean()
#df[df.hw2>5].mean()['hw1']

5.833333333333333

## Sorting
Similar to the code on Series, expect now we need to tell which column or columns are used for sorting.

### `DataFrame.sort_values()`

Sort the table based on the values of a set of columns (parameter <b>by</b>). 

Sorting by one column, i.e. sorting data by hw1 score in the descending order

In [42]:
df.sort_values(by='hw1', ascending=False)

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Garland,9.0,1.0,MSIS
Jenny,8.0,,
Luci,7.0,7.0,MSIS
Jeannine,6.0,7.0,MSIS
Michael,6.0,10.0,MBA
Mercy,5.0,6.0,MSIS
Iluminada,2.0,,MBA
Demetria,2.0,4.0,MSIS
Shelby,1.0,10.0,MSIS


Sorting by more columns. For example, by hw1 descending and, in case of ties, by hw2 ascending

In [43]:
df.sort_values(by=['hw1', 'hw2'], ascending=[False, True])

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Garland,9.0,1.0,MSIS
Jenny,8.0,,
Luci,7.0,7.0,MSIS
Jeannine,6.0,7.0,MSIS
Michael,6.0,10.0,MBA
Mercy,5.0,6.0,MSIS
Demetria,2.0,4.0,MSIS
Iluminada,2.0,,MBA
Shelby,1.0,10.0,MSIS


### Sort objects by label: `DataFrame.sort_index()`

### `DataFrame.head()`

Returns the first (or last) n rows

In [45]:
df.head(2)

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Jeannine,6.0,7.0,MSIS


### `DataFrame.tail()`

In [44]:
df.tail()

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Michael,6.0,10.0,MBA
Garland,9.0,1.0,MSIS
Shelby,1.0,10.0,MSIS
Mercy,5.0,6.0,MSIS
John,,10.0,MSIS


### Exercise

Sort the MSIS students by hw2 descending.

In [46]:
df[df.program=='MSIS'].sort_values('hw2', ascending=False)

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Shelby,1.0,10.0,MSIS
John,,10.0,MSIS
Jeannine,6.0,7.0,MSIS
Luci,7.0,7.0,MSIS
Mercy,5.0,6.0,MSIS
Demetria,2.0,4.0,MSIS
Garland,9.0,1.0,MSIS


Show <b>only</b> the field <i>hw1</i> of the students with the highest hw2 grade

In [47]:
df[df.hw2==df.hw2.max()].hw1

Name
Dorian     10.0
Michael     6.0
Shelby      1.0
John        NaN
Name: hw1, dtype: float64

## Aggregate functions: `DataFrame.mean()`, `DataFrame.min()`, `DataFrame.max()`, etc

Aggregate functions for DataFrame are
+ similar to these functions for Series, and
+ broadcasted to all columns (axis = 0, which is the default) or rows (axis = 1). Numeric aggregators will be executed only on numeric data.

The average for each hw: `DataFrame.mean()`

In [54]:
df_num = df[['hw1','hw2']]
df_num

Unnamed: 0_level_0,hw1,hw2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Dorian,10.0,10.0
Jeannine,6.0,7.0
Iluminada,2.0,
Luci,7.0,7.0
Jenny,8.0,
Demetria,2.0,4.0
Michael,6.0,10.0
Garland,9.0,1.0
Shelby,1.0,10.0
Mercy,5.0,6.0


In [55]:
df_num.mean()  #program is not numercial, can't take mean of this column

hw1    5.600000
hw2    7.222222
dtype: float64

The average for each student

In [56]:
df_num.mean(axis=1)

Name
Dorian       10.0
Jeannine      6.5
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      3.0
Michael       8.0
Garland       5.0
Shelby        5.5
Mercy         5.5
John         10.0
dtype: float64

## Exercise

Compute the spread (i.e., highest minus lowest hw grade) of each student

Hint: use `DataFrame.max(axis=1)` to find each row/student's max, and use `DataFrame.min(axis=1)` to find the min.

In [58]:
df_num.max(axis=1) - df_num.min(axis=1)

Name
Dorian       0.0
Jeannine     1.0
Iluminada    0.0
Luci         0.0
Jenny        0.0
Demetria     2.0
Michael      4.0
Garland      8.0
Shelby       9.0
Mercy        1.0
John         0.0
dtype: float64

Who has the largest spread?

In [59]:
(df_num.max(axis=1) - df_num.min(axis=1)).nlargest(1)

Name
Shelby    9.0
dtype: float64

## Modifying DataFrames

Make a copy of the data frame: `DataFrame.copy()`

In [60]:
df1 = df.copy()
df1

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Jeannine,6.0,7.0,MSIS
Iluminada,2.0,,MBA
Luci,7.0,7.0,MSIS
Jenny,8.0,,
Demetria,2.0,4.0,MSIS
Michael,6.0,10.0,MBA
Garland,9.0,1.0,MSIS
Shelby,1.0,10.0,MSIS
Mercy,5.0,6.0,MSIS


### Add rows

A new student has joined. His name is Oliver and he is the MSIS program; his hw1 is missing and his hw2 score is 8.

In [61]:
df1.loc['Oliver', :] = [np.nan, 8, 'MSIS']
df1

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Jeannine,6.0,7.0,MSIS
Iluminada,2.0,,MBA
Luci,7.0,7.0,MSIS
Jenny,8.0,,
Demetria,2.0,4.0,MSIS
Michael,6.0,10.0,MBA
Garland,9.0,1.0,MSIS
Shelby,1.0,10.0,MSIS
Mercy,5.0,6.0,MSIS


Exercise: A new student has joined. Her name is Caroline and she got 4 in hw2. She is not in any program yet.

In [64]:
df1.loc['Caroline', :] = [np.nan, 4, np.nan]
df1

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Jeannine,6.0,7.0,MSIS
Iluminada,2.0,,MBA
Luci,7.0,7.0,MSIS
Jenny,8.0,,
Demetria,2.0,4.0,MSIS
Michael,6.0,10.0,MBA
Garland,9.0,1.0,MSIS
Shelby,1.0,10.0,MSIS
Mercy,5.0,6.0,MSIS


### Add columns

Add an "empty" column <b>hw3</b>

In [65]:
df1.loc[:, 'hw3'] = np.nan
df1

Unnamed: 0_level_0,hw1,hw2,program,hw3
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dorian,10.0,10.0,MSIS,
Jeannine,6.0,7.0,MSIS,
Iluminada,2.0,,MBA,
Luci,7.0,7.0,MSIS,
Jenny,8.0,,,
Demetria,2.0,4.0,MSIS,
Michael,6.0,10.0,MBA,
Garland,9.0,1.0,MSIS,
Shelby,1.0,10.0,MSIS,
Mercy,5.0,6.0,MSIS,


In [67]:
# Or:
df1['hw3'] = np.nan

In [None]:
df1.hw4 = np.nan #this not work, because hw4 not exists yet

In [69]:
df1['hw1']
df1.hw1  #only work for the columns that already exists

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          NaN
Oliver        NaN
Caroline      NaN
Name: hw1, dtype: float64

### Add calculated columns

Let's add a column with the final grade. It is computed as 0.2\*hw1 + 0.8\*hw2.

In [70]:
df1['final'] = 0.2*df1.hw1 + 0.8*df1.hw2
df1

Unnamed: 0_level_0,hw1,hw2,program,hw3,final
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Dorian,10.0,10.0,MSIS,,10.0
Jeannine,6.0,7.0,MSIS,,6.8
Iluminada,2.0,,MBA,,
Luci,7.0,7.0,MSIS,,7.0
Jenny,8.0,,,,
Demetria,2.0,4.0,MSIS,,3.6
Michael,6.0,10.0,MBA,,9.2
Garland,9.0,1.0,MSIS,,2.6
Shelby,1.0,10.0,MSIS,,8.2
Mercy,5.0,6.0,MSIS,,5.8


## `DataFrame.groupby()`

Similar to `group_by` in R, and GROUP BY in SQL, the `groupby()` operation in pandas DataFrame groups a dataset into multiple subsets, so that following operations can be performed on each subset separately.

Consider the following dataset that comes from a survey of graduate students, asking them about their various computer programming skills.

In [71]:
df = pd.read_csv('partially_cleaned_survey.csv')
df.head()

Unnamed: 0,Expert,BachTime,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Classification,Clustering,Job,Languages
0,1,longer than 1 year ago but less than 3 years ago,MSIS,4,1,1,0.0,1,1.0,1.0,0.0,1.0,0.0,1,4,4,0.0,7.0
1,0,over 5 years ago,MSIS,3,1,1,0.0,1,0.0,0.0,0.0,1.0,0.0,1,2,2,0.5,5.0
2,0,longer than 3 years ago but less than 5 years ago,MSIS,3,0,0,0.0,1,1.0,0.0,0.0,1.0,0.0,1,3,3,0.0,4.0
3,0,over 5 years ago,MSIS,3,1,0,0.0,1,1.0,0.0,1.0,1.0,0.0,1,2,3,0.0,6.0
4,0,longer than 3 years ago but less than 5 years ago,MSIS,3,1,0,0.0,1,1.0,0.0,0.0,1.0,0.0,1,1,1,0.0,5.0


If we want to compute average values for all columns, we can do so by:

In [72]:
df.mean()

  df.mean()


Expert            0.049180
ProgSkills        2.852459
C                 0.590164
CPP               0.442623
CS                0.083333
Java              0.737705
Python            0.433333
JS                0.383333
R                 0.200000
SQL               0.850000
SAS               0.100000
Excel             0.950820
Classification    1.868852
Clustering        1.836066
Job               0.352459
Languages         4.732143
dtype: float64

But what if we want to compute the average values for all columns *for different programs separately*, including 'MSIS', 'Supply Chain Mgmt & Analytics', 'MBA', 'Faculty!', 'Business Man', 'Master of Finance'?

The method `groupby` splits the data into groups based on some criteria, e.g. by the values of a column. We can then aggregate other columns separately for each group.

First, call the groupby method to divide data into groups based on the values in 'Program', which returns a DataFrameGroupBy object

In [73]:
df.groupby(by='Program')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023084B81460>

Second, apply the aggregate function 'mean' to the DataFrameGroupBy object

In [74]:
df.groupby(by='Program').mean()

Unnamed: 0_level_0,Expert,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Classification,Clustering,Job,Languages
Program,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Business Man,0.0,1.0,0.0,0.0,,0.0,0.0,1.0,1.0,0.0,0.0,1.0,2.0,3.0,1.0,
Faculty!,1.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,5.0,5.0,1.0,4.0
MBA,0.0625,2.5,0.3125,0.375,0.125,0.375,0.375,0.333333,0.4375,0.625,0.133333,1.0,1.9375,1.9375,0.65625,4.066667
MSIS,0.025,3.075,0.7,0.475,0.075,0.975,0.487179,0.4,0.102564,0.948718,0.075,0.925,1.775,1.75,0.2,5.081081
Master of Finance,0.0,4.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,6.0
Supply Chain Mgmt & Analytics,0.0,1.5,0.5,0.5,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0,1.0,0.5,3.0


#### Exercise: Display the mean of all columns, grouping by the Job situation

In [75]:
df.groupby(by='Job').mean()

Unnamed: 0_level_0,Expert,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Classification,Clustering,Languages
Job,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0.0,0.0625,2.90625,0.625,0.40625,0.09375,0.78125,0.451613,0.40625,0.15625,0.870968,0.0625,0.96875,1.84375,1.84375,4.733333
0.5,0.0,3.066667,0.666667,0.6,0.066667,0.933333,0.4,0.2,0.142857,0.933333,0.2,1.0,1.733333,1.6,5.071429
1.0,0.071429,2.5,0.428571,0.357143,0.076923,0.428571,0.428571,0.538462,0.357143,0.714286,0.076923,0.857143,2.071429,2.071429,4.333333


### Aggregate only some columns

Oftentimes, we don't want to aggregate all columns. For example, we want to find only the average of Job, grouped by Program.

In [76]:
df.groupby(by='Program')['Job'].mean()

Program
Business Man                     1.00000
Faculty!                         1.00000
MBA                              0.65625
MSIS                             0.20000
Master of Finance                0.00000
Supply Chain Mgmt & Analytics    0.50000
Name: Job, dtype: float64

Or more columns. For example, we want to find the average of Job, C, and R, grouped by Program.

In [77]:
df.groupby(by='Program')['Job', 'C', 'R'].mean()

  df.groupby(by='Program')['Job', 'C', 'R'].mean()


Unnamed: 0_level_0,Job,C,R
Program,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Business Man,1.0,0.0,1.0
Faculty!,1.0,1.0,0.0
MBA,0.65625,0.3125,0.4375
MSIS,0.2,0.7,0.102564
Master of Finance,0.0,1.0,0.0
Supply Chain Mgmt & Analytics,0.5,0.5,0.0


### Exercises

For each Job situation (0=no job, 0.5=part time, 1=full time), find the proportion of students that know SQL (hint: mean) 

In [78]:
df.groupby(by='Job').SQL.mean()  #mean = proportion, because binary

Job
0.0    0.870968
0.5    0.933333
1.0    0.714286
Name: SQL, dtype: float64

For each program, count how many student know SQL (hint: sum)

In [79]:
df.groupby(by='Program').SQL.sum()

Program
Business Man                      0.0
Faculty!                          1.0
MBA                              10.0
MSIS                             37.0
Master of Finance                 1.0
Supply Chain Mgmt & Analytics     2.0
Name: SQL, dtype: float64

Considering only the students who know SQL, find for each Program the proportion of students who know Java

In [81]:
df[df.SQL==1].groupby(by='Program').Java.mean()

Program
Faculty!                         0.000000
MBA                              0.600000
MSIS                             0.972973
Master of Finance                0.000000
Supply Chain Mgmt & Analytics    0.000000
Name: Java, dtype: float64