### Note: this is a copy of the original pandas_part2 notebook that includes the coding that was done in class

# Pandas part 2

This notebook continues to work the Pandas dataframe library. It builds on and assumes familiarity with the concepts in _pandas_part1.ipynb_.

#### data files used in this notebook

This notebook uses data files located in the pandas_data_1 zip file located on Brightspace. Unzip it to create a folder called `pandas_data_1` with a number of files inside.

This notebook generally assumes that the data folder is located one level up from the notebook location so that we can load using relative path syntax such as:

`pd.read_csv('../pandas_data_1/salary.csv)`

That line would look for a folder called pandas_data_1 that is one level up from the notebook and then inside of that folder it looks for salary.csv file.

### Key concepts:

- (review) loading data from files into pandas dataframes
- (review) accessing columns in the dataframe
- making dataframes from dictionaries

- getting information about a dataframe
    - size of the data
    - listing the columns
    - finding all of the values that appear in a column

- adding and removing columns and rows from a dataframe
- handling missing data
- data organization: "tidy" data
- combining dataframes (concatenating and merging)
- processing columns: broadcasting and handling all column values at once
- selecting data based on conditions
- saving dataframes to csv files


#### First, import any libraries we're using in this session

- pandas: dataframe library
- numpy: many numeric functions
- numpy.random: a numpy sub library that gives access to many random number functions
- os: library that gives access to functions for dealing with directories and files

In [3]:
# os functions will be available using os. notation
import os 

# functions for these libraries can be 
# accessed using the short name and a dot
# (i.e., pd.read_csv() or np.array())
import pandas as pd
import numpy as np
import numpy.random as npr


## Reminder: loading data from a csv file

The pandas_part1.ipynb notebook has more info on loading data from file. 

Pandas `read_csv()` function takes as input the name (and possibly location) of a file to read.

Store the return value of read_csv() in a variable and it is a **dataframe** object.

In [4]:
# a filename
fname = 'salary.csv'

# read the file using pd.read_csv()
# ../ means go up one level in the directory
# structure:
salary_df = pd.read_csv('../pandas_data_1/salary.csv')

# just put the data frame name to see the contents:
salary_df



Unnamed: 0,salary,gender,departm,years,age,publications
0,86285,0,bio,26.0,64.0,72
1,77125,0,bio,28.0,58.0,43
2,71922,0,bio,10.0,38.0,23
3,70499,0,bio,16.0,46.0,64
4,66624,0,bio,11.0,41.0,23
...,...,...,...,...,...,...
72,53662,1,neuro,1.0,31.0,3
73,57185,1,stat,9.0,39.0,7
74,52254,1,stat,2.0,32.0,9
75,61885,1,math,23.0,60.0,9


### Access a column of data in the dataframe

Accessing the values in a column is like getting values in a dictionary. Put the dataframe variable name with square brackets and the column name you want.

```df_name['column name']```

In [11]:
# get the 'age' column
col_list = ['age','publications']

salary_df[['age','publications']]

Unnamed: 0,age,publications
0,64.0,72
1,58.0,43
2,38.0,23
3,46.0,64
4,41.0,23
...,...,...
72,31.0,3
73,39.0,7
74,32.0,9
75,60.0,9


### Creating a dataframe from a dictionary

If you remember the section we did on dictionaries you may see the nice link between dataframes and dictionaries.

You can create a dataframe from a dictionary using the pandas DataFrame() function with the dictionary as input.

```python
session_data = {'stimulus': ['a.jpg', 'l.jpg', 'p.jpg'],
               'response': [1, 1, 2], 
               'response time': [1.3, 1.8, .9]}

data_df = pd.DataFrame(session_data)

```


First: make a dictionary called `asd` that has keys **age** and **salary** and uses the a and s lists in the next cell as values.

In [15]:
# two lists (square brackets, comma separated entries)
a = [10,27,45,23,21]
s = [0,23000,100000,35000,60000]

In [18]:
# make a dictionary using {'key':'value'} syntax
asd = {'age': a, 
       'salary': s}

asd

{'age': [10, 27, 45, 23, 21], 'salary': [0, 23000, 100000, 35000, 60000]}

Use the dictionary as input to `pd.DataFrame()`

In [19]:
# create df1 using the dictionary of age and salary
df1 = pd.DataFrame(asd)
df1

Unnamed: 0,age,salary
0,10,0
1,27,23000
2,45,100000
3,23,35000
4,21,60000


In [21]:
df1['salary']

0         0
1     23000
2    100000
3     35000
4     60000
Name: salary, dtype: int64

## Getting information about your dataframe

### dataframe shape

Each dataframe object has a shape attribute attached to it.

```python
df.shape
```

This returns a _tuple_ of two values. A tuple is like a list that you can't edit. 

The first entry is the number of rows, the second is the number of columns.



In [22]:
# get the dataframe rows and 
# colums using df.shape
salary_df.shape

(77, 6)

In [27]:
if salary_df.shape[0] != 77:
    print('wrong number of records')

In [28]:
# access the first (number of rows) or 
# second entry (number of columns) using 
# square indexing:
salary_df.shape[0]

# or put the shape in a variable and access it from there:
nrows = salary_df.shape[0]
ncols = salary_df.shape[1]

print(nrows)
print(ncols)

77
6


#### Exercise: check if our dataframe has some exact number of rows

Check if the number of rows in a dataframe is equal to some value. If it is not, print an error message like "dataframe does not have the correct number of rows"

In [31]:
salary_df.head(2)

Unnamed: 0,salary,gender,departm,years,age,publications
0,86285,0,bio,26.0,64.0,72
1,77125,0,bio,28.0,58.0,43


In [32]:
salary_df.drop(columns='departm')

Unnamed: 0,salary,gender,years,age,publications
0,86285,0,26.0,64.0,72
1,77125,0,28.0,58.0,43
2,71922,0,10.0,38.0,23
3,70499,0,16.0,46.0,64
4,66624,0,11.0,41.0,23
...,...,...,...,...,...
72,53662,1,1.0,31.0,3
73,57185,1,9.0,39.0,7
74,52254,1,2.0,32.0,9
75,61885,1,23.0,60.0,9


In [29]:
desired_length = 100

if salary_df.shape[0] != desired_length:
    print("dataframe does not have the correct number of rows")

dataframe does not have the correct number of rows


<div class="alert alert-info">
**Methods versus properties** <br>
    
The `.shape` property doesn't include a final `()` unlike other methods we have learned about like `.drop()` which required parentheses.  
    
This reflects that size is known as a property or attribute of the dataframe while `.drop()` is a method.  
    
The conceptual differences can be confusing for why one is one way or the other.  However, it is helpful to often think about them as the distinction between nouns and verbs in langauge.  Properties (nouns) are static descriptors of a dataset such as the size of the dataset or the column names.  
    
In contrast, methods (verbs) are things that require computation or modification of the dataframe like deleting things or performing computations on columns or rows.
</div>

### Listing the dataframe columns

In addition to .shape each dataframe has a .columns attribute. This returns an array of all the column names in the dataframe.

This can be used like a list, doing indexing and using membership operators to see what is (or is not) in the set of columns

In [33]:
salary_df.columns

Index(['salary', 'gender', 'departm', 'years', 'age', 'publications'], dtype='object')

In [34]:
# use square indexing to get individual values
salary_df.columns[2]

'departm'

In [35]:
cols = salary_df.columns

In [39]:
a = 'broadway'

'mercer' in a

False

In [42]:
l = ['a.jpg', 'l.jpg', 'p.jpg']

In [44]:
'a.jpg' in l

True

In [46]:
salary_df.columns

Index(['salary', 'gender', 'departm', 'years', 'age', 'publications'], dtype='object')

In [47]:
# use the 'in' operator to check whether there is a
# column called 'years' in the dataframe
'institution' in salary_df.columns

False

You can convert the column array to a list using the .tolist() method attached to the column index:

In [49]:
col_list = salary_df.columns.tolist()
col_list

['salary', 'gender', 'departm', 'years', 'age', 'publications']

<div class="alert alert-info">
<strong>Chaining methods</strong><br>
Dataframes in pandas are what is known as an object-oriented structure. 
    
This means that most the functionality of a dataframe is tied to the variables themselves rather than in external functions.  
    
Most pandas methods either return themselves or a copy of the dataframe that has been altered.  
    
Thus you can "chain" operations and methods together to make the code more concise.  Chaining means calling multiple methods in a row on a single line of code.  
 
    
That's what lets us write something like:
    
    
`salary_df.columns.tolist()`

    
The `columns` call immediately evaluated to the column array and that object has a tolist() function attached to it, so we can chain it together.

### Finding out which values appear in a column

Sometimes it's useful to get a list of all of the possible values in a column without getting _all_ of the individual data values.

For example, in our salary_df we might like to know which departments (the 'departm' column) appear in the dataframe.

When you access a dataframe column you can append .unique() to it and you will get a list of each value that appears in the column at least once.

In [51]:
salary_df['departm']

0       bio
1       bio
2       bio
3       bio
4       bio
      ...  
72    neuro
73     stat
74     stat
75     math
76     math
Name: departm, Length: 77, dtype: object

In [52]:
salary_df['departm'].unique()

array(['bio', 'chem', 'geol', 'neuro', 'stat', 'physics', 'math'],
      dtype=object)

In [54]:
# use membership to check if we have any data for
# the 'history' department:
'neuro' in salary_df['departm'].unique()

True

### Renaming columns

You can rename columns using the dataframe .rename() method.

It takes in a dictionary where keys are the original column name(s) and the values are the new column names.

In [55]:
# make a copy of the dataframe so we don't change the original
salary_copy = salary_df.copy()
salary_copy.head(3)


Unnamed: 0,salary,gender,departm,years,age,publications
0,86285,0,bio,26.0,64.0,72
1,77125,0,bio,28.0,58.0,43
2,71922,0,bio,10.0,38.0,23


In [56]:
# first make a dictionary giving the 
# old (keys) and new (values) column names
rename_dict = {'departm': 'department', 
              'years': 'years_of_work'}

Use rename() and output the result to a new variable or use inplace=True to make the changes stick.

In [64]:
# doing it without inplace=True or output to 
# a variable name and the changes don't stick
rename_dict = {'departm': ['department'], 
              'years': 'years_of_work'}

new_df = salary_copy.rename(columns=rename_dict)
new_df

Unnamed: 0,salary,gender,department,years_of_work,age,publications
0,86285,0,bio,26.0,64.0,72
1,77125,0,bio,28.0,58.0,43
2,71922,0,bio,10.0,38.0,23
3,70499,0,bio,16.0,46.0,64
4,66624,0,bio,11.0,41.0,23
...,...,...,...,...,...,...
72,53662,1,neuro,1.0,31.0,3
73,57185,1,stat,9.0,39.0,7
74,52254,1,stat,2.0,32.0,9
75,61885,1,math,23.0,60.0,9


In [61]:
# do it with inplace=True

salary_copy.rename(columns=rename_dict, inplace=True)
salary_copy.head()

Unnamed: 0,salary,gender,department,years_of_work,age,publications
0,86285,0,bio,26.0,64.0,72
1,77125,0,bio,28.0,58.0,43
2,71922,0,bio,10.0,38.0,23
3,70499,0,bio,16.0,46.0,64
4,66624,0,bio,11.0,41.0,23


## Adding and deleting things from a dataframe

Sometimes after we read in a dataset we might want to add new rows or columns or delete rows and columns from a dataframe.  

One way to do this is to edit the original CSV file that we read in. 

However, there is an important principle I want to emphasize thoughout this class: 

<div class="alert alert-danger" role="alert">
  <strong>**ALWAYS do everything in code**!</strong> <br>
    What does it mean to do everything in code?  It means that if you go to your data file that you got from some place and then delete some of the data by hand in Google Sheets or Excel, there is no record of it. Once you save the file the data will be deleted and noone will know you did this.
  
  Instead, if you keep your data files as "raw" as possible and modify it using code, your code will document ALL of the steps you did in your analysis including the step of deleting data.  
    
"Excluding" (not DELETEing) data is sometimes justified but the important thing is we want to document all our steps honestly and truthfully when doing analysis.  Using code to do every single step of an analysis helps us accomplish this.
</div>



### Delete a column from a dataframe using the df.drop() method

Dataframe objects have a drop() method attached to them:

`df.drop(columns = 'name of column to delete', inplace=True/False)`

Set `inplace` to True to make changes to the calling dataframe, False to output the results into a new dataframe variable.

To delete multiple columns the input to `columns=` should be a list of column names:

`df.drop(columns = ['col_name_1', 'col_name_2'], inplace=True/False)`

In [65]:
salary_df.head()

Unnamed: 0,salary,gender,departm,years,age,publications
0,86285,0,bio,26.0,64.0,72
1,77125,0,bio,28.0,58.0,43
2,71922,0,bio,10.0,38.0,23
3,70499,0,bio,16.0,46.0,64
4,66624,0,bio,11.0,41.0,23


In [66]:
# use drop() method to get rid of 
# the gender column in our salary_df
# use inplace=True to make the changes 
# stick
salary_df.drop(columns = 'gender', inplace=True)

# take a look and the gender column is gone
salary_df.head()

Unnamed: 0,salary,departm,years,age,publications
0,86285,bio,26.0,64.0,72
1,77125,bio,28.0,58.0,43
2,71922,bio,10.0,38.0,23
3,70499,bio,16.0,46.0,64
4,66624,bio,11.0,41.0,23


In [69]:
# delete multiple columns
copy_df = salary_df.copy()
copy_df.head()

copy_df.drop(columns=['publications', 'years'], inplace=True)
copy_df

Unnamed: 0,salary,departm,age
0,86285,bio,64.0
1,77125,bio,58.0
2,71922,bio,38.0
3,70499,bio,46.0
4,66624,bio,41.0
...,...,...,...
72,53662,neuro,31.0
73,57185,stat,39.0
74,52254,stat,32.0
75,61885,math,60.0


### Deleting rows

To delete a row you can use the `.drop()` method to drop a particular item using its row index value.  The `.drop()` method is not an "in place operation" instead it returns a new dataframe with the particular rows removed.

In [70]:
# make a copy of the salary_df dataframe
working_df = salary_df.copy()
working_df.head()

Unnamed: 0,salary,departm,years,age,publications
0,86285,bio,26.0,64.0,72
1,77125,bio,28.0,58.0,43
2,71922,bio,10.0,38.0,23
3,70499,bio,16.0,46.0,64
4,66624,bio,11.0,41.0,23


If we want to delete the first row we can drop it using the index it has, in this case 0.

To preserve the change we either need to assign the output to a new variable or set inplace=True

In [71]:
# drop doesn't stick because no new variable
# and no use of inplace=True
working_df.drop(0)
working_df.head()

Unnamed: 0,salary,departm,years,age,publications
1,77125,bio,28.0,58.0,43
2,71922,bio,10.0,38.0,23
3,70499,bio,16.0,46.0,64
4,66624,bio,11.0,41.0,23
5,64451,bio,23.0,60.0,44


In [None]:
# make it stick with inplace=True
working_df.drop(0, inplace=True)
working_df.head()

You can also remove multiple rows by their index value at once by passing a list of index positions to the drop() method:

In [72]:
list_to_drop = [2,4,6]
working_df.drop(list_to_drop).head()

Unnamed: 0,salary,departm,years,age,publications
1,77125,bio,28.0,58.0,43
3,70499,bio,16.0,46.0,64
5,64451,bio,23.0,60.0,44
7,59344,bio,5.0,40.0,11
8,58560,bio,8.0,38.0,8


### Rows, columns and axes

An alternative way to delete a column uses the `.drop()` method using an additional argument that refers to the "axis" you are dropping from.

Our dataframe has two axes: the rows and the columns. The rows are along axis 0 and the columns are along axis 1.

Here are a couple examples of dropping one or more columns by name.  Note that the case of the column name must match and also you need to specify `axis=1` to refer to dropping columns instead of rows.

In [74]:
working_df.head()

Unnamed: 0,salary,departm,years,age,publications
1,77125,bio,28.0,58.0,43
2,71922,bio,10.0,38.0,23
3,70499,bio,16.0,46.0,64
4,66624,bio,11.0,41.0,23
5,64451,bio,23.0,60.0,44


In [77]:
working_df.drop(3, axis=0)


Unnamed: 0,salary,departm,years,age,publications
1,77125,bio,28.0,58.0,43
2,71922,bio,10.0,38.0,23
4,66624,bio,11.0,41.0,23
5,64451,bio,23.0,60.0,44
6,64366,bio,23.0,53.0,22
...,...,...,...,...,...
72,53662,neuro,1.0,31.0,3
73,57185,stat,9.0,39.0,7
74,52254,stat,2.0,32.0,9
75,61885,math,23.0,60.0,9


Use a list of column names to drop more than one at a time

In [78]:
working_df.drop(['years','age'], axis=1)

Unnamed: 0,salary,departm,publications
1,77125,bio,43
2,71922,bio,23
3,70499,bio,64
4,66624,bio,23
5,64451,bio,44
...,...,...,...
72,53662,neuro,3
73,57185,stat,7
74,52254,stat,9
75,61885,math,9


### Deleting rows with missing data

For many analyses you need to discard any records with missing data. 

For example: dropping any trial from an experiment where a subject didn't give a response before a timer deadline.  

The column coding what the response was might be "empty" or "missing" and you would want to use the `NaN` value to indicate it was missing.  

To totally delete rows with any missing value use the dataframe `dropna()` function:




In [88]:
# first make a dataframe from a dictionary with some missing values
df = pd.DataFrame({"age": [10,27,45,23,None], 
                   "salary": [0,23000,None,35000,60000]})
df

Unnamed: 0,age,salary
0,10.0,0.0
1,27.0,23000.0
2,45.0,
3,23.0,35000.0
4,,60000.0


As you can see the salary value is missing for row with index 2 (age 45).  To drop any row with a missing value in any column:

In [83]:
df.head()

Unnamed: 0,age,salary
0,10.0,0.0
1,27.0,23000.0
2,45.0,
3,23.0,35000.0
4,,60000.0


In [81]:
df.dropna(inplace=True)
df

Unnamed: 0,age,salary
0,10.0,0.0
1,27.0,23000.0
3,23.0,35000.0


To drop only those rows missing data in one or some of the columns use the `dropna(subset=['column_name'])` syntax.

In [90]:
# drop rows missing salary data but not age data
df.dropna(subset= ['age','salary'], inplace=True)
df

Unnamed: 0,age,salary
0,10.0,0.0
1,27.0,23000.0
3,23.0,35000.0


There are several other tricks for dropping missing data besides this.  For example, you can delete rows with only a specific column value missing, etc...  However for now this should work for us.

## Replacing values

In addition to changing column names, there are many situations where you might want to replace values in the dataframe.

Maybe you have abbreviations that aren't informative, maybe you want to convert string labels into numeric codes or vice versa, and so on.

The simplest way to do it is to make a dictionary where the keys are the things you want to replace and the values are the things to replace with:

```replace_dict = {'old_thing': 'new_thing'}```

To change every occurence of 1 in a dataframe with 'yes' and 2 with 'no':

```replace_dict = {1: 'yes', 2: 'no'}```

In [91]:
# replace 'bio' with biology and stat with statistics
# in the salary_df
salary_df.head()

Unnamed: 0,salary,departm,years,age,publications
0,86285,bio,26.0,64.0,72
1,77125,bio,28.0,58.0,43
2,71922,bio,10.0,38.0,23
3,70499,bio,16.0,46.0,64
4,66624,bio,11.0,41.0,23


In [95]:
d = {'bio': 'biology', 'stat': 'statistics'}

salary_df.replace(d, inplace=True)

In [96]:
salary_df['departm'].unique()

array(['biology', 'chem', 'geol', 'neuro', 'statistics', 'physics',
       'math'], dtype=object)

To replace the values in one column, pass a nested dictionary where keys are column names and then the value is a dictionary of the old and new values just like in the previous example

For example, to change values of 0 to 'male' in only the gender column:

replace_dict = {'gender': {0: 'male'}}

In [99]:
# use replace_dict to change things only in the 'gender' column
# reload dataframe to recover the dropped gender column
salary_df = pd.read_csv('../pandas_data_1/salary.csv')

replace_dict = {'gender': {0: 'male'}}

salary_df.replace(replace_dict)

Unnamed: 0,salary,gender,departm,years,age,publications
0,86285,male,bio,26.0,64.0,72
1,77125,male,bio,28.0,58.0,43
2,71922,male,bio,10.0,38.0,23
3,70499,male,bio,16.0,46.0,64
4,66624,male,bio,11.0,41.0,23
...,...,...,...,...,...,...
72,53662,1,neuro,1.0,31.0,3
73,57185,1,stat,9.0,39.0,7
74,52254,1,stat,2.0,32.0,9
75,61885,1,math,23.0,60.0,9


In [102]:
# make a replacement dictionary that has one set
# of replacements for 'gender' column and another 
# for 'departm'

d1 = {'gender': {0: 'male', 1: 'female', 2: 'nonbinary'},
     'departm': {'bio':'biology', 'stat': 'statistics'}
    }


salary_df.replace(d)

Unnamed: 0,salary,gender,departm,years,age,publications
0,86285,male,biology,26.0,64.0,72
1,77125,male,biology,28.0,58.0,43
2,71922,male,biology,10.0,38.0,23
3,70499,male,biology,16.0,46.0,64
4,66624,male,biology,11.0,41.0,23
...,...,...,...,...,...,...
72,53662,female,neuro,1.0,31.0,3
73,57185,female,statistics,9.0,39.0,7
74,52254,female,statistics,2.0,32.0,9
75,61885,female,math,23.0,60.0,9


## Combining dataframes

In a later section we will spend more time with **concatenating** and **merging** dataframes but we'll introduce the idea now.

You can easily combine the rows of two different dataframes into one like stacking dataframes together.  

This might be useful in psychology research for instance if each participant in your experiment had their own data file and you want to read each file into a dataframe and then combine them together to make one big dataframe with all the data from your experiment.


Use `pd.concat()` with input of a list of dataframes.

In [103]:
# make two dataframes with same columns and different values
a = [10,27,45,23,21]
s = [0,23000,100000,35000,60000]
df1 = pd.DataFrame({'age': a, 'salary': s})

a = [23,21,43,16]
s = [23000,31000,35000,8000]
df2 = pd.DataFrame({'age': a, 'salary': s})



In [104]:
df1

Unnamed: 0,age,salary
0,10,0
1,27,23000
2,45,100000
3,23,35000
4,21,60000


In [105]:
df2

Unnamed: 0,age,salary
0,23,23000
1,21,31000
2,43,35000
3,16,8000


In [106]:
# combine them, stacked vertically using pd.concat(list_of_frames)
df_list = [df1, df2]

In [109]:
df_combined = pd.concat(df_list)
df_combined

Unnamed: 0,age,salary
0,10,0
1,27,23000
2,45,100000
3,23,35000
4,21,60000
0,23,23000
1,21,31000
2,43,35000
3,16,8000


This worked well because they have the same columns.  If they have different columns then the missing entries of either are filled in with `NaN` which is the code for "missing values" in pandas. We will revisit this in a minute. 

Note that in the df_combined output above the index (row labels) repeated: it goes from 0 to 4 and then 0 to 4 again. Those are the row index values from the two dataframes that were concatenated. Reset them to start at zero like this:

In [112]:
df_combined = df_combined.reset_index()
df_combined

Unnamed: 0,level_0,index,age,salary
0,0,0,10,0
1,1,1,27,23000
2,2,2,45,100000
3,3,3,23,35000
4,4,4,21,60000
5,5,0,23,23000
6,6,1,21,31000
7,7,2,43,35000
8,8,3,16,8000


By default df.reset_index() is not an inplace operation so to preserve the change we needed to assign the output to a variable or use inplace=True as another input to df.reset_index(). 

In [113]:
# combine them vertically
df_combined2 = pd.concat([df1, df2])
df_combined2


Unnamed: 0,age,salary
0,10,0
1,27,23000
2,45,100000
3,23,35000
4,21,60000
0,23,23000
1,21,31000
2,43,35000
3,16,8000


In [114]:
# reset the index in place and verify that the changes stuck:
df_combined2.reset_index(inplace=True)
df_combined2


Unnamed: 0,index,age,salary
0,0,10,0
1,1,27,23000
2,2,45,100000
3,3,23,35000
4,4,21,60000
5,0,23,23000
6,1,21,31000
7,2,43,35000
8,3,16,8000


### Concatenating dataframes that have different columns

You can also concatenate dataframes that have different columns and you'll simply get a NaN value where there was no  data:

In [115]:
df1 = pd.DataFrame({"age": [10,27,45,23,21], 
                    "salary": [0,23000,100000,35000,60000]})

df2 = pd.DataFrame({"age": [60,70,53,56,80], 
                    "height": [5.2,6.0,5.7,3.4,4.6]})


In [119]:
# combine df1 and df2 into a single frame using pd.concat()
ndf =pd.concat([df1, df2])
ndf

Unnamed: 0,age,salary,height
0,10,0.0,
1,27,23000.0,
2,45,100000.0,
3,23,35000.0,
4,21,60000.0,
0,60,,5.2
1,70,,6.0
2,53,,5.7
3,56,,3.4
4,80,,4.6


In the example above, df1 had a column for 'salary' that was not in df2, and df2 had 'height' which was not in df1, so the df_combined has columns for salary and height with NaN for the missing values.

We will talk about dealing with "missing" values shortly but basically missing values in pandas allows for incomplete rows: you might not have information about every single field of a row and so you can use `NaN` (stands for Not-a-number in computer speak) to represent missing values.

<div class="alert alert-danger" role="alert">
  <strong>**The dangers of recoding missing values**!</strong> <br>
    Missing some measurements in a dataset is very common and how to handle those missing values is critically important. Whenever possible you want to encode the missing data with a value that cannot impact your analyses. So, for example, if we just put a zero everywhere we had no data and then computed the average of the age column in our data our calculation would be off. Using NaN (which is a special floating point value) can help alleviate this possibility.
</div>



## Adding columns

Adding a new column to a dataframe is easy. 

You just assign some values to a new columns name.  

First we will create a data frame with two random columns. np.random.rand() gives random numbers between 0 and 1 (uniformly sampled). The input is the numer of samples you want.

In [125]:
np.random.rand(10)

array([0.83175855, 0.21937908, 0.44817212, 0.65290053, 0.94833599,
       0.36309193, 0.79916662, 0.66262115, 0.73327594, 0.92906925])

In [126]:
df = pd.DataFrame({"col_1": np.random.rand(10), 
                   "col_2": np.random.rand(10)})
df.head()

Unnamed: 0,col_1,col_2
0,0.922259,0.004777
1,0.5301,0.218629
2,0.35307,0.186518
3,0.979585,0.69702
4,0.520212,0.879456


Now we simply assign a new column `df['sum']` and define it to be the sum of two columns.

In [127]:
df['sum'] = df['col_1'] + df['col_2']
df

Unnamed: 0,col_1,col_2,sum
0,0.922259,0.004777,0.927035
1,0.5301,0.218629,0.748729
2,0.35307,0.186518,0.539588
3,0.979585,0.69702,1.676605
4,0.520212,0.879456,1.399668
5,0.996248,0.021721,1.017969
6,0.361529,0.835326,1.196855
7,0.809722,0.181465,0.991187
8,0.80288,0.89381,1.69669
9,0.324573,0.576887,0.90146


You can also define new columns to be a constant value:

In [128]:
df['constant2'] = 3
df

Unnamed: 0,col_1,col_2,sum,constant2
0,0.922259,0.004777,0.927035,3
1,0.5301,0.218629,0.748729,3
2,0.35307,0.186518,0.539588,3
3,0.979585,0.69702,1.676605,3
4,0.520212,0.879456,1.399668,3
5,0.996248,0.021721,1.017969,3
6,0.361529,0.835326,1.196855,3
7,0.809722,0.181465,0.991187,3
8,0.80288,0.89381,1.69669,3
9,0.324573,0.576887,0.90146,3


Column values can be strings as well

In [129]:
df['ID'] = 'sub-99'
df['date'] = 'jan-10-2022'
df

Unnamed: 0,col_1,col_2,sum,constant2,ID,date
0,0.922259,0.004777,0.927035,3,sub-99,jan-10-2022
1,0.5301,0.218629,0.748729,3,sub-99,jan-10-2022
2,0.35307,0.186518,0.539588,3,sub-99,jan-10-2022
3,0.979585,0.69702,1.676605,3,sub-99,jan-10-2022
4,0.520212,0.879456,1.399668,3,sub-99,jan-10-2022
5,0.996248,0.021721,1.017969,3,sub-99,jan-10-2022
6,0.361529,0.835326,1.196855,3,sub-99,jan-10-2022
7,0.809722,0.181465,0.991187,3,sub-99,jan-10-2022
8,0.80288,0.89381,1.69669,3,sub-99,jan-10-2022
9,0.324573,0.576887,0.90146,3,sub-99,jan-10-2022


There are of course some limitations and technicalities here but for the most part you just name a new column and define it as above. **The main thing is that the new column either needs to have as many values as there are rows in your dataframe or just a single value that will be expanded to fill the column.**

## Things you can do with dataframes

The main goal of getting your data into a dataframe is that enables several methods for manipulating your data in powerful ways.

### Sorting

Often times it can help us understand a dataset better if we can sort the rows of the dataset according to the values in one or more columns.  For instance in the salary data set we have been considering it is hard to know who is the highest and lowest paid faculty.  One approach would to be sort the values.

In [None]:
df = pd.read_csv('salary.csv')
df.head()

We can sort this dataset in ascending order with:

In [None]:
df.sort_values('salary')

Now we can easily see from this output that 44,687 is the lowest salary and 112,800 is the highest.  `sort_values()` is **not** an inplace operation so the original dataframe is still unsorted and we have to store the sorted result in a new dataframe variable if we want to keep working with it.

In [None]:
df.head()  # still unsorted

We can sort the other way by adding an additional parameter telling to NOT sort in ascending order (that is, to use descending)

In [None]:
df.sort_values('salary', ascending=False)

And if you sort by two columns it will do them in order (so the first listed column is sorted first then the second):

In [None]:
df.sort_values(['salary','age'])

In this data set it is mostly the same to sort by salary first then age because most people don't have the same salary so that already provides an order.  However if we do it the other way, i.e., age first then salary, it will order by people age and then for the people who are the same age sort by salary.

In [None]:
df.sort_values(['age','salary'])

As you can see in this shortened output there are several people who are 32 in the database and their salaries are ordered from smallest to biggest.

### Arithmetic

Perhaps one of the most useful features of dataframes (and spreadsheets) is the ability to create formulas that compute new values based on the rows and columns. 

For instance if you had a dataframe that had rows for students and each column was the grade on an assignment a common operation might be to compute the average grade as a new column. 

Let's take a look at a simple example of this and then discuss arithmetic operations in Pandas more generally.

In [130]:
grades_df = pd.DataFrame({'student':['001','002','003'], 
                          'assignment1': [90, 80, 70], 
                          'assignment2': [82,84,96], 
                          'assignment3': [89,75,89]})
grades_df

Unnamed: 0,student,assignment1,assignment2,assignment3
0,1,90,82,89
1,2,80,84,75
2,3,70,96,89


This is not necessarily the easiest way to **enter** this data (you might prefer to use a spreadsheet for that), but you could read in a csv to load the grades for instance.  Next you would want to create the average grade for each student.

In [131]:
grades_df['average']=(grades_df['assignment1']+
                      grades_df['assignment2']+
                      grades_df['assignment3'])/3
grades_df

Unnamed: 0,student,assignment1,assignment2,assignment3,average
0,1,90,82,89,87.0
1,2,80,84,75,79.666667
2,3,70,96,89,85.0


We added up the column for assignment 1, 2, and 3 and then divided by three.  

Then we assigned that resulting value to a new column called average.  

A very nice feature is that we did all the values at once rather than having to loop through each entry in the dataframe. 

This is called **broadcasting** which is a feature of many programming languages that automatically detects when you are doing arithmetic operations on collections of numbers and then does that operation for **each entry** rather than like the first one.

We can also broadcast the addition of constant values to a column.  For instance to give all the students a five point bonus we could do this:

In [132]:
grades_df['average'] = grades_df['average']+5
grades_df

Unnamed: 0,student,assignment1,assignment2,assignment3,average
0,1,90,82,89,92.0
1,2,80,84,75,84.666667
2,3,70,96,89,90.0


Again, here it added 5 to *each entry* of the grades column rather than just one or the first row.

Basically any math function can be composed of the columns.  You might also be interested in functions you could compute down the columns rather than across them, however we will consider those in more detail in a later section.

#### Get the mean of some column using df['col_name'].mean()

In [136]:
# get the average across students for assignment 1
a1mean = grades_df['assignment1'].mean()
a1mean

80.0

In [137]:
a2mean = grades_df['assignment2'].mean()


In [138]:
a1mean>a2mean

False

#### Get the std of some column using df['col_name'].std()

In [139]:
grades_df['assignment2'].std()

7.571877794400365

#### Get minimum and maximum values in a column some column using df['col_name'].min() and df['col_name'].max()

In [None]:
grades_df['assignment3'].max()

In [143]:
grades_df['assignment3'].min()

75

## Selecting

"Selecting" is grabbing subsets of a dataframe's rows conditioned on the **values** of some of the rows.  

This is used when we are interested in selecting rows that meet a particular criterion.  

For instance in the professor salary dataset we read in we might want to select only the rows that represent people whose age is greater than 50.

**Logical operations on columns** Selecting data based on conditions builds on the logical operations we already learned.

In [146]:
# save salary_df in df variable for use in next cells
df = salary_df

In [164]:
df['gender'].dtype

dtype('int64')

In [152]:
# check if the age column is greater than 50
# the return is a series of Boolean values
selector = df['age']>50
selector

0      True
1      True
2     False
3     False
4     False
      ...  
72    False
73    False
74    False
75     True
76    False
Name: age, Length: 77, dtype: bool

We get a column of `True`/`False` values \which reflect a test for each row if the age value is greater than 50.  

If it is, then `True` is entered into the new column at that row position and if it isn't then `False` is entered in.

We can write more complex logical operations as well.  For instance:

In [153]:
(df['age']>50) & (df['age']<70)

0      True
1      True
2     False
3     False
4     False
      ...  
72    False
73    False
74    False
75     True
76    False
Name: age, Length: 77, dtype: bool

This expression does a logical `and` due to the `&` symbol and will be true for a particular dataframe row if the age is greater than 50 AND less than 70.

The examples so far used a single column but we can also make combinations using multiple columns.  For instance we could indentify all the rows corresponding to professors that are under 50 and work in the 'bio' department.

In [154]:
(df['age']<50) & (df['departm']=='bio')

0     False
1     False
2      True
3      True
4      True
      ...  
72    False
73    False
74    False
75    False
76    False
Length: 77, dtype: bool

If you want to make an "or" you use the '|' (pipe) character instead of the '&'.

In [155]:
# check if each row is a person under age 50 and/or whether they
# have at least 10 years of work
(df['age']<50) | (df['years']>=10)

0     True
1     True
2     True
3     True
4     True
      ... 
72    True
73    True
74    True
75    True
76    True
Length: 77, dtype: bool

Now that we have this boolean column we can use it to select subsets of the original dataframe:

In [159]:
df[df['age']<35]

Unnamed: 0,salary,gender,departm,years,age,publications
22,47021,0,chem,4.0,34.0,12
23,44687,0,chem,4.0,34.0,19
52,53656,0,stat,2.0,32.0,4
56,72044,0,physics,2.0,32.0,16
70,55949,1,chem,4.0,34.0,12
72,53662,1,neuro,1.0,31.0,3
74,52254,1,stat,2.0,32.0,9
76,49542,1,math,3.0,33.0,5


The previous line selects all the professors under 35.  

On the outer part we have `df[]` and in the middle of the bracket we provide the logical column/series as we just discussed.  

You can break them into two steps if you like:

In [160]:
under35 = df['age']<35
df[under35]

Unnamed: 0,salary,gender,departm,years,age,publications
22,47021,0,chem,4.0,34.0,12
23,44687,0,chem,4.0,34.0,19
52,53656,0,stat,2.0,32.0,4
56,72044,0,physics,2.0,32.0,16
70,55949,1,chem,4.0,34.0,12
72,53662,1,neuro,1.0,31.0,3
74,52254,1,stat,2.0,32.0,9
76,49542,1,math,3.0,33.0,5


This makes clear that we define the "rule" for what things are true and false in this column (`under35`) and then use it to select rows from the original dataframe. Pandas will show all the rows where the selector is True and ignore the ones where it's False.

You use this a lot in data analysis because often a data file from a single subject in an experiment has trials you want to skip or analyze, or you might use it to select trials from particular subjects, or trials that meet a certain requirement (E.g., if a reaction time was too long or something).  Thus it is important to bookmark this concept and we will return to it several times throughout the semester.

## A common work flow

Let's put today's materials together and do the following common steps in data procesing:

    1) read in the salary.csv dataset
        - pd.read_csv()

    2) rename the 'years' column to be 'years_of_work'
        - df.rename(columns=rename_dict)
        
    3) recode the 'gender' column to be 0: male, 1: female, 2: nonbinary
    
    4) add a column 'pubs_per_year' that is the number of 'publications' per 'years_of_work'
        - df['new_column'] = whatever_values_you_want_for_column
    
    5) subset the dataframe by selecting only those rows where the person is in the 'neuro' department
        - df[boolean_selector]
        
    6) save the dataframe back out with a new name
        - df.to_csv()

In [None]:
df.replace({'gender': {0:'male', 1:'female', 2:'nonbinary'}})

### Save dataframes using the df.to_csv() function with input of a path and filename