# Pandas Series

"pandas" is a Python package providing data structures to work on relational and labeled data. It is designed to be efficient and intuitive.

The two main data structures in Pandas are <b>Series</b> and <b>DataFrame</b>. Series is a one-dimensional labeled array, while DataFrame is a two-dimensional tabular data. This module focues on series. 

Try writing the code for importing `pandas` package in, and give it alias `pd`:

In [5]:
import pandas as pd

Note: as you are getting familiar with python programming, I'll frquently leave coding blocks empty (even if it is not an exercise) for you to fill in -- **there is a big difference between reading code and writing code by yourself!**

First, we load the data set in <i>students.csv</i> that is in the current folder, store it in a DataFrame called <i>df</i>, and use the students' name column as the index for easy identification. We'll talk more about pandas in the next module.

In [6]:
df = pd.read_csv('students.csv', index_col='Name')
# Or index by number "index_col=0" 

In [7]:
df

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Jeannine,6.0,7.0,MSIS
Iluminada,2.0,,MBA
Luci,7.0,7.0,MSIS
Jenny,8.0,,
Demetria,2.0,4.0,MSIS
Michael,6.0,10.0,MBA
Garland,9.0,1.0,MSIS
Shelby,1.0,10.0,MSIS
Mercy,5.0,6.0,MSIS


In [5]:
df.head()

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Jeannine,6.0,7.0,MSIS
Iluminada,2.0,,MBA
Luci,7.0,7.0,MSIS
Jenny,8.0,,


Note that NaN indicates missing value.

# Series

In this lecture, we will mostly focus only on the column <i>hw1</i>. Let's make a Series of hw1 scores. 

A Series is a one-dimensional array of data (<b>values</b>) and an associated array of data labels (<b>index</b>). In this example, the <b>index</b> is the student name and the <b>value</b> is the score in hw1.

You can access the column in the df via using a **[' ']** or a **.**

In [8]:
hw1 = df['hw1']
# Or equivalently:
# hw1 = df.hw1

In [7]:
hw1

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          NaN
Name: hw1, dtype: float64

Check the data type of hw1

In [8]:
type(hw1)

pandas.core.series.Series

## Properties of a Series: index and values

Return the index as an Index object and the values as ndarray

In [9]:
hw1.index

Index(['Dorian', 'Jeannine', 'Iluminada', 'Luci', 'Jenny', 'Demetria',
       'Michael', 'Garland', 'Shelby', 'Mercy', 'John'],
      dtype='object', name='Name')

In [10]:
hw1.values

array([10.,  6.,  2.,  7.,  8.,  2.,  6.,  9.,  1.,  5., nan])

The length of hw1, a.k.a., the number of elements:

In [11]:
len(hw1)

11

## Summary statistics using the describe() method

In [12]:
hw1.describe()

count    10.000000
mean      5.600000
std       3.098387
min       1.000000
25%       2.750000
50%       6.000000
75%       7.750000
max      10.000000
Name: hw1, dtype: float64

<div class="alert alert-block alert-info"> 
**Tech Note**: 
You can also use ***tab*** to perform the **auto-filled**. Type the partial function name from beginning, then presss ***tab***. It will auto-fill the function name or bring up a pop-up window with matching multiple choices.
</div>

<div class="alert alert-block alert-info"> 
**Tech Note**: To bring up the **on-line help** for a particular function, type the function name, then press ***shift-tab***. 
</div>

## Aggregate functions (max, min, mean, ...)

An aggregate function performs a calculation on a set of values, and returns a single value. `pandas.Series` offers several such aggregate functions.

The maximum grade among all students

In [13]:
hw1.max()

10.0

The minimum grade among all students

In [14]:
hw1.min()

1.0

The average grade among all students

In [15]:
hw1.mean()

5.6

<div class="alert alert-block alert-info"> 
**Tech Note**: To check how many functions and data objects are available for an object( in this case **hw1**, a **Series**). Type ***hw1.*** then press ***tab***
</div>

Exercise: read the above tech note, and find the function to calculate the median grade 

In [16]:
hw1.median()

6.0

Exercise: The sum of all grades

In [17]:
hw1.sum()

56.0

## Selection

## <i>.iloc[...]</i>: position-based selection 

Selects rows using the positional index. It is like accessing a list of elements, with one big difference: we can access the values using <b>slices</b>.

#### Using one index value

Access the 4-th value. It returns one value.

In [18]:
hw1.iloc[3]

7.0

Exercise: access the last value.

In [19]:
hw1.iloc[-1]

nan

#### Using slices

Retrieve multiple values: 1st, 2nd and 5th.

In [21]:
hw1.iloc[[0,1,4]]

Name
Dorian      10.0
Jeannine     6.0
Jenny        8.0
Name: hw1, dtype: float64

**Caution!**
+ The above code returns a Series object. 
+ And it returns a view, not a copy.

<div class="alert alert-block alert-info"> 
**Tech Note** : Python uses the [ ] operator for both indexing and for constructing a list. The outer [  ] in hw1.iloc[[0,1,4]] is performing the indexing, and the inner is creating a list.
</div>

Retrieve all elements from the 3rd to the 7th (included). It returns a Series. <b>Caution!</b> Slicing as `2:7` below creates a list.

In [22]:
hw1.iloc[[2,3,4,5,6]]

Name
Iluminada    2.0
Luci         7.0
Jenny        8.0
Demetria     2.0
Michael      6.0
Name: hw1, dtype: float64

In [24]:
hw1.iloc[2:7]

Name
Iluminada    2.0
Luci         7.0
Jenny        8.0
Demetria     2.0
Michael      6.0
Name: hw1, dtype: float64

Exercise: Select the "second to last" student of the Series. Make sure to retrieve both the name and the grade.

In [25]:
hw1.iloc[-2:]

Name
Mercy    5.0
John     NaN
Name: hw1, dtype: float64

## <i>s[...]</i>: index-based selection 

Selects rows using the index (using a label value, a slice of label values, or a Boolean selection). It is like accessing a Dictionary of elements, with one big difference: we can access the values using <b>slices</b> and <b>boolean selection</b>.

#### Using a label value

Find Luci's hw1 grade.

In [26]:
hw1['Luci']

7.0

#### Using a slice of label values (rarely used)

Find the grades from Luci's to Michael's

In [28]:
hw1['Luci':'Michael']

Name
Luci        7.0
Jenny       8.0
Demetria    2.0
Michael     6.0
Name: hw1, dtype: float64

Exercise: What is Michael's hw1 score?

In [34]:
hw1['Michael']

6.0

Now change `[]` to `[[]]` in the code and observe how the output differs from above:

In [35]:
hw1[['Michael']]

Name
Michael    6.0
Name: hw1, dtype: float64

## Boolean selection

The binary operators >,<,>=,<=,==,!= can be used to create a Series of booleans to identify those elements whose value satisfy a certain condition

<b>Problem</b>: Find the students whose grade is greater than or equal to 6

First, create a boolean Series

In [30]:
hw1 >= 6

Name
Dorian        True
Jeannine      True
Iluminada    False
Luci          True
Jenny         True
Demetria     False
Michael       True
Garland       True
Shelby       False
Mercy        False
John         False
Name: hw1, dtype: bool

Second, select only those students who have a "True" in the boolean Series above

In [29]:
hw1[hw1>=6]

Name
Dorian      10.0
Jeannine     6.0
Luci         7.0
Jenny        8.0
Michael      6.0
Garland      9.0
Name: hw1, dtype: float64

We can specify multiple concurrent conditions using `&` for AND and `|` for OR. For example, select those students whose hw1 score is less than 5 or greater than 9


In [31]:
hw1[(hw1<5)|(hw1>9)]

Name
Dorian       10.0
Iluminada     2.0
Demetria      2.0
Shelby        1.0
Name: hw1, dtype: float64

In [32]:
hw1[(hw1>5)&(hw1<9)]

Name
Jeannine    6.0
Luci        7.0
Jenny       8.0
Michael     6.0
Name: hw1, dtype: float64

Exercise: Compute the average hw1 grade among those students whose grade is less than or equal to 6


In [36]:
hw1[hw1<=6]

Name
Jeannine     6.0
Iluminada    2.0
Demetria     2.0
Michael      6.0
Shelby       1.0
Mercy        5.0
Name: hw1, dtype: float64

In [33]:
hw1[hw1<=6].mean()

3.6666666666666665

## More Series methods

### rank

Ranks each row based on the value (where by default low values get low rank numbers. It does **NOT** reorder the list. The rank number is **NOT** the original value. 

### idxmax and idxmin

Find the index of the row with maximum and minimum values


In [37]:
hw1.idxmax()

'Dorian'

In [39]:
hw1.idxmin()

'Shelby'

### sort_values

Sort by values


In [40]:
hw1.sort_values()

Name
Shelby        1.0
Iluminada     2.0
Demetria      2.0
Mercy         5.0
Jeannine      6.0
Michael       6.0
Luci          7.0
Jenny         8.0
Garland       9.0
Dorian       10.0
John          NaN
Name: hw1, dtype: float64

In [41]:
hw1.sort_values(ascending=False)

Name
Dorian       10.0
Garland       9.0
Jenny         8.0
Luci          7.0
Jeannine      6.0
Michael       6.0
Mercy         5.0
Iluminada     2.0
Demetria      2.0
Shelby        1.0
John          NaN
Name: hw1, dtype: float64

### sort_index

Sort by index

In [56]:
hw1.sort_index()

Name
Demetria      2.0
Dorian       10.0
Garland       9.0
Iluminada     2.0
Jeannine      6.0
Jenny         8.0
John          NaN
Luci          7.0
Mercy         5.0
Michael       6.0
Shelby        1.0
Name: hw1, dtype: float64

### nlargest and nsmallest

Finds the n items with largest or smallest value


In [44]:
hw1.nlargest(3)

Name
Dorian     10.0
Garland     9.0
Jenny       8.0
Name: hw1, dtype: float64

In [45]:
hw1.nsmallest(3)

Name
Shelby       1.0
Iluminada    2.0
Demetria     2.0
Name: hw1, dtype: float64

#find the student that scored the second highest: (find the top 2 student and retrieve the smallest one)

In [47]:
hw1.nlargest(2).nsmallest(1)

Name
Garland    9.0
Name: hw1, dtype: float64

In [48]:
hw1.nlargest(3).nsmallest(2)

Name
Jenny      8.0
Garland    9.0
Name: hw1, dtype: float64

### head and tail

Returns the first (or last) rows according to the positional index


In [49]:
hw1.head(5)

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Name: hw1, dtype: float64

## Exercises

Explore the parameters of the method "rank" to solve this question. Find the rank of each student (1=best, 10=worst) and deal with ties in the way that makes most sense to you. *Hint:* use `ascending=False, method='min'`

In [53]:
hw1.rank(ascending=False, method='min').sort_values() 
#method='min'----two people are tied, both of them occupy the position, and pick the mean value

Name
Dorian        1.0
Garland       2.0
Jenny         3.0
Luci          4.0
Jeannine      5.0
Michael       5.0
Mercy         7.0
Iluminada     8.0
Demetria      8.0
Shelby       10.0
John          NaN
Name: hw1, dtype: float64

Who got the 4th highest grade? Return both name and grade. (there are multiple ways to solve this)

In [52]:
hw1.nlargest(4).nsmallest(1)

Name
Luci    7.0
Name: hw1, dtype: float64

Retrieve the row of  the person who comes last in alphabetical order.

In [61]:
hw1.sort_index().iloc[[-1]]

Name
Shelby    1.0
Name: hw1, dtype: float64

Retrieve the name only of the person who comes last in alphabetical order.

In [59]:
hw1.sort_index().iloc[[-1]]

Name
Shelby    1.0
Name: hw1, dtype: float64

Retrieve the grade only of the person who comes last in alphabetical order.

In [62]:
hw1.sort_values().iloc[[-1]].index

Index(['John'], dtype='object', name='Name')

Among those whose name starts with ‘J’, who got the highest grade?

In [63]:
# everything is order, alphabetical can be compared as number
hw1[(hw1.index>='J') & (hw1.index<'k')].nlargest(1)

Name
Jenny    8.0
Name: hw1, dtype: float64

## Operations on one Series

### Operations between a scalar and a Series

Operations between a Series and a scalar(a real number) are performed element-wise on the values.

<b>Example</b>: It's Christmas time! As a gift, we want to increase everyone's grade by 5. What will the new grades be?

In [9]:
hw1+5

Name
Dorian       15.0
Jeannine     11.0
Iluminada     7.0
Luci         12.0
Jenny        13.0
Demetria      7.0
Michael      11.0
Garland      14.0
Shelby        6.0
Mercy        10.0
John          NaN
Name: hw1, dtype: float64

What if we wanted to multiply by 2 each grade?

In [11]:
hw1*2

Name
Dorian       20.0
Jeannine     12.0
Iluminada     4.0
Luci         14.0
Jenny        16.0
Demetria      4.0
Michael      12.0
Garland      18.0
Shelby        2.0
Mercy        10.0
John          NaN
Name: hw1, dtype: float64

### abs

Returns the absolute value of all values

In [12]:
hw1.abs() #if the value is negative

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          NaN
Name: hw1, dtype: float64

## Operations between two Series

Operations between two Series are performed element-wise on those elements with the same index label.

Let's create a Series of the hw2 grades. Remember that we have a dataframe object, <i>df</i>

In [13]:
hw2 = df['hw2']
hw2

Name
Dorian       10.0
Jeannine      7.0
Iluminada     NaN
Luci          7.0
Jenny         NaN
Demetria      4.0
Michael      10.0
Garland       1.0
Shelby       10.0
Mercy         6.0
John         10.0
Name: hw2, dtype: float64

The operation is executed between elements *with the same index label*. For example, let's add up hw1 and hw2 grades.

In [14]:
hw1+hw2

Name
Dorian       20.0
Jeannine     13.0
Iluminada     NaN
Luci         14.0
Jenny         NaN
Demetria      6.0
Michael      16.0
Garland      10.0
Shelby       11.0
Mercy        11.0
John          NaN
dtype: float64

Compute everyone's average grade

In [15]:
(hw1+hw2)/2

Name
Dorian       10.0
Jeannine      6.5
Iluminada     NaN
Luci          7.0
Jenny         NaN
Demetria      3.0
Michael       8.0
Garland       5.0
Shelby        5.5
Mercy         5.5
John          NaN
dtype: float64

## Exercises

<p>The average grade of hw1 is too low. We want to normalize it to 8. To this end, do the following <b>in one single command</b>:
<ol>
<li>decrease everyone's grade by the average grade (this will set the new average to 0)</li>
<li>increase everyone's grade by 8</li>
</ol>
</p>

In [17]:
hw1_new = hw1-hw1.mean()+8
hw1_new

Name
Dorian       12.4
Jeannine      8.4
Iluminada     4.4
Luci          9.4
Jenny        10.4
Demetria      4.4
Michael       8.4
Garland      11.4
Shelby        3.4
Mercy         7.4
John          NaN
Name: hw1, dtype: float64

To verify it ..

In [20]:
hw1_new.mean()

8.000000000000002

Compute the average grade between hw1 and hw2 of each student. Which student has the average closest to 6.7?


In [27]:
hw_a =(hw1+hw2)/2
hw_a

Name
Dorian       10.0
Jeannine      6.5
Iluminada     NaN
Luci          7.0
Jenny         NaN
Demetria      3.0
Michael       8.0
Garland       5.0
Shelby        5.5
Mercy         5.5
John          NaN
dtype: float64

In [28]:
(hw_a-6.7).abs().nsmallest(1)

Name
Jeannine    0.2
dtype: float64