# Querying Series

In this session we will learn about querying in Series, Iterating through series elements, merging 2 series objects together and and importance of thinking about parallelisation when engaging in data science programming.

A pandas Series can be queried either by the index position(The number pandas assign to labels starting from zero) or the index label. If we are querying through series which has no index label, then index position and the index label are effectively the same values. To  query by numeric location, starting at zero, use the __iloc__ attribute. To query by the index label, you can use the __loc__ attribute.

Though we have said a lot, note that querying here means accessing the value of the series using the index positions and index labels rather then accessing index positions and index labels respectively.

Lets see examples of how we can use them.

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
sampleDictionary={"Ashish":"Maths","Choudhary":"Chemistry","Ashutosh":"Computer"}
newConstruct=pd.Series(sampleDictionary)
print(newConstruct)

Ashish           Maths
Choudhary    Chemistry
Ashutosh      Computer
dtype: object


In [4]:
#Lets now query with iloc.
newConstruct.iloc[2]

'Computer'

In [5]:
newConstruct.iloc[0]

'Maths'

In [6]:
#Lets now query with loc
newConstruct.loc["Ashutosh"]

'Computer'

In [7]:
newConstruct.loc["Ashish"]

'Maths'

Try to remember things this way. __loc__ is for labels and __iloc__ is for no labels.

Keep in mind that iloc and loc are not methods, they are attributes. So you don't use parentheses to query them, but square brackets instead, which is called the indexing operator. 


Pandas tries to make our code a bit more readable and provides a sort of smart syntax using the indexing operator directly on the series itself. For instance, if you pass in an integer parameter, the indexing operator will behave as if you want it to query via the iloc attribute

Lets understand via example.

In [8]:
newConstruct[2]

'Computer'

In [9]:
newConstruct[0]

'Maths'


If we pass in an index label alongside the series itself, it will query as if you wanted to use the label based loc attribute.

Lets see example to understand.

In [10]:
newConstruct["Ashutosh"]

'Computer'

In [11]:
newConstruct["Choudhary"]

'Chemistry'

So what happens if your index labels are a list of integers? This is a bit complicated and Pandas can't determine automatically whether you're intending to query by index position or index label. So you need to be careful when using the indexing operator on the Series itself. The safer option is to be more explicit and use the iloc or loc attributes directly.

Lets see a few examples.

In [12]:
sampleDictionary={99:"House",100:"Sex",98:"Food"}
newConstruct=pd.Series(sampleDictionary)
print(newConstruct)

99     House
100      Sex
98      Food
dtype: object


In [13]:
newConstruct[0]

KeyError: 0

As observable an error is obtained as the index labels are in numeric format and the using indexing operator to extract elements of the construct will lead to error as the system is confused whether the information the query is being run over iloc or loc.

# Working With Queried Data

Now we know how to extract data out of Series Data structure. Lets now see a few example where we can use this data and perform operations.

###### Example1:
Given we have marks scored by a student in a series. Our task is to find the average marks scored by the student.

In [14]:
marks=[90,74,100,45,67]
marksSeries=pd.Series(marks)

totalMark=0
totalSubject=0
for i in marksSeries:
    totalMark=totalMark+i
    totalSubject=totalSubject+1
average=totalMark/totalSubject
print(average)

75.2


This works, but it's slow.

Pandas and the underlying numpy libraries support a method of computation called vectorization. Vectorization works with most of the functions in the numpy library, including the sum function. Lets solve this question using the sum method of numpy.

In [15]:
lenght=len(marks)
#In general when we apply the usage of sum on arrays using numpy the syntax is different. But for series we need to pass the 
#series through the sum method to get sum of its elements.
totalMarks=np.sum(marksSeries)
average=totalMarks/lenght
print(average)

75.2


See both the codes above do the same thing. But we need to find who is faster to understand why did we write the second code in the first place.


Here, we're actually going to use what's called a cellular magic function. These start with two  percentage signs and wrap the code in the current Jupyter cell. The function we're going to use is called timeit. This function will run our code a few times to determine, on average, how long it takes.


The general syntax to be used is 

###### %%timeit -n < Number >
###### < code >

The number will define how many times the magic function variable has to work\run to find the average.

Lets apply it on our previous code and see how it works.

In [52]:
%%timeit -n 100

totalMark=0
totalSubject=0
for i in marks:
    totalMark=totalMark+i
    totalSubject=totalSubject+1
average=totalMark/totalSubject


886 ns ± 146 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [53]:
%%timeit -n 100

lenght=len(marks)
totalMarks=np.sum(marksSeries)
totalMarks/lenght

102 µs ± 4.78 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Wow! This is a pretty shocking difference in the speed and demonstrates why one should be aware of parallel computing features and start thinking in functional programming terms. Put more simply, vectorization is the ability for a computer to execute multiple instructions at once, and with high performance chips, especially graphics cards, you can get dramatic speedups. Modern graphics cards can run thousands of instructions in parallel.

###### Broadcasting
With broadcasting, you can apply an operation to every value in the series, changing the series. For instance, if we wanted to increase every marks scored in each subject by 2, we could do so quickly using the += operator directly on the Series object.

Let see that.

In [54]:
marksSeries=marksSeries+2
print(marksSeries)

0     92
1     76
2    102
3     47
4     69
dtype: int64


As observable changes are appearing. The procedural way of doing this would be to iterate through all of the items in the series and increase the values directly.

Pandas does support iterating through a series much like a dictionary, allowing you to unpack values easily.But for that we need to use a speicial method called iteritems(). Let see the syntax.

###### for < Index Variable >,< DataVariable > in < Series >.iteritem():
# #Code

Lets see how its done

In [63]:
marksDictionary={"English":65,"Math":100,"Science":74,"Hindi":80,"Computer":92}
marksSeries=pd.Series(marksDictionary)
print(marksSeries)

English      65
Math        100
Science      74
Hindi        80
Computer     92
dtype: int64


In [58]:
#lets iterate through the series
for subject,marks in marksSeries.iteritems():
    print(subject)
    print(marks)

English
65
Math
100
Science
74
Hindi
80
Computer
92


Lets now iterate the values of the marks Series via iteration

###### Example 1

In [64]:
for label,value in marksSeries.iteritems():
    value=value+2
print(marksSeries)

English      65
Math        100
Science      74
Hindi        80
Computer     92
dtype: int64


Its important to observe that the code above is not able to update the values of the under dicussion Series.

###### Example 2

In [65]:
for label,value in marksSeries.iteritems():
    marksSeries.loc[label]=value+2
print(marksSeries)

English      67
Math        102
Science      76
Hindi        82
Computer     94
dtype: int64


As observable this time the changes made are visible. 

In [1]:
#The notes here are still not complete