# Pandas Series
The series is a core structure of Pandas. We can think it as a cross between a list and dictionary. The items are all stored in a order. It has index and actual column like as dictionary. Data can be retrieve by their name. Series is useful when it comes to merging columns of Data

In [28]:
import pandas as pd

to create a series, we need to create a series is to array-like object like a list. It genarates key automatically. we have `pd.Series`


In [29]:
students = ['Alice', 'Mark', 'Jack']
pd.Series(students)

0    Alice
1     Mark
2     Jack
dtype: object

In [30]:
numbers = [1, 2, 3]
pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

# None Value

None value will turn into a None type object 

In [31]:
names=['Micheal', 'Mark', None]
pd.Series(names)

0    Micheal
1       Mark
2       None
dtype: object

# NaN(Not a Number)
None values will turn into NaN in case of Numeric value(int /float). NaN is **float point numbers** as a result all **integers** convert into float. If a list of integers has `dtype` float64, it's probably because there is missing data

In [32]:
marks = [84, 94, None]
pd.Series(marks)

0    84.0
1    94.0
2     NaN
dtype: float64

In [33]:
numbers = [1, 2, None]
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

For Data Scientist, they use None and NaN to represent missing value for String and Number. But in pandas, **NaN** is not equivalent to None. When we perform equivalent test, result will be **False**

In [34]:
import numpy as np
np.nan == None

False

In [35]:
np.nan == np.nan

False

In [36]:
#Instead of above, we can use different function of numpy
np.isnan(np.nan)

True

# Series implementation using Dictionary

Dictionary works as Key Value pairs. When we convert it into series, indexing will not be an incremental integers rather it will be a key value of that dictionary

In [37]:
dict= {"Alex": "Physics", 
       "John": "Math",
        "Jack": "Chemistry"}
s=pd.Series(dict)
s

Alex      Physics
John         Math
Jack    Chemistry
dtype: object

In [38]:
s.index

Index(['Alex', 'John', 'Jack'], dtype='object')

# List with a tuple for Series

In [39]:
students=[("Alex", "Brown"), ("Christpher", "Logan"), ("Mack", "Author")]
pd.Series(students)


0          (Alex, Brown)
1    (Christpher, Logan)
2         (Mack, Author)
dtype: object

# Separate Index creation

In [40]:
s = pd.Series(["Physics", "Math", "Chemistry"], index =['Alex', 'Mark', 'Jenny'])
s


Alex       Physics
Mark          Math
Jenny    Chemistry
dtype: object

In [41]:
dict = {"Alex":"Math",
        "Jenny": "Physics",
        "John": "Chemistry"}
s = pd.Series(dict, index = ["Alex", "Jenny", "Author"])
s

Alex         Math
Jenny     Physics
Author        NaN
dtype: object

When we try to add indexing in a series conversion if the index doesn't exist in dictionary, it will show **NaN** value against that new index

# Query with loc and iloc

query by numeric location

To see the value against any position you preferred `iloc`. we use integer value of the position

To see it by key/index use `loc` 

**`iloc` and `loc`** are **attributes**. We need to use **`[]`** rather than parentheses **`()`**. It's called as **`Index operator`**

**`iloc`**: **Integer value for index position**

**`loc`**: **Object value for index label**

In [42]:
import pandas as pd

In [43]:
students = {'Alice': 'Physics',
            'John': 'Chemistry',
            'Jenny': 'Math'}
s = pd.Series(students)
s
            

Alice      Physics
John     Chemistry
Jenny         Math
dtype: object

In [44]:
s.iloc[2]

'Math'

In [45]:
s.loc['Jenny']

'Math'

# for loop /Iteration with Series

In [46]:
grades = pd.Series([40,80,90,78,89, None])
total=0
for i in grades:
    total += i
print(total)

nan


In [47]:
grades = pd.Series([40,80,90,78,89])
total=0
for i in grades:
    total += i
print(total)

377


In [48]:
grades = pd.Series([40,80,90,78,89])
total=0
for i in grades:
    total += i
    average = total/len(grades)

print(total)
print(average)

377
75.4


# Numpy Vectorization to compute sum/avg

In [49]:
import pandas as pd
import numpy as np
grades = pd.Series([40,80,90,78,89])

sum = np.sum(grades)
avg = np.mean(grades)
print(sum)
print(avg)


377
75.4


# Numpy Random Parameters and Timer for Performance

**`np.random.randint(0, 1000, 10000)`** = **`np.random.randint(min value, high value, size of data)`** Values are known as magic value

In [50]:
numbers = pd.Series(np.random.randint(10, 1000, 10000))

numbers.head()


0    502
1    347
2    146
3    221
4    486
dtype: int64

In [51]:
len(numbers)

10000

Using 100 times iteration

In [52]:
%%timeit -n 100  
total = 0
for i in numbers:
    total += i
    avg = total/len(numbers)
avg


8.99 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [53]:
%%timeit -n 100

total = 0
for i in numbers:
    total+=i

total/len(numbers)
    

1.16 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Parallelism during cal  avg using numpy
vectorization has an ability to execute multiple instruction at a time.

numpy works parallel and make the task more faster. Modern GPU can do thousands of task in parallel

In [54]:
%%timeit -n 100
avg = np.sum(numbers)/len(numbers)
avg

67.6 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Incremental Iteration
all the items in Series iterable by 2. Much like a dictionary, it is allowing unpack the value easily. 
We can use **`iteritems()`** for that as well

In [55]:
#all the Series value incremented by 2
numbers.head()

0    502
1    347
2    146
3    221
4    486
dtype: int64

In [56]:
numbers+=2
numbers.head()

0    504
1    349
2    148
3    223
4    488
dtype: int64

**`.iteritems()`** = returns label and value

**`.set_value()`** = to make any required change like incremental by 2

now a days **`.at()`** is the replacement of **`.set_value()`**

In [57]:
#iteritems()
for label, value in numbers.iteritems():
    numbers.set_value(label, value+2)
print(numbers)

0       506
1       351
2       150
3       225
4       490
       ... 
9995     57
9996    645
9997    856
9998    351
9999    909
Length: 10000, dtype: int64


In [58]:
%%timeit -n 10
num = pd.Series(np.random.randint(0,1000,1000))
for label, value in num.iteritems():
    num.loc[label] = value+2


130 ms ± 2.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [59]:
%%timeit -n 10
num = pd.Series(np.random.randint(0,1000,1000))
num+2

225 µs ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Add new index in a Pandas Series

In [60]:
s = [1,2,4,3]
s=pd.Series(s)
s['history'] = 4
s

0          1
1          2
2          4
3          3
history    4
dtype: int64

# Different Index Value and Append

In [61]:
student_class =pd.Series ({"Alice": 'Physics',
                 "John": 'Math',
                 "Mark": 'Chemistry'})

student_class

Alice      Physics
John          Math
Mark     Chemistry
dtype: object

In [62]:
kelly_class = pd.Series(['Physics', 'Math', 'Chemistry'], index = ['Kelly', 'Kelly', 'Kelly'])
kelly_class

Kelly      Physics
Kelly         Math
Kelly    Chemistry
dtype: object

In [63]:
all_student_classes = student_class.append(kelly_class)
all_student_classes

Alice      Physics
John          Math
Mark     Chemistry
Kelly      Physics
Kelly         Math
Kelly    Chemistry
dtype: object

**note:** it's just create a whle new **Series** by **`append`** method

# Boolen Mask
one dimension for Series
multi dimension for DataFrame

values in the array True/False

**`index_col=0`** = no index column

In [64]:
import pandas as pd
df = pd.read_csv("resources/week-2/datasets/Admission_Predict.csv", index_col=0)
# clear the space and upper case
df.columns=[x.lower().strip() for x in df.columns]
df.head(5)


Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [65]:
df['chance of admit'] >0.7

Serial No.
1       True
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398     True
399    False
400     True
Name: chance of admit, Length: 400, dtype: bool

note = it only returns the boolen series result

# Use of where

to get full details of all rows based on column value

In [66]:
df.where(df['chance of admit']>0.7).head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
5,,,,,,,,


In [67]:
df.where(df['chance of admit']>0.7).dropna().head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9


In [68]:
# Where and DropNa Together

In [71]:
df[df['chance of admit']>0.7].head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9


In [70]:
df['chance of admit']

Serial No.
1      0.92
2      0.76
3      0.72
4      0.80
5      0.65
       ... 
396    0.82
397    0.84
398    0.91
399    0.67
400    0.95
Name: chance of admit, Length: 400, dtype: float64

In [72]:
df[['chance of admit', 'cgpa']].head(3)

Unnamed: 0_level_0,chance of admit,cgpa
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.92,9.65
2,0.76,8.87
3,0.72,8.0


note: each of them are mimicing functionality either .loc() or .where().dropna()

# Multiple Boolen and Multiple Criteria

#  AND `&`

In [75]:
df[df['chance of admit']>0.7] & df[df['chance of admit']]

KeyError: "None of [Float64Index([0.92, 0.76, 0.72,  0.8, 0.65,  0.9, 0.75, 0.68,  0.5, 0.45,\n              ...\n              0.64, 0.71, 0.84, 0.77, 0.89, 0.82, 0.84, 0.91, 0.67, 0.95],\n             dtype='float64', length=400)] are in the [columns]"