<a href="https://colab.research.google.com/github/SinghReena/MachineLearning/blob/master/3_Series_datastructure.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Series Datastructure


**Why Series?**

The series is one of the core data structures in pandas. 

A dataframe is the main data structure used in data science. For now, think of an excel file as a data frame.  Each column of the data is a Series object. When we are manipulating a column of data --- for cleaning, for feature extraction in machine learning etc., we are modifying a Series object. Manipulating a dataframe is similar to manipulating a Series object but will affect multiple columns simultaneously.  So learning manipulations on a single dimensional Series object is fundamental to data science.

**What is Series?**
You think of a Series as a cross between a list and a dictionary. 
- The items are all stored in an order. So it is like a list.
- You can assign labels to the rows with which you can retrieve them. In this way, it is like a dictionary.

An easy way to visualize is two columns of data. The first is the special index, like keys in  a dictionary, while the second is your actual data, or values in the hash table. 

The data column has a label of its own and can be retrieved using the .name attribute. This is different than with dictionaries and is useful when it comes to merging multiple columns of data.

We will cover **five topics** in Series datastructure.

1. Creating a new Series object 
  - Using lists
  - Using Hash tables
  - Explicit values and index
2. Types of the object stored in Series
  - typecasting
  - Null vs. NAN
3. Indexing into the Series object
4. Querying into a Series object
5. Operations on Series object -- broadcasting.




All the  entries of a Series are elements of the same type.

In [None]:
import pandas as pd


## Creating a Series from Lists

You can create a series by passing in a list of values.
One of the easiest ways to  create a series is to use an array-like object, like a list


In [None]:

##  Let us create a list of the three of students, Alice, Bob, Charlie, all as strings
students = ['Alice','Bob','Charlie']

# Now we will call the series function in pandas and pass in the students
s = pd.Series(students)
print(s)

0      Alice
1        Bob
2    Charlie
dtype: object



The type of s is `Series`, but each object of the Series is an object type.


In [None]:
type(s)

pandas.core.series.Series

In [None]:
# You can also create a little list of numbers
numbers = [1,4,8]
pd.Series(numbers)

0    1
1    4
2    8
dtype: int64

Each element of the Series is a 64-bit integer

## Typecasting in Series

The elements of a series are typecasted to the "bigger" type.

In [None]:
# python typecasting


In [None]:
pd.Series([1,2,3])

0    1
1    2
2    3
dtype: int64

What if we create a Series with an integer and a float type?

Pandas will typecast the integer to a float

In [None]:
pd.Series([1, 2.0])

0    1.0
1    2.0
dtype: float64

In [None]:
# 4.21 cannot be coerced into an int. 
# Whereas 2.0 can be coerced to an integer.
s = pd.Series([1, 4.21], dtype = int)

ValueError: ignored

### Use `astype` method to convert types.

In [None]:
# astype is not floor or ceiling. 
# it is the integer component of the number.
s = pd.Series([1, 4.21, 5.9, -5.6])
s.astype(int)

In [None]:
import numpy as np
np.trunc(s)

### `None` type in python.

Lack of data in python  can be represented by None. It is equivalent to null in other languages.

In Databases a null type is used to represent missing data.  For example, if you have a phone column in the database and  its value is null for a row, it can mean that 
  - the person does not have  a phone.
  - the person  has a phone but we do not have its record.
  - phone number is not relevant to this record.
  - we have not asked for it.

We use `None` to represent missing data in dataframes. For numerical values we use `NAN`.



In [None]:
def foo():
  return 

def bar():
  return 5

print(foo())
print(bar())

In [None]:
# length of a string returns an integer
len("hello")


In [None]:
# strip returns a string
s = "hello     ".strip()
print(s)

hello


In [None]:
l = ["a", 'b', 'c']
ret_val = l.remove('a')
print(ret_val)
print(l)

In [None]:
# In python, we have none type to indicate a lack of data. 
# In pandas, if we create a list of strings and we have one element, 
# a None type, pandas inserts it as None.
students=['Don','Ken', None   ]
pd.Series(students)


### `NaN` --- the none type for numerical values

If we create a list of numbers, integers or floats and put in the None type,
pandas automatically converts this to a special floating point value designated as NaN,
which stands for 'Not a Number'

In [None]:
# Let us create a list with a None value in it
# In pandas, integers can be typecast to floats like we saw before

numbers=[1, 2, None] 

pd.Series(numbers)

### `NAN` is not `None`

NaN is similar to None, but it's a numeric value and treated differently for efficiency reasons.


In [None]:
import numpy as np

np.nan == None

In [None]:
np.nan == np.nan

In [None]:
None == None

In [None]:
#Instead, you need to use special function to test for the presence of not a number.

np.isnan(np.nan)

***

## Creating a Series with dictionaries

In [None]:
# Example using some data of students and their classes 

students_subjects = {'Alice':'Physics',
                'Bob': 'Chemistry',
                'Charlie':'Math'}
s= pd.Series(students_subjects)
s

In [None]:
# To get index object using the index attribute
s.index

In [None]:
import numpy as np

a = np.array([1, 2, 3], dtype=float)
type(a)


In [None]:
type(s.index)

In [None]:
s2 = pd.Series(["a", "b", "c"])
print(s2)
s2.index

In [None]:
s3 = pd.Series(["a", "b","c"], index = [100, 200, 300])
s3.index


In [None]:
type(s2.index)

In [None]:
type(s3.index)

In [None]:
# Let's create a more complex type of data, a list of tuples.
students = [('Alice','yellow'),('Bob','Green'),('Charlie','Blue')]
s4 = pd.Series(students)
type(s4.index)

## Creating a Series with a specified index.

In [None]:
# You can also separate your index creation from the data by passing in the index as 
# list explicitly to the series
s=pd.Series(['Physics','Chemistry','Biology'],index=['Alice','Bob','Charlie'])
s

## Index for a Series

In [None]:
s=pd.Series(['Physics','Chemistry','Biology'],index=['Alice','Bob', None])
s.index

In [None]:
s.index[1]

In [None]:
sh = pd.Series({'Alice':'Physics',
                'Bob': 'Chemistry',
                'Charlie':'Math'})

sh

We  have explored the pandas Series data structure. You've seen how to create series from lists and dictionaries, how indicies on data work, and the way that pandas typecasts data including missing values.

**Exercise 1**

Create Series variables for four subjects and four grades for four students using their names as index.\
Hint:  How many series objects do you need? What is common for all these series objects?\
Try to create the series objects using different methods.

**Exercise 2**

Print the index for the series objects.  Are they `equal`?

## Querying a Series

In [None]:
# A pandas Series can be queried either by the index position or the index label.
# index to the series when querying, the position and the label are effectively the same values. 
# query by numeric location, starting at zero, use the iloc attribute. To query by the index label,
# you can use loc attribute.

# Example 1. Students enrolled in classes using dictionary

import pandas as pd
import numpy as np
students_classes ={'Alice':'Physics', 'Charlie':'Social Science', 'Bob':'Math'}
s= pd.Series(students_classes)
s

### Query the series like a hash table

Search for index in the Series like searching for keys in a dictionary.

Use the index to find the value.

In [None]:
# Is the index present in the Series
'Bob' in s

False

In [None]:
'Gary' in s

In [None]:
# If the index is present, get the value.
s['Bob']

In [None]:
# Key error, just like in dictionary.
s['Gary']

In [None]:
# subsetting.
# Note that the order in our list need not be in the same order as the dataset.
s[['Bob', 'Charlie','Alice'] ]

### Index is just like a list of values.

In [None]:
s.index

In [None]:

students_classes ={'Alice':'Physics', 'Charlie':'Social Science', 'Bob':'Math'}
m= pd.Series(students_classes)
m['Bob']

'Math'

In [None]:
m.index[1]

'Charlie'

In [None]:
m[   m.index[1]  ]  # == s['Charlie']

'Social Science'

In [None]:
# If you want to see the entry, we would use iloc
# attribute with the parameter 2
s.iloc[2]   # == s[s.index[2]]

In [None]:
# If you want to see what class Bob has, we would use the loc attribute with a parameter
# of Molly.
s.loc['Bob']

Keep in mind that iloc and loc are not methods, they are attributes. So you don't use parentheses to query them, but square brackets instead, which is called the indexing operator.

In [None]:
# If you pass in an object, it will query as if you wanted to use the label based loc attribute
s['Bob']

**Exercise**

```
s = pd.Series(['USA', 'UK', 'Belgium', 'Uganda'], index = [1001, 1002, 1003, 1004])
```

- What is the country in the first position?
- What is the country with code 1002?



### Querying with indices when indices are integers

If your index has list of integers, it is bit complicated.\
pandas can't determine automatically whether you're intending to 
query by index position or index label.\
So need to be careful when using the indexing operator on the series itself. 
The safer option is to use iloc and loc atrribute.


In [None]:

class_code ={100:'Science',101:'Math',102:'History',103:'Geography'}
s=pd.Series(class_code)

s.loc[100] # it gives key error


'Science'

## Manipulating all the elements of a Series

### Values in a Series is stored as a numpy array.

`five_numbers` is a series object.  But the values in the Series object is stores as  numpy array.  We will see in the next section what this means for broadcasting, filtering etc.

In [None]:
five_numbers = pd.Series([2, 3, 6, 8, 9])
five_numbers.values

array([2, 3, 6, 8, 9])

In [None]:
type(five_numbers)

pandas.core.series.Series

In [None]:
type(five_numbers.values)

numpy.ndarray

In [None]:
# Let's create a big series of random numbers.
import numpy as np

numbers = pd.Series(np.arange(100))
print("len = ", len(numbers))
print("the first five elements are: ")
numbers.head()

len =  100
the first five elements are: 


95    95
96    96
97    97
98    98
99    99
dtype: int64

### Avoid explicit loops as much as possible.  Try to use pandas idioms.

In [None]:
num = pd.Series(np.random.randint(0,1000,10000))
num

0       169
1       421
2       996
3        22
4       665
       ... 
9995     24
9996    357
9997    836
9998    874
9999     45
Length: 10000, dtype: int64

In [None]:
# We are going to use timeit function. This function will run our code a few times to determine, on average, how long it takes.

# You can give timeit the number of loops that you would like to run. By default, it is 1000 loops.


%%timeit -n100

total = 0
for number in num:
    total+=number

total/len(num)

100 loops, best of 3: 1.23 ms per loop


In [None]:
# Timeit ran the code and it doesn'tseem to take very long at all. 
# Now let's try with vectorization

In [None]:
%%timeit -n 100
total = np.sum(num)
total/len(num)

100 loops, best of 3: 88.1 µs per loop


In [None]:
total = np.sum(numbers)
total

5150

In [None]:
num =np.sum(numbers)
num


4950

### Filtering


In [None]:
five_numbers = pd.Series([2, 3, 6, 8, 9])
five_numbers

0    2
1    3
2    6
3    8
4    9
dtype: int64

In [None]:
# which indices have values > 3
five_numbers > 3


0    False
1    False
2     True
3     True
4     True
dtype: bool

This is like searching for values in a dictionary.  But no loops!

In [None]:
# give me the values that are > 3
five_numbers[five_numbers > 3]

# you can think of this as subsetting, if you knew the indices.
# example: five_numbers[[2, 3, 4]]

2    6
3    8
4    9
dtype: int64

In [None]:
# list only even numbers
five_numbers = pd.Series([2, 3, 6, 8, 9])
print(five_numbers)
five_numbers[five_numbers %2 == 0 ][five_numbers > 4]

0    2
1    3
2    6
3    8
4    9
dtype: int64


2    6
3    8
dtype: int64

In [None]:
(five_numbers %2 == 0) & (five_numbers > 4)

a    False
b    False
c    False
d    False
e     True
dtype: bool

In [None]:
five_numbers[(five_numbers %2 == 0) & (five_numbers > 4)]

e    6
dtype: int64

In [None]:
students_classes = pd.Series({'Alice':'Physics', 'Charlie':'Social Science', 'Bob':'Math'})

students_classes == 'Math'
students_classes.eq('Math')

Alice      False
Charlie    False
Bob         True
dtype: bool

### Broadcasting

you can apply an operation to every value in the series in just one step.


In [None]:
# For instance, if you want to increase every random variable by 2,
# we could do so quickly using the += operator directly on the Series object.

numbers.head()

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [None]:
# Increase everything in the series by 2
numbers += 2
numbers.head()

0    2
1    3
2    4
3    5
4    6
dtype: int64

In [None]:
# You can get the [e^x for x in numbers]

np.exp(numbers)

0     7.389056e+00
1     2.008554e+01
2     5.459815e+01
3     1.484132e+02
4     4.034288e+02
          ...     
95    1.338335e+42
96    3.637971e+42
97    9.889030e+42
98    2.688117e+43
99    7.307060e+43
Length: 100, dtype: float64

In [None]:
numbers.values

array([  2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,
        15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
        28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,
        41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,
        54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,
        67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,
        80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,
        93,  94,  95,  96,  97,  98,  99, 100, 101])

The procedural way of doing this would be to iterate through all of the items in the series and increase the values directly. Pandas does support iterating through the series like a dictionary, allowing you to unpack values easily. So we can use the iteritems function in particular which returns a label and value. 

In [None]:
# We can use the iteritems() function which returns a label and value

five_numbers = pd.Series(np.arange(5), index=['a', 'b', 'c', 'd', 'e'])
print(five_numbers)


for label, value in five_numbers.iteritems():
  print("old value: ", label, value)
  # now for the item which is returned, lets call set_value()
  five_numbers[label] = value + 2
  # now check the result of this computation
  print("new value: ", label, five_numbers[label])

#numbers.head()

a    0
b    1
c    2
d    3
e    4
dtype: int64
old value:  a 0
new value:  a 2
old value:  b 1
new value:  b 3
old value:  c 2
new value:  c 4
old value:  d 3
new value:  d 5
old value:  e 4
new value:  e 6


In [None]:
h  = {'a':1, 'b':2, 'c':3, 'd':4}

for k, v in h.items():
  print(k, v)

In [None]:
2 in h.values()

In [None]:
five_numbers['e'] 

In [None]:
3 in five_numbers.values

In [None]:
movies=pd.Series({100:"Cameron",200:"scorsese"})

In [None]:
np.where(movies.values == "Cameron")

In [None]:
movies.index[0]

In [None]:
movies

In [None]:
movies.index

In [None]:
np.where(five_numbers >= 10)

In [None]:
five_numbers.index[np.where(five_numbers >= 10)]

### Combining lists

Operations are easy when the series have the same *indices*

In [None]:
quiz1 = pd.Series([39, 55, 74], index = ['Alice', 'Charlie', 'Bob'])
quiz2 =  pd.Series([50, 75, 14], index = ['Alice', 'Charlie', 'Bob'])
quiz3 =  pd.Series([5, 7, 10], index = ['Alice', 'Charlie', 'Bob'])

quiz1 + quiz2 + quiz3

Alice       94
Charlie    137
Bob         98
dtype: int64

## More functions in Series.

The goal of this colab is to give a conceptual overview of the Series data structure.  This is not a comprehensive list of all the functions in the series.  Please see the documentation. 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

I will cover some more functions that might be of use more often:
- reindex
- drop
- fill
- unique, nunique, 
- value_counts, count
- slicing
- nlargest and nsmallest
- to_list()
- isna
- series.agg(['mean', 'count'])
- append two series
- .le, .eq



There are also some cheatsheets you can find on the web. For example:\
https://towardsdatascience.com/20-examples-to-master-pandas-series-bc4c68200324

In [None]:
print(numbers)
numbers.to_list()

0       2
1       3
2       4
3       5
4       6
     ... 
95     97
96     98
97     99
98    100
99    101
Length: 100, dtype: int64


[2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101]

In [None]:
np.array([1,2,3])

array([1, 2, 3])

In [None]:
s = pd.Series([1, 1, 1, 3, 3, 3, 1, 2, 3, 2, 3])
s.unique()

array([1, 3, 2])

In [None]:
s.nunique()

3

In [None]:
s.value_counts()


3    5
1    4
2    2
dtype: int64

In [None]:
s.nsmallest(8)

0    1
1    1
2    1
6    1
7    2
9    2
3    3
4    3
dtype: int64

In [None]:
s.agg(["mean", "median"])

mean      2.090909
median    2.000000
dtype: float64

In [None]:
s2 = pd.Series([6, 7])

In [None]:
s3 = s.append(s2)
s3

0     1
1     1
2     1
3     3
4     3
5     3
6     1
7     2
8     3
9     2
10    3
0     6
1     7
dtype: int64

In [None]:
s.index
