## <center><b>Python for Data Science</b></center>
## <center><b>Lesson 26</b></center>
## <center><b>Pandas -- Part Two</b></center>
## <center><b>Pandas Series (Notes)</b></center>

![7.jpg](attachment:7.jpg)

<font size="6"><center>[Link: Pandas Documentation](https://pandas.pydata.org/docs/)</center></font>

##  <span style="color:red">TABLE OF CONTENTS</span>

1. [What Is A Pandas Series?](#1)<br>
2. [Creating a Pandas Series](#2)<br>
a. [Creating a Pandas Series from a List](#2a)<br>
b. [Creating a Pandas Series from a Dictionary](#2b)<br>
c. [Creating a Pandas Series from a NumPy Array](#2c)<br>
d. [Getting a Series out of a Pandas DataFrame](#2d)<br>
&emsp;● [Creating a DataFrame from a dictionary](#2di)<br>
&emsp;● [Getting a Series out of the Pandas DataFrame using dictionary syntax](#2dii)<br>
&emsp;● [Getting a Series out of the Pandas DataFrame using dot notation](#2diii)<br>
&emsp;● [Getting the Series by iterating through columns of a DataFrame](#2div)<br>
e. [Creating a Pandas Series from the Pandas read_csv() function / read_table() function](#2e)<br>
3. [Series Helper Functions](#3)<br>
4. [Iterating Over Series](#4)<br>
5. [Retrieving Elements from a Series](#5)<br>
a. [Retrieving elements with position](#5a)<br>
b. [Retrieving elements with index label](#5b)<br>
6. [Pandas Series Attributes](#6)<br>
a. [Values, Index, and Is_Unique](#6a)<br>
b. [Data Type, Size, Shape, and ndim](#6b)<br>
7. [Pandas Series Methods](#7)<br>
a. [Showing Rows ... Head() and Tail()](#7a)<br>
b. [Performing Aggregations](#7b)<br>
c. [Counting Values](#7c)<br>
d. [Sorting by values or index labels](#7d)<br>
e. [Working with missing values](#7e)<br>
f. [Searching values](#7f)<br>
g. [Logical operator methods](#7g)<br>
8. [Pandas Series and Working with Python Built-In Functions](#8)<br>


<div class="alert alert-block alert-warning">
    <b><font size="4">Files needed for this presentation:</font></b>
</div>

- [Employee_Attrition.csv](https://docs.google.com/spreadsheets/d/19WAkVGQTTgUpWiuvZFqAD4S3XVmkCHRh5iGH_Zcg6fY/edit?usp=sharing)
- [student_data.csv](https://drive.google.com/file/d/1t49j1Wvd0QBORZZhMKBmBMVnuoy68NYp/view?usp=share_link)

In [3]:
# set up notebook to display multiple output in one cell

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

print('The notebook is set up to display multiple output in one cell.')

The notebook is set up to display multiple output in one cell.


In [2]:
# Import libraries

import pandas as pd
import numpy as np

# The Data Structures / Objects Provided by Pandas

1. Pandas DataFrame (2-Dimensional)
2. <code style="background:yellow;color:black">Pandas Series (1-Dimensional)</code>
3. Pandas Index

<a class="anchor" id="1"></a>
# <span style="color:blue"><b>1. What Is A Pandas Series?</b></span>

<b>Technical Definition</b> 

A Pandas Series is a one-dimensional labeled array capable of holding any data type.

A Pandas Series is a one-dimensional array of indexed data. 

The Series data structure in Pandas is a one-dimensional labeled array.

- Data in the array can be of any type (integers, strings, floating point numbers, Python objects, etc.).
- Data within the array is homogeneous
- Pandas Series objects always have an index: this gives them both ndarray-like and dict-like properties.

<b>How You Should Understand It</b>

A Pandas Series is nothing but a column in an Excel (Google Sheets) spreadsheet. 

In terms of Pandas Data Structures, a Series represents a single column in memory, which is either independent or belongs to a Pandas DataFrame.

A Series is sort of like a more powerful version of the Python list

Pandas Series are the building blocks of a Pandas DataFrame.

![image.png](attachment:image.png)

<a class="anchor" id="2"></a>
# <span style="color:blue"><b>2. Creating a Pandas Series</b></span>

![4.jpg](attachment:4.jpg)

<a class="anchor" id="2a"></a>
## <span style="color:red"><b><i>a. Creating a Pandas Series from a List</b></span>

In [None]:
series_list = pd.Series(['Accounting','Finance', 'HR', 'IT','Marketing', 'Management', 'R&D'])
series_list

<b>NOTE</b>

- The Series generated by default row index numbers which is a sequence of incremental numbers starting from ‘0’.

In [None]:
companies = ['Google', 'Microsoft', 'Facebook', 'Apple', 'Tesla', 'Amazon']
pd.Series(companies)

<b>NOTES</b>

- All values are represented in the exact same order as they appeared in the original Python list.

- The dtype says object (it is the internal Pandas lingo for the string)
.
- There is an additional column called Index. In this case, it resembles an index from a Python list. But one of the key advantages of a Pandas Series is that the index labels do not have to be numeric, they can be any data type.

<b>NOTE</b>

- We can use the argument index to specify a custom index:

In [None]:
# Pass a numeric custom index

companies = ['Google', 'Microsoft', 'Facebook', 'Apple', 'Tesla', 'Amazon']
companies_series = pd.Series(companies, index = [100,101,102,103,104,105])
companies_series
companies_series.index

In [None]:
# Pass a string custom index

companies = ['Google', 'Microsoft', 'Facebook', 'Apple', 'Tesla', 'Amazon']
companies_series = pd.Series(companies, index = ['GOOGL','MSFT','FB','AAPL', 'TSLA', 'AMZN'])
companies_series
companies_series.index

In [None]:
# define the data and index as lists

temperature = [33, 19, 15, 89, 11, -5, 9]                  # the list for the data
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']         # the list for the index

# create the series 
series_from_list = pd.Series(data = temperature, index = days)
series_from_list
series_from_list.index

In [None]:
# define the index as a NumPy array

cities = ['Brookfield', 'Waukesha', 'Elm Grove', 'Milwaukee', 'Wauwatosa', 'West Allis', 'Menomonee Falls', 'New Berlin']
array = np.arange(0,8)

cities_series = pd.Series(cities, index = array)
cities_series

<a class="anchor" id="2b"></a>
## <span style="color:red"><b><i>b. Creating a Pandas Series from a Dictionary</b></span>

- If we create a Series from a python dictionary, the key becomes the row index while the value becomes the value at that row index.

In [None]:
# Creating a Series from a dictionary

dict = {'a' : 10, 'b': 20, 'c':30}
dict_series = pd.Series(dict)
dict_series
dict_series.index

In [None]:
# Creating a Series from a dictionary

sample_dict = {'a' : [1,2,3], 'b': [4,5], 'c':6, 'd': "Hello World"}

series_from_dict = pd.Series(sample_dict)
series_from_dict
series_from_dict.index

In [None]:
# Creating a Series from a dictionary

companies = {
    'a': 'Google',
    'b': 'Microsoft',
    'c': 'Facebook',
    'd': 'Apple'
}

pd.Series(companies)
pd.Series(companies).index

<b>NOTE</b>

- If the index is specified, then the values in data corresponding to the labels in the index will be pulled out.

In [None]:
# Creating a Series from a dictionary

companies = {
    'a': 'Google',
    'b': 'Microsoft',
    'c': 'Facebook',
    'd': 'Apple'
    }

pd.Series(companies, index=['a', 'b', 'd'])

In [None]:
our_dict = {'Mon': 33, 'Tue': 19, 'Wed': 15, 'Thu': 89, 'Fri': 11, 'Sat': -5, 'Sun': 9}
series_from_dict1 = pd.Series(our_dict)
series_from_dict1
series_from_dict1.index

print()

weekend = ['Sat', 'Sun']

series_from_dict2 = pd.Series(our_dict, index = weekend)
series_from_dict2
series_from_dict2.index

<a class="anchor" id="2c"></a>
## <span style="color:red"><b><i>c. Creating a Pandas Series from a NumPy Array</b></span>

In [None]:
# Series from a numpy array

my_array = np.linspace(0,12,10)
print(my_array)

print()

series_from_ndarray = pd.Series(my_array)
series_from_ndarray

<a class="anchor" id="2d"></a>
## <span style="color:red"><b><i>d. Getting a Series out of a Pandas DataFrame</b></span>

<a class="anchor" id="2di"></a>
### i. Creating a DataFrame from a dictionary

In [None]:
# Creating a DataFrame from a dictionary

my_dict = { 
'name' : ["Tom Smith", "Sara Jones", "Bob White", "Mary Johnson", "Paula Black"],
'age' : [50, 45, 38, 47, 52],
'designation': ["CEO", "VP", "SVP", "AM", "DEV"]
}

df = pd.DataFrame(my_dict, 
index = [
"First -> ",
"Second -> ", 
"Third -> ", 
"Fourth -> ", 
"Fifth -> "])

df

- DataFrame provides two ways of accessing the column ... i.e. by using dictionary syntax df['column_name'] or by using dot notation df.column_name. 

- Each time we use these representation to get a column, we get a Pandas Series.

- In the example above, we can get Pandas Series (i.e a single column) just by accessing the column

<a class="anchor" id="2dii"></a>
### ii. Getting a Series out of the Pandas DataFrame using dictionary syntax

In [None]:
# Getting a Series out of the Pandas DataFrame using dictionary syntax ... i.e. getting columns out of the Pandas DataFrame

name_series = df['name']
name_series

age_series = df['age']
age_series

designation_series = df['designation']
designation_series

<a class="anchor" id="2diii"></a>
### iii. Getting a Series out of the Pandas DataFrame using dot notation

In [None]:
# Getting a Series out of the Pandas DataFrame using dot notation ... i.e. getting columns out of the Pandas DataFrame

name_series = df.name
name_series

age_series = df.age
age_series

designation_series = df.designation
designation_series

<a class="anchor" id="2div"></a>
### iv. Getting the Series by iterating through columns of a DataFrame

- What if we don’t know the name of the columns?

- Pandas DataFrame is iterable and we can iterate through individual columns to get the Pandas Series.

In [None]:
# Get the columns of a DataFrame

df.columns

# Iterating through columns of a DataFrame to get Pandas Series

series_columns = []
for col_name in df.columns:
    series_columns.append(df[col_name])

series_columns 

In [None]:
# Alternate approach

df.columns

for col_name in df.columns:
    series_column = df[col_name]
    print(series_column)
    print()

<a class="anchor" id="2e"></a>
## <span style="color:red"><b><i>e. Creating a Pandas Series from the Pandas read_csv() function / read_table() function</b></span>

In [4]:
pd.read_table('Employee_Attrition.csv', sep= ';')

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,884,Research & Development,23,2,Medical,1,2061,...,3,80,1,17,3,3,5,2,0,3
1466,39,No,Travel_Rarely,613,Research & Development,6,1,Medical,1,2062,...,1,80,1,9,5,3,7,7,1,7
1467,27,No,Travel_Rarely,155,Research & Development,4,3,Life Sciences,1,2064,...,2,80,1,6,0,3,6,2,0,3
1468,49,No,Travel_Frequently,1023,Sales,2,3,Medical,1,2065,...,4,80,0,17,3,2,9,6,0,8


- If we want the data to be imported into a Series instead of a DataFrame, we can provide additional arguments usecols and squeeze. 

- The squeeze=True will convert a DataFrame of one column into a Series.

In [5]:
department_df = pd.read_table('Employee_Attrition.csv', sep= ';', usecols=['Department'])
department_df
department_df.info()

Unnamed: 0,Department
0,Sales
1,Research & Development
2,Research & Development
3,Research & Development
4,Research & Development
...,...
1465,Research & Development
1466,Research & Development
1467,Research & Development
1468,Sales


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Department  1470 non-null   object
dtypes: object(1)
memory usage: 11.6+ KB


In [6]:
department_series = pd.read_table('Employee_Attrition.csv', sep= ';', usecols=['Department'], squeeze=True)
department_series
department_series.index

0                        Sales
1       Research & Development
2       Research & Development
3       Research & Development
4       Research & Development
                 ...          
1465    Research & Development
1466    Research & Development
1467    Research & Development
1468                     Sales
1469    Research & Development
Name: Department, Length: 1470, dtype: object

RangeIndex(start=0, stop=1470, step=1)

<a class="anchor" id="3"></a>
# <span style="color:blue"><b>3. Series Helper Functions</b></span>

In [None]:
# Create a series

our_dict = {'Mon': 33, 'Tue': 19, 'Wed': 33, 'Thu': 89, 'Fri': 19, 'Sat': -5, 'Sun': 9}
series_from_dict = pd.Series(our_dict)
series_from_dict

In [None]:
# Getting the mean of a Series
series_from_dict.mean()

# Getting the median of a Series
series_from_dict.median()

# Getting the standard deviation of a Series
series_from_dict.std()

# Getting the size of the Series
series_from_dict.size

# Getting all unique items in a series
series_from_dict.unique()

# Getting a python list out of a Series
series_from_dict.tolist()

<a class="anchor" id="4"></a>
# <span style="color:blue"><b>4. Iterating Over Series</b></span>

- Just like many other data structures in python, it’s possible to iterate over series using a simple for loop.

In [None]:
series_from_dict

print()

for value in series_from_dict:
    print(value + 5)

- We can also iterate over series row indexes.

In [None]:
for row_index in series_from_dict.keys():
    print(row_index)

<a class="anchor" id="5"></a>
# <span style="color:blue"><b>5. Retrieving Elements from a Series</b></span>

In [None]:
department_series = pd.read_table('Employee_Attrition.csv', sep= ';', usecols=['Department'], squeeze=True)
department_series

age_series = pd.read_table('Employee_Attrition.csv', sep= ';', usecols=['Age'], squeeze=True)
age_series

<a class="anchor" id="5a"></a>
## <span style="color:red"><b><i>a. Retrieving elements with position</b></span>

- Pass the index to retrieve an element.

In [None]:
# retrieving a specific element

department_series[0]
age_series[100]

In [None]:
# retrieving the first n elements

department_series[:7]
age_series[:12]

In [None]:
# Retrieving the last n elements

department_series[-4:]
age_series[-8:]

In [None]:
# retrieving elements within a range

department_series[101:107]
age_series[563:570]

In [None]:
# retrieving elements by step

department_series[::100]
age_series[25:1471:125]

<a class="anchor" id="5b"></a>
## <span style="color:red"><b><i>b. Retrieving elements with index label</b></span>

- Pandas Series is a 1-dimensional labeled array that we can access elements by index label.

In [None]:
# Create a series

companies = ['Google', 'Microsoft', 'Facebook', 'Apple', 'Tesla', 'Amazon']
companies_series = pd.Series(companies, index = ['GOOGL','MSFT','FB','AAPL', 'TSLA', 'AMZN'])
companies_series

In [None]:
# Retrieving a single element using an index label

companies_series['AAPL']

In [None]:
# Retrieving multiple elements using a list of index labels.

companies_series[['GOOGL', 'FB', 'TSLA']]

<a class="anchor" id="6"></a>
# <span style="color:blue"><b>6. Pandas Series Attributes</b></span>

- Objects in Python have Attributes and Methods. 

- Attributes are a way that we can use to find information without manipulating or destroying anything. 

- Methods actually do something to the object. It may be manipulating it, or adding value, or doing some calculation with the object’s values.

- A Pandas Series is just one type of Python objects. In this section, we will cover some of the commonly used attributes in the Pandas Series.

<a class="anchor" id="6a"></a>
## <span style="color:red"><b><i>a. Values, Index, and Is_Unique</b></span>

- The <b>values attribute</b> returns an array of all the values within the series.

In [None]:
companies_series

print()

companies_series.values

- The <b>index attribute</b> returns a RangeIndex object. 

In [None]:
department_series

print()

department_series.values
department_series.index

- The <b>is_unique attribute</b> returns a boolean (True or False). 

- It is a really convenient way to check if every series value is unique or not.

In [None]:
department_series.is_unique
companies_series.is_unique

<a class="anchor" id="6b"></a>
## <span style="color:red"><b><i>b. Data Type, Size, Shape, and ndim</b></span>

- The <b>dtype attribute</b> returns the data type.

- When it gives us 'O', that is short for object.

In [None]:
companies_series.dtype
age_series.dtype

- The <b>size attribute</b> returns the number of items in a Series.

In [None]:
companies_series.size
age_series.size

The <b>shape attribute</b> returns the number of rows by the number of columns in a tuple.

In [None]:
companies_series.shape
age_series.shape

- We also have the <b>ndim attribute</b> which is short for the number of dimensions and a Series is always a 1-dimensional object

In [None]:
companies_series.ndim
age_series.ndim

<a class="anchor" id="7"></a>
# <span style="color:blue"><b>7. Pandas Series Methods</b></span>

- A method as mentioned actually does something to the object. 

- It may be manipulating it, or adding value, or doing some calculation with the object’s values.

<a class="anchor" id="7a"></a>
## <span style="color:red"><b><i>a. Showing Rows ... Head() and Tail()</b></span>

- The <b>head()</b> and <b>tail() methods</b> return the top and last n rows respectively. 

- n defaults to 5 if you don’t give any value. 

- They are useful for quickly verifying data, for example after sorting or appending rows.

In [None]:
companies_series.head()
age_series.tail()

In [None]:
companies_series.head(7)
age_series.tail(10)

In [None]:
companies_series.head(-2)
age_series.tail(-10)

<a class="anchor" id="7b"></a>
## <span style="color:red"><b><i>b. Performing Aggregations</b></span>

- We can perform aggregation on a Series, such as <b>mean()</b>, <b>sum()</b>, <b>product()</b>, <b>max()</b>, <b>min()</b>, and <b>median()</b>.

In [None]:
age_series.mean()
age_series.median()
age_series.max()
age_series.min()
age_series.sum()

- If we need multiple aggregations, we can pass them in a list to <b>agg() method</b>.

In [None]:
age_series.agg(['mean', 'median', 'max', 'min'])

<a class="anchor" id="7c"></a>
## <span style="color:red"><b><i>c. Counting Values</b></span>

- The <b>unique()</b> and <b>nunique() methods</b> return the unique values and the number of unique values, respectively.

In [None]:
companies_series.unique()
companies_series.nunique()

department_series.unique()
department_series.nunique()

- The <b>value_counts() method</b> returns the number of occurrences of each unique value in a Series. 

- It is useful to get an overview of the distribution of values.

In [None]:
companies_series.value_counts()
department_series.value_counts()

<a class="anchor" id="7d"></a>
## <span style="color:red"><b><i>d. Sorting by values or index labels</b></span>

- The <b>sort_values() method</b> sorts a Series in ascending or descending order by some criterion.

In [None]:
# ascending by default

companies_series.sort_values()
department_series.sort_values()

In [None]:
# To sort it in descenting order

companies_series.sort_values(ascending = False)
department_series.sort_values(ascending = False)

- When inplace = True , the data is modified in place, which means it will return nothing and the dataframe / series is now updated. 

- When inplace = False , which is the default, then the operation is performed and it returns a copy of the object.

In [None]:
# To modify the original series

# companies_series.sort_values(inplace = True)
# department_series.sort_values(inplace = True)

- The <b>sort_index() method</b> sorts a Series by index label. It is similar to <b>sort_values()</b>.

In [None]:
# ascending by default
companies_series.sort_index()

# To sort it in descenting order
companies_series.sort_index(ascending = False)

# To modify the original series
companies_series.sort_index(ascending = False, inplace = True)
companies_series

# To modify the original series
companies_series.sort_index(inplace = True)
companies_series

<a class="anchor" id="7e"></a>
## <span style="color:red"><b><i>e. Working with missing values</b></span>

- The <b>isna() method</b> returns a boolean same-sized object indicating if the values are missing.

In [None]:
companies_series.isna()

- We can count the number of missing values by chaining the result with the <b>sum() method</b>.

In [None]:
companies_series.isna().sum()

The <b>count() method</b> returns the number of non-missing values in a Series.

In [None]:
student_series = pd.read_table('student_data.csv', sep= ',', usecols=['State'], squeeze = True)
student_series

In [None]:
student_series.isna()

In [None]:
student_series.isna().sum()

<a class="anchor" id="7f"></a>
## <span style="color:red"><b><i>f. Searching values</b></span>

- The <b>nlargest()</b> and <b>nsmallest() methods</b> return the largest and smallest values in a Series. 

- By default, it is showing 5 results if you don’t give any value.

In [None]:
age_series.nsmallest()
age_series.nlargest(12)

<a class="anchor" id="7g"></a>
## <span style="color:red"><b><i>g. Logical operator methods</b></span>

- <b>gt()</b> => greater than
- <b>ge()</b> => greater than or equal to
- <b>eq()</b> => equal to
- <b>le()</b> => less than or equal to
- <b>lt()</b> => less than
- <b>ne()</b> => not equal to

***

- These are equivalent to >, >=, = , <= , < and != respectively, but with support to substitute a fill_value for missing values.

In [None]:
age_series.ge(30, fill_value=0)

<a class="anchor" id="8"></a>
# <span style="color:blue"><b>8. Pandas Series and Working with Python Built-In Functions</b></span>

- <b>len()</b> and <b>type()</b> are Python built-in functions for size and data type.

In [None]:
len(companies_series)

type(age_series)
type(department_series)

- <b>dir()</b> short for the directory. 

- If we pass a Series onto it, it is going to give us an output of all of the available attributes and methods.

In [None]:
dir(companies_series)

- To condense a Series, we can use the <b>built-in list() function</b>. 

- It’s kind of doing the exact opposite operation as when we passed a list to Series().

In [None]:
list(companies_series)

- Python <b>in keyword</b> returns a boolean value that compares the value you provide to the values in the list. 

- It’s going to return True if it exists among those values and False if it does not.

In [None]:
# Create a series

prices = [10, 5, 3, 2.5, 8, 11]
series = pd.Series(prices)
series

In [None]:

100 in series
2.5 in series
4 in series

- 2.5 in s returns False because by default Pandas is going to look among the index labels not the actual values within the Series. 

- Just make sure to add the <b>values attribute</b>.

In [None]:
2.5 in series.values