## Intro

### In this lesson, you will learn about...

- Pandas Series
- Attributes
- Binning values
- Summarizing a series
- Vectorized operation using a user-defined function

### By the end of this lesson, you should be able to...

- Create a new series
- Perform vectorized operations on a series
- Access attributes of a series
- Describe values of a series (.describe, .value_counts)
- Peek into the series (.head, .tail, .sample)
- Sort values (sort_values, sort_index)
- Test for values in the series (.isin, .any, .all)
- Perform string manipulation (.str)
- Apply a user defined function to all items in a series (.apply)
- Bin continuous data to convert it to discrete (.cut)
- Plot series values (.plot)

### Agenda

1. About Pandas Series
2. Series Part 1
    - Create a Series
    - Vectorized Operations
    - Series Attributes: .index, .values, .dtype, .name, .size, .shape
    - Series Methods: .head, .tail, .sample, ,astype, .value_counts, .describe, .nlargest, .nsmallest, sort_values, .sort_index
3. Exercises, part I
4. Series Part II
    - Indexing and Subsetting
    - Series Attribute: .str
    - Series Methods: .any, .all, .isin, .apply
5. Exercises, part II
6. Series Part III
    - Binning
    - Plotting
7. Exercises, part III

## 1. About Pandas Series

A pandas Series object is a one-dimensional, labeled array made up of an autogenerated index that starts at 0 and data of a single data type.

A couple of important things to note about a Series:

- If I try to make a pandas Series using multiple data types like int and string values, the data will be converted to the same object data type; the int values will lose their int functionality.
- A pandas Series can be created in several ways; we will look at a few of these ways below. However, **it will most often be created by selecting a single column from a pandas Dataframe in which case the Series retains the same index as the Dataframe.** We will dive into this in the next two lessons: DataFrames and Advanced DataFrames.

---

Numpy vs. Pandas

- Numpy: Python library for representing n-dimensional arrays.
- Pandas: Python library, built upon Numpy, for representing series and dataframes which are tabular structures.

---

Series vs. Dataframes

- Series: a one-dimensional, labeled array. A series has row names but no column name.
- Dataframes: 2-d structures that represent datasets. Imagine a table with rows and columns. A dataframe has row names and column names.

---

Series vs. List

- Series contains an index, which can be thought of as a row name (often is a row number), which is a way to reference items. The index is stored with other meta-information (information about the series).
- the elements are of a specific data type. The data type is inferred, but can be manually specified.

## 2. Series Part I

- Create a Series
- Series data types
- Vectorized Operations
- Series Attributes: .index, .values, .dtype, .name, .size, .shape
- Series Methods: .head, .tail, .sample, ,astype, .value_counts, .describe, .nlargest, .nsmallest, sort_values, .sort_index

Import Pandas

`import pandas as pd`

In [1]:
import pandas as pd
import numpy as np
from pydataset import data

### Create a Series

In practice, a Series will most often be created by selecting a single column from a pandas Dataframe in which case the Series retains the same index as the Dataframe.

1. from a list
2. from a numpy array
3. from a dictionary
4. from a dataframe

From a List

In [2]:
my_list = [2, 3, 5]
type(my_list)

list

Using an index to access value in list is possible, but those indices are integers representing location and cannot be changed to be a name, datetime, etc.

In [3]:
my_list[0]

2

Create series from list, similar to how you would convert a list to an array with `np.array(my_list)`, using `pd.Series(my_list)`.

*Notice how the `S` is capitalized.*

In [4]:
my_series = pd.Series(my_list)

What kind of object is that?

In [5]:
type(my_series)

pandas.core.series.Series

What's inside the series?

`my_series`

In [6]:
my_series


0    2
1    3
2    5
dtype: int64

- 3 rows, with the row indices (or row names) as [0, 1, 2]
- the values are [2, 3, 5]
- the datatype is int64 (i.e. will store LARGE integers)

From an array

In [7]:
my_array = np.array([8.0, 13.0, 21.0])

# create series from array
my_series = pd.Series(my_array)

type(my_series)

pandas.core.series.Series

In [8]:
my_series

0     8.0
1    13.0
2    21.0
dtype: float64

- 3 rows, with the row indices as [0, 1, 2]
- the values are [8.0, 13.0, 21.0]
- the datatype is float64

**From a dictionary**

In [9]:
labeled_series = pd.Series({'a' : 0, 'b' : 1.5, 'c' : 2, 'd': 3.5, 'e': 4, 'f': 5.5})
labeled_series

a    0.0
b    1.5
c    2.0
d    3.5
e    4.0
f    5.5
dtype: float64

**From a dataframe**

In [10]:
sleep_df = data('sleepstudy')
sleep_df.head()

Unnamed: 0,Reaction,Days,Subject
1,249.56,0,308
2,258.7047,1,308
3,250.8006,2,308
4,321.4398,3,308
5,356.8519,4,308


option 1: `.column_name`

In [11]:
sleep_series = sleep_df.Reaction
type(sleep_series)
# my_series

pandas.core.series.Series

option 2: single bracket `[]`

In [12]:
sleep_series = sleep_df['Reaction']
type(sleep_series)

pandas.core.series.Series

In the next lesson, we will learn about dataframes, but notice if I use double brackets to select the column, I end up with a dataframe, not a series.

In [13]:
my_dataframe_that_resembles_a_series = sleep_df[['Reaction']]
type(my_dataframe_that_resembles_a_series)

pandas.core.frame.DataFrame

In [14]:
sleep_series

1      249.5600
2      258.7047
3      250.8006
4      321.4398
5      356.8519
         ...   
176    329.6076
177    334.4818
178    343.2199
179    369.1417
180    364.1236
Name: Reaction, Length: 180, dtype: float64

In [15]:
my_dataframe_that_resembles_a_series

Unnamed: 0,Reaction
1,249.5600
2,258.7047
3,250.8006
4,321.4398
5,356.8519
...,...
176,329.6076
177,334.4818
178,343.2199
179,369.1417


### **Summary**

From a list, array, dictionary: - `myseries = pd.Series(<list or array or dictionary>)`

From existing dataframe:

- `myseries = df['col_for_series']`
- `myseries = df.col_for_series`

### Pandas data types

Data types you will see in series and dataframes:

- int: integer, whole number values
- float: decimal numbers
- bool: true or false values
- object: strings
- category: a fixed set of string values
- a name, an optional human-friendly name for the series
- inferring
- using `astype()`

## Inferring

In [16]:
pd.Series([True, False, True])

0     True
1    False
2     True
dtype: bool

In [17]:
pd.Series(['I', 'Love', 'Codeup'])

0         I
1      Love
2    Codeup
dtype: object

In [18]:
my_series = pd.Series([1, 3, 'five'])
my_series

0       1
1       3
2    five
dtype: object

In [19]:
# filter out 'five' from the series and reassign
my_new_series = my_series[my_series != 'five']

my_new_series

0    1
1    3
dtype: object

### Using astype()

In [20]:
my_new_series.astype('int')

0    1
1    3
dtype: int64

What would happen if we tried to change a series to a datatype that it cannot convert the values to?

`# my_series.astype('int')`

The sleep subject column in the Sleep dataframe is an ID representing a person/subject; therefore, we should store the values as an 'object' (string).

In [21]:
sleep_subj_series = sleep_df['Subject'].astype('str')
sleep_subj_series

1      308
2      308
3      308
4      308
5      308
      ... 
176    372
177    372
178    372
179    372
180    372
Name: Subject, Length: 180, dtype: object

### **Summary**

- Pandas will infer datatypes
- You can change datatypes upon creating the series `pd.Series(mylist).astype('int')` or later using "astype(x)" where x can be 'float', 'int', 'str', e.g. `myseries.astype('str')`
- astype('str') will show the series dtype = object.

### Vectorized Operations

Like numpy arrays, pandas series are vectorized by default. E.g., we can easily use the basic arithmetic operators to manipulate every element in the series.

1. arithmetic operations
2. comparison operations

In [22]:
fibi_series = pd.Series([0, 1, 1, 2, 3, 5, 8])

fibi_series.head()

0    0
1    1
2    1
3    2
4    3
dtype: int64

In [23]:
fibi_series + 1

0    1
1    2
2    2
3    3
4    4
5    6
6    9
dtype: int64

In [24]:
fibi_series / 2

0    0.0
1    0.5
2    0.5
3    1.0
4    1.5
5    2.5
6    4.0
dtype: float64

In [25]:
fibi_series >= 5

0    False
1    False
2    False
3    False
4    False
5     True
6     True
dtype: bool

In [26]:
(fibi_series >= 3) & (fibi_series % 2 == 0)

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

### **Summary**

- Just as in Numpy, we can perform operations on each element in the series by simply applying the series, s + 1, s/2, s == 3, etc. and each will be evaluated.
- a series is always returned
- a series of booleans if we are giving condition statements.
- a series of transformed values if we are doing an arithmetic operation.

### Series Attributes

**Attributes** return useful information about a Series' properties; they don't perform operations or calculations with the Series. Attributes are easily accessible using dot notation like we will see in the examples below. Jupyter Notebook allows you to quickly access a list of available attributes by pressing the tab key after the series name followed by a period or dot; this is called dot notation or attribute access.

There are several components that make up a pandas Series, and I can easliy access each component by using attributes.

`.index`

The index allows us to reference items in the series. In our numbers_series, the index consists of the numbers 0-3.

`fibi_series.index`

RangeIndex(start=0, stop=7, step=1)

`.values`

The values are my data.

In [27]:
# The values are stored in a NumPy array. Hello vectorized operations!

fibi_series.values

array([0, 1, 1, 2, 3, 5, 8])

`.dtype`

The dtype is the data type of the elements in the Series. In our numbers_series, the data type is int64; it was inferred from the data we used.

Pandas has several main data types we will work with:

- int: integer, whole number values
- float: decimal numbers
- bool: true or false values
- object: strings
- category: a fixed and limited set of string value

`fibi_series.dtype`

dtype('int64')

`.name`

The name is an optional human-friendly name for the Series.

Our Series doesn't have a name, but we can give it one:

In [28]:
fibi_series.name = 'Fibonacci'
fibi_series

0    0
1    1
2    1
3    2
4    3
5    5
6    8
Name: Fibonacci, dtype: int64

`.size`

The .size attribute returns an int representing the number of rows in the Series. NULL values are included.

`fibi_series.size`

7

`.shape`

The .shape attribute returns a tuple representing the rows and columns when used on a two-dimensional structure like a DataFrame, but it can also be used on a Series to return its number of rows. NULL values are included.

`fibi_series.shape`

(7,)

### Series Methods

**Methods** used on pandas Series objects often return new Series objects; most also offer parameters with default settings designed to keep the user from mutating the original Series objects. (inplace=False)

If I want to save any manipulations or transformations I make on my Series, I can either assign the Series to a variable or adjust my parameters (inplace=True).

- `.head()`: returns the 1st 5 rows (max) of the series

`fibi_series.head()`

In [29]:
fibi_series.head()

0    0
1    1
2    1
3    2
4    3
Name: Fibonacci, dtype: int64

- `.tail()`: returns the last 5 rows of the series

`fibi_series.tail()`

In [30]:
fibi_series.tail()

2    1
3    2
4    3
5    5
6    8
Name: Fibonacci, dtype: int64

`.sample()`: returns a random sample of rows in the Series; n = 1 by default. Again, the index is retained.

In [31]:
sleep_df = data('sleepstudy')
sleep_days_series = sleep_df.Days

In [32]:
sleep_days_series.sample(5)

180    9
40     9
133    2
81     0
166    5
Name: Days, dtype: int64

- `.value_counts()`: count number of records/items/rows containing each unique value (think "group by")

In [33]:
sleep_days_series.value_counts() # think 'group-by' 

0    18
1    18
2    18
3    18
4    18
5    18
6    18
7    18
8    18
9    18
Name: Days, dtype: int64

In SQL, this would look like:

`select Days, count(Subject) from my_df group by Days;`

### **Descriptive stats**

Pandas has a number of methods that can be used to view summary statistics about our data. The table below [taken from here](https://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics) provides a summary of some of the most commonly used methods.

| Function | Description |
| --- | --- |
| count | Number of non-NA observations |
| sum | Sum of values |
| mean | Mean of values |
| median | Arithmetic median of values |
| min | Minimum |
| max | Maximum |
| mode | Mode |
| abs | Absolute Value |
| std | Bessel-corrected sample standard deviation |
| quantile | Sample quantile (value at %) |

In [34]:
sleep_df = data('sleepstudy')
sleep_reaction_time_series = sleep_df.Reaction

In [35]:
{
    'count': sleep_reaction_time_series.count(),
    'sum': sleep_reaction_time_series.sum(),
    'mean': sleep_reaction_time_series.mean(),
    'median': sleep_reaction_time_series.median()

}

{'count': 180,
 'sum': 53731.42049999999,
 'mean': 298.50789166666664,
 'median': 288.6508}

- `.describe()`: returns a series of descriptive statistics on a pandas Series. The information it returns depends on the data type of the elements in the Series.

In [36]:
sleep_reaction_time_series.describe()

count    180.000000
mean     298.507892
std       56.328757
min      194.332200
25%      255.375825
50%      288.650800
75%      336.752075
max      466.353500
Name: Reaction, dtype: float64

In [37]:
print(fibi_series)
fibi_series.describe()

0    0
1    1
2    1
3    2
4    3
5    5
6    8
Name: Fibonacci, dtype: int64


count    7.000000
mean     2.857143
std      2.794553
min      0.000000
25%      1.000000
50%      2.000000
75%      4.000000
max      8.000000
Name: Fibonacci, dtype: float64

`.nlargest()`, `.nsmallest()`

These methods allow me to return the n largest or n smallest values from a pandas Series. I can set the keep parameter to first, last, or all to deal with duplicate largest or smallest values; this is quite handy.

The default argument for keep is shown below.

In [38]:
fibi_series.nlargest(n=3,keep='first')

6    8
5    5
4    3
Name: Fibonacci, dtype: int64

In [39]:
fibi_series.nsmallest(n=2,keep='all')

0    0
1    1
2    1
Name: Fibonacci, dtype: int64