# Pandas basics part 1 - Series

## Prerequisites 
- Python basics parts 1 and 2
- Numpy basics parts 1, 2 and 3

### References 
- https://wesmckinney.com/book/pandas-basics

## Learning objectives 
- Create Pandas data series 
- Perform mathematical operations on pandas series
- Create non-numeric pandas series
- Select portions of series based on numeric and non-numeric data
- Count the num

#### Import the Pandas and Numpy modules
- When we use Pandas, we almost always use numpy too
- import pandas as pd. Every time you create a pandas object or use a bulit in pandas method, you preface with "pd."
- import numpy as np. Every time you create a numpy object or use a bulit in numpy method, you preface with "np." 

In [1]:
# Run this code block to import the pandas and numpy libraries
import pandas as pd
import numpy as np

### Series with numerical values
- Pandas series are one-dimensional objects, much like numpy arrays.
- Unlike a numpy array, each data point in the series is labeled with a unique _index_ value
  - The index could have meaning, like a date time or a name, or it could be numeric.
  - The index is of the series is preserved if the series is sorted or slided
- Data in a series can be numeric or categorical
- Run the code block below to create a series with with an array of five numbers and an unspecified index

#### Create a series from a list or an array with an auto-generated index
- you can create a series from a named list or array or fill in the data like we have done below

In [2]:
# run this code block to create an example series
ex_series = pd.Series([13, 42, 40, 25, 18])
print(ex_series)

0    13
1    42
2    40
3    25
4    18
dtype: int64


#### Create a series with a meaninful index
- Suppose the series above represents the weekly hours worked of employees at a small coffee shop and the employer would like to index the series by employee last name. _This probably is not a good idea, a unique employee ID would be better, but let's stick with names for this example_
- Run the code block below to create an indexed series

In [3]:
# Run this code to create an indexed series.  You need to put the data first, then the index
# you can leave of data= and index=, but it's good practice to include the 'key word' arguments 
ind_series  = pd.Series(data=[13, 42, 40, 25, 18], index=['McDaniel', 'Tang', 'Aucejo', 'Manelli', 'Townsend'])
print(ind_series)

McDaniel    13
Tang        42
Aucejo      40
Manelli     25
Townsend    18
dtype: int64


#### Create a seires from a dictionary
- You can also create a series from a dictionary.
- Run the code block below to create a similar data series (we add one more employee) from a dictionary
- Note that pd.Series(emp_dict) returns the series, to store it you have to define variable

In [4]:
# Run this code block to create both the dictionary and the data series.  
emp_dict = {'McDaniel': 13, 'Tang': 42, 'Aucejo': 40, 'Manelli': 25, 'Townsend': 18, 'Burns': 39.5}
emp_series = pd.Series(emp_dict)
print(emp_series)

McDaniel    13.0
Tang        42.0
Aucejo      40.0
Manelli     25.0
Townsend    18.0
Burns       39.5
dtype: float64


- Notice how including one 'float' in the data changes the series data type from int64 to float64
- In the codeblock below, experiment with creating series from lists of data, from numpy arrays, dictionaries and with and with different data types

#### Add data to series
- add more data to the series by creating a new index value and adding the data.
    -  _seriesname_[_newindexvalue_] = _newvalue_ will add a new row to the series with index value _newindexvalue_ 
-  Try adding a new row to emp_series with the index 'Warner' and the data value 41

#### Length or size of a series
- You can find the number of entries in a pandas series two ways
  - _seriesname_.size:  Series are one-dimensional objects, so the size will return an integer with the number of rows in the series.
  - len(_seriesname) will return the 'length' of the series or the number of rows.
- Try this out on emp_series in the code block below 

#### Mathematical operations on series 
- just like with numpy arrays, you can add, subtract, multiply, divide, and raise to the power by a _scalar_ 
- try performing mathematical operations with scalars on on emp_series in the codeblock below

#### Numpy functions on series
- You can also use built-in numpy functions on pandas series just like you could with numpy arrays
- Try out np.log(), np.exp(), np.abs(), np.sqrt() on emp_sieres in the codeblock below

### Mathematical and statistical operations on series
- the Series _emp_series_ data is all numerical, so we can perform mathematical and statistical methods on the series similar to a numpy array. Note that it's more convenient to use methods rather than numpy functions
  - _seriesname_.sum() returns the sum of the elements in the series
  - _seriesname_.mean() returns the mean of the elements in the series
  - _seriesname_.median() returns the median value of the elements in the series
  - _seriesname_.var() returns the variace of the elements in the series
  - _seriesname_.std() returns the standard deviation of the elements in the series
  - _seriesname_.max() returns the maximum value of the elements in the series
  - _seriesname_.min() returns the minimum value of the elemens in the series
- If the there is even one non-numeric entry in the series, you will receive an error when you try any of the above. We'll address this in the 'data cleaning' section
- Experiment with mathematical and statistical operations in the code block below.

### mathematical operations with two series
- you can add, subtract, multiply, divide and raise to the power element-by-element of series 
- the two series need to have the same index values
  - if one index value is different, the returned series with have an 'NaN' value where the index values do not match.
- run the code below to create a second series representing hours in a second week  

In [5]:
# Run this code block to create both the dictionary and the data series for workers in a coffee shop in a second week
emp_dict2 = {'Tang': 41, 'Aucejo': 40, 'Manelli': 27, 'Townsend': 10, 'Burns': 28.5, 'Summers': 32}
emp_series2 = pd.Series(emp_dict2)
print(emp_series2)

Tang        41.0
Aucejo      40.0
Manelli     27.0
Townsend    10.0
Burns       28.5
Summers     32.0
dtype: float64


- Notice how the index value 'McDaniel' is not in the series emp_series2 and there is a new index value, 'Summers,' in emp_series2
- In the codeblock below, create a new series equal to the sum of emp_series and emp_series2.
  - Notice how the index values for 'McDaniel' and 'Summers' in the new series are NaN.  _We'll address strategies for dealing with missing values in the data cleaning section_

### Series with non-numerical data
- Series can contain categorical data like gender, ethnicity or race, state, etc.
- The dictionary below contains the employess and their genders

In [6]:
# run the code block below to create a series with the genders of the employees
gender_dict = {'McDaniel': 'Female', 'Tang': 'Female', 'Aucejo': 'Male', 'Manelli': 'Male', 'Townsend': 'Male', 'Burns': 'Male'}
gender_series = pd.Series(gender_dict)
print(gender_series)

McDaniel    Female
Tang        Female
Aucejo        Male
Manelli       Male
Townsend      Male
Burns         Male
dtype: object


- As we did with mathematical series, you can append new entries to the series. Include a new row by running gender_series['Summers'] = 'Female' in the codeblock below.

#### Selecting series based on numeric conditions
- Often, we are interested selecting portions of our data based on their values
- _seriesname_ > _z_ returns a series with the same index as _seriesname_ with the boolean value _True_ for all entries in the series that are greater than the value _z_ and _False_ otherwise.
- Use the following
    - '>' for greater than
    - '>=' for greater than or equal to
    - '==' for equal to
    - '<' for less than
    - '<=' for less than or equal to
- Find which entries in emp_series are greater than 30, less than 30, equal to 30, or less than or equal  to 30 or greater than or equal to 40 in the codeblock below 

#### Creating new series based on numeric conditions
- We can create a subset of the series based on boolean values in a series with the same index values
- For example, _zmask = seriesname > z_ create a series with boolean value True for all entries in the series that are greater than the value z and False otherwise.  The index values of _zmask_ are the same as the index values for _seriesname_.
- Next, we create a new series _zseries = seriesname[zmask]_. The new series contains only the rows where _seriesname_ > z.
    - You can also do this in one step: _zseries = seriesname[seriesname > z]_ produces the same series
- try creating a new series called _emp30_ that contains only the employees in emp_series that have more than 30 hours. Experiment with different values

#### Selecting and creating series based on non-numeric values
  - We can check if values are equal to some string.  _seriesname == 'string'_ returns a series with the boolean value 'True' for all rows that are 'string' and 'False' otherwise
  - Once we have the series with the same index as _seriesname_ and boolean values for 'True' or 'False' we can create new series that only contains the desired entries.
  - With two lines of code: _smask = seriesname == 'string'_ and _strseries = seriesname[ssmask]_
or
   - With one line of code: _strseries = seriesname[seriesname=='string']_
In the code block below create a new series female_series from gender_series that only contains the rows that are 'Female' 

- We can also select rows if they are equal to more than one element
- Suppose we want to check if series values are equal to either 'string1' or 'string2'.
- First, we create a list with the relevant values, _stringlist = ['string 1', 'string 2']_
- _seriesname_.isin(stringlist)_ returns a series with boolean values for 'True' if the entry in the series is equal to either 'string1' or 'string2' and false otherwise.
- Our example series _gender_series_ is not very interesting because there are only two values: 'Male' and 'Female'.  Try creating some new entries, like 'Other' or 'None' and experiment selecting and creating subsets of the series.

#### Counting the values of a series 
- We can count the number of times each value appears in the series with the _valuecounts()_ method
- _seriesname.value_counts()_ returns a series with index of the entries in _seriesname_ and the number of times each entry appears in the series.
- try gender_series.value_counts() in the codeblock below.  This works for both numeric and non-numeric data, so try out with emp_series as well.

Up next: Pandas dataframes