# Jupyter formatting syntax
bold is ** before and after
italic is _ or * before and after
underline is <ins> </ins>
adding an image is html: <img src="filename" align = "right/left/center" width=### height=### />


# Manipulating Data With Pandas <img src="images/SWC22-Pandas.PythonPandasLogo.jpg" align = "right" width=200 height=200/>
- Pandas is a data analysis library built on top of NumPy.
- Pandas provides data structures and operations for manipulating data using DataFrames
- DataFrames are multidimensional arrays with attached row and column labels.
- DataFrames can include heterogeneous types and/or missing data.
- Pandas also provides functions for handling data in a similar fashion to database frameworks and spreadsheet programs.



In [None]:
# use NumPy and Pandas
import numpy as np
import pandas as pd
print("Pandas version is", pd.__version__)

# The Series Object
- A Pandas **Series** is a one-dimensional array of indexed data. It can be created from a list or array.
    - A Series wraps both a sequence of values and a sequence of indices, which can be used to access with the values and index attributes.
    - The values are simply a familiar NumPy array


In [None]:
atad = pd.Series([0.52, 0.8, 0.63, 4.0], index = ['a', 'b', 'c', 'd'])
print(atad)
print()
print(atad.values)

# The Series Index
- The Series index is an array-like object of type pd.Index
    - Like with a NumPy array, data can be accessed by the associated index using square-bracket notation
    - The Pandas Series is much more general and flexible than the one-dimensional NumPy array that it emulates.


In [None]:
data = pd.Series([0.25, 0.5, 0.79, 1.0])
data

In [None]:
#index of the data series
data.index

In [None]:
#the element at index 1
data[1]

In [None]:
#a slice of a series (start:stop)
data[1:3]

# Python Dictionaries and Pandas Series
- A Pandas Series is similar to a specialized Python dictionary. 
    - A dictionary maps arbitrary keys to a set of arbitrary values; a Series maps typed keys to a set of typed values.
    - The type information of a Pandas Series is much more efficient than Python dictionaries for certain operations.
- Construct a Series object directly from a Python dictionary:


In [None]:
#create a dictionary of key:value pairs
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population_dict

In [None]:
#create a Pandas series from a Python dictionary
population = pd.Series(population_dict)
population

In [None]:
#notice the difference between printing the Python dictionary and printing the Pandas Series
#there is an implied for:each loop to print each element on a separate line.
print("Dictionary:")
print(population_dict)
print("\nSeries:")
print(population)


### Dictionary-style item access can be used with a Series:



In [None]:
population['California']

- Unlike a dictionary, the Series also supports Numpy array-style operations such as slicing:

In [None]:
population['California':'New York']

# Creating a Series
- Creating a Series is almost always some form of
            pd.Series(data, index = index)
- where index is an optional argument, and data can be one of many entities (e.g., list, dictionary, Numpy array).

In [None]:
# simple scalar series
pd.Series([2, 4, 6])

In [None]:
# scalar series, fill with 5's and specify index
pd.Series(5, index=[100, 200, 300])

In [None]:
# simple dictionary-based series
pd.Series({2:'a', 1:'b', 3:'c'})

In [None]:
# populate using only specified keys (by index)
print(pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2]))

# DataFrames
- The DataFrame can also be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.
- A DataFrame is comparable to a two-dimensional array with both flexible row indices and flexible column names. 
- Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects.



### Construct a new area Series which parallels the population Series created earlier, then create a two-dimensional DataFrame using those objects

In [None]:
#recall the population_dict from above
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

In [None]:
#and the Pandas Series created from that dictionary
population = pd.Series(population_dict)
population

In [None]:
#create a new area dictionary for the same states
area_dict = {'California': 423967,
             'Texas': 695662,
             'New York': 141297,
             'Florida': 170312, 
             'Illinois': 149995}

In [None]:
#create a Pandas Series from the area dictionary
area = pd.Series(area_dict)
area

In [None]:
#create a DataFrame from the two Series
states = pd.DataFrame({'population': population,
                       'area': area})
states

### DataFrame attributes

- DataFrames have an index and a column attribute

In [None]:
#index refers to the row headings
states.index

In [None]:
states.columns

In [None]:
#DataFrames use colum values as indices to a series
states['area']

In [None]:
### A DataFrame from a list of Dictionaries

In [None]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
data



In [None]:
pd.DataFrame(data)

## Missing Values
- Missing values are filled with NaN ("not-a-number")
- This behavior is important; in data science missing values can impact analytical results and should be dealt with consistently
- https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b


In [None]:
data=[{'a': 1, 'b': 2}, 
      {'b': 3, 'c': 4}]
data

In [None]:
pd.DataFrame(data)

# A DataFrame from a 2D Array
- Given a two-dimensional array of data, a DataFrame can be created with any specified  column and index name.
    - If the names are omitted, an integer index will be used for each


In [None]:
pd.DataFrame(np.random.rand(3, 2),
      columns=['foo', 'bar'],
      index=['a', 'b', 'c'])

# The Pandas Index is an Object
- Both the Series and DataFrame objects in Pandas contain an explicit index that lets you reference and modify data
- A Pandas Index is itself an object that may contain repeated values
- It can be thought of either as an immutable array or as an ordered set (technically a multiset -- a set which allows multiple instances of each of its elements)
- This has some interesting consequences in operations available on Index objects. 


In [None]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

## Array-like Behavior of a Pandas Index
- The Index object in many ways operates like an array. 
    - Indexing notation to retrieve values or slices
    - Index objects have many of the attributes familiar from NumPy arrays


In [None]:
ind[1]

In [None]:
ind[::2]

## A Pandas Index is Immutable
- One difference between Index objects and NumPy arrays is that indices are immutable; they cannot be modified in place.
- This makes it safer to share indices between multiple DataFrames and arrays, without the potential for side effects from inadvertent index modification.


In [None]:
ind[1] = 0

## The Pandas Index as an Ordered Set
- Pandas objects are designed to facilitate operations such as joins across datasets
- The Index object follows many of the conventions used by Python’s built-in set data structure, so that unions, intersections, differences, and other combinations can be computed.


In [None]:
#create two index objects
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
#indA & indB --> intersection: the & usage is deprecated, use the intersection method

indA.intersection(indB)


In [None]:
# indA ^ indB --> symmetric difference ("exclusive OR"): the ^ usage is deprecated

indA.symmetric_difference(indB)

In [None]:
# indA | indB  --> union: The | usage is deprecated

indA.union(indB)


# Series as a Dictionary
- Like a dictionary, the series object provides a mapping from a collection of keys to a collection of values

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                index=['a', 'b', 'c', 'd'])

In [None]:
data

In [None]:
data['b']

### We can use dictionary-like expressions and methods to examine keys/indices and values

In [None]:
'a' in data

In [None]:
data.keys()

In [None]:
list(data.items())

### Series objects can even be modified with a dictionary-like syntax
- As a dictionary can be extended by assigning to a new key, extend a series by assigning a value to a new index

In [None]:
data

In [None]:
data['e'] = 1.25
data

# Series as a One-dimensional Array
- A series provides array-style item selection using the same basic mechanisms as NumPy array, including slices, masking, and fancy indexing

In [None]:
#slicing by explicit index
#NOTE: includes last index!!

data['a':'c']


In [None]:
#slicing by implicit integer index
#NOTE: does NOT include the last index!

data[0:2]

In [None]:
#masking
data[(data > 0.3) & (data < 0.8)]

In [None]:
#fancy indexing -- non-continuous
data[['a', 'e']]

# Indexers
### Pandas provides special <ins>indexer</ins> attributes that explicitly expose certain indexing schemes to help prevent confusion when indexing.

- Without using the indexers, given a specific index, indexing will use the Series index.
- Without using the indexers, slicing will use the Python-style index.


In [None]:
data = pd.Series(['a', 'b', 'c'], 
                 index=[1, 3, 5])


In [None]:
# explicit index
print(data[1])

In [None]:
# implicit index when slicing --> up to but not including the last index
print(data[0:2])

#### The loc attribute allows indexing and slicing that always references the explicit index

In [None]:
data.loc[1]

#### The iloc attribute allows indexing and slicing that always references the implicit Python-style index

In [None]:
data.iloc[1]

#### _The explicit nature of loc and iloc make them very useful in maintaining clean and readable code, especially in the case of integer indexes_

# DataFrame as Dictionary
- The individual series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})

In [None]:
data

In [None]:
data['area']

In [None]:
#we can use attribute-style access with column names that are strings

data.area

In [None]:
#use the **is** operator to compare identities of each object

data.area is data['area']

In [None]:
data.pop('area')


In [None]:
data.columns

### Dictionary-style syntax can also be used to modify the DataFrame

In [None]:
#first, add back the column that was popped previously

data['area'] = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
data

In [None]:
#compute density and add the density column to the DataFrame

data['density'] = data['pop'] / data['area']

data

# DataFrame as a Two-dimensional (2D) array
- The DataFrame can be viewed as an enhanced two-dimensional array
 - View the underlying data array using the ndarray **values** attribute

In [None]:
data.values

 - Transpost the array using the **T** attribute


In [None]:
data.T

# Resume with Slide 33