# Manipulating Data With Pandas <img src="images/SWC22-Pandas.PythonPandasLogo.jpg" align = "right" width=200 height=200/>
- Pandas is a data analysis library built on top of NumPy.
- Pandas provides data structures and operations for manipulating data using DataFrames
- DataFrames are multidimensional arrays with attached row and column labels.
- DataFrames can include heterogeneous types and/or missing data.
- Pandas also provides functions for handling data in a similar fashion to database frameworks and spreadsheet programs.



In [32]:
# use NumPy and Pandas
import numpy as np
import pandas as pd
print("Pandas version is", pd.__version__)

Pandas version is 1.4.2


# The Series Object
- A Pandas **Series** is a one-dimensional array of indexed data. It can be created from a list or array.
    - A Series wraps both a sequence of values and a sequence of indices, which can be used to access with the values and index attributes.
    - The values are simply a familiar NumPy array


In [34]:
atad = pd.Series([0.52, 0.8, 0.63, 4.0], index = ['a', 'b', 'c', 'd'])
print(atad)
print()
print(atad.values)

a    0.52
b    0.80
c    0.63
d    4.00
dtype: float64

[0.52 0.8  0.63 4.  ]


# The Series Index
- The Series index is an array-like object of type pd.Index
    - Like with a NumPy array, data can be accessed by the associated index using square-bracket notation
    - The Pandas Series is much more general and flexible than the one-dimensional NumPy array that it emulates.


In [37]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [41]:
#index of the data series
data.index

RangeIndex(start=0, stop=4, step=1)

In [42]:
#the element at index 1
data[1]

0.5

In [48]:
#a slice of a series (start:stop)
data[1:3]

1    0.50
2    0.75
dtype: float64

# Python Dictionaries and Pandas Series
- A Pandas Series is similar to a specialized Python dictionary. 
    - A dictionary maps arbitrary keys to a set of arbitrary values; a Series maps typed keys to a set of typed values.
    - The type information of a Pandas Series is much more efficient than Python dictionaries for certain operations.
- Construct a Series object directly from a Python dictionary:


In [49]:
#create a dictionary of key:value pairs
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population_dict

{'California': 38332521,
 'Texas': 26448193,
 'New York': 19651127,
 'Florida': 19552860,
 'Illinois': 12882135}

In [50]:
#create a Pandas series from a Python dictionary
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [51]:
#notice the difference between printing the Python dictionary and printing the Pandas Series
#there is an implied for:each loop to print each element on a separate line.
print("Dictionary:")
print(population_dict)
print("\nSeries:")
print(population)


Dictionary:
{'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135}

Series:
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64


# Python Dictionaries and Pandas Series
- Dictionary-style item access can be used with a Series:



In [55]:
population['California']

38332521

- Unlike a dictionary, the Series also supports Numpy array-style operations such as slicing:

In [56]:
population['California':'New York']

California    38332521
Texas         26448193
New York      19651127
dtype: int64

# Creating a Series
- Creating a Series is almost always some form of
            pd.Series(data, index = index)
- where index is an optional argument, and data can be one of many entities (e.g., list, dictionary, Numpy array).

In [57]:
# simple scalar series
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [59]:
# scalar series, fill with 5's and specify index
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [30]:
# simple dictionary-based series
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object


In [31]:
# populate using only specified keys (by index)
print(pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2]))

3    c
2    a
dtype: object


# DataFrames
- The DataFrame can also be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.
- A DataFrame is comparable to a two-dimensional array with both flexible row indices and flexible column names. 
- Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects.



### Construct a new area Series which parallels the population Series created earlier, then create a two-dimensional DataFrame using those objects

In [63]:
#recall the population_dict from above
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

In [66]:
#and the Pandas Series created from that dictionary
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [65]:
#create a new area dictionary for the same states
area_dict = {'California': 423967,
             'Texas': 695662,
             'New York': 141297,
             'Florida': 170312, 
             'Illinois': 149995}

In [67]:
#create a Pandas Series from the area dictionary
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [69]:
#create a DataFrame from the two Series
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


### DataFrame attributes

- DataFrames have an index and a column attribute

In [75]:
#index refers to the row headings
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [74]:
states.columns

Index(['population', 'area'], dtype='object')

In [73]:
#DataFrames use colum values as indices to a series
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [76]:
### A DataFrame from a list of Dictionaries

In [77]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
data



[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [78]:
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [None]:
#start on Slide 14