# NB 7: Tidy data and Pandas

Pandas Overview
Pandas is a Python library used for data manipulation and analysis, especially for tabular data. It provides two main structures:

1. Series: A one-dimensional array-like object that can hold any data type, like a column in a table.
2. DataFrame: A two-dimensional, table-like data structure where rows and columns are labeled. It's the primary structure for handling data in Pandas.

In [5]:
import pandas as pd  # Standard idiom for loading pandas
from pandas import DataFrame, Series

### Series Objects

A [`Series`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) object is a column-oriented object that we will use to store a variable of a tibble.

##### Creating a Series

In [186]:
import pandas as pd

#Create a Series
s = pd.Series([10, 20, 30], index =['a', 'b', 'c']) #default index if not specified is numbered index
s

a    10
b    20
c    30
dtype: int64

In [13]:
from pandas import DataFrame, Series

obj = Series([-1, 2, -3, 4, -5])
print(obj)

0   -1
1    2
2   -3
3    4
4   -5
dtype: int64


Index provides a convenient way to reference individual elements of the Series.

By default, `a Seri`es has an index that is akin to range() in standard Python, and effectively numbers the entries from 0 to n-1, where n is the length of the Series. A Series object also becomes list-like in how you reference its elemen.

tsYou can also use more complex index objects, like lists of integers and conditional masks.

In [16]:
I = [0, 2, 3]
obj[I] # Also: obj[[0, 2, 3]]

0   -1
2   -3
3    4
dtype: int64

In [18]:
I_pos = obj > 0
print(type(I_pos), I_pos)

<class 'pandas.core.series.Series'> 0    False
1     True
2    False
3     True
4    False
dtype: bool


In [20]:
print(obj[I_pos])

1    2
3    4
dtype: int64


However, the index can be a more general structure, which effectively turns a Series object into something that is "dictionary-like."

In [31]:
obj3 = Series([      1,    -2,       3,     -4,        5,      -6],
              ['alice', 'bob', 'carol', 'dave', 'esther', 'frank'])
print(f'{obj3} \n')
print("* obj3['bob']: {}\n".format(obj3['bob']))

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
dtype: int64 

* obj3['bob']: -2



You can construct a Series from a dictionary directly:

In [36]:
peeps = {'alice': 1, 'carol': 3, 'esther': 5, 'bob': -2, 'dave': -4, 'frank': -6}
obj4 = Series(peeps)
print(obj4)

alice     1
carol     3
esther    5
bob      -2
dave     -4
frank    -6
dtype: int64


In [38]:
mujeres = [0, 2, 4] # list of integer offsets
print("* las mujeres of `obj3` at offsets {}:\n{}\n".format(mujeres, obj3[mujeres]))

* las mujeres of `obj3` at offsets [0, 2, 4]:
alice     1
carol     3
esther    5
dtype: int64



  print("* las mujeres of `obj3` at offsets {}:\n{}\n".format(mujeres, obj3[mujeres]))


Basic arithmetic works on `Series` as vector-like operations.

In [48]:
print(obj3, "\n")
print(obj3 + 5, "\n")
print(obj3 + 5 > 0, "\n")
print((-2.5 * obj3) + (obj3 + 5))

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
dtype: int64 

alice      6
bob        3
carol      8
dave       1
esther    10
frank     -1
dtype: int64 

alice      True
bob        True
carol      True
dave       True
esther     True
frank     False
dtype: bool 

alice      3.5
bob        8.0
carol      0.5
dave      11.0
esther    -2.5
frank     14.0
dtype: float64


A Series object also supports vector-style operations with automatic alignment based on index values.

In [61]:
print(obj3, "\n")

obj_l = obj3[mujeres]
print(obj_l, "\n")


print(obj3 + obj_l, "\n")

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
dtype: int64 

alice     1
carol     3
esther    5
dtype: int64 

alice      2.0
bob        NaN
carol      6.0
dave       NaN
esther    10.0
frank      NaN
dtype: float64 



  obj_l = obj3[mujeres]


Another useful transformation is the `.apply(fun)` method. It returns a copy of the Series where the function fun has been applied to each element.

In [68]:
print(f'{obj3} \n')
print(obj3.apply(abs)) #apply abs value function to all of series and return a copy

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
dtype: int64 

alice     1
bob       2
carol     3
dave      4
esther    5
frank     6
dtype: int64


A `Series` may be _named_, too.

In [75]:
obj3.name = 'peep'

print(f'{obj3}\n')
print(obj3.name)

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
Name: peep, dtype: int64

peep


### DataFrame Objects

A pandas `DataFrame` object is a table whose columns are Series objects, all keyed on the same index. It's the perfect container for what we have been referring to as a tibble.

##### Creating a DataFrame

In [194]:
#Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)
print(df)

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000


In [90]:
#Creating dataframe
cafes = DataFrame({'name': ['east pole', 'chrome yellow', 'brash', 'taproom', '3heart', 'spiller park pcm', 'refuge', 'toptime'],
                   'zip': [30324, 30312, 30318, 30317, 30306, 30308, 30303, 30318],
                   'poc': ['jared', 'kelly', 'matt', 'jonathan', 'nhan', 'dale', 'kitti', 'nolan']})
print("type:", type(cafes), "\n")
print(cafes, "\n")
display(cafes) # Or just `cafes` as the last line of a cell

type: <class 'pandas.core.frame.DataFrame'> 

               name    zip       poc
0         east pole  30324     jared
1     chrome yellow  30312     kelly
2             brash  30318      matt
3           taproom  30317  jonathan
4            3heart  30306      nhan
5  spiller park pcm  30308      dale
6            refuge  30303     kitti
7           toptime  30318     nolan 



Unnamed: 0,name,zip,poc
0,east pole,30324,jared
1,chrome yellow,30312,kelly
2,brash,30318,matt
3,taproom,30317,jonathan
4,3heart,30306,nhan
5,spiller park pcm,30308,dale
6,refuge,30303,kitti
7,toptime,30318,nolan


##### Indexing and Slicing in DataFrames

In [101]:
#DataFrames have named columns, which are stored as an `Index` 
print(cafes.columns)

#Each column is a named Series:
print(type(cafes['zip'])) 

Index(['name', 'zip', 'poc'], dtype='object')
<class 'pandas.core.series.Series'>


Selecting columns: You can also select multiple columns by passing a list of column names.

In [198]:
# Select multiple columns
target_fields = ['zip', 'poc'] 
cafes[target_fields]

Unnamed: 0,zip,poc
0,30324,jared
1,30312,kelly
2,30318,matt
3,30317,jonathan
4,30306,nhan
5,30308,dale
6,30303,kitti
7,30318,nolan


In [212]:
#Slices apply to rows.
cafes[1::2] #select every other row, starting from index 1

Unnamed: 0,name,zip,poc
1,chrome yellow,30312,kelly
3,taproom,30317,jonathan
5,spiller park pcm,30308,dale
7,toptime,30318,nolan


In [218]:
cafes2 = cafes[['poc', 'zip']]
cafes2.index = cafes['name'] #assign the name column as the index
cafes2.index.name = None     
cafes2

Unnamed: 0,poc,zip
east pole,jared,30324
chrome yellow,kelly,30312
brash,matt,30318
taproom,jonathan,30317
3heart,nhan,30306
spiller park pcm,dale,30308
refuge,kitti,30303
toptime,nolan,30318


###### Accessing Rows and Columns using:
`.loc[]` - label-based selection

`.iloc[]` - integer-based selection (position)

In [237]:
print(df, '\n')
print(df.loc[:, 'Name']) # Select all rows for the 'Name' column
print(df.iloc[:, 1])  # Select all rows for the second (Age) column

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000 

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object
0    25
1    30
2    35
Name: Age, dtype: int64


In [109]:
#You can access subsets of rows using the .loc field and index values:
print(cafes2.loc[['chrome yellow', '3heart']], '\n')

#Alternatively, you can use integer offsets via the .iloc field index positions.
print(cafes2.iloc[[1, 3]]) #row index 1 & 3(2nd and 4th row)

                 poc    zip
chrome yellow  kelly  30312
3heart          nhan  30306 

                    poc    zip
chrome yellow     kelly  30312
taproom        jonathan  30317


##### Slicing Data:

In [246]:
# Slicing by label - loc
print(df.loc[0:1, 'Name':'Salary'])  # Rows 0 to 1 and columns from Name to Salary

# Slicing by position - iloc
print(df.iloc[0:2, 0:2])  # First two rows and first two columns

    Name  Age  Salary
0  Alice   25   50000
1    Bob   30   60000
    Name  Age
0  Alice   25
1    Bob   30


In [113]:
#Adding columns
cafes2['rating'] = 4.0
cafes2['price'] = '$$'
cafes2

Unnamed: 0,poc,zip,rating,price
east pole,jared,30324,4.0,$$
chrome yellow,kelly,30312,4.0,$$
brash,matt,30318,4.0,$$
taproom,jonathan,30317,4.0,$$
3heart,nhan,30306,4.0,$$
spiller park pcm,dale,30308,4.0,$$
refuge,kitti,30303,4.0,$$
toptime,nolan,30318,4.0,$$


#### Using `.apply()` Method
The apply() method allows you to apply a function along the axis of a DataFrame or Series. It’s a powerful tool for transforming data.
- Column Operations (`axis = 0`): vertically  ~ Default
- Row Operations (`axis = 1`): horizontally

In [272]:
#Row Operations 
df['Salary_after_tax'] = df.apply(lambda row: row['Salary'] * 0.7, axis=1)
print(df, '\n')

      Name  Age  Salary  Salary_after_tax
0    Alice   25   50000           35000.0
1      Bob   30   60000           42000.0
2  Charlie   35   70000           49000.0 



In [115]:
#Vector arithmetic on columns
prices_as_ints = cafes2['price'].apply(lambda s: len(s))
print(prices_as_ints, '\n')

cafes2['value'] = cafes2['rating'] / prices_as_ints
print(cafes2)

east pole           2
chrome yellow       2
brash               2
taproom             2
3heart              2
spiller park pcm    2
refuge              2
toptime             2
Name: price, dtype: int64 

                       poc    zip  rating price  value
east pole            jared  30324     4.0    $$    2.0
chrome yellow        kelly  30312     4.0    $$    2.0
brash                 matt  30318     4.0    $$    2.0
taproom           jonathan  30317     4.0    $$    2.0
3heart                nhan  30306     4.0    $$    2.0
spiller park pcm      dale  30308     4.0    $$    2.0
refuge               kitti  30303     4.0    $$    2.0
toptime              nolan  30318     4.0    $$    2.0


In the above example, vector arithmetic works because all the Series objects involved have identical indexes Because the columns are Series objects, there is an implicit matching that is happening on the indexes.

This wont work in the following example>

In [142]:
cafes3 = cafes2.copy()
is_fancy = cafes3['zip'].isin({30306, 30308})
# Alternative:
#is_fancy = cafes3['zip'].apply(lambda z: z in {30306, 30308})
print(is_fancy, '\n')
display(cafes3[is_fancy])

east pole           False
chrome yellow       False
brash               False
taproom             False
3heart               True
spiller park pcm     True
refuge              False
toptime             False
Name: zip, dtype: bool 



Unnamed: 0,poc,zip,rating,price,value
3heart,nhan,30306,4.0,$$,2.0
spiller park pcm,dale,30308,4.0,$$,2.0


In [150]:
#Concataneation: Add extra $ to price for fancy restauraunts
cafes3[is_fancy]['price'] += '$'   #ERROR

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cafes3[is_fancy]['price'] += '$' #Error


When you slice horizontally, you get copies of the original data, not references to subsets of the original data. Therefore, we'll need different strategy. Use loc instead. 

In [148]:
cafes3.loc[is_fancy, 'price'] += '$'
cafes3

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
3heart,nhan,30306,4.0,$$$,2.0
spiller park pcm,dale,30308,4.0,$$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


**A different approach.** Let's see if we can solve this problem in other ways to see what may or may not work.

In [156]:
cafes4 = cafes2.copy() # Start over

#Attempt 1:
fancy_shops = cafes4.index[is_fancy]
print(fancy_shops, '\n')

fancy_markup = Series(['$'] * len(fancy_shops), index=fancy_shops)
print(fancy_markup, '\n')

cafes4['price'] + fancy_markup  #Incorrect bc missing values are treated as NaN objects

Index(['3heart', 'spiller park pcm'], dtype='object') 

3heart              $
spiller park pcm    $
dtype: object 



3heart              $$$
brash               NaN
chrome yellow       NaN
east pole           NaN
refuge              NaN
spiller park pcm    $$$
taproom             NaN
toptime             NaN
dtype: object

In [158]:
#Attempt 2
cafes4 = cafes2.copy()
cafes4['price'] += Series([x * '$' for x in is_fancy.tolist()], index=is_fancy.index)
cafes4

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
3heart,nhan,30306,4.0,$$$,2.0
spiller park pcm,dale,30308,4.0,$$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


More on **apply() for DataFrame objects**. As with a Series, there is a `DataFrame.apply()` procedure. However, it's meaning is a bit more nuanced because a DataFrame is generally 2-D rather than 1-D.

In [168]:
print(cafes4.apply(lambda x: repr(type(x))), '\n') # What does this do? What does the output tell you?
#axis parameter useful
print(cafes4.apply(lambda x: repr(type(x)), axis=1), '\n') # What does this do? What does the output tell you?
#verify what you get when axis=1
cafes4.apply(lambda x: print('==> ' + x.name + '\n' + repr(x)) if x.name == 'east pole' else None, axis=1);

poc       <class 'pandas.core.series.Series'>
zip       <class 'pandas.core.series.Series'>
rating    <class 'pandas.core.series.Series'>
price     <class 'pandas.core.series.Series'>
value     <class 'pandas.core.series.Series'>
dtype: object 

east pole           <class 'pandas.core.series.Series'>
chrome yellow       <class 'pandas.core.series.Series'>
brash               <class 'pandas.core.series.Series'>
taproom             <class 'pandas.core.series.Series'>
3heart              <class 'pandas.core.series.Series'>
spiller park pcm    <class 'pandas.core.series.Series'>
refuge              <class 'pandas.core.series.Series'>
toptime             <class 'pandas.core.series.Series'>
dtype: object 

==> east pole
poc       jared
zip       30324
rating      4.0
price        $$
value       2.0
Name: east pole, dtype: object


**Exercise.** Use `DataFrame.apply()` to update the `'value'` column in `cafes4`, which is out of date given the update of the prices.

In [171]:
print(cafes4, '\n') # Verify visually that `'value'` is out of date


def calc_value(row):
    return row['rating'] / len(row['price'])

cafes4['value'] = cafes4.apply(calc_value, axis=1)
print(cafes4)

                       poc    zip  rating price  value
east pole            jared  30324     4.0    $$    2.0
chrome yellow        kelly  30312     4.0    $$    2.0
brash                 matt  30318     4.0    $$    2.0
taproom           jonathan  30317     4.0    $$    2.0
3heart                nhan  30306     4.0   $$$    2.0
spiller park pcm      dale  30308     4.0   $$$    2.0
refuge               kitti  30303     4.0    $$    2.0
toptime              nolan  30318     4.0    $$    2.0 

                       poc    zip  rating price     value
east pole            jared  30324     4.0    $$  2.000000
chrome yellow        kelly  30312     4.0    $$  2.000000
brash                 matt  30318     4.0    $$  2.000000
taproom           jonathan  30317     4.0    $$  2.000000
3heart                nhan  30306     4.0   $$$  1.333333
spiller park pcm      dale  30308     4.0   $$$  1.333333
refuge               kitti  30303     4.0    $$  2.000000
toptime              nolan  30318     4

#### Grouping and Aggregating
You can group data based on one or more columns and then apply aggregation functions.

In [None]:
print(df, '\n')

df_grouped = df.groupby('Age')['Salary'].mean()  # Group by Age and calculate mean Salary
print(df_grouped)

#### Merging and Joining DataFrames
You can combine DataFrames using merge operations, similar to SQL joins:

In [293]:
# Creating two DataFrames to merge
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Salary': [50000, 60000]})

# Merging based on 'Name' column
df_merged = pd.merge(df1, df2, on='Name')
print(df_merged)

    Name  Age  Salary
0  Alice   25   50000
1    Bob   30   60000
    Name  Age
0  Alice   25
1    Bob   30
    Name  Salary
0  Alice   50000
1    Bob   60000
