# Pandas Notes

Pandas Overview
Pandas is a Python library used for data manipulation and analysis, especially for tabular data. It provides two main structures:

1. Series: A one-dimensional array-like object that can hold any data type, like a column in a table.
2. DataFrame: A two-dimensional, table-like data structure where rows and columns are labeled. It's the primary structure for handling data in Pandas. Double brackets return a DataFrame, while single brackets return a Series.

In [316]:
import pandas as pd  # Standard idiom for loading pandas
from pandas import DataFrame, Series

### Series Objects

A [`Series`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) object is a column-oriented object that we will use to store a variable of a tibble.

##### Creating a Series

In [319]:
import pandas as pd

#Create a Series
s = pd.Series([10, 20, 30], index =['a', 'b', 'c']) #default index if not specified is numbered index
s

a    10
b    20
c    30
dtype: int64

In [320]:
from pandas import DataFrame, Series

obj = Series([-1, 2, -3, 4, -5])
print(obj)

0   -1
1    2
2   -3
3    4
4   -5
dtype: int64


Index provides a convenient way to reference individual elements of the Series.

By default, `a Seri`es has an index that is akin to range() in standard Python, and effectively numbers the entries from 0 to n-1, where n is the length of the Series. A Series object also becomes list-like in how you reference its elemen.

tsYou can also use more complex index objects, like lists of integers and conditional masks.

In [322]:
I = [0, 2, 3]
obj[I] # Also: obj[[0, 2, 3]]

0   -1
2   -3
3    4
dtype: int64

In [323]:
I_pos = obj > 0
print(type(I_pos), I_pos)

<class 'pandas.core.series.Series'> 0    False
1     True
2    False
3     True
4    False
dtype: bool


In [324]:
print(obj[I_pos])

1    2
3    4
dtype: int64


However, the index can be a more general structure, which effectively turns a Series object into something that is "dictionary-like."

In [326]:
obj3 = Series([      1,    -2,       3,     -4,        5,      -6],
              ['alice', 'bob', 'carol', 'dave', 'esther', 'frank'])
print(f'{obj3} \n')
print("* obj3['bob']: {}\n".format(obj3['bob']))

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
dtype: int64 

* obj3['bob']: -2



You can construct a Series from a dictionary directly:

In [328]:
peeps = {'alice': 1, 'carol': 3, 'esther': 5, 'bob': -2, 'dave': -4, 'frank': -6}
obj4 = Series(peeps)
print(obj4)

alice     1
carol     3
esther    5
bob      -2
dave     -4
frank    -6
dtype: int64


In [329]:
mujeres = [0, 2, 4] # list of integer offsets
print("* las mujeres of `obj3` at offsets {}:\n{}\n".format(mujeres, obj3[mujeres]))

* las mujeres of `obj3` at offsets [0, 2, 4]:
alice     1
carol     3
esther    5
dtype: int64



  print("* las mujeres of `obj3` at offsets {}:\n{}\n".format(mujeres, obj3[mujeres]))


Basic arithmetic works on `Series` as vector-like operations.

In [331]:
print(obj3, "\n")
print(obj3 + 5, "\n")
print(obj3 + 5 > 0, "\n")
print((-2.5 * obj3) + (obj3 + 5))

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
dtype: int64 

alice      6
bob        3
carol      8
dave       1
esther    10
frank     -1
dtype: int64 

alice      True
bob        True
carol      True
dave       True
esther     True
frank     False
dtype: bool 

alice      3.5
bob        8.0
carol      0.5
dave      11.0
esther    -2.5
frank     14.0
dtype: float64


A Series object also supports vector-style operations with automatic alignment based on index values.

In [333]:
print(obj3, "\n")

obj_l = obj3[mujeres]
print(obj_l, "\n")


print(obj3 + obj_l, "\n")

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
dtype: int64 

alice     1
carol     3
esther    5
dtype: int64 

alice      2.0
bob        NaN
carol      6.0
dave       NaN
esther    10.0
frank      NaN
dtype: float64 



  obj_l = obj3[mujeres]


Another useful transformation is the `.apply(fun)` method. It returns a copy of the Series where the function fun has been applied to each element.

In [335]:
print(f'{obj3} \n')
print(obj3.apply(abs)) #apply abs value function to all of series and return a copy

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
dtype: int64 

alice     1
bob       2
carol     3
dave      4
esther    5
frank     6
dtype: int64


A `Series` may be _named_, too.

In [337]:
obj3.name = 'peep'

print(f'{obj3}\n')
print(obj3.name)

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
Name: peep, dtype: int64

peep


### DataFrame Objects

A pandas `DataFrame` object is a table whose columns are Series objects, all keyed on the same index. It's the perfect container for what we have been referring to as a tibble.

##### Creating a DataFrame

In [213]:
#Creating a DataFrame
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'Molly'],
        'Age': [25, 33, 41, 36],
        'Salary': [60000, 78000, 98000, 80000]}
df = pd.DataFrame(data)
print(df)

      Name  Age  Salary
0    Alice   25   60000
1      Bob   33   78000
2  Charlie   41   98000
3    Molly   36   80000


In [320]:
#Adding columns
df['Level'] = ['Entry', 'Mid-level', 'Senior', 'Mid-level']
df['Bonus'] = df['Salary'] * 0.10
display(df)

#Deleting columns
df2 = df.copy()
df2.drop('Bonus', axis=1, inplace=True) #inplace=T modifies the df; axis=1 means column

avg_salary = df[['Salary']].apply(np.mean) #entire column

#Aggreagted column:  Total Pay
df[['Salary', 'Bonus']].apply(sum, axis=1) #each row
df['Total_Salary'] = df[['Salary']].apply(lambda x: x*0.10 + x, axis=1) #each row
display(df)

#Grouping certain columns by agg
grouped_df = df.groupby('Level')[['Total_Salary', 'Age']].mean()
display(grouped_df)

#Filtering
filtered_df = df[(df['Salary'] <= 80000) & (df['Age'] < 27)]
filtered_df

#Delete df
del filtered_df

Unnamed: 0,Name,Age,Salary,Level,Bonus,Total_Salary
0,Alice,25,60000,Entry,6000.0,66000.0
1,Bob,33,78000,Mid-level,7800.0,85800.0
2,Charlie,41,98000,Senior,9800.0,107800.0
3,Molly,36,80000,Mid-level,8000.0,88000.0


Unnamed: 0,Name,Age,Salary,Level,Bonus,Total_Salary
0,Alice,25,60000,Entry,6000.0,66000.0
1,Bob,33,78000,Mid-level,7800.0,85800.0
2,Charlie,41,98000,Senior,9800.0,107800.0
3,Molly,36,80000,Mid-level,8000.0,88000.0


Unnamed: 0_level_0,Total_Salary,Age
Level,Unnamed: 1_level_1,Unnamed: 2_level_1
Entry,66000.0,25.0
Mid-level,86900.0,34.5
Senior,107800.0,41.0


In [342]:
#Creating dataframe
cafes = DataFrame({'name': ['east pole', 'chrome yellow', 'brash', 'taproom', '3heart', 'spiller park pcm', 'refuge', 'toptime'],
                   'zip': [30324, 30312, 30318, 30317, 30306, 30308, 30303, 30318],
                   'poc': ['jared', 'kelly', 'matt', 'jonathan', 'nhan', 'dale', 'kitti', 'nolan']})
print("type:", type(cafes), "\n")
print(cafes, "\n")
display(cafes) # Or just `cafes` as the last line of a cell

type: <class 'pandas.core.frame.DataFrame'> 

               name    zip       poc
0         east pole  30324     jared
1     chrome yellow  30312     kelly
2             brash  30318      matt
3           taproom  30317  jonathan
4            3heart  30306      nhan
5  spiller park pcm  30308      dale
6            refuge  30303     kitti
7           toptime  30318     nolan 



Unnamed: 0,name,zip,poc
0,east pole,30324,jared
1,chrome yellow,30312,kelly
2,brash,30318,matt
3,taproom,30317,jonathan
4,3heart,30306,nhan
5,spiller park pcm,30308,dale
6,refuge,30303,kitti
7,toptime,30318,nolan


##### Indexing and Slicing in DataFrames

In [344]:
#DataFrames have named columns, which are stored as an `Index` 
print(cafes.columns)

#Each column is a named Series:
print(type(cafes['zip'])) 

Index(['name', 'zip', 'poc'], dtype='object')
<class 'pandas.core.series.Series'>


Selecting columns: You can also select multiple columns by passing a list of column names.

In [346]:
# Select multiple columns
target_fields = ['zip', 'poc'] 
cafes[target_fields]

Unnamed: 0,zip,poc
0,30324,jared
1,30312,kelly
2,30318,matt
3,30317,jonathan
4,30306,nhan
5,30308,dale
6,30303,kitti
7,30318,nolan


In [347]:
#Slices apply to rows.
cafes[1::2] #select every other row, starting from index 1

Unnamed: 0,name,zip,poc
1,chrome yellow,30312,kelly
3,taproom,30317,jonathan
5,spiller park pcm,30308,dale
7,toptime,30318,nolan


In [348]:
cafes2 = cafes[['poc', 'zip']]
cafes2.index = cafes['name'] #assign the name column as the index
cafes2.index.name = None     
cafes2

Unnamed: 0,poc,zip
east pole,jared,30324
chrome yellow,kelly,30312
brash,matt,30318
taproom,jonathan,30317
3heart,nhan,30306
spiller park pcm,dale,30308
refuge,kitti,30303
toptime,nolan,30318


###### Accessing Rows and Columns using:
`.loc[]` - label-based selection

`.iloc[]` - integer-based selection (position)

In [350]:
print(df, '\n')
print(df.loc[:, 'Name']) # Select all rows for the 'Name' column
print(df.iloc[:, 1])  # Select all rows for the second (Age) column

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000 

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object
0    25
1    30
2    35
Name: Age, dtype: int64


In [351]:
#You can access subsets of rows using the .loc field and index values:
print(cafes2.loc[['chrome yellow', '3heart']], '\n')

#Alternatively, you can use integer offsets via the .iloc field index positions.
print(cafes2.iloc[[1, 3]]) #row index 1 & 3(2nd and 4th row)

                 poc    zip
chrome yellow  kelly  30312
3heart          nhan  30306 

                    poc    zip
chrome yellow     kelly  30312
taproom        jonathan  30317


##### Slicing Data:

In [353]:
# Slicing by label - loc
print(df.loc[0:1, 'Name':'Salary'])  # Rows 0 to 1 and columns from Name to Salary

# Slicing by position - iloc
print(df.iloc[0:2, 0:2])  # First two rows and first two columns

    Name  Age  Salary
0  Alice   25   50000
1    Bob   30   60000
    Name  Age
0  Alice   25
1    Bob   30


In [354]:
#Adding columns
cafes2['rating'] = 4.0
cafes2['price'] = '$$'
cafes2

Unnamed: 0,poc,zip,rating,price
east pole,jared,30324,4.0,$$
chrome yellow,kelly,30312,4.0,$$
brash,matt,30318,4.0,$$
taproom,jonathan,30317,4.0,$$
3heart,nhan,30306,4.0,$$
spiller park pcm,dale,30308,4.0,$$
refuge,kitti,30303,4.0,$$
toptime,nolan,30318,4.0,$$


#### Using `.apply()` Method
The apply() method allows you to apply a function along the axis of a DataFrame or Series. It’s a powerful tool for transforming data.
- Column Operations (`axis = 0`): vertically  ~ Default
- Row Operations (`axis = 1`): horizontally

In [523]:
display(df_merged)
print(df_merged[['Salary']].apply(sum)) #sum of entire column
print(df_merged[['Salary']].apply(sum, axis = 1)) #sum of each row

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000


Salary    110000
dtype: int64
0    50000
1    60000
dtype: int64


In [356]:
#Row Operations 
df['Salary_after_tax'] = df.apply(lambda row: row['Salary'] * 0.7, axis=1)
print(df, '\n')

      Name  Age  Salary  Salary_after_tax
0    Alice   25   50000           35000.0
1      Bob   30   60000           42000.0
2  Charlie   35   70000           49000.0 



In [357]:
#Vector arithmetic on columns
prices_as_ints = cafes2['price'].apply(lambda s: len(s))
print(prices_as_ints, '\n')

cafes2['value'] = cafes2['rating'] / prices_as_ints
print(cafes2)

east pole           2
chrome yellow       2
brash               2
taproom             2
3heart              2
spiller park pcm    2
refuge              2
toptime             2
Name: price, dtype: int64 

                       poc    zip  rating price  value
east pole            jared  30324     4.0    $$    2.0
chrome yellow        kelly  30312     4.0    $$    2.0
brash                 matt  30318     4.0    $$    2.0
taproom           jonathan  30317     4.0    $$    2.0
3heart                nhan  30306     4.0    $$    2.0
spiller park pcm      dale  30308     4.0    $$    2.0
refuge               kitti  30303     4.0    $$    2.0
toptime              nolan  30318     4.0    $$    2.0


In the above example, vector arithmetic works because all the Series objects involved have identical indexes Because the columns are Series objects, there is an implicit matching that is happening on the indexes.

This wont work in the following example>

In [359]:
cafes3 = cafes2.copy()
is_fancy = cafes3['zip'].isin({30306, 30308})
# Alternative:
#is_fancy = cafes3['zip'].apply(lambda z: z in {30306, 30308})
print(is_fancy, '\n')
display(cafes3[is_fancy])

east pole           False
chrome yellow       False
brash               False
taproom             False
3heart               True
spiller park pcm     True
refuge              False
toptime             False
Name: zip, dtype: bool 



Unnamed: 0,poc,zip,rating,price,value
3heart,nhan,30306,4.0,$$,2.0
spiller park pcm,dale,30308,4.0,$$,2.0


In [360]:
#Concataneation: Add extra $ to price for fancy restauraunts
cafes3[is_fancy]['price'] += '$'   #ERROR

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cafes3[is_fancy]['price'] += '$'   #ERROR


When you slice horizontally, you get copies of the original data, not references to subsets of the original data. Therefore, we'll need different strategy. Use loc instead. 

In [362]:
cafes3.loc[is_fancy, 'price'] += '$'
cafes3

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
3heart,nhan,30306,4.0,$$$,2.0
spiller park pcm,dale,30308,4.0,$$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


**A different approach.** Let's see if we can solve this problem in other ways to see what may or may not work.

In [364]:
cafes4 = cafes2.copy() # Start over

#Attempt 1:
fancy_shops = cafes4.index[is_fancy]
print(fancy_shops, '\n')

fancy_markup = Series(['$'] * len(fancy_shops), index=fancy_shops)
print(fancy_markup, '\n')

cafes4['price'] + fancy_markup  #Incorrect bc missing values are treated as NaN objects

Index(['3heart', 'spiller park pcm'], dtype='object') 

3heart              $
spiller park pcm    $
dtype: object 



3heart              $$$
brash               NaN
chrome yellow       NaN
east pole           NaN
refuge              NaN
spiller park pcm    $$$
taproom             NaN
toptime             NaN
dtype: object

In [365]:
#Attempt 2
cafes4 = cafes2.copy()
cafes4['price'] += Series([x * '$' for x in is_fancy.tolist()], index=is_fancy.index)
cafes4

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
3heart,nhan,30306,4.0,$$$,2.0
spiller park pcm,dale,30308,4.0,$$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


More on **apply() for DataFrame objects**. As with a Series, there is a `DataFrame.apply()` procedure. However, it's meaning is a bit more nuanced because a DataFrame is generally 2-D rather than 1-D.

In [367]:
print(cafes4.apply(lambda x: repr(type(x))), '\n') # What does this do? What does the output tell you?
#axis parameter useful
print(cafes4.apply(lambda x: repr(type(x)), axis=1), '\n') # What does this do? What does the output tell you?
#verify what you get when axis=1
cafes4.apply(lambda x: print('==> ' + x.name + '\n' + repr(x)) if x.name == 'east pole' else None, axis=1);

poc       <class 'pandas.core.series.Series'>
zip       <class 'pandas.core.series.Series'>
rating    <class 'pandas.core.series.Series'>
price     <class 'pandas.core.series.Series'>
value     <class 'pandas.core.series.Series'>
dtype: object 

east pole           <class 'pandas.core.series.Series'>
chrome yellow       <class 'pandas.core.series.Series'>
brash               <class 'pandas.core.series.Series'>
taproom             <class 'pandas.core.series.Series'>
3heart              <class 'pandas.core.series.Series'>
spiller park pcm    <class 'pandas.core.series.Series'>
refuge              <class 'pandas.core.series.Series'>
toptime             <class 'pandas.core.series.Series'>
dtype: object 

==> east pole
poc       jared
zip       30324
rating      4.0
price        $$
value       2.0
Name: east pole, dtype: object


**Exercise.** Use `DataFrame.apply()` to update the `'value'` column in `cafes4`, which is out of date given the update of the prices.

In [369]:
print(cafes4, '\n') # Verify visually that `'value'` is out of date


def calc_value(row):
    return row['rating'] / len(row['price'])

cafes4['value'] = cafes4.apply(calc_value, axis=1)
print(cafes4)

                       poc    zip  rating price  value
east pole            jared  30324     4.0    $$    2.0
chrome yellow        kelly  30312     4.0    $$    2.0
brash                 matt  30318     4.0    $$    2.0
taproom           jonathan  30317     4.0    $$    2.0
3heart                nhan  30306     4.0   $$$    2.0
spiller park pcm      dale  30308     4.0   $$$    2.0
refuge               kitti  30303     4.0    $$    2.0
toptime              nolan  30318     4.0    $$    2.0 

                       poc    zip  rating price     value
east pole            jared  30324     4.0    $$  2.000000
chrome yellow        kelly  30312     4.0    $$  2.000000
brash                 matt  30318     4.0    $$  2.000000
taproom           jonathan  30317     4.0    $$  2.000000
3heart                nhan  30306     4.0   $$$  1.333333
spiller park pcm      dale  30308     4.0   $$$  1.333333
refuge               kitti  30303     4.0    $$  2.000000
toptime              nolan  30318     4

## Index Objects 
A pandas Index is "list-like." It has a number of useful operations, including set-like operations (e.g., testing for membership, intersection, union, difference). ~ rows

In [371]:
from pandas import Index

#index names
cafes4.index 

Index(['east pole', 'chrome yellow', 'brash', 'taproom', '3heart',
       'spiller park pcm', 'refuge', 'toptime'],
      dtype='object')

Index values can be duplicated, so there can be more than one row with the same index value. We can use the `Index.duplicated()` function to identify all the duplicate values.

In [373]:
idx = pd.Index(['Labrador', 'Beagle', 'Labrador',
                      'Lhasa', 'Husky', 'Beagle']) #row names(index)

# Identify all duplicated occurrence of values
idx.duplicated(keep = False)

array([ True,  True,  True, False, False,  True])

The `isin()` function returns a boolean mask array, in which each row of the mask array has a True/False value, if that row meets or does not meet the isin() criteria.

In [375]:
# boolean mask/membership
cafes4.index.isin(['brash', '3heart'])

array([False, False,  True, False,  True, False, False, False])

In [376]:
cafes4.index

# create an union with a new value
union = cafes4.index.union(['chattahoochee'])
print(union, '\n')

# return the difference
difference = cafes4.index.difference(['chattahoochee', 'starbucks', 'bar crema'])
print(difference, '\n')

#change the index of a DataFrame
cafes5 = cafes4.reindex(Index(['3heart', 'east pole', 'brash', 'starbucks']))
display(cafes4)
display(cafes5)

Index(['3heart', 'brash', 'chattahoochee', 'chrome yellow', 'east pole',
       'refuge', 'spiller park pcm', 'taproom', 'toptime'],
      dtype='object') 

Index(['3heart', 'brash', 'chrome yellow', 'east pole', 'refuge',
       'spiller park pcm', 'taproom', 'toptime'],
      dtype='object') 



Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
3heart,nhan,30306,4.0,$$$,1.333333
spiller park pcm,dale,30308,4.0,$$$,1.333333
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


Unnamed: 0,poc,zip,rating,price,value
3heart,nhan,30306.0,4.0,$$$,1.333333
east pole,jared,30324.0,4.0,$$,2.0
brash,matt,30318.0,4.0,$$,2.0
starbucks,,,,,


Another useful operation is dropping the index (and replacing it with the default, integers).

In [378]:
# Reset the index from the string names to integers in new df
cafes6 = cafes4.reset_index(drop=True)
print(cafes6, '\n')

# Create a new column from the index of the other df
cafes6['name'] = cafes4.index
print(cafes6, '\n')

        poc    zip  rating price     value
0     jared  30324     4.0    $$  2.000000
1     kelly  30312     4.0    $$  2.000000
2      matt  30318     4.0    $$  2.000000
3  jonathan  30317     4.0    $$  2.000000
4      nhan  30306     4.0   $$$  1.333333
5      dale  30308     4.0   $$$  1.333333
6     kitti  30303     4.0    $$  2.000000
7     nolan  30318     4.0    $$  2.000000 

        poc    zip  rating price     value              name
0     jared  30324     4.0    $$  2.000000         east pole
1     kelly  30312     4.0    $$  2.000000     chrome yellow
2      matt  30318     4.0    $$  2.000000             brash
3  jonathan  30317     4.0    $$  2.000000           taproom
4      nhan  30306     4.0   $$$  1.333333            3heart
5      dale  30308     4.0   $$$  1.333333  spiller park pcm
6     kitti  30303     4.0    $$  2.000000            refuge
7     nolan  30318     4.0    $$  2.000000           toptime 



Another useful operation is gluing DataFrame objects together:

The `.concat()` function will bring together dataframes with the same column names and will also join those with different column names. 

In [381]:
# Split based on price:

# create a boolean mask
is_cheap = cafes4['price'] <= '$$'

# create 2 df's, based on the mask
cafes_cheap = cafes4[is_cheap]   #cheap
cafes_pricey = cafes4[~is_cheap]  #not cheap

display(cafes_cheap)
display(cafes_pricey)

# Never mind; recombine
pd.concat([cafes_cheap, cafes_pricey])

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


Unnamed: 0,poc,zip,rating,price,value
3heart,nhan,30306,4.0,$$$,1.333333
spiller park pcm,dale,30308,4.0,$$$,1.333333


Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0
3heart,nhan,30306,4.0,$$$,1.333333
spiller park pcm,dale,30308,4.0,$$$,1.333333


#### Merging and Joining DataFrames
You can combine DataFrames using merge operations, similar to SQL joins:

The `merge()` function in python is also useful. In SQL, this is a select and join of two tables.

The main interface for this is th**e pd.mer**ge function, which you will use extensivelin class

Parameters:
```
pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
```

The 'left' and 'right' parameters define the two dataframes to merge onR remember which df is left and which is rig.m.

The 'how' parameter designates the type of join operation to perform (left, right, outer, inner, cross) Default = inner. eek.

The 'on' parameter gives the column(s) or index level names to join on. Whatever you designate for this parameter must be contained in both the left and right df's.

You can use the 'left_on' and 'right_on' parameters when the two dataframes do not have identically-named columns, but they have differently-named columns that have the same values, and you can join on them. We will show an example of this below.
s.

In [511]:
# Creating two DataFrames to merge
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Salary': [50000, 60000]})

# Merging based on 'Name' column
df_merged = pd.merge(df1, df2, on='Name')
print(df_merged)

    Name  Age  Salary
0  Alice   25   50000
1    Bob   30   60000


If the "on" parameter is not specified, the default behavior of pd.merge() is that it looks for one or more matching column names between the two inputs, and uses this as the key. 

Options for specifying how to merge:
1. 
You can explicitly specify the name of the key column using the on keyword, which takes a column name or a list of column names. Note that this only works if both the left and right dataframes have the specified column name.2. 

The left_on and right_on keywords are useful when you want to merge two datasets with different column names.

In [386]:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue','Chris','Rich'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR','IT','Engineering']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
display(df1, df2, df3)

Unnamed: 0,employee,group
0,Bob,Accounting
1,Jake,Engineering
2,Lisa,Engineering
3,Sue,HR
4,Chris,IT
5,Rich,Engineering


Unnamed: 0,employee,hire_date
0,Lisa,2004
1,Bob,2008
2,Jake,2012
3,Sue,2014


Unnamed: 0,name,salary
0,Bob,70000
1,Jake,80000
2,Lisa,120000
3,Sue,90000


In [387]:
# option 1, using 'on'
display(df1, df2, pd.merge(df1, df2, on='employee'))

# option 2 above, using left_on and right_on
# recall the designation of the left and right df's, and then using the correct *_on column names
display(df1, df3, pd.merge(df1, df3, left_on="employee", right_on="name"))

Unnamed: 0,employee,group
0,Bob,Accounting
1,Jake,Engineering
2,Lisa,Engineering
3,Sue,HR
4,Chris,IT
5,Rich,Engineering


Unnamed: 0,employee,hire_date
0,Lisa,2004
1,Bob,2008
2,Jake,2012
3,Sue,2014


Unnamed: 0,employee,group,hire_date
0,Bob,Accounting,2008
1,Jake,Engineering,2012
2,Lisa,Engineering,2004
3,Sue,HR,2014


Unnamed: 0,employee,group
0,Bob,Accounting
1,Jake,Engineering
2,Lisa,Engineering
3,Sue,HR
4,Chris,IT
5,Rich,Engineering


Unnamed: 0,name,salary
0,Bob,70000
1,Jake,80000
2,Lisa,120000
3,Sue,90000


Unnamed: 0,employee,group,name,salary
0,Bob,Accounting,Bob,70000
1,Jake,Engineering,Jake,80000
2,Lisa,Engineering,Lisa,120000
3,Sue,HR,Sue,90000


'how' parameter: 

inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys. Returns only keys in both the left and right df's.

outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically. The outer join fills in all missing values with NAs.

In [389]:
# outer join
display(df1, df2, pd.merge(df1, df2, on='employee', how='outer'))

Unnamed: 0,employee,group
0,Bob,Accounting
1,Jake,Engineering
2,Lisa,Engineering
3,Sue,HR
4,Chris,IT
5,Rich,Engineering


Unnamed: 0,employee,hire_date
0,Lisa,2004
1,Bob,2008
2,Jake,2012
3,Sue,2014


Unnamed: 0,employee,group,hire_date
0,Bob,Accounting,2008.0
1,Jake,Engineering,2012.0
2,Lisa,Engineering,2004.0
3,Sue,HR,2014.0
4,Chris,IT,
5,Rich,Engineering,


#### Pandas Grouping and Aggregation
You can group data based on one or more columns and then apply aggregation functions. The groupby() function is typically used to aggregate conditionally on some row label or index. The function is similar in usage to the SQL command 'group by'.

A groupby() operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

```
out = dataframe.groupby(by=columnname).function()

df.groupby(by=["b"]).sum()um()
```

In [391]:
print(df, '\n')

df_grouped = df.groupby('Age')['Salary'].mean()  # Group by Age and calculate mean Salary
print(df_grouped)

      Name  Age  Salary  Salary_after_tax
0    Alice   25   50000           35000.0
1      Bob   30   60000           42000.0
2  Charlie   35   70000           49000.0 

Age
25    50000.0
30    60000.0
35    70000.0
Name: Salary, dtype: float64


###### Aggreagtion 

In [393]:
import pandas as pd
import numpy as np

# for a Series
rng = np.random.RandomState(42)  # set a random starting point
agg_series = pd.Series(rng.rand(5))
display(agg_series) #series of 5 random numbers 

# aggregate the entire column
print(agg_series.sum())
print(agg_series.mean())

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

2.811925491708157
0.5623850983416314


In [394]:
# for a dataframe
df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
display(df)

#If axis not specified, the default is aggregation over the entire columns.

print(df.mean()) #aggregate entire columns
print(df.mean(axis='columns')) #aggreagte within each row

Unnamed: 0,A,B
0,0.155995,0.020584
1,0.058084,0.96991
2,0.866176,0.832443
3,0.601115,0.212339
4,0.708073,0.181825


A    0.477888
B    0.443420
dtype: float64
0    0.088290
1    0.513997
2    0.849309
3    0.406727
4    0.444949
dtype: float64


If we don't specify an axis argument, the default is to perform the aggregation over the columns.

###### Grouping & Aggreagtion 

In [397]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
display(df)

# groupby and aggregate by a single column
display(df.groupby('key').sum())
display(df.groupby('key').mean())

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,1.5
B,2.5
C,3.5


In [398]:
# groupby and aggregate by multiple columns

df2 = pd.DataFrame({'key1': ['A', 'B', 'C', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C'],
                    'key2': ['far', 'far', 'far', 'near', 'near', 'near','far', 'far', 'far', 'near', 'near', 'near'],
                   'data': range(12)}, columns=['key1', 'key2', 'data'])
display(df2)

#multiple columns
display(df2.groupby(['key1','key2']).sum())
display(df2.groupby(['key1','key2']).sum())

Unnamed: 0,key1,key2,data
0,A,far,0
1,B,far,1
2,C,far,2
3,A,near,3
4,B,near,4
5,C,near,5
6,A,far,6
7,B,far,7
8,C,far,8
9,A,near,9


Unnamed: 0_level_0,Unnamed: 1_level_0,data
key1,key2,Unnamed: 2_level_1
A,far,6
A,near,12
B,far,8
B,near,14
C,far,10
C,near,16


Unnamed: 0_level_0,Unnamed: 1_level_0,data
key1,key2,Unnamed: 2_level_1
A,far,6
A,near,12
B,far,8
B,near,14
C,far,10
C,near,16


There is a convenience method, `.describe()`, that computes several common aggregates for each column and returns the result. This is good function when you are performing exploratory data analysis (EDA).

In [400]:
df5 = pd.DataFrame({'A': rng.rand(10),
                   'B': rng.rand(10),
                   'C': rng.rand(10),
                   'D': rng.rand(10)})
df5.describe()

Unnamed: 0,A,B,C,D
count,10.0,10.0,10.0,10.0
mean,0.36015,0.489559,0.415477,0.640281
std,0.148051,0.351583,0.30671,0.262852
min,0.139494,0.04645,0.034389,0.184854
25%,0.291458,0.177812,0.156224,0.526729
50%,0.335302,0.553325,0.372383,0.630211
75%,0.450039,0.740768,0.636969,0.864904
max,0.611853,0.965632,0.90932,0.969585


#### Apply, Boolean Mask, & Groupby:
```
Apply: dataframe.apply(function)
Group: dataframe.groupby(by=columnname).function()
Both: dataframe.groupby(by=columnname).apply(function)
```

Apply
- axis=0 (default) applies function to each column.  
- axis=1 applies function to each row.
- gives us the result for the columns IT IS ABLE TO OPERATE ON, otherwise returns error.r.

Group By
-  it will NOT INCLUDE the columns that the function cannot operate on in the resu

Together
-  return the function result for the columns/rows IT IS ABLE TO OPERATE ON however, sum() will concanate strings and mean() gives an error.lt

In [627]:
#Data set
data = {
    'SEASON_ID': [22020, 22020, 22020, 22018, 22019, 22019],
    'PLAYER_ID': [203932, 1628988, 1630174, 1627846, 1629690, 1629699],
    'PLAYER_NAME': ['Aaron Gordon', 'Aaron Holiday', 'Aaron Nesmith', 'Abdel Nader', 'Adam Mokoka', 'Daph J'],
    'GAMES_PLAYED': [50, 66, 46, 24, 14, 20],
    'MINUTES': [1383.78, 1176.086667, 668.731667, 355.25, 56.178333, 1383.79],
    'POINTS': [618, 475, 218, 160, 15, 329],
    'REBOUNDS': [284, 89, 127, 62, 5, 29],
    'PLUS_MINUS': [60, 3, -7, 28, -8, 9]
}

sports_df = pd.DataFrame(data)
display(sports_df)

Unnamed: 0,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
0,22020,203932,Aaron Gordon,50,1383.78,618,284,60
1,22020,1628988,Aaron Holiday,66,1176.086667,475,89,3
2,22020,1630174,Aaron Nesmith,46,668.731667,218,127,-7
3,22018,1627846,Abdel Nader,24,355.25,160,62,28
4,22019,1629690,Adam Mokoka,14,56.178333,15,5,-8
5,22019,1629699,Daph J,20,1383.79,329,29,9


We can also use the apply function on multiple columns or the entire dataframe, but to do so all of the dataframe columns must be able to be operated on by the function. If we try to perform a function on an incompatible column/row using apply(), it will return an error.

In [629]:
#Mean of entire column
sports_df[['GAMES_PLAYED']].apply(np.mean)  #double brackets for data frame
sports_df[['GAMES_PLAYED', 'POINTS', 'REBOUNDS']].apply(np.mean) #ave games, points, and rebounds

GAMES_PLAYED     36.666667
POINTS          302.500000
REBOUNDS         99.333333
dtype: float64

In [648]:
#sports_df.groupby(by= 'POINTS').sum()

sports_df.groupby(by = 'SEASON_ID').apply(np.sum, axis=0) #Note: It concantanates the strings for Player name
#sports_df.groupby(by = 'SEASON_ID').apply(np.sum, axis=0) #ERROR: mean

Unnamed: 0_level_0,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
SEASON_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
22018,22018,1627846,Abdel Nader,24,355.25,160,62,28
22019,44038,3259389,Adam MokokaDaph J,34,1439.968333,344,34,1
22020,66060,3463094,Aaron GordonAaron HolidayAaron Nesmith,162,3228.598334,1311,500,56


Depending on the function we are using (sum vs. mean, for example), some columns may be included in (or excluded from) the returned data frame. So we may get results that we are not expecting.

To fix this:

1. Create a new dataframe by keeping only the columns necessary for that particular analysis.

2. (Optional) Set your columns to groupby as indices on the new dataframe.(This ensures that you are not grouping extraneous columns, for functions such as sum())

3. Perform the required groupby/apply/function on the new dataframe.

4. (Optional) Set the index columns to be regular columns.

Below are additional steps that the (exam/homework) exercise may require:
- Merge the returned dataframe with the other dataframe(s) required by the analysis.
- Drop the extraneous columns in the new/merged dataframe.
- Rename the remaining columns, per the exercise requirements.

**Excercise 1** Return a dataframe that summarizes the total minutes, games played, points, and rebounds for each player, over the 4 seasons.

In [678]:
#sports_df
sports_df.groupby(by = 'PLAYER_NAME').sum()

#step 1: create new df with columns required
sports_stats_test = sports_df[['PLAYER_NAME','MINUTES', 'POINTS', 'REBOUNDS']]

#step 2 (optional): set the grouping columns to be indexes
sports_stats_test = sports_stats_test.set_index(['PLAYER_NAME']) #set index to player name instead of deafult intergers



Unnamed: 0_level_0,SEASON_ID,PLAYER_ID,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
PLAYER_NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aaron Gordon,22020,203932,50,1383.78,618,284,60
Aaron Holiday,22020,1628988,66,1176.086667,475,89,3
Aaron Nesmith,22020,1630174,46,668.731667,218,127,-7
Abdel Nader,22018,1627846,24,355.25,160,62,28
Adam Mokoka,22019,1629690,14,56.178333,15,5,-8
Daph J,22019,1629699,20,1383.79,329,29,9
