# Intro to Pandas

In [4]:
import pandas as pd
import numpy as np

# Series

Operates similarly to the following, and will utilize Data from the following options

- Numpy array (singular or multiple dimensions)
- Dictionaries
- a scalar value


### Pandas Series from numpy ndarray

In [9]:
# Here, we specify the index 
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a    0.759940
b   -0.605697
c   -0.233597
d   -0.521836
e    1.066798
dtype: float64

In [6]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [8]:
# Here, we let Pandas create a default index
pd.Series(np.random.randn(5))

0   -0.652703
1    1.396967
2   -1.692087
3   -0.933317
4   -1.161064
dtype: float64

### Pandas Series From dictionary

In [10]:
d = {'b': 1, 'a': 0, 'c': 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

### Pandas Series From scalar value

If data is a scalar value, an index must be provided. The value will be repeated to match the length of the index.



In [11]:
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

# Dataframes

### From Dictionary of Series or Dictionaries

The resulting index will be the union of the indexes of the various Series.

In [20]:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
         'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} 

df = pd.DataFrame(d)

df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [21]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [22]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


### From Dictionary of ndarrays or lists

The ndarrays must all be of the same length. If an index is passed, it must also be of the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.



In [16]:
d = {'one': [1., 2., 3., 4.],
         'two': [4., 3., 2., 1.]}

pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [17]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


### Column selection, addition, deletion

You can treat a DataFrame semantically like a dictionary of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dictionary operations:

add a new column 'D' that is the result of column 'A' multiplied by column 'B', and then delete column 'C'?

In [43]:
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [9, 10, 11, 12]
})

df['D'] = df['A'] * df['B']

df

Unnamed: 0,A,B,C,D
0,1,5,9,5
1,2,6,10,12
2,3,7,11,21
3,4,8,12,32


In [28]:
del df['C']

df

Unnamed: 0,A,B,D
0,1,5,5
1,2,6,12
2,3,7,21
3,4,8,32


perform element-wise multiplication of all elements in df by 10 and store the result back in df?

In [45]:
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

df = df*10

df

Unnamed: 0,A,B
0,10,40
1,20,50
2,30,60


#### Scalar

When inserting a scalar value, it will naturally be propagated to fill the column:

In [30]:
df['foo'] = 'bar'

df

Unnamed: 0,A,B,D,foo
0,1,5,5,bar
1,2,6,12,bar
2,3,7,21,bar
3,4,8,32,bar


When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index.

In [31]:
df['one_trunc'] = df['A'][:2]

df

Unnamed: 0,A,B,D,foo,one_trunc
0,1,5,5,bar,1.0
1,2,6,12,bar,2.0
2,3,7,21,bar,
3,4,8,32,bar,


In [34]:
df1 = pd.DataFrame(np.random.randn(8, 3), index=range(8), columns=list('ABC'))

df1

Unnamed: 0,A,B,C
0,-0.386851,-1.607079,0.578029
1,-0.360151,0.323313,1.792127
2,0.234993,-1.222968,0.350741
3,-0.630497,-1.793681,-1.52976
4,0.147146,-1.591079,-0.182208
5,0.970677,0.729362,0.242854
6,0.899637,1.960144,-1.287631
7,-0.747163,-0.916021,0.891135


In [35]:
df1 * 5 + 2

Unnamed: 0,A,B,C
0,0.065745,-6.035396,4.890146
1,0.199247,3.616564,10.960635
2,3.174966,-4.114839,3.753707
3,-1.152483,-6.968403,-5.648802
4,2.73573,-5.955393,1.088958
5,6.853383,5.646808,3.214268
6,6.498183,11.800719,-4.438156
7,-1.735817,-2.580105,6.455676


In [36]:
1 / df1

Unnamed: 0,A,B,C
0,-2.584974,-0.622247,1.730017
1,-2.776617,3.092981,0.557996
2,4.255442,-0.817683,2.851104
3,-1.586052,-0.557513,-0.653697
4,6.79597,-0.628504,-5.488223
5,1.030209,1.371062,4.117707
6,1.11156,0.510167,-0.77662
7,-1.338395,-1.091678,1.122164


### Boolean

#### Booelan Operators

- & is AND - True when both are True. 
- | is OR - True when at least one is True.
- ^ is XOR - True when exactly one is True (not both or neither)

In [None]:
df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)

df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)

df1 & df2

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False


In [42]:
df1 | df2

Unnamed: 0,a,b
0,True,True
1,True,True
2,True,True


### Datatypes in Pandas

For the most part, Pandas uses NumPy arrays and dtypes for Series or individual columns of a DataFrame. NumPy provides support for float, int, bool, timedelta64[ns] and datetime64[ns].

However, NumPy doesn't allow non-numeric data types, therefore, Pandas has to extend NumPy's type system in a few places. The following table lists most of Pandas extension types (the most common ones):

In [58]:
datatypes= {
    'Kind of Data': ['Categorical', 'nullable integer', 'Strings', 'Boolean (with NA)','any'],
    'Date Type': ['CategoricalDtype', 'Int64Dtype','StringDtype','BooleanDtype','object dtype'],
    'String Aliases' : ['category', '"Int8", "UInt8", "Int16", "UInt16"...', 'string', ' "boolean","bool"', 'object']
}

datatypes_df = pd.DataFrame(datatypes)

datatypes_df



Unnamed: 0,Kind of Data,Date Type,String Aliases
0,Categorical,CategoricalDtype,category
1,nullable integer,Int64Dtype,"""Int8"", ""UInt8"", ""Int16"", ""UInt16""..."
2,Strings,StringDtype,string
3,Boolean (with NA),BooleanDtype,"""boolean"",""bool"""
4,any,object dtype,object


In [None]:
dft = pd.DataFrame({'A': np.random.rand(3),
                        'B': 1,
                        'C': 'foo',
                        'D': pd.Timestamp('20010102'),
                        'E': pd.Series([1.0] * 3).astype('float32'),
                        'F': False,
                        'G': pd.Series([1] * 3, dtype='int8')})



dft


Unnamed: 0,A,B,C,D,E,F,G
0,0.671847,1,foo,2001-01-02,1.0,False,1
1,0.267029,1,foo,2001-01-02,1.0,False,1
2,0.252666,1,foo,2001-01-02,1.0,False,1


In [48]:
dft.dtypes

A          float64
B            int64
C           object
D    datetime64[s]
E          float32
F             bool
G             int8
dtype: object

#### Strings in Pandas

- object dtype
    - Can hold any Python object, including strings.
- StringDtype
    - Dedicated to strings (introduced in 2020, only in the Pandas 1.0.0 version)
    
It is recommended to use StringDtype for strings because an object can hide any data type inside.

### Converting Datatypes

#### astype()



In [59]:
df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')

df1.dtypes

A    float32
dtype: object

In [60]:
df1 = df1.astype('float64')

In [61]:
df1.dtypes

A    float64
dtype: object

You can .astype() on a subset of columns as well, even on a single column, a.k.a. Series.

In [62]:
dft1 = pd.DataFrame({'a': [1, 0, 1], 'b': [4, 5, 6], 'c': [7, 8, 9]})

dft1 = dft1.astype({'a': bool, 'c': np.float64})

dft1

Unnamed: 0,a,b,c
0,True,4,7.0
1,False,5,8.0
2,True,6,9.0


In [63]:
dft1.dtypes

a       bool
b      int64
c    float64
dtype: object

### Functions / Accessors of Pandas objects

- value_counts() 
    - Series method 
    - Only works on 1D array of values.
- mode() 
    - Mode of the values in a Series or DataFrame.
- .dt
    - Essentially a ‘datetime’ accessor that works on series in Pandas.
- .str
    - Series is equipped with a set of string processing methods that make it easy to operate on each element of the array.


### Sorting


There are three types of sorting in Pandas: 
    - Sorting by index labels 
    - Sorting by column values 
    - Sorting by a combination of both

In [2]:
# Sort by index

import pandas as pd
import numpy as np

df = pd.DataFrame({
        'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
        'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})


unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
                          columns=['three', 'two', 'one'])

unsorted_df

Unnamed: 0,three,two,one
a,,3.353389,1.193548
d,-0.833807,-0.332141,
c,1.030907,0.229597,0.018438
b,0.179965,0.380328,-0.250276


In [5]:
# Sort DataFrame by index 
unsorted_df.sort_index()

Unnamed: 0,three,two,one
a,,3.353389,1.193548
b,0.179965,0.380328,-0.250276
c,1.030907,0.229597,0.018438
d,-0.833807,-0.332141,


In [6]:
# Sort DataFrame by index Inverse
unsorted_df.sort_index(ascending=False)

Unnamed: 0,three,two,one
d,-0.833807,-0.332141,
c,1.030907,0.229597,0.018438
b,0.179965,0.380328,-0.250276
a,,3.353389,1.193548


In [8]:
# Sort DataFrame by column names ( axis = 1 )

unsorted_df.sort_index(axis=1)

Unnamed: 0,one,three,two
a,1.193548,,3.353389
d,,-0.833807,-0.332141
c,0.018438,1.030907,0.229597
b,-0.250276,0.179965,0.380328


In [9]:
# Sort Series by index

unsorted_df['three'].sort_index()

a         NaN
b    0.179965
c    1.030907
d   -0.833807
Name: three, dtype: float64

In [None]:
# Sort by values

df1 = pd.DataFrame({'one': [2, 1, 1, 1],
                        'two': [1, 3, 2, 4],
                        'three': [5, 4, 3, 2]})

df1

Unnamed: 0,one,two,three
0,2,1,5
1,1,3,4
2,1,2,3
3,1,4,2


In [13]:
# Sort DataFrame by column "two"
df1.sort_values(by='two')

Unnamed: 0,one,two,three
0,2,1,5
2,1,2,3
1,1,3,4
3,1,4,2


In [14]:
# Sort DataFrame by columns "one" and "two"
df1[['one', 'two', 'three']].sort_values(by=['one', 'two'])

Unnamed: 0,one,two,three
2,1,2,3
1,1,3,4
3,1,4,2
0,2,1,5


### Boolean Compoaritors

Boolean Comparisons
Series and DataFrame have the binary comparison methods eq, ne, le, lt, ge, and gt whose behavior is vectorized:
- eq 
    - == equals to
- ne
    - !=  not equals to
- le
    - <= less than or equals to
- lt 
    - < less than
- ge
    - '>=' greater than or equals to
- gt
    - '>' greater than


In [26]:
df = pd.DataFrame({
        'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
        'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

df2 = df.copy()

# gt used as method to compare df-dataframe against df2-dataframe within singular statement
df.gt(df2)

Unnamed: 0,one,two,three
a,False,False,False
b,False,False,False
c,False,False,False
d,False,False,False


In [19]:
# ne used as a method to compare 'not equal' to values within sinulgar statement
df2.ne(df)

Unnamed: 0,one,two,three
a,False,False,True
b,False,False,False
c,False,False,False
d,True,False,False


Methods below perform in the same concept as the comparitors above in their own context

- empty
- any()
- all()
- bool() 
- equals()

In [28]:
print(df)

# are all values in each column greater than 0
(df > 0).all()

        one       two     three
a  0.883036  2.279542       NaN
b -1.330833 -0.891217 -0.996467
c  0.339572 -1.096066 -1.941777
d       NaN -0.402539  0.925562


one      False
two      False
three    False
dtype: bool

In [31]:
print(df)

# are any values in each column greater than o
(df > 0).any()

        one       two     three
a  0.883036  2.279542       NaN
b -1.330833 -0.891217 -0.996467
c  0.339572 -1.096066 -1.941777
d       NaN -0.402539  0.925562


one      True
two      True
three    True
dtype: bool

### Objects comparison

perform element-wise comparisons when comparing a pandas data structure with a scalar value:

In [32]:
pd.Series(['foo', 'bar', 'baz']) == 'foo'

0     True
1    False
2    False
dtype: bool

In [33]:
pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])

0     True
1     True
2    False
dtype: bool

In [34]:
(df + df).equals(df * 2)

True

### Descriptive Statistics

There are a large number of methods for computing descriptive statistics and other related operations on Series, DataFrame.

- describe()
    - Count
        - Number of non-null observations.
    - Mean
        - Arithmetic mean.
    - Std
        - Standard deviation.
    - Min
        - Minimum value.
    - 25%, 50% (median), 75%: Percentiles (quartiles).
    - Max
        - Maximum value.
    - For non-numeric data, it provides count, unique, top, and freq.
- Measures of Central Tendency
    - mean()
    - median()
    - mode()
- Measures of Dispersion/Variability
    - std()
        - Calculates the standard deviation. By default, it calculates the sample standard deviation (ddof=1).
    - var()
        - Calculates the variance.
    - min()
        - Finds the minimum value.
    - max()
        - Finds the maximum value.
    - quantile()
        - Calculates quantiles (e.g., 0.25 for the first quartile).
    - count()
        - Counts non-null observations.
    - sum()
        - Calculates the sum of values.
    - abs()
        - Returns the absolute value of each element.
    - prod()
        - Calculates the product of values.
    - cumsum()
        - Calculates the cumulative sum.
    - cumprod()
        - Calculates the cumulative product.
    - nunique()
        - Returns the number of unique values.
    - idxmin() / idxmax()
        - Returns the index label of the minimum/maximum value.
    - argmin() / argmax()
        - Returns the integer index of the minimum/maximum value.


Generally speaking, **these methods take an axis as an argument and the axis can be specified by name or integer:**

In [35]:
# Aggregation for each column
df.mean(0)

one     -0.036075
two     -0.027570
three   -0.670894
dtype: float64

In [36]:
# Aggregation for each index
df.mean(1)

a    1.581289
b   -1.072839
c   -0.899424
d    0.261511
dtype: float64

### Iterations

When iterating over a Series, it is regarded as array-like, and basic iterations produces the values. 

Iterating over DataFrames follow the dict-like convention of iterating over the keys of the objects.

In short, basic iteration (for i in object) produces:

Series: values
DataFrame: column labels

In [37]:
df = pd.DataFrame({'col1': np.random.randn(3),
                     'col2': np.random.randn(3)}, index=['a', 'b', 'c'])


for col in df:
        print(col)

col1
col2


#### Dataframe Iterating Methods

- items()
    - To iterate over the (key, value) pairs.
- iterrows()
    - Iterate over the rows of a DataFrame as (index, Series) pairs. 
    - This converts the rows to Series objects, which can change the dtypes and has some performance implications.
- itertuples()
    - Iterate over the rows of a DataFrame as namedtuples of the values. 
    - This is a lot faster than iterrows() and is in most cases preferable to use to iterate over the values of a DataFrame.


### Merging

#### Merge() & Concat()

Concatenating Pandas objects together with concat() (equivalent to UNION in SQL):

##### Pandas.concat() = sql Union()

In [3]:
import pandas as pd 
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4))

df

Unnamed: 0,0,1,2,3
0,0.117737,-0.485657,0.683541,-0.131472
1,0.263717,1.200925,0.760706,0.333891
2,-0.596264,-0.067011,-0.929175,0.822631
3,-0.90862,0.067446,-1.6968,-1.232521
4,-1.414758,1.826359,1.420804,0.189378
5,-1.21252,1.21064,0.046005,0.625258
6,1.207507,0.519099,-0.746933,-0.372751
7,-1.250094,-1.32837,0.01246,-0.361438
8,0.333165,-1.243732,-1.549012,0.915562
9,1.738738,-0.324127,-0.621795,-0.942039


In [4]:
#break it into pieces

pieces = [df[:3], df[3:7], df[7:]]

pieces

[          0         1         2         3
 0  0.117737 -0.485657  0.683541 -0.131472
 1  0.263717  1.200925  0.760706  0.333891
 2 -0.596264 -0.067011 -0.929175  0.822631,
           0         1         2         3
 3 -0.908620  0.067446 -1.696800 -1.232521
 4 -1.414758  1.826359  1.420804  0.189378
 5 -1.212520  1.210640  0.046005  0.625258
 6  1.207507  0.519099 -0.746933 -0.372751,
           0         1         2         3
 7 -1.250094 -1.328370  0.012460 -0.361438
 8  0.333165 -1.243732 -1.549012  0.915562
 9  1.738738 -0.324127 -0.621795 -0.942039]

In [5]:
# Concat pieces

pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,0.117737,-0.485657,0.683541,-0.131472
1,0.263717,1.200925,0.760706,0.333891
2,-0.596264,-0.067011,-0.929175,0.822631
3,-0.90862,0.067446,-1.6968,-1.232521
4,-1.414758,1.826359,1.420804,0.189378
5,-1.21252,1.21064,0.046005,0.625258
6,1.207507,0.519099,-0.746933,-0.372751
7,-1.250094,-1.32837,0.01246,-0.361438
8,0.333165,-1.243732,-1.549012,0.915562
9,1.738738,-0.324127,-0.621795,-0.942039


##### Pandas.merge() = sql Join()

Inner join is done automatically with merge(). 

Ro do other types of joins like the **outer, left or right, you should use the parameter, how.**


In [8]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

In [9]:
pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,bar,2,5


In [10]:
pd.merge(left, right, on='key', how='outer')

Unnamed: 0,key,lval,rval
0,bar,2,5
1,foo,1,4


You have two DataFrames, sales and targets, with a common column region. 

How would you merge these DataFrames to include all regions from sales and only matching regions from targets? Write the code to achieve this.

In [14]:
pd.merge(sales, targets , on='region', how='left')

NameError: name 'sales' is not defined

Given a DataFrame df with columns department, employee, hours_worked, and salary, 

how would you group by department to find the total hours worked and the average salary for each department? 

In [15]:
df.groupby('department').agg({
    'hours_worked': 'sum',
    'salary': 'mean'
})

KeyError: 'department'

### Grouping


In [11]:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                             'foo', 'bar', 'foo', 'foo'],
                        'B': ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                       'C': np.random.randn(8),
                       'D': np.random.randn(8)})

In [12]:
df.groupby('A').sum()

Unnamed: 0_level_0,B,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,onethreetwo,2.429543,-0.203567
foo,onetwotwoonethree,-3.108232,-1.195885


## Reshaping

### Stacking
The stack() method "compresses" a level in the DataFrame's columns.

STACK produces a 'pivot table', but essentially a DataFrame with multiple indexes. 

In [37]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                         'foo', 'foo', 'qux', 'qux'],
                        ['one', 'two', 'one', 'two',
                         'one', 'two', 'one', 'two']]))

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

index


MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [38]:
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.134107,0.156934
bar,two,0.466833,-0.664006
baz,one,0.374504,-0.313771
baz,two,0.341261,-0.002158
foo,one,-0.830245,0.266927
foo,two,-0.05453,0.206631
qux,one,0.32059,-1.508668
qux,two,-0.112587,-0.025687


In [45]:
df2 = df[:4]

use the stack() function to "compress" the columns into the index.


In [46]:
stacked = df2.stack()

stacked

# Datatype is a float64

first  second   
bar    one     A    1.134107
               B    0.156934
       two     A    0.466833
               B   -0.664006
baz    one     A    0.374504
               B   -0.313771
       two     A    0.341261
               B   -0.002158
dtype: float64

With a **"stacked" DataFrame or Series (having a MultiIndex as the index)**, 
the inverse operation of stack() is **unstack(), which by default unstacks the last level**:

In [47]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.134107,0.156934
bar,two,0.466833,-0.664006
baz,one,0.374504,-0.313771
baz,two,0.341261,-0.002158


In [48]:
stacked.unstack(1)

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,1.134107,0.466833
bar,B,0.156934,-0.664006
baz,A,0.374504,0.341261
baz,B,-0.313771,-0.002158


In [56]:
stacked.unstack(0)

Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,1.134107,0.374504
one,B,0.156934,-0.313771
two,A,0.466833,0.341261
two,B,-0.664006,-0.002158


### Pivot Tables

Actual Pivot Tables in Pandas

In [62]:
df = pd.DataFrame({    'A': ['one', 'one', 'two', 'three'] * 3,
                       'B': ['A', 'B', 'C'] * 4,
                       'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                       'D': np.random.randn(12),
                       'E': np.random.randn(12)})

In [65]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc='sum')

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.321466,1.176736
one,B,-0.969991,-0.897635
one,C,0.716673,1.684672
three,A,0.668352,
three,B,,-0.655055
three,C,-1.223354,
two,A,,0.20074
two,B,0.492775,
two,C,,-0.434969


### Apply Functions

- tablewise function application: pipe()
- row or column-wise function application: apply()

#### Tablewise Fucntion Application

When functions are enactign against entire functions, use pipe() to combine both functions

In [66]:
def extract_city_name(df):
   .....:     """
   .....:     Chicago, IL -> Chicago for city_name column
   .....:     """
   .....:     df['city_name'] = df['city_and_code'].str.split(",").str.get(0)
   .....:     return df

def add_country_name(df, country_name=None):
   .....:     """
   .....:     Chicago -> Chicago-US for city_name column
   .....:     """
   .....:     col = 'city_name'
   .....:     df['city_and_country'] = df[col] + country_name
   .....:     return df




df_p = pd.DataFrame({'city_and_code': ['Chicago, IL']})

add_country_name(extract_city_name(df_p), country_name='US')

Unnamed: 0,city_and_code,city_name,city_and_country
0,"Chicago, IL",Chicago,ChicagoUS


instead of the complex procedure above, we can use the 'pipe' fucniton to combine the custom functions

In [67]:
(df_p.pipe(extract_city_name)
         .pipe(add_country_name, country_name="US"))

Unnamed: 0,city_and_code,city_name,city_and_country
0,"Chicago, IL",Chicago,ChicagoUS


#### Row or Column-wise Function Application

When your functions are focusing on columns or rows, the apply() function is used.

These functions can be applied along the axes of a DataFrame using the apply() method. Your chosen axis helps specify the direction and usecase of the functions

In [69]:
df = pd.DataFrame({
        'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
        'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})


df

Unnamed: 0,one,two,three
a,-0.913029,0.46884,
b,-1.239461,1.565533,-0.729133
c,0.459077,0.436876,-0.623462
d,,0.005138,-1.245232


In [None]:
# numpy mean function
df.apply(np.mean)

one     -0.564471
two      0.619097
three   -0.865942
dtype: float64

In [None]:
# numpy mean function
df.apply(np.mean, axis=1)

a   -0.222095
b   -0.134354
c    0.090831
d   -0.620047
dtype: float64

In [72]:
# own lambda function
df.apply(lambda x: x.max() - x.min())

one      1.698538
two      1.560394
three    0.621771
dtype: float64

You can use apply() to apply your own function:

In [73]:
def own_function(x):
    return x*x

df.apply(own_function)

Unnamed: 0,one,two,three
a,0.833622,0.219811,
b,1.536264,2.450892,0.531635
c,0.210752,0.190861,0.388704
d,,2.6e-05,1.550603


In [77]:
def subtract_and_divide(x, sub, divide=1):
    return (x - sub) / divide

df.apply(subtract_and_divide, args=(5,3))

Unnamed: 0,one,two,three
a,-1.97101,-1.510387,
b,-2.07982,-1.144822,-1.909711
c,-1.513641,-1.521041,-1.874487
d,,-1.664954,-2.081744


args has to be iterable. Therefore, even if you pass only 1 argument, you have to pass it as a tuple:

args=(5,)

### Lambda

A lambda is a small, anonymous function that can take any number of arguments but only has one expression.
Use lambda within an apply() function to create a singular function on the spot.


#### Example


4. Find movies with at least one character with a name starting with "Dr.". 
 
Use apply to create a new column indicating whether a movie has such a character, and then count the number of these movies.

In [78]:
cast = pd.read_csv('/Users/mitchellpalmer/Projects/Lighthouse Lab Projects/Python Practices/Pandas/Pandas_exercise/data/imdb_pandas/cast.csv', index_col=None)
cast.head()

Unnamed: 0,title,year,name,type,character,n
0,Closet Monster,2015,Buffy #1,actor,Buffy 4,
1,Suuri illusioni,1985,Homo $,actor,Guests,22.0
2,Battle of the Sexes,2017,$hutter,actor,Bobby Riggs Fan,10.0
3,Secret in Their Eyes,2015,$hutter,actor,2002 Dodger Fan,
4,Steve Jobs,2015,$hutter,actor,1988 Opera House Patron,


In [79]:
cast['has_dr_character'] = cast['character'].apply(lambda x: str(x).startswith("Dr.") if pd.notnull(x) else False)


In [80]:
dr_movies = cast[cast['has_dr_character']]
dr_movie_count = dr_movies[['title', 'year']].drop_duplicates().shape[0]

print("Number of movies with at least one 'Dr.' character:", dr_movie_count)


Number of movies with at least one 'Dr.' character: 18676


- cast['character'] is a Series (i.e., one column from your DataFrame).

- .apply(lambda x: ...) means:
    - For each row in the 'character' column,
    - The value in that row is passed into the lambda function as the variable x.
    - lambda x: str(x).startswith("Dr.") if pd.notnull(x) else False
    - Converts x to string (in case it's not already a string or is NaN),
    - Checks if the string starts with "Dr.",
    - If the value is NaN, it safely returns False instead of throwing an error.

