# <font color=hotpink> Python Pandas - I </font>

* Pandas is a Python library that provides utilities to deal with structured data stored in the form of rows and columns.
* According to Wikipedia page on Pandas, the name is derived from the term "*panel data*", term used for multidimensional structured data sets.
* Robust toolkit for analyzing, filtering, manipulating, aggregating, merging, pivoting, and cleaning data.
* Can be called as "*Excel for Python*" or "*Excel on Steroids*", metaphorically.

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.__version__

'1.5.2'

In [3]:
np.__version__

'1.24.0'

## <font color=fe7401> Series </font>

* One-dimensional ndarray holding data of any type.
* Series combine both list and dictionary Python data-structure.
* It allows to store data in order like list, but also allows to assign identifier like dictionary.
* By default Pandas assign identifier/key as index like 0, 1, 2 ... to the data.
* Difference between Pandas series and dictionary is that series can have duplicate keys. 


In [4]:
# creation of series

ice_cream = ['Vanilla', 'Chocolate', 'Strawberry']
pd.Series(ice_cream)

0       Vanilla
1     Chocolate
2    Strawberry
dtype: object

In [5]:
# series of int data type

pd.Series([11, 22, 33, 44])

0    11
1    22
2    33
3    44
dtype: int64

In [6]:
# series of heterogeneous dtype

pd.Series(["Apple", 12, 99.56, True])

0    Apple
1       12
2    99.56
3     True
dtype: object

In [7]:
# series from dictionary

runs_score = {
    "Sachin": 99,
    "Virat": 56,
    "Brown": 89
}

pd.Series(runs_score)

Sachin    99
Virat     56
Brown     89
dtype: int64

In [8]:
# parameter - the name we give to expected input
# argument - concrete value we provide to a parameter

# Difficuty (param) - Easy, Medium, Hard (argument)
# pd.Series(data = dataSrc, index = idx)
# param: data, index
# arg: dataSrc, idx

In [9]:
# duplicate keys allowed in series

weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
fruits = ['Apple', 'Mango', 'Kiwi', 'Strawberry', 'Apple']

pd.Series(data = weekdays, index = fruits)   # keyword args

Apple            Monday
Mango           Tuesday
Kiwi          Wednesday
Strawberry     Thursday
Apple            Friday
dtype: object

### <font color=blue> Methods on Series </font>

In [10]:
prices = pd.Series([12, 23.56, 8.5, None])
prices

0    12.00
1    23.56
2     8.50
3      NaN
dtype: float64

In [11]:
# count method will ignore any N/A or Null values

prices.count()

3

In [12]:
prices.sum()

44.06

In [13]:
prices.product()

2403.12

In [14]:
prices.mean()

14.686666666666667

In [15]:
prices.isnull()

0    False
1    False
2    False
3     True
dtype: bool

In [16]:
prices.describe()

count     3.000000
mean     14.686667
std       7.881277
min       8.500000
25%      10.250000
50%      12.000000
75%      17.780000
max      23.560000
dtype: float64

### <font color=blue> Attributes on Series </font>

* Driving a car is action/method, whereas car's color is detail/attribute

In [17]:
adjectives = pd.Series(["Good", "Handsome", "Smart", "Tall"])
adjectives

0        Good
1    Handsome
2       Smart
3        Tall
dtype: object

In [18]:
adjectives.size

4

In [19]:
adjectives.is_unique

True

In [20]:
adjectives.values

array(['Good', 'Handsome', 'Smart', 'Tall'], dtype=object)

In [21]:
type(adjectives.values)

numpy.ndarray

In [22]:
adjectives.index

RangeIndex(start=0, stop=4, step=1)

In [23]:
type(adjectives.index)

pandas.core.indexes.range.RangeIndex

In [24]:
adjectives.dtype

dtype('O')

### <font color=blue> Import Series with pd.read_csv() </font>

* pd.read_csv() returns DataFrame object.
* Dataframe visualize as alternate colors in rows, whereas Series is a plain black & white text in the output of a cell.

In [25]:
pd.read_csv("./datasets/pokemon.csv")  

Unnamed: 0,Pokemon,Type
0,Bulbasaur,Grass
1,Ivysaur,Grass
2,Venusaur,Grass
3,Charmander,Fire
4,Charmeleon,Fire
...,...,...
716,Yveltal,Dark
717,Zygarde,Dragon
718,Diancie,Rock
719,Hoopa,Psychic


In [26]:
# to get the series ie. 1-d array

pokemon = pd.read_csv("./datasets/pokemon.csv", usecols=["Pokemon"]).squeeze("columns")
pokemon

0       Bulbasaur
1         Ivysaur
2        Venusaur
3      Charmander
4      Charmeleon
          ...    
716       Yveltal
717       Zygarde
718       Diancie
719         Hoopa
720     Volcanion
Name: Pokemon, Length: 721, dtype: object

### <font color=blue> head and tail method on series </font>

In [27]:
pokemon.head()

0     Bulbasaur
1       Ivysaur
2      Venusaur
3    Charmander
4    Charmeleon
Name: Pokemon, dtype: object

In [28]:
pokemon.head(2)

0    Bulbasaur
1      Ivysaur
Name: Pokemon, dtype: object

In [29]:
pokemon.tail()

716      Yveltal
717      Zygarde
718      Diancie
719        Hoopa
720    Volcanion
Name: Pokemon, dtype: object

### <font color=blue> Views versus Copies </font>

* Some operations in pandas (and numpy as well) will return views of the original data, while other copies. 
* To put it very simply, a view is a subset of the original object ( DataFrame or Series ) linked to the original source, while a copy is an entirely new object .

### <font color=blue>Passing Series to Python built-in functions </font>

In [30]:
len(pokemon)

721

In [31]:
sorted(pokemon)

['Abomasnow',
 'Abra',
 'Absol',
 'Accelgor',
 'Aegislash',
 'Aerodactyl',
 'Aggron',
 'Aipom',
 'Alakazam',
 'Alomomola',
 'Altaria',
 'Amaura',
 'Ambipom',
 'Amoonguss',
 'Ampharos',
 'Anorith',
 'Arbok',
 'Arcanine',
 'Arceus',
 'Archen',
 'Archeops',
 'Ariados',
 'Armaldo',
 'Aromatisse',
 'Aron',
 'Articuno',
 'Audino',
 'Aurorus',
 'Avalugg',
 'Axew',
 'Azelf',
 'Azumarill',
 'Azurill',
 'Bagon',
 'Baltoy',
 'Banette',
 'Barbaracle',
 'Barboach',
 'Basculin',
 'Bastiodon',
 'Bayleef',
 'Beartic',
 'Beautifly',
 'Beedrill',
 'Beheeyem',
 'Beldum',
 'Bellossom',
 'Bellsprout',
 'Bergmite',
 'Bibarel',
 'Bidoof',
 'Binacle',
 'Bisharp',
 'Blastoise',
 'Blaziken',
 'Blissey',
 'Blitzle',
 'Boldore',
 'Bonsly',
 'Bouffalant',
 'Braixen',
 'Braviary',
 'Breloom',
 'Bronzong',
 'Bronzor',
 'Budew',
 'Buizel',
 'Bulbasaur',
 'Buneary',
 'Bunnelby',
 'Burmy',
 'Butterfree',
 'Cacnea',
 'Cacturne',
 'Camerupt',
 'Carbink',
 'Carnivine',
 'Carracosta',
 'Carvanha',
 'Cascoon',
 'Castform',


In [32]:
max(pokemon)

'Zygarde'

In [33]:
dict(pokemon)

{0: 'Bulbasaur',
 1: 'Ivysaur',
 2: 'Venusaur',
 3: 'Charmander',
 4: 'Charmeleon',
 5: 'Charizard',
 6: 'Squirtle',
 7: 'Wartortle',
 8: 'Blastoise',
 9: 'Caterpie',
 10: 'Metapod',
 11: 'Butterfree',
 12: 'Weedle',
 13: 'Kakuna',
 14: 'Beedrill',
 15: 'Pidgey',
 16: 'Pidgeotto',
 17: 'Pidgeot',
 18: 'Rattata',
 19: 'Raticate',
 20: 'Spearow',
 21: 'Fearow',
 22: 'Ekans',
 23: 'Arbok',
 24: 'Pikachu',
 25: 'Raichu',
 26: 'Sandshrew',
 27: 'Sandslash',
 28: 'Nidoran',
 29: 'Nidorina',
 30: 'Nidoqueen',
 31: 'Nidoran♂',
 32: 'Nidorino',
 33: 'Nidoking',
 34: 'Clefairy',
 35: 'Clefable',
 36: 'Vulpix',
 37: 'Ninetales',
 38: 'Jigglypuff',
 39: 'Wigglytuff',
 40: 'Zubat',
 41: 'Golbat',
 42: 'Oddish',
 43: 'Gloom',
 44: 'Vileplume',
 45: 'Paras',
 46: 'Parasect',
 47: 'Venonat',
 48: 'Venomoth',
 49: 'Diglett',
 50: 'Dugtrio',
 51: 'Meowth',
 52: 'Persian',
 53: 'Psyduck',
 54: 'Golduck',
 55: 'Mankey',
 56: 'Primeape',
 57: 'Growlithe',
 58: 'Arcanine',
 59: 'Poliwag',
 60: 'Poliwhirl',


In [34]:
type(pokemon)

pandas.core.series.Series

### <font color=blue> sort_values method </font>

In [35]:
pokemon.sort_values().head()

459    Abomasnow
62          Abra
358        Absol
616     Accelgor
680    Aegislash
Name: Pokemon, dtype: object

In [36]:
pokemon.sort_values(ascending = False).head()

717     Zygarde
633    Zweilous
40        Zubat
569       Zorua
570     Zoroark
Name: Pokemon, dtype: object

### <font color=blue> sort_index method </font>

In [37]:
poke_idx = pd.read_csv("./datasets/pokemon.csv", index_col = "Type").squeeze("columns")
poke_idx

Type
Grass       Bulbasaur
Grass         Ivysaur
Grass        Venusaur
Fire       Charmander
Fire       Charmeleon
              ...    
Dark          Yveltal
Dragon        Zygarde
Rock          Diancie
Psychic         Hoopa
Fire        Volcanion
Name: Pokemon, Length: 721, dtype: object

In [38]:
poke_idx.sort_index()

Type
Bug        Leavanny
Bug           Burmy
Bug       Scolipede
Bug        Genesect
Bug      Kricketune
            ...    
Water       Panpour
Water          Seel
Water       Dewgong
Water      Politoed
Water      Chinchou
Name: Pokemon, Length: 721, dtype: object

### <font color=blue> Check for inclusion using 'in' keyword </font>

* By default Pandas will look with index while using the 'in' keyword.

In [39]:
pokemon.head()

0     Bulbasaur
1       Ivysaur
2      Venusaur
3    Charmander
4    Charmeleon
Name: Pokemon, dtype: object

In [40]:
'Bulbasaur' in pokemon

False

In [41]:
4 in pokemon

True

In [42]:
4 in pokemon.index

True

In [43]:
'Bulbasaur' in pokemon.values

True

### <font color=blue> Extract Series value by Indexing </font>

* Negative indexing doesn't work in Pandas like `sr[-1]`, `sr[-20]`, but negative slicig is possible like `sr[-20 : -10]`

In [44]:
pokemon.head()

0     Bulbasaur
1       Ivysaur
2      Venusaur
3    Charmander
4    Charmeleon
Name: Pokemon, dtype: object

In [45]:
pokemon[3]

'Charmander'

In [46]:
pokemon[:2]

0    Bulbasaur
1      Ivysaur
Name: Pokemon, dtype: object

In [47]:
pokemon[: 4 : 2]

0    Bulbasaur
2     Venusaur
Name: Pokemon, dtype: object

In [48]:
# pokemon[-1] error

pokemon[-5 : -1]

716    Yveltal
717    Zygarde
718    Diancie
719      Hoopa
Name: Pokemon, dtype: object

In [49]:
pokemon[-5 :]

716      Yveltal
717      Zygarde
718      Diancie
719        Hoopa
720    Volcanion
Name: Pokemon, dtype: object

In [50]:
pokemon[[1, 2, 720]]

1        Ivysaur
2       Venusaur
720    Volcanion
Name: Pokemon, dtype: object

### <font color=blue> Extract Series Value by Index </font>

In [51]:
poke_idx.head()

Type
Grass     Bulbasaur
Grass       Ivysaur
Grass      Venusaur
Fire     Charmander
Fire     Charmeleon
Name: Pokemon, dtype: object

In [52]:
poke_idx["Fire"][0]

'Charmander'

In [53]:
poke_idx["Grass"].head(2)

Type
Grass    Bulbasaur
Grass      Ivysaur
Name: Pokemon, dtype: object

In [54]:
poke_idx[1]

'Ivysaur'

### <font color=blue> get method </font>

* Provides advantage wrt indexing, as we can provide fallback value ie. if value not present

In [55]:
pokemon.get(3)

'Charmander'

In [56]:
poke_idx.get("Grass")

Type
Grass     Bulbasaur
Grass       Ivysaur
Grass      Venusaur
Grass        Oddish
Grass         Gloom
            ...    
Grass       Chespin
Grass     Quilladin
Grass    Chesnaught
Grass        Skiddo
Grass        Gogoat
Name: Pokemon, Length: 66, dtype: object

In [57]:
poke_idx.get("Nor-World", default = "N/A")

'N/A'

In [58]:
# if any of the index not present in the series we will get default value

poke_idx.get(["Grass", "Nor-World"], default = "N/A")

'N/A'

In [59]:
del(poke_idx)

### <font color=blue> Overwrite a Series Value </font>

* If overwrite index is not present, it will be created and append to the Series

In [60]:
sr = pd.read_csv("./datasets/pokemon.csv", usecols=["Pokemon", "Type"], index_col="Pokemon").squeeze("columns")
sr

Pokemon
Bulbasaur       Grass
Ivysaur         Grass
Venusaur        Grass
Charmander       Fire
Charmeleon       Fire
               ...   
Yveltal          Dark
Zygarde        Dragon
Diancie          Rock
Hoopa         Psychic
Volcanion        Fire
Name: Type, Length: 721, dtype: object

In [61]:
sr["Bulbasaur"] = "Water"
sr.head()

Pokemon
Bulbasaur     Water
Ivysaur       Grass
Venusaur      Grass
Charmander     Fire
Charmeleon     Fire
Name: Type, dtype: object

In [62]:
sr[[0, 2]] = "Shadow"
sr.head()

Pokemon
Bulbasaur     Shadow
Ivysaur        Grass
Venusaur      Shadow
Charmander      Fire
Charmeleon      Fire
Name: Type, dtype: object

In [63]:
sr["Gauravior"] = "Universe"
sr

Pokemon
Bulbasaur       Shadow
Ivysaur          Grass
Venusaur        Shadow
Charmander        Fire
Charmeleon        Fire
                ...   
Zygarde         Dragon
Diancie           Rock
Hoopa          Psychic
Volcanion         Fire
Gauravior     Universe
Name: Type, Length: 722, dtype: object

In [64]:
del(sr)

### <font color=blue> copy method </font>

* The copy() method returns a copy of the DataFrame.
* By default, the copy is a "deep copy" meaning that any changes made in the original DataFrame will NOT be reflected in the copy.

In [65]:
pokemon_df = pd.read_csv("./datasets/pokemon.csv", usecols=["Pokemon"])
pokemon_df.head(3)

Unnamed: 0,Pokemon
0,Bulbasaur
1,Ivysaur
2,Venusaur


In [66]:
pokemon_ser = pokemon_df.squeeze("columns")
pokemon_ser.head(3)

0    Bulbasaur
1      Ivysaur
2     Venusaur
Name: Pokemon, dtype: object

In [67]:
pokemon_ser[0] = "Change"
pokemon_ser.head(1)

0    Change
Name: Pokemon, dtype: object

In [68]:
pokemon_df.head(1)

Unnamed: 0,Pokemon
0,Change


In [69]:
# to prevent this side-effect, we use copy method
del(pokemon_ser)
pokemon_df = pd.read_csv("./datasets/pokemon.csv", usecols=["Pokemon"])
pokemon_ser = pokemon_df.squeeze("columns").copy()
pokemon_ser.head(1)

0    Bulbasaur
Name: Pokemon, dtype: object

In [70]:
pokemon_ser[0] = "Let's change"
pokemon_ser.head(1)

0    Let's change
Name: Pokemon, dtype: object

In [71]:
pokemon_df.head(1)

Unnamed: 0,Pokemon
0,Bulbasaur


In [72]:
del(pokemon_df)
del(pokemon_ser)

### <font color=blue> Broadcasting </font>

In [73]:
google = pd.read_csv("./datasets/google_stock_price.csv")
google.head(3)

Unnamed: 0,Stock Price
0,50.12
1,54.1
2,54.65


In [74]:
# just like we see broadcasting in Numpy

google = google + 10
google.head(3)

Unnamed: 0,Stock Price
0,60.12
1,64.1
2,64.65


In [75]:
del(google) 

### <font color=blue> value_counts method </font>

* Return a Series containing counts of unique values.

In [76]:
poke_idx = pd.read_csv("./datasets/pokemon.csv", index_col="Pokemon").squeeze("columns")
poke_idx.head()

Pokemon
Bulbasaur     Grass
Ivysaur       Grass
Venusaur      Grass
Charmander     Fire
Charmeleon     Fire
Name: Type, dtype: object

In [77]:
poke_idx.value_counts()

Water       105
Normal       93
Grass        66
Bug          63
Fire         47
Psychic      47
Rock         41
Electric     36
Ground       30
Poison       28
Dark         28
Fighting     25
Dragon       24
Ghost        23
Ice          23
Steel        22
Fairy        17
Flying        3
Name: Type, dtype: int64

### <font color=blue> apply method </font>

* Invoke a specified function on every value of Series.

In [78]:
poke_idx.head(5)

Pokemon
Bulbasaur     Grass
Ivysaur       Grass
Venusaur      Grass
Charmander     Fire
Charmeleon     Fire
Name: Type, dtype: object

In [79]:
def rank_pokemon(pokemon_type):
    if pokemon_type in ["Grass", "Bug"]:
        return "Classic"
    else:
        return "Unique"

In [80]:
poke_idx.apply(rank_pokemon).head()

Pokemon
Bulbasaur     Classic
Ivysaur       Classic
Venusaur      Classic
Charmander     Unique
Charmeleon     Unique
Name: Type, dtype: object

### <font color=blue> map method </font>
* Map values of Series according to an input mapping or function.

In [81]:
mappings = {
    "Grass": "Classic",
    "Water": "Classic",
    "Fire": "Classic",
    "Dark": "Unique",
    "Psychic": "Unique"
}

In [82]:
poke_idx.head()

Pokemon
Bulbasaur     Grass
Ivysaur       Grass
Venusaur      Grass
Charmander     Fire
Charmeleon     Fire
Name: Type, dtype: object

In [83]:
poke_idx.map(mappings)

Pokemon
Bulbasaur     Classic
Ivysaur       Classic
Venusaur      Classic
Charmander    Classic
Charmeleon    Classic
               ...   
Yveltal        Unique
Zygarde           NaN
Diancie           NaN
Hoopa          Unique
Volcanion     Classic
Name: Type, Length: 721, dtype: object

In [84]:
# we can also pass series in mapping

mappings_ser = pd.Series(mappings)
mappings_ser

Grass      Classic
Water      Classic
Fire       Classic
Dark        Unique
Psychic     Unique
dtype: object

In [85]:
poke_idx.map(mappings_ser)

Pokemon
Bulbasaur     Classic
Ivysaur       Classic
Venusaur      Classic
Charmander    Classic
Charmeleon    Classic
               ...   
Yveltal        Unique
Zygarde           NaN
Diancie           NaN
Hoopa          Unique
Volcanion     Classic
Name: Type, Length: 721, dtype: object

In [86]:
del(mappings)
del(mappings_ser)

## <font color=#fe7401> DataFrames </font>

* A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

### <font color=blue> Knowing a Series/DataFrame </font>


* It is possible to understand a Series better by using `describe` method.
* The method provides details like mean, std, etc. about a series.
* Two methods majorly `info` and `describe` can be used to know about the data, present in a data frame.
* `describe` method by default provides details of only numeric fields

In [87]:
df = pd.DataFrame({'temp':pd.Series(28 + 10*np.random.randn(10)), 
                'rain':pd.Series(100 + 50*np.random.randn(10)),
             'location':list('AAAAABBBBB')})
df

Unnamed: 0,temp,rain,location
0,31.043461,94.706951,A
1,24.293718,40.016916,A
2,49.014004,60.64645,A
3,29.672352,107.967292,A
4,22.313781,138.697785,A
5,16.647243,187.910014,B
6,13.658539,150.717186,B
7,26.59161,102.786701,B
8,33.228899,127.169316,B
9,16.487881,128.133249,B


In [88]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   temp      10 non-null     float64
 1   rain      10 non-null     float64
 2   location  10 non-null     object 
dtypes: float64(2), object(1)
memory usage: 368.0+ bytes


In [89]:
df.describe()

Unnamed: 0,temp,rain
count,10.0,10.0
mean,26.295149,113.875186
std,10.365909,43.009427
min,13.658539,40.016916
25%,18.063877,96.726888
50%,25.442664,117.568304
75%,30.700684,136.056651
max,49.014004,187.910014


In [90]:
# use include argument to white list data types that has to be included in the result.
# below is giving result for col, `location`

df.describe(include=['object'])

Unnamed: 0,location
count,10
unique,2
top,A
freq,5


### <font color=blue> Methods and Attributes b/w Series and DataFrames </font>

In [91]:
nba = pd.read_csv("./datasets/nba.csv")
nba

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...,...
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In [92]:
s = pd.Series([11, 22, 33, 44, 55])
s

0    11
1    22
2    33
3    44
4    55
dtype: int64

In [93]:
s.head(2)

0    11
1    22
dtype: int64

In [94]:
nba.tail(2)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


In [95]:
s.index

RangeIndex(start=0, stop=5, step=1)

In [96]:
nba.index

RangeIndex(start=0, stop=458, step=1)

In [97]:
s.values

array([11, 22, 33, 44, 55], dtype=int64)

In [98]:
nba.values

array([['Avery Bradley', 'Boston Celtics', 0.0, ..., 180.0, 'Texas',
        7730337.0],
       ['Jae Crowder', 'Boston Celtics', 99.0, ..., 235.0, 'Marquette',
        6796117.0],
       ['John Holland', 'Boston Celtics', 30.0, ..., 205.0,
        'Boston University', nan],
       ...,
       ['Tibor Pleiss', 'Utah Jazz', 21.0, ..., 256.0, nan, 2900000.0],
       ['Jeff Withey', 'Utah Jazz', 24.0, ..., 231.0, 'Kansas', 947276.0],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=object)

In [99]:
s.shape

(5,)

In [100]:
nba.shape

(458, 9)

In [101]:
s.dtype

dtype('int64')

In [102]:
s.dtypes

dtype('int64')

In [103]:
# nba.dtype -> Error

nba.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

In [104]:
# s.columns -> Error

nba.columns

Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
       'College', 'Salary'],
      dtype='object')

In [105]:
s.info()

<class 'pandas.core.series.Series'>
RangeIndex: 5 entries, 0 to 4
Series name: None
Non-Null Count  Dtype
--------------  -----
5 non-null      int64
dtypes: int64(1)
memory usage: 168.0 bytes


In [106]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    float64
 3   Position  457 non-null    object 
 4   Age       457 non-null    float64
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   373 non-null    object 
 8   Salary    446 non-null    float64
dtypes: float64(4), object(5)
memory usage: 32.3+ KB


In [107]:
s.describe()

count     5.000000
mean     33.000000
std      17.392527
min      11.000000
25%      22.000000
50%      33.000000
75%      44.000000
max      55.000000
dtype: float64

In [108]:
nba.describe()

Unnamed: 0,Number,Age,Weight,Salary
count,457.0,457.0,457.0,446.0
mean,17.678337,26.938731,221.522976,4842684.0
std,15.96609,4.404016,26.368343,5229238.0
min,0.0,19.0,161.0,30888.0
25%,5.0,24.0,200.0,1044792.0
50%,13.0,26.0,220.0,2839073.0
75%,25.0,30.0,240.0,6500000.0
max,99.0,40.0,307.0,25000000.0


In [109]:
s.axes

[RangeIndex(start=0, stop=5, step=1)]

In [110]:
nba.axes

[RangeIndex(start=0, stop=458, step=1),
 Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
        'College', 'Salary'],
       dtype='object')]

In [111]:
s.hasnans

False

In [112]:
# nba.hasnans -> Error

nba.isnull().any()    # DataFrame.isnull is an alias for DataFrame.isna.

Name        True
Team        True
Number      True
Position    True
Age         True
Height      True
Weight      True
College     True
Salary      True
dtype: bool

In [113]:
nba.tail(1)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
457,,,,,,,,,


In [114]:
nba.isnull().sum()

Name         1
Team         1
Number       1
Position     1
Age          1
Height       1
Weight       1
College     85
Salary      12
dtype: int64

### <font color=blue> idxmax method </font>

* Return index of first occurrence of maximum over requested axis.


In [115]:
sample_df = pd.DataFrame(data = np.random.normal(loc = 100, scale = 30, size = 100), columns = ["Values"])
sample_df

Unnamed: 0,Values
0,96.677966
1,96.697213
2,81.053520
3,82.398728
4,106.062393
...,...
95,113.206976
96,128.926721
97,108.024799
98,116.878853


In [116]:
# idxmin() complementary method

sample_df.idxmax()

Values    47
dtype: int64

In [117]:
sample_df.nlargest(n = 2, columns = "Values")

Unnamed: 0,Values
47,196.270627
30,188.014668


### <font color=blue> Difference between shared methods </font>

In [118]:
s.head(10)

0    11
1    22
2    33
3    44
4    55
dtype: int64

In [119]:
s.sum()

165

In [120]:
rev = pd.read_csv("./datasets/revenue.csv")
rev

Unnamed: 0,Date,New York,Los Angeles,Miami
0,1/1/16,985,122,499
1,1/2/16,738,788,534
2,1/3/16,14,20,933
3,1/4/16,730,904,885
4,1/5/16,114,71,253
5,1/6/16,936,502,497
6,1/7/16,123,996,115
7,1/8/16,935,492,886
8,1/9/16,846,954,823
9,1/10/16,54,285,216


In [121]:
# summing columns wise, default

rev.sum(numeric_only = True)

New York       5475
Los Angeles    5134
Miami          5641
dtype: int64

In [122]:
rev.sum(axis = 0, numeric_only = True)

New York       5475
Los Angeles    5134
Miami          5641
dtype: int64

In [123]:
# more intituative way

rev.sum(axis = "index", numeric_only = True)

New York       5475
Los Angeles    5134
Miami          5641
dtype: int64

In [124]:
# summing row-wise

rev.sum(numeric_only = True, axis = 1)

0    1606
1    2060
2     967
3    2519
4     438
5    1935
6    1234
7    2313
8    2623
9     555
dtype: int64

In [125]:
rev.sum(numeric_only = True, axis = "columns")

0    1606
1    2060
2     967
3    2519
4     438
5    1935
6    1234
7    2313
8    2623
9     555
dtype: int64

### <font color="blue"> Select a column from the DataFrame </font>

In [126]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [127]:
# this way of accessing  like a attribute doesn't work every time
# bcs if the column name contains whitespace char

print(type(nba.Age))
nba.Age

<class 'pandas.core.series.Series'>


0      25.0
1      25.0
2      27.0
3      22.0
4      29.0
       ... 
453    26.0
454    24.0
455    26.0
456    26.0
457     NaN
Name: Age, Length: 458, dtype: float64

In [128]:
nba.College

0                  Texas
1              Marquette
2      Boston University
3          Georgia State
4                    NaN
             ...        
453               Butler
454                  NaN
455                  NaN
456               Kansas
457                  NaN
Name: College, Length: 458, dtype: object

In [129]:
# but this will work even for whitespace char in the column name

nba["College"]

0                  Texas
1              Marquette
2      Boston University
3          Georgia State
4                    NaN
             ...        
453               Butler
454                  NaN
455                  NaN
456               Kansas
457                  NaN
Name: College, Length: 458, dtype: object

In [130]:
nba["College"].head().values

array(['Texas', 'Marquette', 'Boston University', 'Georgia State', nan],
      dtype=object)

####  Selecting two or more col from the dataframe 

In [131]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [132]:
# result would be dataframe with the order of col name pass, if col name doesn't exist we get KeyError

col_to_select = ["Team", "College", "Position"]
nba[col_to_select].head()

Unnamed: 0,Team,College,Position
0,Boston Celtics,Texas,PG
1,Boston Celtics,Marquette,SF
2,Boston Celtics,Boston University,SG
3,Boston Celtics,Georgia State,SG
4,Boston Celtics,,PF


In [133]:
# we can also use double sq bracket format

nba[["Age", "Position"]].head()

Unnamed: 0,Age,Position
0,25.0,PG
1,25.0,SF
2,27.0,SG
3,22.0,SG
4,29.0,PF


In [134]:
# nba["College", "Team"] -> KeyError

nba[col_to_select][["College", "Team"]].head()

Unnamed: 0,College,Team
0,Texas,Boston Celtics
1,Marquette,Boston Celtics
2,Boston University,Boston Celtics
3,Georgia State,Boston Celtics
4,,Boston Celtics


### <font color=blue> Add a new col to the dataframe </font>

* We can add column using the following ways:
   1. Add columns at the end of the table.
   2. Add columns at a specific index using `insert` method
   3. Add columns with the `loc` method.
   4. Add columns with the `assign` function.

In [135]:
# adding col at the end of the datframe, using assignment operator

nba["MyCol1"] = "Fills"
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,MyCol1
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,Fills
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,Fills
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,Fills
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,Fills
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0,Fills


In [136]:
# we can add multiple cols as well
nba[["MyCol2", "MyCol3"]] = "N/A"
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,MyCol1,MyCol2,MyCol3
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,Fills,,
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,Fills,,
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,Fills,,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,Fills,,
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0,Fills,,


In [137]:
# now lets drop this cols

nba.drop(["MyCol1", "MyCol2", "MyCol3"], axis = 1, inplace = True)
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [138]:
# adding col to the dataframe at specified loc using insert method

nba.insert(loc = 2, column = "Col_X", value = "NaN")
nba.head()

Unnamed: 0,Name,Team,Col_X,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,,8.0,PF,29.0,6-10,231.0,,5000000.0


In [139]:
nba.drop(labels = ["Col_X"], inplace = True, axis = "columns")

In [140]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


### <font color=blue> Create a new col from existing col </font>

In [141]:
nba["Age after a Decade"] = nba["Age"] + 10

In [142]:
nba["Salary Hike"] = nba["Salary"].mul(0.05)

In [143]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Age after a Decade,Salary Hike
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,35.0,386516.85
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,35.0,339805.85
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,37.0,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,32.0,57432.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0,39.0,250000.0


#### Review of value_counts

In [144]:
nba["Position"].value_counts()

SG    102
PF    100
PG     92
SF     85
C      78
Name: Position, dtype: int64

### <font color=blue> Drop Rows with Null values </font>

In [145]:
nba.tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Age after a Decade,Salary Hike
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0,36.0,121666.65
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0,34.0,45000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0,36.0,145000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0,36.0,47363.8
457,,,,,,,,,,,


In [146]:
# to remove any rows which have NaN values

nba.dropna()
# nba.dropna(how="any")    # same as above

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Age after a Decade,Salary Hike
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,35.0,386516.85
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,35.0,339805.85
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,32.0,57432.00
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0,31.0,58548.00
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0,35.0,108258.00
...,...,...,...,...,...,...,...,...,...,...,...
449,Rodney Hood,Utah Jazz,5.0,SG,23.0,6-8,206.0,Duke,1348440.0,33.0,67422.00
451,Chris Johnson,Utah Jazz,23.0,SF,26.0,6-6,206.0,Dayton,981348.0,36.0,49067.40
452,Trey Lyles,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,2239800.0,30.0,111990.00
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0,36.0,121666.65


In [147]:
# to remove rows which have all col val as NaN

nba.dropna(how = "all")   # last row is removed only

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Age after a Decade,Salary Hike
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,35.0,386516.85
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,35.0,339805.85
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,37.0,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,32.0,57432.00
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0,39.0,250000.00
...,...,...,...,...,...,...,...,...,...,...,...
452,Trey Lyles,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,2239800.0,30.0,111990.00
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0,36.0,121666.65
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0,34.0,45000.00
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0,36.0,145000.00


In [148]:
# to reomve rows which have certain col value as NaN

nba.dropna(subset = ["College", "Salary Hike"])

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Age after a Decade,Salary Hike
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,35.0,386516.85
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,35.0,339805.85
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,32.0,57432.00
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0,31.0,58548.00
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0,35.0,108258.00
...,...,...,...,...,...,...,...,...,...,...,...
449,Rodney Hood,Utah Jazz,5.0,SG,23.0,6-8,206.0,Duke,1348440.0,33.0,67422.00
451,Chris Johnson,Utah Jazz,23.0,SF,26.0,6-6,206.0,Dayton,981348.0,36.0,49067.40
452,Trey Lyles,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,2239800.0,30.0,111990.00
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0,36.0,121666.65


In [149]:
nba.drop(["Age after a Decade", "Salary Hike"], inplace = True, axis = 1)

### <font color=blue>Fill in Missing DataFrame Values with the fillna Method</font>

In [150]:
nba.tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


In [151]:
# fill all the NaN values to 0, and this will gives new df

nba.fillna(0)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,0,5000000.0
...,...,...,...,...,...,...,...,...,...
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,0,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,0,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In [152]:
nba.tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


In [153]:
# filling NaN of College to Unknown

nba["College"].fillna("Unknown")

0                  Texas
1              Marquette
2      Boston University
3          Georgia State
4                Unknown
             ...        
453               Butler
454              Unknown
455              Unknown
456               Kansas
457              Unknown
Name: College, Length: 458, dtype: object

In [154]:
nba.tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


In [155]:
# mutating the df as well either using inplace = True or using assignment operator

nba["College"].fillna("Unknown", inplace = True)
nba.tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,Unknown,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,Unknown,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,Unknown,


In [156]:
nba["Salary"] = nba["Salary"].fillna(0)
nba.tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,Unknown,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,Unknown,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,Unknown,0.0


### <font color=blue> astype method </font>

* `astype` method cannot convert NaN values to the specified datatype. So, it's good to first deal with NaN values either by `fillna` or `dropna`
* If there is a NaN value in a numerical column (either it's int), then that whole column wil be converted as a *float* data type.

In [157]:
nba = pd.read_csv("./datasets/nba.csv").dropna(how = "all")
nba

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...,...
452,Trey Lyles,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,2239800.0
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0


In [158]:
nba["Age"].hasnans

False

In [159]:
nba["Age"].astype("int16")

0      25
1      25
2      27
3      22
4      29
       ..
452    20
453    26
454    24
455    26
456    26
Name: Age, Length: 457, dtype: int16

In [160]:
# above process is not reflected to original df

nba["Age"].dtype

dtype('float64')

In [161]:
# to change the original df use assignment operator

nba["Age"] = nba["Age"].astype("int16")

In [162]:
nba["Age"].dtype

dtype('int16')

In [163]:
nba["Salary"].hasnans

True

In [164]:
# now to convert Salary column dtype to string
# first fill the missing val

nba["Salary"].fillna(value = 0, inplace = True)

In [165]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29,6-10,231.0,,5000000.0


In [166]:
# float -> int -> str

nba["Salary"] = nba["Salary"].astype("int").astype("str")
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99.0,SF,25,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30.0,SG,27,6-5,205.0,Boston University,0
3,R.J. Hunter,Boston Celtics,28.0,SG,22,6-5,185.0,Georgia State,1148640
4,Jonas Jerebko,Boston Celtics,8.0,PF,29,6-10,231.0,,5000000


####  category datatype

* Why use categorical datatypes in pandas?
    1. Memory usage — for string columns where there are many repeated values, categories can drastically reduce the amount of memory required to store the data in memory
    2. Runtime performance — there are optimizations in place which can improve execution speed for certain operations

In [167]:
# mind the memory-usage

nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    float64
 3   Position  457 non-null    object 
 4   Age       457 non-null    int16  
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   373 non-null    object 
 8   Salary    457 non-null    object 
dtypes: float64(2), int16(1), object(6)
memory usage: 33.0+ KB


In [168]:
# it's taking around 30 KB, before category

nba["Position"].memory_usage(deep = True)

30541

In [169]:
# str type

nba["Position"].dtype

dtype('O')

In [170]:
nba["Position"].value_counts()

SG    102
PF    100
PG     92
SF     85
C      78
Name: Position, dtype: int64

In [171]:
nba["Position"].nunique()

5

In [172]:
nba["Position"].astype("category")

0      PG
1      SF
2      SG
3      SG
4      PF
       ..
452    PF
453    PG
454    PG
455     C
456     C
Name: Position, Length: 457, dtype: category
Categories (5, object): ['C', 'PF', 'PG', 'SF', 'SG']

In [173]:
# as there are only 5 unique values we can categorize them rather than re-creating str again and again

nba["Position"] = nba["Position"].astype("category")

In [174]:
nba["Position"].dtype

CategoricalDtype(categories=['C', 'PF', 'PG', 'SF', 'SG'], ordered=False)

In [175]:
# previosly memory usage is 33KB and now it's 30KB

nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Name      457 non-null    object  
 1   Team      457 non-null    object  
 2   Number    457 non-null    float64 
 3   Position  457 non-null    category
 4   Age       457 non-null    int16   
 5   Height    457 non-null    object  
 6   Weight    457 non-null    float64 
 7   College   373 non-null    object  
 8   Salary    457 non-null    object  
dtypes: category(1), float64(2), int16(1), object(5)
memory usage: 30.1+ KB


In [176]:
# now it's taking 4KB, category dtype is useful for repeated data

nba["Position"].memory_usage(deep = True)

4579

### <font color=blue> Sort a DataFrame with the sort_values method </font>

In [177]:
nba = pd.read_csv("./datasets/nba.csv")
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [178]:
# sorting wrt name in descending order

nba.sort_values(by = "Name", ascending = False)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
237,Zaza Pachulia,Dallas Mavericks,27.0,C,32.0,6-11,275.0,,5200000.0
271,Zach Randolph,Memphis Grizzlies,50.0,PF,34.0,6-9,260.0,Michigan State,9638555.0
402,Zach LaVine,Minnesota Timberwolves,8.0,PG,21.0,6-5,189.0,UCLA,2148360.0
270,Xavier Munford,Memphis Grizzlies,14.0,PG,24.0,6-3,180.0,Rhode Island,
386,Wilson Chandler,Denver Nuggets,21.0,SF,29.0,6-8,225.0,DePaul,10449438.0
...,...,...,...,...,...,...,...,...,...
404,Adreian Payne,Minnesota Timberwolves,33.0,PF,25.0,6-10,237.0,Michigan State,1938840.0
328,Aaron Harrison,Charlotte Hornets,9.0,SG,21.0,6-6,210.0,Kentucky,525093.0
356,Aaron Gordon,Orlando Magic,0.0,PF,20.0,6-9,220.0,Arizona,4171680.0
152,Aaron Brooks,Chicago Bulls,0.0,PG,31.0,6-0,161.0,Oregon,2250000.0


In [179]:
# sorting wrt Name, Team with Name as asc and Team as desc

nba.sort_values(by = ["Name", "Team"], ascending = [True, False])

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
152,Aaron Brooks,Chicago Bulls,0.0,PG,31.0,6-0,161.0,Oregon,2250000.0
356,Aaron Gordon,Orlando Magic,0.0,PF,20.0,6-9,220.0,Arizona,4171680.0
328,Aaron Harrison,Charlotte Hornets,9.0,SG,21.0,6-6,210.0,Kentucky,525093.0
404,Adreian Payne,Minnesota Timberwolves,33.0,PF,25.0,6-10,237.0,Michigan State,1938840.0
312,Al Horford,Atlanta Hawks,15.0,C,30.0,6-10,245.0,Florida,12000000.0
...,...,...,...,...,...,...,...,...,...
270,Xavier Munford,Memphis Grizzlies,14.0,PG,24.0,6-3,180.0,Rhode Island,
402,Zach LaVine,Minnesota Timberwolves,8.0,PG,21.0,6-5,189.0,UCLA,2148360.0
271,Zach Randolph,Memphis Grizzlies,50.0,PF,34.0,6-9,260.0,Michigan State,9638555.0
237,Zaza Pachulia,Dallas Mavericks,27.0,C,32.0,6-11,275.0,,5200000.0


### <font color=blue> sort_index method to get the original df back </font>

In [180]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [181]:
# now let's do a inplace sort and try to get the original df back

nba = nba.sort_values(by = ["Name", "Team"])
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
152,Aaron Brooks,Chicago Bulls,0.0,PG,31.0,6-0,161.0,Oregon,2250000.0
356,Aaron Gordon,Orlando Magic,0.0,PF,20.0,6-9,220.0,Arizona,4171680.0
328,Aaron Harrison,Charlotte Hornets,9.0,SG,21.0,6-6,210.0,Kentucky,525093.0
404,Adreian Payne,Minnesota Timberwolves,33.0,PF,25.0,6-10,237.0,Michigan State,1938840.0
312,Al Horford,Atlanta Hawks,15.0,C,30.0,6-10,245.0,Florida,12000000.0


In [182]:
# we can use sort_index method to get the original df back after inplace sort

nba = nba.sort_index()
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


### <font color=blue> Rank Series Values with the rank Method </font>

* Suppose we want to Rank the Salary column, highest salary as 1 and so on.
* Let's first replace NaN values with 0
* And convert Salary.dtype to int
* Now we assign rank to the salary values using `rank` method
* Cross-verify by sorting df wrt to Salary col in descending order

In [183]:
nba["Salary"].hasnans

True

In [184]:
nba["Salary"].isna().sum()

12

In [185]:
nba["Salary"] = nba["Salary"].fillna(value = 0)
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0


In [186]:
nba["Salary"].dtype

dtype('float64')

In [187]:
nba["Salary"] = nba["Salary"].astype("int")
nba.Salary.dtype

dtype('int32')

In [188]:
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0


In [189]:
nba["Salary Rank"] = nba["Salary"].rank(ascending = False)
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Salary Rank
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337,97.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117,110.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0,452.5
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640,322.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000,147.0


In [190]:
nba.sort_values(by = "Salary", ascending = False)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Salary Rank
109,Kobe Bryant,Los Angeles Lakers,24.0,SF,37.0,6-6,212.0,,25000000,1.0
169,LeBron James,Cleveland Cavaliers,23.0,SF,31.0,6-8,250.0,,22970500,2.0
33,Carmelo Anthony,New York Knicks,7.0,SF,32.0,6-8,240.0,Syracuse,22875000,3.0
251,Dwight Howard,Houston Rockets,12.0,C,30.0,6-11,265.0,,22359364,4.0
339,Chris Bosh,Miami Heat,1.0,PF,32.0,6-11,235.0,Georgia Tech,22192730,5.0
...,...,...,...,...,...,...,...,...,...,...
269,Ray McCallum,Memphis Grizzlies,5.0,PG,24.0,6-3,190.0,Detroit,0,452.5
409,Greg Smith,Minnesota Timberwolves,4.0,PF,25.0,6-10,250.0,Fresno State,0,452.5
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0,452.5
264,Jordan Farmar,Memphis Grizzlies,4.0,PG,29.0,6-2,180.0,UCLA,0,452.5


## <font color=fe7401> Filtering Data </font>

### <font color=blue> Memory Optimization </font>

* Categorized the data wherever required.
* Convert the string dated columns to datetime object.
* Convert the string True/False column to bool type.

In [191]:
df = pd.read_csv("./datasets/employees.csv")
df.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services


In [192]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   First Name         933 non-null    object 
 1   Gender             855 non-null    object 
 2   Start Date         1000 non-null   object 
 3   Last Login Time    1000 non-null   object 
 4   Salary             1000 non-null   int64  
 5   Bonus %            1000 non-null   float64
 6   Senior Management  933 non-null    object 
 7   Team               957 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 62.6+ KB


In [193]:
# pd.to_datetime converts argument to datetime.

df["Start Date"] = pd.to_datetime(df["Start Date"])

In [194]:
# current date is prefixed with the time

df["Last Login Time"] = pd.to_datetime(df["Last Login Time"])

In [195]:
df["Senior Management"] = df["Senior Management"].astype("bool")

In [196]:
df["Gender"] = df["Gender"].astype("category")

In [197]:
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2023-02-06 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2023-02-06 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2023-02-06 11:17:00,130590,11.858,False,Finance


In [198]:
# memory usage reduced

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   First Name         933 non-null    object        
 1   Gender             855 non-null    category      
 2   Start Date         1000 non-null   datetime64[ns]
 3   Last Login Time    1000 non-null   datetime64[ns]
 4   Salary             1000 non-null   int64         
 5   Bonus %            1000 non-null   float64       
 6   Senior Management  1000 non-null   bool          
 7   Team               957 non-null    object        
dtypes: bool(1), category(1), datetime64[ns](2), float64(1), int64(1), object(2)
memory usage: 49.1+ KB


### <font color = blue> Filter a dataframe based on one condition </font>

In [199]:
# all above code in a single cell

df = pd.read_csv("datasets/employees.csv", parse_dates = ["Start Date", "Last Login Time"])

df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")

df.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2023-02-06 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2023-02-06 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2023-02-06 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2023-02-06 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2023-02-06 16:47:00,101004,1.389,True,Client Services


In [200]:
# let's suppose we want to filter the Team having Finance
# below code will return boolean series, broadcasting operation is happening

df["Team"] == "Finance"

0      False
1      False
2       True
3       True
4      False
       ...  
995    False
996     True
997    False
998    False
999    False
Name: Team, Length: 1000, dtype: bool

In [201]:
# writing like this can confuse sometimes, instead use variable for the boolean series
# pandas will extract only True value row from the boolean series

df[df["Team"] == "Finance"]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,2023-02-06 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2023-02-06 13:00:00,138705,9.340,True,Finance
7,,Female,2015-07-20,2023-02-06 10:43:00,45906,11.598,True,Finance
14,Kimberly,Female,1999-01-14,2023-02-06 07:13:00,41426,14.543,True,Finance
46,Bruce,Male,2009-11-28,2023-02-06 22:47:00,114796,6.796,False,Finance
...,...,...,...,...,...,...,...,...
907,Elizabeth,Female,1998-07-27,2023-02-06 11:12:00,137144,10.081,False,Finance
954,Joe,Male,1980-01-19,2023-02-06 16:06:00,119667,1.148,True,Finance
987,Gloria,Female,2014-12-08,2023-02-06 05:08:00,136709,10.331,True,Finance
992,Anthony,Male,2011-10-16,2023-02-06 08:35:00,112769,11.625,True,Finance


In [202]:
# pandas will convert the str formated date to to_datetime itself,
# but specifying date format is good practice

mask = df["Start Date"] > "01/01/2000"
df[mask].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
3,Jerry,Male,2005-03-04,2023-02-06 13:00:00,138705,9.34,True,Finance
7,,Female,2015-07-20,2023-02-06 10:43:00,45906,11.598,True,Finance
8,Angela,Female,2005-11-22,2023-02-06 06:29:00,95570,18.523,True,Engineering
9,Frances,Female,2002-08-08,2023-02-06 06:51:00,139852,7.524,True,Business Development
13,Gary,Male,2008-01-27,2023-02-06 23:40:00,109831,5.831,False,Sales


In [203]:
mask = df["Team"] != "Sales"
df[mask].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2023-02-06 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2023-02-06 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2023-02-06 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2023-02-06 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2023-02-06 16:47:00,101004,1.389,True,Client Services


### <font color = blue> Filtering data with more than one condition </font>

* And criteria filtering: `&`
* Or criteria filtering: `|`

In [204]:
df.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2023-02-06 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2023-02-06 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2023-02-06 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2023-02-06 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2023-02-06 16:47:00,101004,1.389,True,Client Services


In [205]:
# filtering Marketing Team having male employee

mask1 = df["Team"] == "Marketing"
mask2 = df["Gender"] == "Male"

df[mask1 & mask2].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2023-02-06 12:42:00,97308,6.945,True,Marketing
21,Matthew,Male,1995-09-05,2023-02-06 02:12:00,100612,13.645,False,Marketing
26,Craig,Male,2000-02-27,2023-02-06 07:45:00,37598,7.757,True,Marketing
74,Thomas,Male,1995-06-04,2023-02-06 14:24:00,62096,17.029,False,Marketing
77,Charles,Male,2004-09-14,2023-02-06 20:13:00,107391,1.26,True,Marketing


In [206]:
# filtering employee having Bonus more than 5% or Salary greater than 100000 and Start Date is after 26th Jan, 2005

mask1 = df["Bonus %"] > 5
mask2 = df["Salary"] > 100000
mask3 = df["Start Date"] > pd.to_datetime("26/01/2005", format = "%d/%m/%Y")  # this is a good practice, date format specification

df[(mask1 | mask2) & mask3].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
3,Jerry,Male,2005-03-04,2023-02-06 13:00:00,138705,9.34,True,Finance
7,,Female,2015-07-20,2023-02-06 10:43:00,45906,11.598,True,Finance
8,Angela,Female,2005-11-22,2023-02-06 06:29:00,95570,18.523,True,Engineering
13,Gary,Male,2008-01-27,2023-02-06 23:40:00,109831,5.831,False,Sales
16,Jeremy,Male,2010-09-21,2023-02-06 05:56:00,90370,7.369,False,Human Resources


### <font color=blue> Check for inclusion with the isin() method </font>

* Whether each element in the DataFrame is contained in values.
* values: 'Series | DataFrame | Sequence | Mapping'

In [207]:
# extract all Team series having Sales, Product or Marketing

mask1 = df["Team"] == "Sales"
mask2 = df["Team"] == "Product"
mask3 = df["Team"] == "Marketing"

df[mask1 | mask2 | mask3].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2023-02-06 12:42:00,97308,6.945,True,Marketing
6,Ruby,Female,1987-08-17,2023-02-06 16:20:00,65476,10.012,True,Product
13,Gary,Male,2008-01-27,2023-02-06 23:40:00,109831,5.831,False,Sales
15,Lillian,Female,2016-06-05,2023-02-06 06:09:00,59414,1.256,False,Product
17,Shawn,Male,1986-12-07,2023-02-06 19:45:00,111737,6.414,False,Product


In [208]:
# above implementation can be done using isin() method

mask = df["Team"].isin(["Sales", "Product", "Marketing"])
df[mask].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2023-02-06 12:42:00,97308,6.945,True,Marketing
6,Ruby,Female,1987-08-17,2023-02-06 16:20:00,65476,10.012,True,Product
13,Gary,Male,2008-01-27,2023-02-06 23:40:00,109831,5.831,False,Sales
15,Lillian,Female,2016-06-05,2023-02-06 06:09:00,59414,1.256,False,Product
17,Shawn,Male,1986-12-07,2023-02-06 19:45:00,111737,6.414,False,Product


### <font color=blue> Check for null values with the isnull() and notnull() methods </font>

In [209]:
df.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2023-02-06 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2023-02-06 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2023-02-06 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2023-02-06 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2023-02-06 16:47:00,101004,1.389,True,Client Services


In [210]:
df["Team"].isnull()

0      False
1       True
2      False
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Name: Team, Length: 1000, dtype: bool

In [211]:
df[ df["Team"].isnull() ].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,2023-02-06 06:53:00,61933,4.17,True,
10,Louise,Female,1980-08-12,2023-02-06 09:01:00,63241,15.132,True,
23,,Male,2012-06-14,2023-02-06 16:19:00,125792,5.042,True,
32,,Male,1998-08-21,2023-02-06 14:27:00,122340,6.417,True,
91,James,,2005-01-26,2023-02-06 23:00:00,128771,8.309,False,


In [212]:
df[ df["Team"].notnull() ].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2023-02-06 12:42:00,97308,6.945,True,Marketing
2,Maria,Female,1993-04-23,2023-02-06 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2023-02-06 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2023-02-06 16:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,2023-02-06 01:35:00,115163,10.125,False,Legal


### <font color=blue> Check for inclusion within a range of values with the between() method </font>
* Both ranges are inclusive

In [213]:
mask = df["Salary"].between(35000, 40000)
df[mask].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
25,,Male,2012-10-08,2023-02-06 01:12:00,37076,18.576,True,Client Services
26,Craig,Male,2000-02-27,2023-02-06 07:45:00,37598,7.757,True,Marketing
38,Stephanie,Female,1986-09-13,2023-02-06 01:52:00,36844,5.574,True,Business Development
63,Matthew,Male,2013-01-02,2023-02-06 22:33:00,35203,18.04,False,Human Resources
82,Steven,Male,1980-03-30,2023-02-06 21:20:00,35095,8.379,True,Client Services


In [214]:
mask = df["Start Date"].between("01-04-2000", "01-04-2001")
df[mask].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
26,Craig,Male,2000-02-27,2023-02-06 07:45:00,37598,7.757,True,Marketing
50,Nancy,Female,2000-09-23,2023-02-06 08:05:00,94976,13.83,True,Engineering
92,Linda,Female,2000-05-25,2023-02-06 17:45:00,119009,12.506,True,Business Development
109,Christopher,Male,2000-04-22,2023-02-06 10:15:00,37919,11.449,False,
160,Kathy,Female,2000-03-18,2023-02-06 19:26:00,149563,16.991,True,Finance


### <font color=blue> Check for duplicate dataFrame rows with the duplicated() method </font>

In [215]:
df1 = df.sort_values(by = "First Name")
df1.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2023-02-06 10:20:00,61602,11.849,True,Marketing
327,Aaron,Male,1994-01-29,2023-02-06 18:48:00,58755,5.097,True,Marketing
440,Aaron,Male,1990-07-22,2023-02-06 14:53:00,52119,11.343,True,Client Services
937,Aaron,,1986-01-22,2023-02-06 19:39:00,63126,18.424,False,Client Services
137,Adam,Male,2011-05-21,2023-02-06 01:45:00,95327,15.12,False,Distribution


In [216]:
df1["First Name"].duplicated()

101    False
327     True
440     True
937     True
137    False
       ...  
902     True
925     True
946     True
947     True
951     True
Name: First Name, Length: 1000, dtype: bool

In [217]:
# Mark duplicates as True except for the first occurrence.

df1[ df1["First Name"].duplicated() ]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
327,Aaron,Male,1994-01-29,2023-02-06 18:48:00,58755,5.097,True,Marketing
440,Aaron,Male,1990-07-22,2023-02-06 14:53:00,52119,11.343,True,Client Services
937,Aaron,,1986-01-22,2023-02-06 19:39:00,63126,18.424,False,Client Services
141,Adam,Male,1990-12-24,2023-02-06 20:57:00,110194,14.727,True,Product
302,Adam,Male,2007-07-05,2023-02-06 11:59:00,71276,5.027,True,Human Resources
...,...,...,...,...,...,...,...,...
902,,Male,2001-05-23,2023-02-06 19:52:00,103877,6.322,True,Distribution
925,,Female,2000-08-23,2023-02-06 16:19:00,95866,19.388,True,Sales
946,,Female,1985-09-15,2023-02-06 01:50:00,133472,16.941,True,Distribution
947,,Male,2012-07-30,2023-02-06 15:07:00,107351,5.329,True,Marketing


In [218]:
df1["First Name"].duplicated(keep = "last")

101     True
327     True
440     True
937    False
137     True
       ...  
902     True
925     True
946     True
947     True
951    False
Name: First Name, Length: 1000, dtype: bool

In [219]:
df1[ df1["First Name"].duplicated(keep = "last") ]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2023-02-06 10:20:00,61602,11.849,True,Marketing
327,Aaron,Male,1994-01-29,2023-02-06 18:48:00,58755,5.097,True,Marketing
440,Aaron,Male,1990-07-22,2023-02-06 14:53:00,52119,11.343,True,Client Services
137,Adam,Male,2011-05-21,2023-02-06 01:45:00,95327,15.120,False,Distribution
141,Adam,Male,1990-12-24,2023-02-06 20:57:00,110194,14.727,True,Product
...,...,...,...,...,...,...,...,...
890,,Male,2015-11-24,2023-02-06 03:11:00,145329,7.100,True,Finance
902,,Male,2001-05-23,2023-02-06 19:52:00,103877,6.322,True,Distribution
925,,Female,2000-08-23,2023-02-06 16:19:00,95866,19.388,True,Sales
946,,Female,1985-09-15,2023-02-06 01:50:00,133472,16.941,True,Distribution


In [220]:
# Mark all duplicates as True

df1["First Name"].duplicated(keep = False)

101    True
327    True
440    True
937    True
137    True
       ... 
902    True
925    True
946    True
947    True
951    True
Name: First Name, Length: 1000, dtype: bool

In [221]:
# only unique First Name values left, even the first occurence of duplicate is removed
# `~`: negation operation

df1[ ~ df1["First Name"].duplicated(keep = False)]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
8,Angela,Female,2005-11-22,2023-02-06 06:29:00,95570,18.523,True,Engineering
688,Brian,Male,2007-04-07,2023-02-06 22:47:00,93901,17.821,True,Legal
190,Carol,Female,1996-03-19,2023-02-06 03:39:00,57783,9.129,False,Finance
887,David,Male,2009-12-05,2023-02-06 08:48:00,92242,15.407,False,Legal
5,Dennis,Male,1987-04-18,2023-02-06 01:35:00,115163,10.125,False,Legal
495,Eugene,Male,1984-05-24,2023-02-06 10:54:00,81077,2.117,False,Sales
33,Jean,Female,1993-12-18,2023-02-06 09:07:00,119082,16.18,False,Business Development
832,Keith,Male,2003-02-12,2023-02-06 15:02:00,120672,19.467,False,Legal
291,Tammy,Female,1984-11-11,2023-02-06 10:30:00,132839,17.463,True,Client Services


### <font color=blue> Delete duplicate dataFrame rows with the drop_duplicates() method </font>

* Return DataFrame with duplicate rows removed.

In [222]:
df1.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2023-02-06 10:20:00,61602,11.849,True,Marketing
327,Aaron,Male,1994-01-29,2023-02-06 18:48:00,58755,5.097,True,Marketing
440,Aaron,Male,1990-07-22,2023-02-06 14:53:00,52119,11.343,True,Client Services
937,Aaron,,1986-01-22,2023-02-06 19:39:00,63126,18.424,False,Client Services
137,Adam,Male,2011-05-21,2023-02-06 01:45:00,95327,15.12,False,Distribution


In [223]:
len(df1)

1000

In [224]:
df1.drop_duplicates().head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2023-02-06 10:20:00,61602,11.849,True,Marketing
327,Aaron,Male,1994-01-29,2023-02-06 18:48:00,58755,5.097,True,Marketing
440,Aaron,Male,1990-07-22,2023-02-06 14:53:00,52119,11.343,True,Client Services
937,Aaron,,1986-01-22,2023-02-06 19:39:00,63126,18.424,False,Client Services
137,Adam,Male,2011-05-21,2023-02-06 01:45:00,95327,15.12,False,Distribution


In [225]:
# as there're no duplicate rows ie. two rows having same col values entirely

len(df1.drop_duplicates())

1000

In [226]:
# keep first duplicate and remove others

df1.drop_duplicates(subset = ["First Name"])

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2023-02-06 10:20:00,61602,11.849,True,Marketing
137,Adam,Male,2011-05-21,2023-02-06 01:45:00,95327,15.120,False,Distribution
300,Alan,Male,1988-06-26,2023-02-06 03:54:00,111786,3.592,True,Engineering
372,Albert,Male,1997-02-01,2023-02-06 16:20:00,67827,19.717,True,Engineering
988,Alice,Female,2004-10-05,2023-02-06 09:34:00,47638,11.209,False,Human Resources
...,...,...,...,...,...,...,...,...
433,Wanda,Female,2008-07-20,2023-02-06 13:44:00,65362,7.132,True,Legal
177,Wayne,Male,2012-04-07,2023-02-06 08:00:00,102652,14.085,True,Distribution
820,William,Male,1993-11-18,2023-02-06 12:27:00,54058,5.182,True,Human Resources
450,Willie,Male,2009-08-22,2023-02-06 13:03:00,55038,19.691,False,Legal


In [227]:
# keep last duplicate value and remove others

df1.drop_duplicates(subset = ["First Name"], keep = "last")

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
937,Aaron,,1986-01-22,2023-02-06 19:39:00,63126,18.424,False,Client Services
538,Adam,Male,2010-10-08,2023-02-06 21:53:00,45181,3.491,False,Human Resources
610,Alan,Male,2012-02-17,2023-02-06 00:26:00,41453,10.084,False,Product
959,Albert,Male,1992-09-19,2023-02-06 02:35:00,45094,5.850,True,Business Development
693,Alice,Female,1995-10-16,2023-02-06 21:19:00,92799,2.782,False,Sales
...,...,...,...,...,...,...,...,...
512,Wanda,Female,1993-04-06,2023-02-06 03:11:00,78883,19.695,False,
637,Wayne,Male,2009-09-02,2023-02-06 01:37:00,126956,18.396,False,Human Resources
127,William,Male,2002-09-29,2023-02-06 16:09:00,66521,5.830,False,Human Resources
652,Willie,Male,2009-12-05,2023-02-06 05:39:00,141932,1.017,True,Engineering


In [228]:
# both providing col values are check for duplication, ie (r1.c1, r1.c2) should not same as (r2.c1, r2.c2)

df1.drop_duplicates(subset = ["First Name", "Senior Management"])

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2023-02-06 10:20:00,61602,11.849,True,Marketing
937,Aaron,,1986-01-22,2023-02-06 19:39:00,63126,18.424,False,Client Services
137,Adam,Male,2011-05-21,2023-02-06 01:45:00,95327,15.120,False,Distribution
141,Adam,Male,1990-12-24,2023-02-06 20:57:00,110194,14.727,True,Product
300,Alan,Male,1988-06-26,2023-02-06 03:54:00,111786,3.592,True,Engineering
...,...,...,...,...,...,...,...,...
820,William,Male,1993-11-18,2023-02-06 12:27:00,54058,5.182,True,Human Resources
127,William,Male,2002-09-29,2023-02-06 16:09:00,66521,5.830,False,Human Resources
450,Willie,Male,2009-08-22,2023-02-06 13:03:00,55038,19.691,False,Legal
175,Willie,Male,1998-02-17,2023-02-06 20:20:00,146651,1.451,True,Engineering


### <font color=blue> Identify and count unique values with the unique() and nunique() methods </font>

* unique(): Return unique values of Series object
* nunique(): Count number of distinct elements

In [229]:
df.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2023-02-06 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2023-02-06 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2023-02-06 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2023-02-06 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2023-02-06 16:47:00,101004,1.389,True,Client Services


In [230]:
df["Gender"].unique()

['Male', 'Female', NaN]
Categories (2, object): ['Female', 'Male']

In [231]:
# unique method also consider NaN value, whereas nunique method doesn't

df["Gender"].nunique()

2

In [232]:
# to consider nan also

df["Gender"].nunique(dropna = False)

3

## <font color=fe7401> Data Extraction </font>

In [233]:
bond = pd.read_csv("./datasets/jamesbond.csv")
bond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


### <font color=blue> Use the set_index and reset_index methods to define a new dataFrame index </font>

* set_index(): set the DataFrame index using existing columns.
* reset_index(): reset the index

In [234]:
# similar to >>> bond = pd.read_csv("./datasets/jamesbond.csv", index_col = "Film")

bond.set_index("Film", inplace = True)
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [235]:
# default drop -> True, drops the previous index col entirely

bond.set_index("Year").head()

Unnamed: 0_level_0,Actor,Director,Box Office,Budget,Bond Actor Salary
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1962,Sean Connery,Terence Young,448.8,7.0,0.6
1963,Sean Connery,Terence Young,543.8,12.6,1.6
1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
1965,Sean Connery,Terence Young,848.1,41.9,4.7
1967,David Niven,Ken Hughes,315.0,85.0,


In [236]:
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [237]:
bond.reset_index().head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


### <font color=blue> Retrieve rows by index label with `loc []` Accessor </font>

* `loc` is a pandas's property.
* **label-based selection**
* Similar to list slicing except in slicing, range stop value is exclusive whereas `loc`'s stop value is inclusive

In [238]:
bond = pd.read_csv("./datasets/jamesbond.csv", index_col = "Film")
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [239]:
# mind the square bracket

bond.loc["Thunderball"]

Year                          1965
Actor                 Sean Connery
Director             Terence Young
Box Office                   848.1
Budget                        41.9
Bond Actor Salary              4.7
Name: Thunderball, dtype: object

In [240]:
bond.loc[ ["Casino Royale", "Dr. No"] ]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6


In [241]:
bond.loc["GoldenEye" :]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


In [242]:
# stop range inclusive

bond.loc["GoldenEye" : "Skyfall"]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5


In [243]:
bond.loc["GoldenEye" : "Skyfall" : 2]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5


In [244]:
sample_df = pd.DataFrame(data = {"Apple": [120, 55, 86], "Mango": [56, 78, 95]}, index = [1, 2, 3])
sample_df

Unnamed: 0,Apple,Mango
1,120,56
2,55,78
3,86,95


In [245]:
# bond.loc[ [1, 3] ] -> gives error bcs it's index values are not integers but strings

sample_df.loc[ [1, 3] ]

Unnamed: 0,Apple,Mango
1,120,56
3,86,95


### <font color=blue> Retrieve rows by index position with `iloc []` Accessor </font>

* **index-based selection**
* Difference between iloc and loc:
    1. loc gets rows (and/or columns) with particular labels. iloc gets rows (and/or columns) at integer locations.
        ``` 
        bond.iloc["Skyfall"] -> this will give error 
        ```
    2. iloc method does not include the last element of the range passed in it unlike loc()

In [246]:
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [247]:
# stop range doesn't included

bond.iloc[1:3]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [248]:
bond.reset_index(inplace = True)

In [249]:
bond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [250]:
bond.iloc[ [0, 3, 5] ]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
5,You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4


In [251]:
bond.iloc[ : 3]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [252]:
bond.iloc[ : : 10]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
10,The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,
20,The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5


### <font color=blue>Passing second arguments to the loc and iloc Accessors</font>

In [253]:
bond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [254]:
bond.set_index("Film", inplace = True)
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [255]:
bond.loc["Goldfinger"]

Year                         1964
Actor                Sean Connery
Director             Guy Hamilton
Box Office                  820.4
Budget                       18.6
Bond Actor Salary             3.2
Name: Goldfinger, dtype: object

In [256]:
# first param is row index label and second param is col to extract

bond.loc["Goldfinger", "Actor"]

'Sean Connery'

In [257]:
bond.loc["Goldfinger", ["Box Office", "Actor", "Budget"]]

Box Office           820.4
Actor         Sean Connery
Budget                18.6
Name: Goldfinger, dtype: object

In [258]:
bond.loc["Goldfinger", "Actor":"Box Office"]

Actor         Sean Connery
Director      Guy Hamilton
Box Office           820.4
Name: Goldfinger, dtype: object

In [259]:
bond.loc["Quantum of Solace":, :"Actor"]

Unnamed: 0_level_0,Year,Actor
Film,Unnamed: 1_level_1,Unnamed: 2_level_1
Quantum of Solace,2008,Daniel Craig
Skyfall,2012,Daniel Craig
Spectre,2015,Daniel Craig


In [260]:
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [261]:
# same applies to iloc except pass index rather than labels or col name

bond.iloc[2, [2, 1]]

Director    Guy Hamilton
Actor       Sean Connery
Name: Goldfinger, dtype: object

In [262]:
bond.shape

(26, 6)

In [263]:
bond.iloc[:2, :-1]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6


### <font color=blue> Set New Value for a Specific Cell or Cells In a Row </font>

In [264]:
bond.loc["Dr. No", "Actor"] = "Mr. Actor"

In [265]:
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Mr. Actor,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [266]:
bond.loc["Dr. No" : "Goldfinger", ["Budget", "Year"]] = [7, 2004]

In [267]:
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,2004,Mr. Actor,Terence Young,448.8,7.0,0.6
From Russia with Love,2004,Sean Connery,Terence Young,543.8,7.0,1.6
Goldfinger,2004,Sean Connery,Guy Hamilton,820.4,7.0,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [268]:
bond.loc["Dr. No" : "Goldfinger", ["Budget", "Year"]] = [[17, 2004], [8, 2005], [5, 2000]]
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,2004,Mr. Actor,Terence Young,448.8,17.0,0.6
From Russia with Love,2005,Sean Connery,Terence Young,543.8,8.0,1.6
Goldfinger,2000,Sean Connery,Guy Hamilton,820.4,5.0,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


### <font color=blue> Set Multiple Values in a DataFrame </font>

In [269]:
# suppose we want to replace Actor col value Sean Connery to Sir Sean Connery whereever found

bond["Actor"] == "Sean Connery"

Film
Dr. No                             False
From Russia with Love               True
Goldfinger                          True
Thunderball                         True
Casino Royale                      False
You Only Live Twice                 True
On Her Majesty's Secret Service    False
Diamonds Are Forever                True
Live and Let Die                   False
The Man with the Golden Gun        False
The Spy Who Loved Me               False
Moonraker                          False
For Your Eyes Only                 False
Never Say Never Again               True
Octopussy                          False
A View to a Kill                   False
The Living Daylights               False
Licence to Kill                    False
GoldenEye                          False
Tomorrow Never Dies                False
The World Is Not Enough            False
Die Another Day                    False
Casino Royale                      False
Quantum of Solace                  False
Skyfall    

In [270]:
bool_ser_sean_connery = bond["Actor"] == "Sean Connery"
bond[bool_ser_sean_connery]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
From Russia with Love,2005,Sean Connery,Terence Young,543.8,8.0,1.6
Goldfinger,2000,Sean Connery,Guy Hamilton,820.4,5.0,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,


In [271]:
bond[bool_ser_sean_connery]["Actor"]

Film
From Russia with Love    Sean Connery
Goldfinger               Sean Connery
Thunderball              Sean Connery
You Only Live Twice      Sean Connery
Diamonds Are Forever     Sean Connery
Never Say Never Again    Sean Connery
Name: Actor, dtype: object

In [272]:
# now this will give error as it's a copy not original
# bond[bool_ser_sean_connery]["Actor"] = "Sir Sean Connery" # -> Error

In [273]:
# we can also pass boolean series to loc[]

bond.loc[bool_ser_sean_connery, "Actor"] = "Sir Sean Connery"
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,2004,Mr. Actor,Terence Young,448.8,17.0,0.6
From Russia with Love,2005,Sir Sean Connery,Terence Young,543.8,8.0,1.6
Goldfinger,2000,Sir Sean Connery,Guy Hamilton,820.4,5.0,3.2
Thunderball,1965,Sir Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


### <font color=blue> Rename Index Labels or Columns in a DataFrame </font>

* This can be achieved by using `df.rename()` method.
* Column renaming can also be done using `df.columns` attribute, but we need to assign the same col length list to it.
* Column renaming with `df.columns` is an inplace operation.

In [274]:
# let's rename index Film:Goldfinger to Gold Finger and Dr. No to DrNo

bond.rename(mapper = {
    "Goldfinger": "Gold Finger",
    "Dr. No": "DrNo"
}).head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DrNo,2004,Mr. Actor,Terence Young,448.8,17.0,0.6
From Russia with Love,2005,Sir Sean Connery,Terence Young,543.8,8.0,1.6
Gold Finger,2000,Sir Sean Connery,Guy Hamilton,820.4,5.0,3.2
Thunderball,1965,Sir Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [275]:
# above code can be done like that also

bond.rename(mapper = {
    "Goldfinger": "Gold Finger",
    "Dr. No": "DrNo"
}, axis = 0)

bond.rename(mapper = {
    "Goldfinger": "Gold Finger",
    "Dr. No": "DrNo"
}, axis = "index")

bond.rename(index = {
    "Goldfinger": "Gold Finger",
    "Dr. No": "DrNo"
}).head()  # while specifying index as param, don't specify axis param

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DrNo,2004,Mr. Actor,Terence Young,448.8,17.0,0.6
From Russia with Love,2005,Sir Sean Connery,Terence Young,543.8,8.0,1.6
Gold Finger,2000,Sir Sean Connery,Guy Hamilton,820.4,5.0,3.2
Thunderball,1965,Sir Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [276]:
# we can also rename column name

bond.rename(columns = {"Bond Actor Salary" : "Salary"}).head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,2004,Mr. Actor,Terence Young,448.8,17.0,0.6
From Russia with Love,2005,Sir Sean Connery,Terence Young,543.8,8.0,1.6
Goldfinger,2000,Sir Sean Connery,Guy Hamilton,820.4,5.0,3.2
Thunderball,1965,Sir Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [277]:
# another way to rename column is using df.columns attribute

bond.columns

Index(['Year', 'Actor', 'Director', 'Box Office', 'Budget',
       'Bond Actor Salary'],
      dtype='object')

In [278]:
# and this operation is inplace

bond.columns = ['Released', 'Actor', 'Director', 'Box Office', 'Cost', 'Salary']
bond.head()

Unnamed: 0_level_0,Released,Actor,Director,Box Office,Cost,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,2004,Mr. Actor,Terence Young,448.8,17.0,0.6
From Russia with Love,2005,Sir Sean Connery,Terence Young,543.8,8.0,1.6
Goldfinger,2000,Sir Sean Connery,Guy Hamilton,820.4,5.0,3.2
Thunderball,1965,Sir Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


### <font color=blue> Delete Rows or Columns from a DataFrame </font>

* This can be done using either:
    1. `df.drop()`
    2. `df.pop()`: inplace operation, columns deletion
    3. `del df["Col_to_remove"]`: inplace operation

In [279]:
# drop a row using index label

bond.drop("Dr. No").head(3)

Unnamed: 0_level_0,Released,Actor,Director,Box Office,Cost,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
From Russia with Love,2005,Sir Sean Connery,Terence Young,543.8,8.0,1.6
Goldfinger,2000,Sir Sean Connery,Guy Hamilton,820.4,5.0,3.2
Thunderball,1965,Sir Sean Connery,Terence Young,848.1,41.9,4.7


In [280]:
bond.drop("Dr. No", axis = "index").head(3)

Unnamed: 0_level_0,Released,Actor,Director,Box Office,Cost,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
From Russia with Love,2005,Sir Sean Connery,Terence Young,543.8,8.0,1.6
Goldfinger,2000,Sir Sean Connery,Guy Hamilton,820.4,5.0,3.2
Thunderball,1965,Sir Sean Connery,Terence Young,848.1,41.9,4.7


In [281]:
bond.drop(index = "Dr. No").head(2)

Unnamed: 0_level_0,Released,Actor,Director,Box Office,Cost,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
From Russia with Love,2005,Sir Sean Connery,Terence Young,543.8,8.0,1.6
Goldfinger,2000,Sir Sean Connery,Guy Hamilton,820.4,5.0,3.2


In [282]:
# dropping a column

bond.drop("Actor", axis = "columns").head(3)

Unnamed: 0_level_0,Released,Director,Box Office,Cost,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Dr. No,2004,Terence Young,448.8,17.0,0.6
From Russia with Love,2005,Terence Young,543.8,8.0,1.6
Goldfinger,2000,Guy Hamilton,820.4,5.0,3.2


In [283]:
# above operation can be done like that also

bond.drop(columns = ["Actor", "Released"]).head(3)

Unnamed: 0_level_0,Director,Box Office,Cost,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dr. No,Terence Young,448.8,17.0,0.6
From Russia with Love,Terence Young,543.8,8.0,1.6
Goldfinger,Guy Hamilton,820.4,5.0,3.2


In [284]:
# pop() method perform inplace operation and also return the value

actor = bond.pop("Actor")
actor

Film
Dr. No                                    Mr. Actor
From Russia with Love              Sir Sean Connery
Goldfinger                         Sir Sean Connery
Thunderball                        Sir Sean Connery
Casino Royale                           David Niven
You Only Live Twice                Sir Sean Connery
On Her Majesty's Secret Service      George Lazenby
Diamonds Are Forever               Sir Sean Connery
Live and Let Die                        Roger Moore
The Man with the Golden Gun             Roger Moore
The Spy Who Loved Me                    Roger Moore
Moonraker                               Roger Moore
For Your Eyes Only                      Roger Moore
Never Say Never Again              Sir Sean Connery
Octopussy                               Roger Moore
A View to a Kill                        Roger Moore
The Living Daylights                 Timothy Dalton
Licence to Kill                      Timothy Dalton
GoldenEye                            Pierce Brosnan
Tomorro

In [285]:
bond.head()

Unnamed: 0_level_0,Released,Director,Box Office,Cost,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Dr. No,2004,Terence Young,448.8,17.0,0.6
From Russia with Love,2005,Terence Young,543.8,8.0,1.6
Goldfinger,2000,Guy Hamilton,820.4,5.0,3.2
Thunderball,1965,Terence Young,848.1,41.9,4.7
Casino Royale,1967,Ken Hughes,315.0,85.0,


In [286]:
del bond["Cost"]

In [287]:
bond.head(2)

Unnamed: 0_level_0,Released,Director,Box Office,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dr. No,2004,Terence Young,448.8,0.6
From Russia with Love,2005,Terence Young,543.8,1.6


### <font color= blue> Create Random Sample with the sample Method </font>

* Return a random sample of items from an axis of object.

In [288]:
bond.sample()

Unnamed: 0_level_0,Released,Director,Box Office,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
On Her Majesty's Secret Service,1969,Peter R. Hunt,291.5,0.6


In [289]:
bond.sample(3)

Unnamed: 0_level_0,Released,Director,Box Office,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Casino Royale,1967,Ken Hughes,315.0,
Moonraker,1979,Lewis Gilbert,535.0,
Casino Royale,2006,Martin Campbell,581.5,3.3


In [290]:
bond.sample(frac = 0.25)

Unnamed: 0_level_0,Released,Director,Box Office,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Goldfinger,2000,Guy Hamilton,820.4,3.2
The Spy Who Loved Me,1977,Lewis Gilbert,533.0,
Live and Let Die,1973,Guy Hamilton,460.3,
The Living Daylights,1987,John Glen,313.5,5.2
Diamonds Are Forever,1971,Guy Hamilton,442.5,5.8
A View to a Kill,1985,John Glen,275.2,9.1


In [291]:
bond.sample(n = 2, axis = "columns").head()

Unnamed: 0_level_0,Director,Released
Film,Unnamed: 1_level_1,Unnamed: 2_level_1
Dr. No,Terence Young,2004
From Russia with Love,Terence Young,2005
Goldfinger,Guy Hamilton,2000
Thunderball,Terence Young,1965
Casino Royale,Ken Hughes,1967


### <font color=blue> Use the nsmallest / nlargest methods to get rows with smallest / largest values </font>

In [292]:
bond.head(2)

Unnamed: 0_level_0,Released,Director,Box Office,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dr. No,2004,Terence Young,448.8,0.6
From Russia with Love,2005,Terence Young,543.8,1.6


In [293]:
bond.nsmallest(n = 3, columns = ["Box Office", "Salary"])

Unnamed: 0_level_0,Released,Director,Box Office,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Licence to Kill,1989,John Glen,250.9,7.9
A View to a Kill,1985,John Glen,275.2,9.1
On Her Majesty's Secret Service,1969,Peter R. Hunt,291.5,0.6


In [294]:
bond.nlargest(n = 2, columns = "Salary")

Unnamed: 0_level_0,Released,Director,Box Office,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Die Another Day,2002,Lee Tamahori,465.4,17.9
Skyfall,2012,Sam Mendes,943.5,14.5


In [295]:
bond["Box Office"].nlargest(3)

Film
Skyfall        943.5
Thunderball    848.1
Goldfinger     820.4
Name: Box Office, dtype: float64

### <font color=blue> Filter A DataFrame with the where method </font>

In [296]:
bond = pd.read_csv("./datasets/jamesbond.csv", index_col = "Film")
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [297]:
mask1 = bond["Actor"] == "Sean Connery"
bond[mask1]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,


In [298]:
bond.where(mask1)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962.0,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963.0,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964.0,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965.0,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,,,,,,
You Only Live Twice,1967.0,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,,,,,,
Diamonds Are Forever,1971.0,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,,,,,,
The Man with the Golden Gun,,,,,,


In [299]:
mask2 = bond["Box Office"] > 800
bond.where(mask1 & mask2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,,,,,,
From Russia with Love,,,,,,
Goldfinger,1964.0,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965.0,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,,,,,,
You Only Live Twice,,,,,,
On Her Majesty's Secret Service,,,,,,
Diamonds Are Forever,,,,,,
Live and Let Die,,,,,,
The Man with the Golden Gun,,,,,,


### <font color=blue> Filter A DataFrame with the query method </font>

* Before filtering df using query, first remve any space if there in column name

In [300]:
bond.head(1)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6


In [301]:
["_".join(col_name.strip().split()) for col_name in bond.columns]

['Year', 'Actor', 'Director', 'Box_Office', 'Budget', 'Bond_Actor_Salary']

In [302]:
bond.columns = ["_".join(col_name.strip().split()) for col_name in bond.columns]
bond.head(1)

Unnamed: 0_level_0,Year,Actor,Director,Box_Office,Budget,Bond_Actor_Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6


In [303]:
# queries can be written in Python syntax

bond.query(" Actor == 'Sean Connery' ")

Unnamed: 0_level_0,Year,Actor,Director,Box_Office,Budget,Bond_Actor_Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,


In [304]:
bond.query(" Actor == '   Sean Connery '.strip() and Box_Office > 800")

Unnamed: 0_level_0,Year,Actor,Director,Box_Office,Budget,Bond_Actor_Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7


### <font color=blue> Create a Copy of a DataFrame with the copy Method </font>

In [305]:
bond = pd.read_csv("./datasets/jamesbond.csv", index_col = "Film")
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [306]:
directors = bond["Director"]
directors.head(3)

Film
Dr. No                   Terence Young
From Russia with Love    Terence Young
Goldfinger                Guy Hamilton
Name: Director, dtype: object

In [307]:
# now if change something in directors variable, it will be reflected in the original df

directors["Goldfinger"]

'Guy Hamilton'

In [308]:
directors["Goldfinger"] = "Guy"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  directors["Goldfinger"] = "Guy"


In [309]:
# here's the change

directors.head(3)

Film
Dr. No                   Terence Young
From Russia with Love    Terence Young
Goldfinger                         Guy
Name: Director, dtype: object

In [310]:
# original also affected

bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy,820.4,18.6,3.2


In [311]:
# this problem can be solved using copy() method

directors = bond["Director"].copy()
directors.head(3)

Film
Dr. No                   Terence Young
From Russia with Love    Terence Young
Goldfinger                         Guy
Name: Director, dtype: object

In [312]:
directors["Dr. No"]

'Terence Young'

In [313]:
directors["Dr. No"] = "Young"

In [314]:
directors.head(3)

Film
Dr. No                           Young
From Russia with Love    Terence Young
Goldfinger                         Guy
Name: Director, dtype: object

In [315]:
# original is safe

bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy,820.4,18.6,3.2


### <font color=brown> to be continued ...... </font>

[Pandas - II] (http://localhost:8888/notebooks/Pandas%20-%20II.ipynb)