# **groupby**

- groupby() is a powerful function in pandas that allows you to split your data into groups based on some criteria, apply a function to each group, and then combine the results back into a single DataFrame. This is particularly useful for performing operations like aggregation, transformation, and filtering on subsets of your data.
- groups are formed on the data based on the values in one or more columns , generally we have two types of groups
    - **Categorical groups**: These are groups formed based on categorical variables, such as 'Genre', 'Country', or 'Director'. Each unique value in the categorical column defines a group.
    - **Numerical groups**: These are groups formed based on numerical variables, such as 'Year' or 'Rating'. You can create bins or ranges to categorize numerical data into groups.
- but groupby() does not work with the column having numeric data type .

In [1]:
import pandas as pd
import numpy as np

In [2]:
movies = pd.read_csv('../DataSets/imdb-top-1000.csv')

In [3]:
movies.head()

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
0,The Shawshank Redemption,1994,142,Drama,9.3,Frank Darabont,Tim Robbins,2343110,28341469.0,80.0
1,The Godfather,1972,175,Crime,9.2,Francis Ford Coppola,Marlon Brando,1620367,134966411.0,100.0
2,The Dark Knight,2008,152,Action,9.0,Christopher Nolan,Christian Bale,2303232,534858444.0,84.0
3,The Godfather: Part II,1974,202,Crime,9.0,Francis Ford Coppola,Al Pacino,1129952,57300000.0,90.0
4,12 Angry Men,1957,96,Crime,9.0,Sidney Lumet,Henry Fonda,689845,4360000.0,96.0


In [4]:
genres = movies.groupby('Genre')

## Aggregation Functions
- Aggregation Functions involves applying a function to each group and combining the results into a single DataFrame. Common aggregation functions include :
    - sum()
    - mean()
    - median()
    - std()
    - var()
    - count()
    - min()
    - max().

In [8]:
genres.sum()

Unnamed: 0_level_0,Series_Title,Released_Year,Runtime,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Action,The Dark KnightThe Lord of the Rings: The Retu...,2008200320102001200219991980197719621954200019...,22196,1367.3,Christopher NolanPeter JacksonChristopher Nola...,Christian BaleElijah WoodLeonardo DiCaprioElij...,72282412,32632260000.0,10499.0
Adventure,InterstellarBack to the FutureInglourious Bast...,2014198520091981196819621959201319751963194819...,9656,571.5,Christopher NolanRobert ZemeckisQuentin Tarant...,Matthew McConaugheyMichael J. FoxBrad PittJürg...,22576163,9496922000.0,5020.0
Animation,Sen to Chihiro no kamikakushiThe Lion KingHota...,2001199419882016201820172008199719952019200920...,8166,650.3,Hayao MiyazakiRoger AllersIsao TakahataMakoto ...,Daveigh ChaseRob MinkoffTsutomu TatsumiRyûnosu...,21978630,14631470000.0,6082.0
Biography,Schindler's ListGoodfellasHamiltonThe Intoucha...,1993199020202011200220171995198420182013201320...,11970,698.6,Steven SpielbergMartin ScorseseThomas KailOliv...,Liam NeesonRobert De NiroLin-Manuel MirandaÉri...,24006844,8276358000.0,6023.0
Comedy,GisaengchungLa vita è bellaModern TimesCity Li...,2019199719361931200919641940200120001973196019...,17380,1224.7,Bong Joon HoRoberto BenigniCharles ChaplinChar...,Kang-ho SongRoberto BenigniCharles ChaplinChar...,27620327,15663870000.0,9840.0
Crime,The GodfatherThe Godfather: Part II12 Angry Me...,1972197419571994200219991995199120192006199519...,13524,857.8,Francis Ford CoppolaFrancis Ford CoppolaSidney...,Marlon BrandoAl PacinoHenry FondaJohn Travolta...,33533615,8452632000.0,6706.0
Drama,The Shawshank RedemptionFight ClubForrest Gump...,1994199919941975202019981946201420061998198819...,36049,2299.7,Frank DarabontDavid FincherRobert ZemeckisMilo...,Tim RobbinsBrad PittTom HanksJack NicholsonSur...,61367304,35409970000.0,19208.0
Family,E.T. the Extra-TerrestrialWilly Wonka & the Ch...,19821971,215,15.6,Steven SpielbergMel Stuart,Henry ThomasGene Wilder,551221,439110600.0,158.0
Fantasy,Das Cabinet des Dr. CaligariNosferatu,19201922,170,16.0,Robert WieneF.W. Murnau,Werner KraussMax Schreck,146222,782726700.0,0.0
Film-Noir,The Third ManThe Maltese FalconShadow of a Doubt,194919411943,312,23.9,Carol ReedJohn HustonAlfred Hitchcock,Orson WellesHumphrey BogartTeresa Wright,367215,125910500.0,287.0


In [10]:
genres.min()

Unnamed: 0_level_0,Series_Title,Released_Year,Runtime,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Action,300,1924,45,7.6,Abhishek Chaubey,Aamir Khan,25312,3296.0,33.0
Adventure,2001: A Space Odyssey,1925,88,7.6,Akira Kurosawa,Aamir Khan,29999,61001.0,41.0
Animation,Akira,1940,71,7.6,Adam Elliot,Adrian Molina,25229,128985.0,61.0
Biography,12 Years a Slave,1928,93,7.6,Adam McKay,Adrien Brody,27254,21877.0,48.0
Comedy,(500) Days of Summer,1921,68,7.6,Alejandro G. Iñárritu,Aamir Khan,26337,1305.0,45.0
Crime,12 Angry Men,1931,80,7.6,Akira Kurosawa,Ajay Devgn,27712,6013.0,47.0
Drama,1917,1925,64,7.6,Aamir Khan,Abhay Deol,25088,3600.0,28.0
Family,E.T. the Extra-Terrestrial,1971,100,7.8,Mel Stuart,Gene Wilder,178731,4000000.0,67.0
Fantasy,Das Cabinet des Dr. Caligari,1920,76,7.9,F.W. Murnau,Max Schreck,57428,337574718.0,
Film-Noir,Shadow of a Doubt,1941,100,7.8,Alfred Hitchcock,Humphrey Bogart,59556,449191.0,94.0


Q : find the top 3 genres by total earning .

In [11]:
movies.groupby('Genre').sum()['Gross'].sort_values(ascending=False).head(3)

Genre
Drama     3.540997e+10
Action    3.263226e+10
Comedy    1.566387e+10
Name: Gross, dtype: float64

In [12]:
# another way
movies.groupby('Genre')['Gross'].sum().sort_values(ascending=False).head(3)

Genre
Drama     3.540997e+10
Action    3.263226e+10
Comedy    1.566387e+10
Name: Gross, dtype: float64

* `Which method is better ??` :
* ans : second method is better because it is more efficient as it directly selects the 'Gross' column after grouping, reducing memory usage and improving performance.

Q : find the genre with highest avg IMDB rating

In [15]:
movies.groupby('Genre')['IMDB_Rating'].mean().sort_values(ascending=False).head(1)

Genre
Western    8.35
Name: IMDB_Rating, dtype: float64

Q : Find the director with most popularity .

In [17]:
movies.groupby('Director')['No_of_Votes'].sum().sort_values(ascending=False).head(1)

Director
Christopher Nolan    11578345
Name: No_of_Votes, dtype: int64

Q :  find the highest rated movie of each genre .

In [20]:
movies.groupby('Genre')['IMDB_Rating'].max()

Genre
Action       9.0
Adventure    8.6
Animation    8.6
Biography    8.9
Comedy       8.6
Crime        9.2
Drama        9.3
Family       7.8
Fantasy      8.1
Film-Noir    8.1
Horror       8.5
Mystery      8.4
Thriller     7.8
Western      8.8
Name: IMDB_Rating, dtype: float64

In [21]:
# another way
movies.loc[movies.groupby('Genre')['IMDB_Rating'].idxmax()]

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
2,The Dark Knight,2008,152,Action,9.0,Christopher Nolan,Christian Bale,2303232,534858444.0,84.0
21,Interstellar,2014,169,Adventure,8.6,Christopher Nolan,Matthew McConaughey,1512360,188020017.0,74.0
23,Sen to Chihiro no kamikakushi,2001,125,Animation,8.6,Hayao Miyazaki,Daveigh Chase,651376,10055859.0,96.0
7,Schindler's List,1993,195,Biography,8.9,Steven Spielberg,Liam Neeson,1213505,96898818.0,94.0
19,Gisaengchung,2019,132,Comedy,8.6,Bong Joon Ho,Kang-ho Song,552778,53367844.0,96.0
1,The Godfather,1972,175,Crime,9.2,Francis Ford Coppola,Marlon Brando,1620367,134966411.0,100.0
0,The Shawshank Redemption,1994,142,Drama,9.3,Frank Darabont,Tim Robbins,2343110,28341469.0,80.0
688,E.T. the Extra-Terrestrial,1982,115,Family,7.8,Steven Spielberg,Henry Thomas,372490,435110554.0,91.0
321,Das Cabinet des Dr. Caligari,1920,76,Fantasy,8.1,Robert Wiene,Werner Krauss,57428,337574718.0,
309,The Third Man,1949,104,Film-Noir,8.1,Carol Reed,Orson Welles,158731,449191.0,97.0


Q : Find number of movies done by each actor

In [24]:
movies['Star1'].value_counts()

Star1
Tom Hanks            12
Robert De Niro       11
Al Pacino            10
Clint Eastwood       10
Humphrey Bogart       9
                     ..
Phil Harris           1
David Hemmings        1
John Lennon           1
Tallulah Bankhead     1
Bruce Lee             1
Name: count, Length: 660, dtype: int64

In [25]:
movies.groupby('Star1')['Series_Title'].count().sort_values(ascending=False)

Star1
Tom Hanks               12
Robert De Niro          11
Clint Eastwood          10
Al Pacino               10
Humphrey Bogart          9
                        ..
Zbigniew Zamachowski     1
Zooey Deschanel          1
Çetin Tekindor           1
Éric Toledano            1
Aaron Taylor-Johnson     1
Name: Series_Title, Length: 660, dtype: int64

## Groupby Attributes and Methods

- find total number of groups

In [26]:
len(movies.groupby('Genre'))

14

In [27]:
movies.groupby('Genre').ngroups

14

In [38]:
movies['Genre'].nunique()

14

- find items in each group

In [41]:
movies.groupby('Genre').size()

Genre
Action       172
Adventure     72
Animation     82
Biography     88
Comedy       155
Crime        107
Drama        289
Family         2
Fantasy        2
Film-Noir      3
Horror        11
Mystery       12
Thriller       1
Western        4
dtype: int64

In [43]:
movies['Genre'].value_counts()
# here size() and value_counts() both give the same result but value_counts() is more efficient as it directly counts the occurrences of each unique value in the 'Genre' column without the overhead of creating a groupby object .

Genre
Drama        289
Action       172
Comedy       155
Crime        107
Biography     88
Animation     82
Adventure     72
Mystery       12
Horror        11
Western        4
Film-Noir      3
Fantasy        2
Family         2
Thriller       1
Name: count, dtype: int64

`first() , last() and nth()`
- These methods allow you to retrieve the first, last, or nth item from each group.
- They are useful for quickly accessing specific rows within each group without having to iterate through the groups manually.

In [45]:
genres = movies.groupby('Genre')

In [47]:
genres.first()
# This will return the first row of each group based on the order of the original DataFrame.

Unnamed: 0_level_0,Series_Title,Released_Year,Runtime,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Action,The Dark Knight,2008,152,9.0,Christopher Nolan,Christian Bale,2303232,534858444.0,84.0
Adventure,Interstellar,2014,169,8.6,Christopher Nolan,Matthew McConaughey,1512360,188020017.0,74.0
Animation,Sen to Chihiro no kamikakushi,2001,125,8.6,Hayao Miyazaki,Daveigh Chase,651376,10055859.0,96.0
Biography,Schindler's List,1993,195,8.9,Steven Spielberg,Liam Neeson,1213505,96898818.0,94.0
Comedy,Gisaengchung,2019,132,8.6,Bong Joon Ho,Kang-ho Song,552778,53367844.0,96.0
Crime,The Godfather,1972,175,9.2,Francis Ford Coppola,Marlon Brando,1620367,134966411.0,100.0
Drama,The Shawshank Redemption,1994,142,9.3,Frank Darabont,Tim Robbins,2343110,28341469.0,80.0
Family,E.T. the Extra-Terrestrial,1982,115,7.8,Steven Spielberg,Henry Thomas,372490,435110554.0,91.0
Fantasy,Das Cabinet des Dr. Caligari,1920,76,8.1,Robert Wiene,Werner Krauss,57428,337574718.0,
Film-Noir,The Third Man,1949,104,8.1,Carol Reed,Orson Welles,158731,449191.0,97.0


In [49]:
genres.last()
# This will return the last row of each group based on the order of the original DataFrame

Unnamed: 0_level_0,Series_Title,Released_Year,Runtime,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Action,Escape from Alcatraz,1979,112,7.6,Don Siegel,Clint Eastwood,121731,43000000.0,76.0
Adventure,Kelly's Heroes,1970,144,7.6,Brian G. Hutton,Clint Eastwood,45338,1378435.0,50.0
Animation,The Jungle Book,1967,78,7.6,Wolfgang Reitherman,Phil Harris,166409,141843612.0,65.0
Biography,Midnight Express,1978,121,7.6,Alan Parker,Brad Davis,73662,35000000.0,59.0
Comedy,Breakfast at Tiffany's,1961,115,7.6,Blake Edwards,Audrey Hepburn,166544,679874270.0,76.0
Crime,The 39 Steps,1935,86,7.6,Alfred Hitchcock,Robert Donat,51853,302787539.0,93.0
Drama,Lifeboat,1944,97,7.6,Alfred Hitchcock,Tallulah Bankhead,26471,852142728.0,78.0
Family,Willy Wonka & the Chocolate Factory,1971,100,7.8,Mel Stuart,Gene Wilder,178731,4000000.0,67.0
Fantasy,Nosferatu,1922,94,7.9,F.W. Murnau,Max Schreck,88794,445151978.0,
Film-Noir,Shadow of a Doubt,1943,108,7.8,Alfred Hitchcock,Teresa Wright,59556,123353292.0,94.0


In [51]:
genres.nth(2)
# This will return the third row (index 2) of each group. If a group has fewer than three rows, it will not appear in the result.

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
4,12 Angry Men,1957,96,Crime,9.0,Sidney Lumet,Henry Fonda,689845,4360000.0,96.0
8,Inception,2010,148,Action,8.8,Christopher Nolan,Leonardo DiCaprio,2067042,292576195.0,74.0
11,Forrest Gump,1994,142,Drama,8.8,Robert Zemeckis,Tom Hanks,1809221,330252182.0,82.0
18,Hamilton,2020,160,Biography,8.6,Thomas Kail,Lin-Manuel Miranda,55291,440984783.0,90.0
46,Hotaru no haka,1988,89,Animation,8.5,Isao Takahata,Tsutomu Tatsumi,235231,150734678.0,94.0
51,Modern Times,1936,87,Comedy,8.5,Charles Chaplin,Charles Chaplin,217881,163245.0,96.0
93,Inglourious Basterds,2009,153,Adventure,8.3,Quentin Tarantino,Brad Pitt,1267869,120540719.0,69.0
115,Per qualche dollaro in più,1965,132,Western,8.3,Sergio Leone,Clint Eastwood,232772,15000000.0,74.0
119,Vertigo,1958,128,Mystery,8.3,Alfred Hitchcock,James Stewart,364368,3200000.0,100.0
271,The Thing,1982,109,Horror,8.1,John Carpenter,Kurt Russell,371271,13782838.0,57.0


`get_group()`
- This method allows you to retrieve a specific group from the grouped object.
- You need to pass the group name (the value of the column you grouped by) as an argument to this method.

In [53]:
genres.get_group('Western')
# This will return all rows where the 'Genre' is 'Western'.

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
12,"Il buono, il brutto, il cattivo",1966,161,Western,8.8,Sergio Leone,Clint Eastwood,688390,6100000.0,90.0
48,Once Upon a Time in the West,1968,165,Western,8.5,Sergio Leone,Henry Fonda,302844,5321508.0,80.0
115,Per qualche dollaro in più,1965,132,Western,8.3,Sergio Leone,Clint Eastwood,232772,15000000.0,74.0
691,The Outlaw Josey Wales,1976,135,Western,7.8,Clint Eastwood,Clint Eastwood,65659,31800000.0,69.0


In [54]:
# another way
movies[movies['Genre'] == 'Western']

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
12,"Il buono, il brutto, il cattivo",1966,161,Western,8.8,Sergio Leone,Clint Eastwood,688390,6100000.0,90.0
48,Once Upon a Time in the West,1968,165,Western,8.5,Sergio Leone,Henry Fonda,302844,5321508.0,80.0
115,Per qualche dollaro in più,1965,132,Western,8.3,Sergio Leone,Clint Eastwood,232772,15000000.0,74.0
691,The Outlaw Josey Wales,1976,135,Western,7.8,Clint Eastwood,Clint Eastwood,65659,31800000.0,69.0


- Here, get_group() is more efficient as it directly accesses the specified group without filtering the entire DataFrame.

`groups`
- This attribute returns a dictionary where the keys are the group names and the values are the indices of the rows that belong to each group.

In [55]:
genres.groups

{'Action': [2, 5, 8, 10, 13, 14, 16, 29, 30, 31, 39, 42, 44, 55, 57, 59, 60, 63, 68, 72, 106, 109, 129, 130, 134, 140, 142, 144, 152, 155, 160, 161, 166, 168, 171, 172, 177, 181, 194, 201, 202, 216, 217, 223, 224, 236, 241, 262, 275, 294, 308, 320, 325, 326, 331, 337, 339, 340, 343, 345, 348, 351, 353, 356, 357, 362, 368, 369, 375, 376, 390, 410, 431, 436, 473, 477, 479, 482, 488, 493, 496, 502, 507, 511, 532, 535, 540, 543, 564, 569, 570, 573, 577, 582, 583, 602, 605, 608, 615, 623, ...], 'Adventure': [21, 47, 93, 110, 114, 116, 118, 137, 178, 179, 191, 193, 209, 226, 231, 247, 267, 273, 281, 300, 301, 304, 306, 323, 329, 361, 366, 377, 402, 406, 415, 426, 458, 470, 497, 498, 506, 513, 514, 537, 549, 552, 553, 566, 576, 604, 609, 618, 638, 647, 675, 681, 686, 692, 711, 713, 739, 755, 781, 797, 798, 851, 873, 884, 912, 919, 947, 957, 964, 966, 984, 991], 'Animation': [23, 43, 46, 56, 58, 61, 66, 70, 101, 135, 146, 151, 158, 170, 197, 205, 211, 213, 219, 229, 230, 242, 245, 246, 270, 33

`describe()`
- This method provides a summary of statistics for each group, similar to the describe() method for a DataFrame.

In [57]:
genres.describe()

Unnamed: 0_level_0,Runtime,Runtime,Runtime,Runtime,Runtime,Runtime,Runtime,Runtime,IMDB_Rating,IMDB_Rating,...,Gross,Gross,Metascore,Metascore,Metascore,Metascore,Metascore,Metascore,Metascore,Metascore
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Genre,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Action,172.0,129.046512,28.500706,45.0,110.75,127.5,143.25,321.0,172.0,7.949419,...,267443700.0,936662225.0,143.0,73.41958,12.421252,33.0,65.0,74.0,82.0,98.0
Adventure,72.0,134.111111,33.31732,88.0,109.0,127.0,149.0,228.0,72.0,7.9375,...,199807000.0,874211619.0,64.0,78.4375,12.345393,41.0,69.75,80.5,87.25,100.0
Animation,82.0,99.585366,14.530471,71.0,90.0,99.5,106.75,137.0,82.0,7.930488,...,252061200.0,873839108.0,75.0,81.093333,8.813646,61.0,75.0,82.0,87.5,96.0
Biography,88.0,136.022727,25.514466,93.0,120.0,129.0,146.25,209.0,88.0,7.938636,...,98299240.0,753585104.0,79.0,76.240506,11.028187,48.0,70.5,76.0,84.5,97.0
Comedy,155.0,112.129032,22.946213,68.0,96.0,106.0,124.5,188.0,155.0,7.90129,...,81078090.0,886752933.0,125.0,78.72,11.82916,45.0,72.0,79.0,88.0,99.0
Crime,107.0,126.392523,27.689231,80.0,106.5,122.0,141.5,229.0,107.0,8.016822,...,71021630.0,790482117.0,87.0,77.08046,13.099102,47.0,69.5,77.0,87.0,100.0
Drama,289.0,124.737024,27.74049,64.0,105.0,121.0,137.0,242.0,289.0,7.957439,...,116446100.0,924558264.0,241.0,79.701245,12.744687,28.0,72.0,82.0,89.0,100.0
Family,2.0,107.5,10.606602,100.0,103.75,107.5,111.25,115.0,2.0,7.8,...,327332900.0,435110554.0,2.0,79.0,16.970563,67.0,73.0,79.0,85.0,91.0
Fantasy,2.0,85.0,12.727922,76.0,80.5,85.0,89.5,94.0,2.0,8.0,...,418257700.0,445151978.0,0.0,,,,,,,
Film-Noir,3.0,104.0,4.0,100.0,102.0,104.0,106.0,108.0,3.0,7.966667,...,62730680.0,123353292.0,3.0,95.666667,1.527525,94.0,95.0,96.0,96.5,97.0


`sample()`
- This method allows you to randomly sample a specified number of rows from each group.

In [62]:
genres.sample(2)
# This will return 2 random rows from each genre group.

ValueError: Cannot take a larger sample than population when 'replace=False'

In [None]:
# first()/last() -> nth item
# get_group -> vs filtering
# groups
# describe
# sample
# nunique