<a href="https://colab.research.google.com/github/Amjad-Bin-Aslam/Data-Analysis-Practice/blob/main/05_Pandas/03_Pandas_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# groupby Object
* groupby() is one of the most powerful features in Pandas.
* A special Pandas object that holds grouped data and waits for an aggregation operation like sum(), mean(), count(), etc.
* In groupby() data groups are made on the basis of columns.
* groupby() is generally applied to categorical variables to summarize numeric data across categories, but it can technically be used on any column type.

1. Split data into groups
2. Apply some operation
3. Combine the results



In [10]:
import pandas as pd
import numpy as np

In [11]:
movies = pd.read_csv('/content/imdb-top-1000.csv')
matches = pd.read_csv('/content/deliveries.csv')

In [12]:
movies.head(1)

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
0,The Shawshank Redemption,1994,142,Drama,9.3,Frank Darabont,Tim Robbins,2343110,28341469.0,80.0


In [20]:
# Made the group on the basis of Genre column
genres = movies.groupby('Genre')

In [99]:
# Applying built-in aggregate functions on groupby object
genres.sum(numeric_only=True).head(5)

Unnamed: 0_level_0,Runtime,IMDB_Rating,No_of_Votes,Gross,Metascore
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Action,22196,1367.3,72282412,32632260000.0,10499.0
Adventure,9656,571.5,22576163,9496922000.0,5020.0
Animation,8166,650.3,21978630,14631470000.0,6082.0
Biography,11970,698.6,24006844,8276358000.0,6023.0
Comedy,17380,1224.7,27620327,15663870000.0,9840.0


In [41]:
# find top 3 genres by total earning
#movies.groupby("Genre").sum(numeric_only=True)['Gross'].sort_values(ascending=False).head(3)\
movies.groupby('Genre')['Gross'].sum(numeric_only=True).sort_values(ascending=False).head(3)

Unnamed: 0_level_0,Gross
Genre,Unnamed: 1_level_1
Drama,35409970000.0
Action,32632260000.0
Comedy,15663870000.0


In [43]:
# find the genre with highiest avg IMDB rating
movies.groupby('Genre')['IMDB_Rating'].mean().sort_values(ascending=False).head(1)

Unnamed: 0_level_0,IMDB_Rating
Genre,Unnamed: 1_level_1
Western,8.35


In [48]:
# find the director with most popularity
movies.groupby('Director')['No_of_Votes'].sum().sort_values(ascending=False).head(1)

Unnamed: 0_level_0,No_of_Votes
Director,Unnamed: 1_level_1
Christopher Nolan,11578345


In [51]:
# find number of movies done by each actor
movies.head(2)

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
0,The Shawshank Redemption,1994,142,Drama,9.3,Frank Darabont,Tim Robbins,2343110,28341469.0,80.0
1,The Godfather,1972,175,Crime,9.2,Francis Ford Coppola,Marlon Brando,1620367,134966411.0,100.0


In [61]:
# Find number of movies done by each actor
# movies.groupby('Star1')['Series_Title'].count().sort_values(ascending=False)
movies['Star1'].value_counts()

Unnamed: 0_level_0,count
Star1,Unnamed: 1_level_1
Tom Hanks,12
Robert De Niro,11
Al Pacino,10
Clint Eastwood,10
Humphrey Bogart,9
...,...
Junko Iwao,1
Fernanda Montenegro,1
Eli Marienthal,1
Til Schweiger,1


# groupby attributes and methods

## len()
* len() is used to count the number of groups made by the groupby().  

In [63]:
len(movies.groupby('Genre'))

14

## nunique()
1. DataFrame/Series
* It is used to return number of unique values.
* It counts how many distinct values are present.
2. groupby object
* The number of distinct (unique) values within each group.

In [68]:
# for single column (DataFrame or series)
movies['Genre'].nunique()

14

In [94]:
genres.nunique().head(4)

Unnamed: 0_level_0,Series_Title,Released_Year,Runtime,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Action,172,61,78,15,123,121,172,172,50
Adventure,72,49,58,10,59,59,72,72,33
Animation,82,35,41,11,51,77,82,82,29
Biography,88,44,56,13,76,72,88,88,40


## size()
* It is used to count the number of rows in each groups.

In [80]:
movies.groupby('Genre').size().head(4)

Unnamed: 0_level_0,0
Genre,Unnamed: 1_level_1
Action,172
Adventure,72
Animation,82
Biography,88


## value_count()
* Return number of rows in each group.
* It is just like the size() method.
* But the diff is that it return the sorted data.

In [81]:
movies['Genre'].value_counts().head(4)

Unnamed: 0_level_0,count
Genre,Unnamed: 1_level_1
Drama,289
Action,172
Comedy,155
Crime,107


## first(), last() and nth()
* first() return the first item from each group.
* last() return the last item from each group.
* nth() is used to get the specific item from each group.

In [77]:
genres = movies.groupby('Genre')
genres.first().head(5)

Unnamed: 0_level_0,Series_Title,Released_Year,Runtime,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Action,The Dark Knight,2008,152,9.0,Christopher Nolan,Christian Bale,2303232,534858444.0,84.0
Adventure,Interstellar,2014,169,8.6,Christopher Nolan,Matthew McConaughey,1512360,188020017.0,74.0
Animation,Sen to Chihiro no kamikakushi,2001,125,8.6,Hayao Miyazaki,Daveigh Chase,651376,10055859.0,96.0
Biography,Schindler's List,1993,195,8.9,Steven Spielberg,Liam Neeson,1213505,96898818.0,94.0
Comedy,Gisaengchung,2019,132,8.6,Bong Joon Ho,Kang-ho Song,552778,53367844.0,96.0


In [79]:
genres.nth(7).head(5) # that will return the 7th movie from each group

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
28,The Silence of the Lambs,1991,118,Crime,8.6,Jonathan Demme,Jodie Foster,1270197,130742922.0,85.0
29,Star Wars,1977,121,Action,8.6,George Lucas,Mark Hamill,1231473,322740140.0,90.0
34,Whiplash,2014,106,Drama,8.5,Damien Chazelle,Miles Teller,717585,13092000.0,88.0
70,Mononoke-hime,1997,134,Animation,8.4,Hayao Miyazaki,Yôji Matsuda,343171,2375308.0,76.0
95,Amélie,2001,122,Comedy,8.3,Jean-Pierre Jeunet,Audrey Tautou,703810,33225499.0,69.0


## get_group()
* this method is used to get the all information about a particular group.
* Returns actual DataFrame of that group.

In [84]:
genres.get_group('Horror').head(5)

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
49,Psycho,1960,109,Horror,8.5,Alfred Hitchcock,Anthony Perkins,604211,32000000.0,97.0
75,Alien,1979,117,Horror,8.4,Ridley Scott,Sigourney Weaver,787806,78900000.0,89.0
271,The Thing,1982,109,Horror,8.1,John Carpenter,Kurt Russell,371271,13782838.0,57.0
419,The Exorcist,1973,122,Horror,8.0,William Friedkin,Ellen Burstyn,362393,232906145.0,81.0
544,Night of the Living Dead,1968,96,Horror,7.9,George A. Romero,Duane Jones,116557,89029.0,89.0


## sample()
* Randomly select rows from each group.
* Instead of sampling from the whole DataFrame, it samples inside each group separately.

In [97]:
genres.sample().head(4)

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
262,Jurassic Park,1993,127,Action,8.1,Steven Spielberg,Sam Neill,867615,402453882.0,68.0
638,Diarios de motocicleta,2004,126,Adventure,7.8,Walter Salles,Gael García Bernal,96703,16756372.0,75.0
405,Akira,1988,124,Animation,8.0,Katsuhiro Ôtomo,Mitsuo Iwata,164918,553171.0,
159,A Beautiful Mind,2001,135,Biography,8.2,Ron Howard,Russell Crowe,848920,170742341.0,72.0


## .groups
The groups attribute of a GroupBy object returns a dictionary mapping each group label to the index labels belonging to that group.
* It is a attribute of groupby object.
* It return a dictionary that shows which row indexes belong to each group.

In [98]:
# genres.groups