# PANDAS MINI PROJECT

You have the data for the 100 top-rated movies from the past decade along with various pieces of information about the movie, its actors, and the voters who have rated these movies online. In this assignment, you will try to find some interesting insights into these movies and their voters, using Python.

# Task 1: Reading the data

### Subtask 1.1: Read the Movies Data.

Read the movies data file provided and store it in a dataframe movies.

In [2]:
import pandas as pd
import numpy as numpy
from sklearn import linear_model
import matplotlib.pyplot as plt

#1.1. Read the “Movies” Data.
movies=pd.read_csv('C:/Users/pc/Desktop/Data Science with ML/pandas/Movie+Assignment+Data.csv')
movies.head(5)

Unnamed: 0,Title,title_year,budget,Gross,actor_1_name,actor_2_name,actor_3_name,actor_1_facebook_likes,actor_2_facebook_likes,actor_3_facebook_likes,...,Votes3044M,Votes3044F,Votes45A,Votes45AM,Votes45AF,Votes1000,VotesUS,VotesnUS,content_rating,Country
0,La La Land,2016,30000000,151101803,Ryan Gosling,Emma Stone,Amiée Conn,14000,19000,11000,...,7.9,7.8,7.6,7.6,7.5,7.1,8.3,8.1,PG-13,USA
1,Zootopia,2016,150000000,341268248,Ginnifer Goodwin,Jason Bateman,Idris Elba,2800,28000,27000,...,7.8,8.1,7.8,7.8,8.1,7.6,8.0,8.0,PG,USA
2,Lion,2016,12000000,51738905,Dev Patel,Nicole Kidman,Rooney Mara,33000,96000,9800,...,7.9,8.2,8.0,7.9,8.4,7.1,8.1,8.0,PG-13,Australia
3,Arrival,2016,47000000,100546139,Amy Adams,Jeremy Renner,Forest Whitaker,35000,5300,13433,...,7.8,7.8,7.6,7.6,7.7,7.3,8.0,7.9,PG-13,USA
4,Manchester by the Sea,2016,9000000,47695371,Casey Affleck,Michelle Williams,Kyle Chandler,518,71000,3300,...,7.7,7.7,7.6,7.6,7.6,7.1,7.9,7.8,R,USA


### Subtask 1.2: Inspect the Dataframe
Inspect the dataframe for dimensions, null-values, and summary of different numeric columns.

In [3]:
#1.2. Inspect the Dataframe for dimensions, null-values, and summary of different numeric columns.
movies.notnull()
print(" Row and Columns are" ,movies.shape)
print(" NULL values are :",movies.isnull().sum())
print(movies.describe)

 Row and Columns are (100, 62)
 NULL values are : Title             0
title_year        0
budget            0
Gross             0
actor_1_name      0
                 ..
Votes1000         0
VotesUS           0
VotesnUS          0
content_rating    0
Country           0
Length: 62, dtype: int64
<bound method NDFrame.describe of                                            Title  title_year     budget  \
0                                     La La Land        2016   30000000   
1                                       Zootopia        2016  150000000   
2                                           Lion        2016   12000000   
3                                        Arrival        2016   47000000   
4                          Manchester by the Sea        2016    9000000   
..                                           ...         ...        ...   
95                                      Whiplash        2014    3300000   
96                               Before Midnight        2013    3000000

## Task 2: Data Analysis

Now that we have loaded the dataset and inspected it, we see that most of the data is in place. As of now, no data cleaning is required, so let's start with some data manipulation, analysis, and visualisation to get various insights about the data.

### Subtask 2.1: Reduce those Digits!
These numbers in the `budget` and `gross` are too big, compromising its readability. Let's convert the unit of the budget and gross columns from `$` to `million $` first.

In [4]:
#---------------------------------------------------------------------------------------------------------------------------------------------------#
#2.1. Reduce the digits in “budget” and “gross” for readability (See notebook for details)
pd.set_option('display.precision', 2)
#movies['budget','Gross'].set_printoptions(precision=2)
movies.head(4)


Unnamed: 0,Title,title_year,budget,Gross,actor_1_name,actor_2_name,actor_3_name,actor_1_facebook_likes,actor_2_facebook_likes,actor_3_facebook_likes,...,Votes3044M,Votes3044F,Votes45A,Votes45AM,Votes45AF,Votes1000,VotesUS,VotesnUS,content_rating,Country
0,La La Land,2016,30000000,151101803,Ryan Gosling,Emma Stone,Amiée Conn,14000,19000,11000,...,7.9,7.8,7.6,7.6,7.5,7.1,8.3,8.1,PG-13,USA
1,Zootopia,2016,150000000,341268248,Ginnifer Goodwin,Jason Bateman,Idris Elba,2800,28000,27000,...,7.8,8.1,7.8,7.8,8.1,7.6,8.0,8.0,PG,USA
2,Lion,2016,12000000,51738905,Dev Patel,Nicole Kidman,Rooney Mara,33000,96000,9800,...,7.9,8.2,8.0,7.9,8.4,7.1,8.1,8.0,PG-13,Australia
3,Arrival,2016,47000000,100546139,Amy Adams,Jeremy Renner,Forest Whitaker,35000,5300,13433,...,7.8,7.8,7.6,7.6,7.7,7.3,8.0,7.9,PG-13,USA


### Subtask 2.2: Let's Talk Profit!

1. Create a new column called `profit` which contains the difference of the two columns: `gross` and `budget`
2. Sort the dataframe using the `profit` column as reference.
3. Extract the top ten profiting movies in descending order and store them in a new dataframe named `top10`
4. Record your observations
5. Extract the movies with a `negative profit` and store them in a new dataframe named `neg_profit`

In [5]:
#Create a new column called profit which contains the difference of “gross” and “budget”
profit1=movies['Gross']-movies['budget']
movies['profit']=profit1
movies.head(3)

Unnamed: 0,Title,title_year,budget,Gross,actor_1_name,actor_2_name,actor_3_name,actor_1_facebook_likes,actor_2_facebook_likes,actor_3_facebook_likes,...,Votes3044F,Votes45A,Votes45AM,Votes45AF,Votes1000,VotesUS,VotesnUS,content_rating,Country,profit
0,La La Land,2016,30000000,151101803,Ryan Gosling,Emma Stone,Amiée Conn,14000,19000,11000,...,7.8,7.6,7.6,7.5,7.1,8.3,8.1,PG-13,USA,121101803
1,Zootopia,2016,150000000,341268248,Ginnifer Goodwin,Jason Bateman,Idris Elba,2800,28000,27000,...,8.1,7.8,7.8,8.1,7.6,8.0,8.0,PG,USA,191268248
2,Lion,2016,12000000,51738905,Dev Patel,Nicole Kidman,Rooney Mara,33000,96000,9800,...,8.2,8.0,7.9,8.4,7.1,8.1,8.0,PG-13,Australia,39738905


In [6]:
#Sort the data frame using the profit column as a reference.
sor=movies.sort_values(by='profit', ascending=False)
#Extract the top ten profiting movies in descending order and store them in a new dataframe named “top10”
topten=sor.head(10)
print(topten)

                                         Title  title_year     budget  \
97  Star Wars: Episode VII - The Force Awakens        2015  245000000   
11                                The Avengers        2012  220000000   
47                                    Deadpool        2016   58000000   
32             The Hunger Games: Catching Fire        2013  130000000   
12                                 Toy Story 3        2010  200000000   
8                        The Dark Knight Rises        2012  250000000   
45                              The Lego Movie        2014   60000000   
1                                     Zootopia        2016  150000000   
41                               Despicable Me        2010   69000000   
18                                  Inside Out        2015  175000000   

        Gross       actor_1_name       actor_2_name           actor_3_name  \
97  936662225        Doug Walker         Rob Walker                      0   
11  623279547    Chris Hemsworth  Robert

In [7]:
#Extract the movies with a negative profit and store them in a new data frame named“neg_profit”
neg_profit1=movies['profit']<0
neg_profit=movies[neg_profit1]
neg_profit

Unnamed: 0,Title,title_year,budget,Gross,actor_1_name,actor_2_name,actor_3_name,actor_1_facebook_likes,actor_2_facebook_likes,actor_3_facebook_likes,...,Votes3044F,Votes45A,Votes45AM,Votes45AF,Votes1000,VotesUS,VotesnUS,content_rating,Country,profit
7,Tangled,2010,260000000,200807262,Brad Garrett,Donna Murphy,M.C. Gainey,799,553,284,...,8.0,7.7,7.6,7.9,6.9,7.9,7.7,PG,USA,-59192738
17,Edge of Tomorrow,2014,178000000,100189501,Tom Cruise,Lara Pulver,Noah Taylor,10000,854,509,...,7.7,7.8,7.8,7.8,7.5,8.0,7.8,PG-13,USA,-77810499
22,Hugo,2011,170000000,73820094,ChloÃ« Grace Moretz,Christopher Lee,Ray Winstone,17000,16000,1000,...,7.4,7.5,7.5,7.6,7.4,7.7,7.5,PG,USA,-96179906
28,X-Men: First Class,2011,160000000,146405371,Jennifer Lawrence,Michael Fassbender,Oliver Platt,34000,13000,1000,...,7.8,7.6,7.5,7.7,7.3,7.8,7.7,PG-13,USA,-13594629
39,The Little Prince,2015,81200000,1339152,Jeff Bridges,James Franco,Mackenzie Foy,12000,11000,6000,...,7.9,7.5,7.4,7.9,6.6,7.7,7.7,PG,France,-79860848
46,Scott Pilgrim vs. the World,2010,60000000,31494270,Anna Kendrick,Kieran Culkin,Ellen Wong,10000,1000,719,...,7.2,7.1,7.1,7.0,6.6,7.8,7.4,PG-13,USA,-28505730
56,Rush,2013,38000000,26903709,Chris Hemsworth,Olivia Wilde,Alexandra Maria Lara,26000,10000,471,...,7.9,7.8,7.8,7.8,7.1,7.9,8.1,R,UK,-11096291
66,Warrior,2011,25000000,13651662,Tom Hardy,Frank Grillo,Kevin Dunn,27000,798,581,...,8.0,7.7,7.7,7.5,7.1,8.2,8.1,PG-13,USA,-11348338
82,Flipped,2010,14000000,1752214,Madeline Carroll,Rebecca De Mornay,Aidan Quinn,1000,872,767,...,7.7,7.4,7.3,7.6,6.4,7.5,7.7,PG,USA,-12247786
89,Amour,2012,8900000,225377,Isabelle Huppert,Emmanuelle Riva,Jean-Louis Trintignant,678,432,319,...,7.9,7.9,7.8,8.1,7.2,7.9,7.8,PG-13,France,-8674623


In [8]:
#2.3. You might have noticed the column “MetaCritic” in this dataset. Second, you also have
#another column “IMDb_rating” which tells you the IMDb rating of a movie. Your task is to find
#out the highest rated movies which have been liked by critics and audiences alike.

#1. Firstly you will notice that the MetaCritic score is on a scale of 100 whereas the
#IMDb_rating is on a scale of 10. First convert the MetaCritic column to a scale of 10.
movies['MetaCritic'] = movies['MetaCritic'] / 10
movies.iloc[0:3,8:17]

Unnamed: 0,actor_2_facebook_likes,actor_3_facebook_likes,IMDb_rating,genre_1,genre_2,genre_3,MetaCritic,Runtime,CVotes10
0,19000,11000,8.2,Comedy,Drama,Music,9.3,128,74245
1,28000,27000,8.1,Animation,Adventure,Comedy,7.8,108,53626
2,96000,9800,8.1,Biography,Drama,,6.9,118,23325


### Subtask 2.3: The General Audience and the Critics


You might have noticed the column `MetaCritic` in this dataset. This is a very popular website where an average score is determined through the scores given by the top-rated critics. Second, you also have another column `IMDb_rating` which tells you the IMDb rating of a movie. This rating is determined by taking the average of hundred-thousands of ratings from the general audience.

As a part of this subtask, you are required to find out the highest rated movies which have been liked by critics and audiences alike.

1. Firstly you will notice that the `MetaCritic` score is on a scale of `100` whereas the `IMDb_rating` is on a scale of `10`      First convert the `MetaCritic` column to a scale of 10.
2. Now, to find out the movies which have been liked by both critics and audiences alike and also have a high rating overall,      you need to:

    - Create a new column `Avg_rating` which will have the average of the `MetaCritic` and `Rating` columns
    - Retain only the movies in which the absolute difference(using abs() function) between the `IMDb_rating` and `Metacritic`             columns     is less than 0.5. Refer to this link to know how abs() funtion works - https://www.geeksforgeeks.org/abs-in-python/
    - Sort these values in a descending order of `Avg_rating` and retain only the movies with a rating equal to or greater than 8       and    store these movies in a new dataframe `UniversalAcclaim`.

In [9]:
#2. Now, you have to find out the movies which have been liked by both critics and
#audiences alike and also have a high rating overall. (See notebook for details)



liked_Movie = (movies['MetaCritic'] > 8) & (movies['IMDb_rating'] > 7.7)

# Assign the boolean mask to a new column in the DataFrame
movies['liked_Movie'] = liked_Movie

# Sort the DataFrame based on the 'liked_Movie' column
liked_movies_sorted = movies.sort_values(by='liked_Movie', ascending=False)

liked_movies_sorted.iloc[0]

Title               La La Land
title_year                2016
budget                30000000
Gross                151101803
actor_1_name      Ryan Gosling
                      ...     
VotesnUS                   8.1
content_rating           PG-13
Country                    USA
profit               121101803
liked_Movie               True
Name: 0, Length: 64, dtype: object

### Subtask 2.4: Find the Most Popular Trios - I
You're a producer looking to make a blockbuster movie. There will primarily be three lead roles in your movie and you wish to cast the most popular actors for it. Now, since you don't want to take a risk, you will cast a trio which has already acted in together in a movie before. The metric that you've chosen to check the popularity is the Facebook likes of each of these actors.

The dataframe has three columns to help you out for the same, viz. `actor_1_facebook_likes`, `actor_2_facebook_likes`, and `actor_3_facebook_likes`. Your objective is to find the trios which has the most number of Facebook likes combined. That is, the sum of `actor_1_facebook_likes`, `actor_2_facebook_likes` and `actor_3_facebook_likes` should be `maximum`. Find out the `top 5` popular trios, and output their `names` in a list.

In [10]:
# Group by actor names and sum their Facebook likes
popular_trio = movies.groupby(['actor_1_name', 'actor_2_name', 'actor_3_name']).agg({
    'actor_1_facebook_likes': 'sum',
    'actor_2_facebook_likes': 'sum',
    'actor_3_facebook_likes': 'sum'
})

# Calculate the total likes for each trio
popular_trio['total_likes'] = popular_trio.sum(axis=1)

# Sort the trios by total likes in descending order and get the top 5
top_5_trios = popular_trio['total_likes'].sort_values(ascending=False).head(5)

# Get the names of the top 5 popular trios
#top_5_trios_names = top_5_trios.index.tolist()

print(top_5_trios)

actor_1_name          actor_2_name        actor_3_name        
Dev Patel             Nicole Kidman       Rooney Mara             138800
Benedict Cumberbatch  Chiwetel Ejiofor    Rachel McAdams           86600
Leonardo DiCaprio     Tom Hardy           Joseph Gordon-Levitt     79000
Jennifer Lawrence     Peter Dinklage      Hugh Jackman             76000
Casey Affleck         Michelle Williams   Kyle Chandler            74818
Name: total_likes, dtype: int64


### Subtask 2.5: Find the Most Popular Trios - II
In the previous subtask you found the popular trio based on the total number of facebook likes. Let's add a small condition to it and make sure that all three actors are popular. The condition is none of the three actors' Facebook likes should be less than half of the other two. For example, the following is a valid combo:

actor_1_facebook_likes: 70000
actor_2_facebook_likes: 40000
actor_3_facebook_likes: 50000
But the below one is not:

actor_1_facebook_likes: 70000
actor_2_facebook_likes: 40000
actor_3_facebook_likes: 30000
since in this case, actor_3_facebook_likes is 30000, which is less than half of actor_1_facebook_likes.

Having this condition ensures that you aren't getting any unpopular actor in your trio (since the total likes calculated in the previous question doesn't tell anything about the individual popularities of each actor in the trio.).

You can do a manual inspection of the top 5 popular trios you have found in the previous subtask and check how many of those trios satisfy this condition. Also, which is the most popular trio after applying the condition above? Write your answers in the markdown cell provided below?

In [11]:
def popular_trio_condition(row):
    likes = sorted([row['actor_1_facebook_likes'], row['actor_2_facebook_likes'], row['actor_3_facebook_likes']])
    return likes[0] >= likes[1] / 2 and likes[0] >= likes[2] / 2

# Apply the condition to the DataFrame
movies['is_popular_trio'] = movies.apply(popular_trio_condition, axis=1)

# Filter the DataFrame for popular trios
popular_trios = movies[movies['is_popular_trio']]

# Group by actor names and sum their Facebook likes
popular_trio_likes = popular_trios.groupby(['actor_1_name', 'actor_2_name', 'actor_3_name']).agg({
    'actor_1_facebook_likes': 'sum',
    'actor_2_facebook_likes': 'sum',
    'actor_3_facebook_likes': 'sum'
})

# Calculate the total likes for each popular trio
popular_trio_likes['total_likes'] = popular_trio_likes.sum(axis=1)

# Sort the popular trios by total likes in descending order
sorted_popular_trios = popular_trio_likes['total_likes'].sort_values(ascending=False)

# Get the names of the top 5 popular trios after applying the condition
top_5_popular_trios_after_condition = sorted_popular_trios.head(5).index.tolist()

# Find the most popular trio after applying the condition
most_popular_trio_after_condition = sorted_popular_trios.index[0]

# Output the results
print("Top 5 popular trios after applying the condition:", top_5_popular_trios_after_condition)
print("Most popular trio after applying the condition:", most_popular_trio_after_condition)

Top 5 popular trios after applying the condition: [('Leonardo DiCaprio', 'Tom Hardy', 'Joseph Gordon-Levitt'), ('Jennifer Lawrence', 'Peter Dinklage', 'Hugh Jackman'), ('Tom Hardy', 'Christian Bale', 'Joseph Gordon-Levitt'), ('Chris Hemsworth', 'Robert Downey Jr.', 'Scarlett Johansson'), ('Robert Downey Jr.', 'Scarlett Johansson', 'Chris Evans')]
Most popular trio after applying the condition: ('Leonardo DiCaprio', 'Tom Hardy', 'Joseph Gordon-Levitt')


                                              Title  title_year     budget  \
47                                         Deadpool        2016   58000000   
36                          The Wolf of Wall Street        2013  100000000   
35                                 Django Unchained        2012  100000000   
29                               Mad Max: Fury Road        2015  150000000   
95                                         Whiplash        2014    3300000   
31                                     The Revenant        2015  135000000   
40                                   Shutter Island        2010   80000000   
43                                        Gone Girl        2014   61000000   
65                         The Grand Budapest Hotel        2014   25000000   
72  Birdman or (The Unexpected Virtue of Ignorance)        2014   18000000   

        Gross       actor_1_name         actor_2_name       actor_3_name  \
47  363024263      Ryan Reynolds            Ed Skrein     Stefan 

### Subtask 2.6: R-Rated Movies

Although R rated movies are restricted movies for the under 18 age group, still there are vote counts from that age group. Among all the R rated movies that have been voted by the under-18 age group, find the top 10 movies that have the highest number of votes i.e`CVotesU18` from the `movies` dataframe. Store these in a dataframe named `PopularR`.

In [15]:
r_movies = movies[movies['content_rating'] == "R"]
# Sort the R-rated movies by 'CVotesU18' in descending order and get the top 10
PopularR = r_movies.sort_values(by='CVotesU18', ascending=False).head(10)
print(PopularR)

                                              Title  title_year     budget  \
47                                         Deadpool        2016   58000000   
36                          The Wolf of Wall Street        2013  100000000   
35                                 Django Unchained        2012  100000000   
29                               Mad Max: Fury Road        2015  150000000   
95                                         Whiplash        2014    3300000   
31                                     The Revenant        2015  135000000   
40                                   Shutter Island        2010   80000000   
43                                        Gone Girl        2014   61000000   
65                         The Grand Budapest Hotel        2014   25000000   
72  Birdman or (The Unexpected Virtue of Ignorance)        2014   18000000   

        Gross       actor_1_name         actor_2_name       actor_3_name  \
47  363024263      Ryan Reynolds            Ed Skrein     Stefan 

## Task 3 : Demographic analysis

If you take a look at the last columns in the dataframe, most of these are related to demographics of the voters (in the last subtask, i.e., 2.8, you made use one of these columns - CVotesU18). We also have three genre columns indicating the genres of a particular movie. We will extensively use these columns for the third and the final stage of our assignment wherein we will analyse the voters across all demographics and also see how these vary across various genres. So without further ado, let's get started with demographic analysis.

### Subtask 3.1 Dataframe & Genres

There are 3 columns in the dataframe - `genre_1`, `genre_2`, and `genre_3`. As a part of this subtask, you need to aggregate a few values over these 3 columns.

1. First create a new dataframe `df_by_genre` that contains `genre_1`, `genre_2`, and `genre_3` and all the columns related to `CVotes/Votes` from the `movies` data frame. There are `47` columns to be extracted in total.
2. Now, Add a column called `cnt` to the dataframe `df_by_genre` and initialize it to `one`. You will realise the use of this column by the end of this subtask.
3. First group the dataframe `df_by_genre` by `genre_1` and find the sum of all the numeric columns such as `cnt`, columns related to `CVotes` and `Votes` columns and store it in a dataframe `df_by_g1`.
4. Perform the same operation for `genre_2` and `genre_3` and store it dataframes `df_by_g2` and `df_by_g3` respectively.
5. Now that you have 3 dataframes performed by grouping over `genre_1`, `genre_2`, and `genre_3` separately, it's time to combine them. For this, add the three dataframes and store it in a new dataframe `df_add`, so that the corresponding values of `Votes/CVotes` get added for each genre.There is a function called `add()` in pandas which lets you do this. You can refer to this link to see how this function works. https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.add.html
6. The column `cnt` on aggregation has basically kept the track of the number of occurences of each genre.Subset the genres that have atleast 10 movies into a new dataframe `genre_top10` based on the `cnt` column value.
7. Now, take the mean of all the numeric columns by dividing them with the column value `cnt` and store it back to the same dataframe. `We will be using this dataframe for further analysis in this task unless it is explicitly mentioned to use the dataframe movies`.
8. Since the number of votes can't be a fraction, type cast all the `CVotes` related columns to integers. Also, round off all the `Votes` related columns upto two digits after the decimal point.

In [17]:
import pandas as pd

df_by_genre = movies[['genre_1', 'genre_2', 'genre_3', 
                      'CVotes10', 'CVotes09', 'CVotes08', 'CVotes07', 'CVotes06', 'CVotes05', 'CVotes04', 'CVotes03', 'CVotes02', 'CVotes01', 
                      'CVotesMale', 'CVotesFemale', 'CVotesU18', 'CVotesU18M', 'CVotesU18F', 
                      'CVotes1829', 'CVotes1829M', 'CVotes1829F', 
                      'CVotes3044', 'CVotes3044M', 'CVotes3044F', 
                      'CVotes45A', 'CVotes45AM', 'CVotes45AF', 
                      'CVotes1000', 'CVotesUS', 'CVotesnUS', 'VotesM', 'VotesF', 'VotesU18', 'VotesU18M', 'VotesU18F', 
                      'Votes1829', 'Votes1829M', 'Votes1829F', 
                      'Votes3044', 'Votes3044M', 'Votes3044F', 
                      'Votes45A', 'Votes45AM', 'Votes45AF', 'Votes1000']]

print(df_by_genre.head())

     genre_1    genre_2 genre_3  CVotes10  CVotes09  CVotes08  CVotes07  \
0     Comedy      Drama   Music     74245     71191     64640     38831   
1  Animation  Adventure  Comedy     53626     70912    102352     57261   
2  Biography      Drama     NaN     23325     29830     40564     20296   
3      Drama    Mystery  Sci-Fi     55533     87850    109536     65440   
4      Drama        NaN     NaN     18191     33532     46596     29626   

   CVotes06  CVotes05  CVotes04  ...  Votes1829  Votes1829M  Votes1829F  \
0     17377      8044      3998  ...        8.4         8.4         8.2   
1     16719      4539      1467  ...        8.2         8.1         8.4   
2      5842      1669       558  ...        8.1         8.0         8.4   
3     26913     10556      5057  ...        8.2         8.2         8.1   
4     11879      4539      1976  ...        8.0         8.1         7.8   

   Votes3044  Votes3044M  Votes3044F  Votes45A  Votes45AM  Votes45AF  \
0        7.9         7.9  

In [18]:
# Add a column cnt and initialize it to 1 using .loc
df_by_genre.loc[:, 'cnt'] = 1

# Display the updated dataframe
print(df_by_genre.head())

     genre_1    genre_2 genre_3  CVotes10  CVotes09  CVotes08  CVotes07  \
0     Comedy      Drama   Music     74245     71191     64640     38831   
1  Animation  Adventure  Comedy     53626     70912    102352     57261   
2  Biography      Drama     NaN     23325     29830     40564     20296   
3      Drama    Mystery  Sci-Fi     55533     87850    109536     65440   
4      Drama        NaN     NaN     18191     33532     46596     29626   

   CVotes06  CVotes05  CVotes04  ...  Votes1829M  Votes1829F  Votes3044  \
0     17377      8044      3998  ...         8.4         8.2        7.9   
1     16719      4539      1467  ...         8.1         8.4        7.8   
2      5842      1669       558  ...         8.0         8.4        8.0   
3     26913     10556      5057  ...         8.2         8.1        7.8   
4     11879      4539      1976  ...         8.1         7.8        7.7   

   Votes3044M  Votes3044F  Votes45A  Votes45AM  Votes45AF  Votes1000  cnt  
0         7.9         

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_by_genre.loc[:, 'cnt'] = 1


In [19]:
# Group the dataframe df_by_genre by genre_1 and find the sum
df_by_g1 = df_by_genre.groupby('genre_1').sum()

# Display the dataframe df_by_g1
print(df_by_g1.head())

                                                     genre_2  \
genre_1                                                        
Action     AdventureThrillerAdventureSci-FiAdventureAdven...   
Adventure  FantasyFantasyDramaDramaDramaDramaDramaComedyB...   
Animation  AdventureAdventureAdventureAdventureActionActi...   
Biography  DramaComedyDramaDramaDramaDramaComedyDramaDram...   
Comedy      DramaDramaDramaFantasyDramaDramaDramaDramaHorror   

                                                     genre_3  CVotes10  \
genre_1                                                                  
Action     FantasySci-FiThrillerSci-FiSci-FiSci-FiSci-FiS...   2928407   
Adventure  FamilySci-FiThrillerSci-FiWesternDramaDramaDra...   1058779   
Animation  ComedyComedyComedyComedyAdventureAdventureCome...    681562   
Biography  CrimeThrillerSportHistoryDramaHistoryRomanceTh...    666831   
Comedy       MusicRomanceRomanceRomanceRomanceFantasyRomance    371217   

           CVotes09  CVotes08  C

In [20]:
# Group the dataframe df_by_genre by genre_2 and find the sum
df_by_g2 = df_by_genre.groupby('genre_2').sum()

# Group the dataframe df_by_genre by genre_3 and find the sum
df_by_g3 = df_by_genre.groupby('genre_3').sum()

# Display the dataframes df_by_g2 and df_by_g3
print("DataFrame df_by_g2:")
print(df_by_g2.head())

print("\nDataFrame df_by_g3:")
print(df_by_g3.head())

DataFrame df_by_g2:
                                                     genre_1  \
genre_2                                                        
Action                  AnimationAnimationAnimationAnimation   
Adventure  AnimationActionAnimationActionAnimationActionA...   
Biography                  ActionActionActionActionAdventure   
Comedy     BiographyActionActionBiographyAdventureAdventu...   
Crime                                                 Action   

                                                     genre_3  CVotes10  \
genre_2                                                                  
Action                  AdventureAdventureAdventureAdventure    238060   
Adventure  ComedyFantasyComedySci-FiComedyThrillerSci-FiS...   2297820   
Biography                          DramaDramaDramaDramaDrama    185172   
Comedy                      CrimeRomanceDramaDramaDramaDrama    428995   
Crime                                                  Drama     19576   

           C

In [21]:
df_by_g1 = df_by_g1.apply(pd.to_numeric, errors='coerce')
df_by_g2 = df_by_g2.apply(pd.to_numeric, errors='coerce')
df_by_g3 = df_by_g3.apply(pd.to_numeric, errors='coerce')

# Add the three dataframes and store the result in a new dataframe df_add
df_add = df_by_g1.add(df_by_g2, fill_value=0).add(df_by_g3, fill_value=0)

# Display the resulting dataframe df_add
print("DataFrame df_add:")
print(df_add.head())


DataFrame df_add:
           CVotes01  CVotes02  CVotes03  CVotes04  CVotes05  CVotes06  \
Action     171247.0   65573.0   95004.0  166970.0  393484.0  1.08e+06   
Adventure  173858.0   69737.0  103318.0  183070.0  438970.0  1.21e+06   
Animation   25193.0   10026.0   15733.0   30718.0   83069.0  2.51e+05   
Biography   51297.0   20613.0   29510.0   53718.0  138648.0  4.26e+05   
Comedy      88367.0   39391.0   56218.0   97469.0  226852.0  6.00e+05   

           CVotes07  CVotes08  CVotes09  CVotes10  ...  Votes45AM  VotesF  \
Action     2.92e+06  4.68e+06  3.55e+06  3.17e+06  ...      236.4   245.6   
Adventure  3.28e+06  5.26e+06  4.01e+06  3.59e+06  ...      290.4   304.5   
Animation  7.23e+05  1.15e+06  7.98e+05  6.82e+05  ...       84.1    89.3   
Biography  1.33e+06  2.23e+06  1.40e+06  8.52e+05  ...      137.9   141.8   
Comedy     1.59e+06  2.51e+06  1.77e+06  1.38e+06  ...      174.7   181.2   

           VotesM  VotesU18  VotesU18F  VotesU18M   cnt  genre_1  genre_2  \
Act

In [22]:
genre_top10 = df_add[df_add['cnt'] >= 10]

# Step 7: Take the mean of all the numeric columns by dividing them with the column value cnt
genre_top10.iloc[:, :-1] = genre_top10.iloc[:, :-1].div(genre_top10['cnt'], axis=0)

# Step 8: Type cast CVotes related columns to integers and round off Votes related columns
genre_top10.loc[:, genre_top10.columns[:-1]] = genre_top10.loc[:, genre_top10.columns[:-1]].astype(int, errors='ignore')
genre_top10.loc[:, genre_top10.columns[-1:]] = round(genre_top10.loc[:, genre_top10.columns[-1:]], 2)

# Replace NaN values with 0
genre_top10 = genre_top10.fillna(0)

# Display the genre_top10 dataframe
print("DataFrame genre_top10:")
print(genre_top10)


DataFrame genre_top10:
           CVotes01  CVotes02  CVotes03  CVotes04  CVotes05  CVotes06  \
Action       5524.0    2115.0    3064.0    5386.0   12693.0   34688.0   
Adventure    4575.0    1835.0    2718.0    4817.0   11551.0   31896.0   
Animation    2290.0     911.0    1430.0    2792.0    7551.0   22825.0   
Biography    2849.0    1145.0    1639.0    2984.0    7702.0   23644.0   
Comedy       3842.0    1712.0    2444.0    4237.0    9863.0   26099.0   
Crime        3383.0    1544.0    2246.0    3842.0    8971.0   25308.0   
Drama        3250.0    1449.0    2078.0    3622.0    8497.0   23528.0   
Romance      3082.0    1476.0    2130.0    3762.0    8530.0   21637.0   
Sci-Fi       6731.0    2715.0    3876.0    6583.0   14951.0   39518.0   
Thriller     4433.0    1982.0    2918.0    5021.0   11534.0   32003.0   

           CVotes07  CVotes08  CVotes09  CVotes10  ...  Votes45AM  VotesF  \
Action      94262.0  150895.0  114433.0  102144.0  ...        7.0     7.0   
Adventure   86367.0