# **`INTRODUCTION TO THE PANDAS MODULE PROJECT DARA-2024`**

In this notebook we will analyse a CSV file that is containing top 1000 in the history of film making according to `IMDB` website.

### Open the CSV file from your projects folder or file folders

In [195]:
import pandas as pd

movies = pd.read_csv("movie_dataset.csv")
df_movies = pd.DataFrame(movies)

display(df_movies)

OSError: [Errno 22] Invalid argument: 'movie_dataset.csv'

Check for `null` values. Decide how to handle, drop or fill.

In [176]:
df_movies.isnull().sum()

Rank                    0
Title                   0
Genre                   0
Description             0
Director                0
Actors                  0
Year                    0
Runtime (Minutes)       0
Rating                  0
Votes                   0
Revenue (Millions)    128
Metascore              64
dtype: int64

In [177]:
df_movies.dropna(inplace=True)

Remove duplicates

In [178]:
df_movies.drop_duplicates(inplace=True)

In [179]:
display(df_movies)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
...,...,...,...,...,...,...,...,...,...,...,...,...
993,994,Resident Evil: Afterlife,"Action,Adventure,Horror",While still out to destroy the evil Umbrella C...,Paul W.S. Anderson,"Milla Jovovich, Ali Larter, Wentworth Miller,K...",2010,97,5.9,140900,60.13,37.0
994,995,Project X,Comedy,3 high school seniors throw a birthday party t...,Nima Nourizadeh,"Thomas Mann, Oliver Cooper, Jonathan Daniel Br...",2012,88,6.7,164088,54.72,48.0
996,997,Hostel: Part II,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0
997,998,Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0


# Basic Attributes and Methods
Now, we will demonstrate some of the basic attributes and methods related to pandas.

- `df.info()`: method used for getting information about number of rows, columns, count of not null, memory usage. You can read more about it in the documentation.
- `df.head()`: method that views first 5 rows of the dataframe. You can pass any number in the brackets that will represent any number of rows you want to view.
- `df.tail()`: method that views last 5 rows of the dataframe. You can pass any number in the brackets that will represent any number of rows you want to view.
- `df.describe()`: method that can view the summary statistics for numerical columns. You can read more about this in the documentation.
- `df.shape`: attribute that is use to find out the number of rows and columns (Can you remember the same method in `NumPy`?)
- `df.columns`: attribute that is used to view the column names and it's an iterator by default.

Lets see some code!

In [180]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 838 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                838 non-null    int64  
 1   Title               838 non-null    object 
 2   Genre               838 non-null    object 
 3   Description         838 non-null    object 
 4   Director            838 non-null    object 
 5   Actors              838 non-null    object 
 6   Year                838 non-null    int64  
 7   Runtime (Minutes)   838 non-null    int64  
 8   Rating              838 non-null    float64
 9   Votes               838 non-null    int64  
 10  Revenue (Millions)  838 non-null    float64
 11  Metascore           838 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 85.1+ KB


In [181]:
df_movies.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


In [182]:
df_movies.describe()

Unnamed: 0,Rank,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
count,838.0,838.0,838.0,838.0,838.0,838.0,838.0
mean,485.247017,2012.50716,114.638425,6.81432,193230.3,84.564558,59.575179
std,286.572065,3.17236,18.470922,0.877754,193099.0,104.520227,16.952416
min,1.0,2006.0,66.0,1.9,178.0,0.0,11.0
25%,238.25,2010.0,101.0,6.3,61276.5,13.9675,47.0
50%,475.5,2013.0,112.0,6.9,136879.5,48.15,60.0
75%,729.75,2015.0,124.0,7.5,271083.0,116.8,72.0
max,1000.0,2016.0,187.0,9.0,1791916.0,936.63,100.0


Just run the remaining atributes. And voila you have it!

For the purpose of finding the number of `Null` values specifically. We can use the following `isnull()` methods such as following.

In [183]:
#finding null values
df_movies.isnull().sum()

Rank                  0
Title                 0
Genre                 0
Description           0
Director              0
Actors                0
Year                  0
Runtime (Minutes)     0
Rating                0
Votes                 0
Revenue (Millions)    0
Metascore             0
dtype: int64

## 1. What is the highest rated movie in the dataset?

In [184]:
df_movies.loc[df_movies["Rating"].idxmax()]

Rank                                                                 55
Title                                                   The Dark Knight
Genre                                                Action,Crime,Drama
Description           When the menace known as the Joker wreaks havo...
Director                                              Christopher Nolan
Actors                Christian Bale, Heath Ledger, Aaron Eckhart,Mi...
Year                                                               2008
Runtime (Minutes)                                                   152
Rating                                                              9.0
Votes                                                           1791916
Revenue (Millions)                                               533.32
Metascore                                                          82.0
Name: 54, dtype: object

# 2. What is the average revenue of all movies in the dataset?
Note, since the answer will be effected by how you dealt with missing values, a range has been provided.

In [185]:
df_movies['Revenue (Millions)'].mean()

84.5645584725537

The total revenue is more like a mean.

# 3. What is the average revenue from 2015 to 2017 in the dataset?
Note, since the answer will be effected by how you dealt with missing values, a range has been provided.


To calculate the average revenue of movies from 2015 to 2017, you first need to filter the dataset for movies released during those years and then calculate the mean of the revenue (or Revenue (Millions)) column. Here’s how you can do it:

In [186]:

# Filter the DataFrame for movies released between 2015 and 2017
movies_2015_to_2017 = df_movies[(df_movies['Year'] >= 2015) & (df_movies['Year'] <= 2017)]

# Calculate the average revenue for the filtered movies
average_revenue = movies_2015_to_2017['Revenue (Millions)'].mean()

# Display the result
average_revenue


64.49895765472313

# 4. How many movies where released in the year 2016?

Check for all movies that have 2016 in their column `year`.


In [187]:
#Filter the Dataset for only movies that have the year 2016 and count how many rows are there.
num_movies_2016 = df_movies[df_movies['Year'] == 2016].shape[0]

#display the number of movies
num_movies_2016

198

# 5. How many movies where directed my Christopher Nolan?

filter the Dataset with only the rows that have the Director. Check the `Director` column and make sure it is Christopher Nolan, then count the number of rows. Here's how.

In [188]:
# filter the Dataset
movies_by_Christopher_Nolan = df_movies[df_movies['Director'] == 'Christopher Nolan'].shape[0]

#display the number of movies.
movies_by_Christopher_Nolan

5

# 6. How many movies in the Dataset have a rating of at least 8.0?

The same method of filtering the Dataset and counting how many.

In [189]:
# filter the Dataset
movies_with_rating_above_eight = df_movies[df_movies['Rating'] >= 8.0].shape[0]

#display the number of movies.
movies_with_rating_above_eight


70

# 7. What is the mean rating of movies directed by Chriatopher Nolan?

In [190]:
# filter the Dataset
movies_by_Christopher_Nolan = df_movies[df_movies['Director'] == 'Christopher Nolan']

#mean rating
mean_rating = movies_by_Christopher_Nolan['Rating'].mean()
mean_rating

8.680000000000001

# 8. Find the year with the highest average rating.

In [191]:
# Group by 'year' and calculate the average rating for each year
average_ratings_by_year = df_movies.groupby('Year')['Rating'].mean()

# Find the year with the highest average rating
year_with_highest_avg_rating = average_ratings_by_year.idxmax()
highest_avg_rating = average_ratings_by_year.max()

year_with_highest_avg_rating, highest_avg_rating


(2006, 7.14390243902439)

# 9. What is the percentage increase in the number of movies made between 2006 and 2016?

In [192]:
# Count the number of movies for 2015 and 2016
num_movies_2006 = df_movies[df_movies['Year'] == 2006].shape[0]
num_movies_2016 = df_movies[df_movies['Year'] == 2016].shape[0]
# Calculate the percentage increase
percentage_increase = ((num_movies_2016 - num_movies_2006) / num_movies_2006) * 100

percentage_increase


382.9268292682927

# 10. Find the most common actor in all the movies?
Note, the actor column has multiple actors names. you must find a way to search for the most common actor in all the movies.

In [193]:
# Step 1: Split the actor names into a list for each movie
df_movies['Actors'] = df_movies['Actors'].str.split(',')


exploded_actors = df_movies.explode('Actors')
actor_counts = exploded_actors['Actors'].value_counts()

# Step 4: Find the most common actor
most_common_actor = actor_counts.idxmax()
most_common_actor_count = actor_counts.max()

most_common_actor, most_common_actor_count


('Mark Wahlberg', 11)

# 11. How many unique genres are there in the dataset?
Note, the "Genre" column has multiple genres per movie. You must find a way to identify them individually.

In [194]:
df_movies['Genre'] = df_movies['Genre'].str.split(',')
exploded_genres = df_movies.explode('Genre')
unique_genres = exploded_genres['Genre'].unique()
num_unique_genres = len(unique_genres)
num_unique_genres


20