# Mastering DataFrame Mutations with Hollywood Data

## Introduction
In this lab exercise, you will be working with a real-life dataset to perform various data manipulation techniques. You will be working with a movie dataset for all the activities. The dataset contains the details of movies, including the movie name, release year, budget, gross, etc.

In this lab we will cover following things:

- Generating new columns in a data frame by utilizing basic arithmetic operations such as addition, subtraction, and division on existing columns.
- Generating new columns in a data frame by using boolean expressions such as <(less-than), >(greater-than), ==(equalto), etc on existing columns.
- Dropping rows based on certain conditions.
- Dropping single and multiple columns with conditions.

First, We load the dataset.

Use the pandas.read_csv() function to load the dataset into a pandas dataframe. Store the dataframe in a variable named 'df'.

In [16]:
import pandas as pd 
path_to_csv = "../../data/movies.csv"
df = pd.read_csv(path_to_csv)

In [17]:
df.columns

Index(['name', 'rating', 'genre', 'year', 'released', 'score', 'votes',
       'director', 'writer', 'star', 'country', 'budget', 'gross', 'company',
       'runtime'],
      dtype='object')

In [18]:
df.head()

Unnamed: 0,name,rating,genre,year,released,score,votes,director,writer,star,country,budget,gross,company,runtime
0,The Shining,R,Drama,1980,"June 13, 1980 (United States)",8.4,927000.0,Stanley Kubrick,Stephen King,Jack Nicholson,United Kingdom,19000000.0,46998772.0,Warner Bros.,146.0
1,The Blue Lagoon,R,Adventure,1980,"July 2, 1980 (United States)",5.8,65000.0,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,United States,4500000.0,58853106.0,Columbia Pictures,104.0
2,Star Wars: Episode V - The Empire Strikes Back,PG,Action,1980,"June 20, 1980 (United States)",8.7,1200000.0,Irvin Kershner,Leigh Brackett,Mark Hamill,United States,18000000.0,538375067.0,Lucasfilm,124.0
3,Airplane!,PG,Comedy,1980,"July 2, 1980 (United States)",7.7,221000.0,Jim Abrahams,Jim Abrahams,Robert Hays,United States,3500000.0,83453539.0,Paramount Pictures,88.0
4,Caddyshack,R,Comedy,1980,"July 25, 1980 (United States)",7.3,108000.0,Harold Ramis,Brian Doyle-Murray,Chevy Chase,United States,6000000.0,39846344.0,Orion Pictures,98.0


## Activities

### 1. Create a new column revenue

Add a new column 'revenue' to the DataFrame that represents the difference between the 'gross' and 'budget' columns.

In [19]:
df['revenue'] = df['gross'] - df['budget']

### 2. Create a new column percentage_profit

Create a new column percentage_profit with the value of the percentage of the revenue represented by the gross earnings for that row. For example, if the revenue earnings are 100 million and the gross is 200 million, then the percentage profit is 50%.

- Calculate percentage profit as percentage.

In [20]:
df['percentage_profit'] = df['revenue'] / df['gross'] * 100

### 3. Create a new column high_budget_movie

Create a new column high_budget_movie with the value True if the movie's budget is greater than 100 million and False otherwise.

In [21]:
df['high_budget_movie'] = df['budget'] > 100_000_000

### 4. Create a new column successful_movie

Create a new column successful_movie with the value True if the movie's percentage_profit is greater than 50 and False otherwise.

In [22]:
df['successful_movie'] = df['percentage_profit'] > 50

### 5. High-Rated Movies

Create a new column high_rated_movie with the value True if the movie's score is greater than 8 and False otherwise.

In [23]:
df['high_rated_movie'] = df['score'] > 8

### 6. Create a new column is_new_release

Create a new column is_new_release which is True if the value of year column is greater than 2020 and False otherwise.

In [24]:
df['is_new_release'] = df['year'] > 2020

### 7. Create a new column is_long_movie

Create a new column is_long_movie which is True if the value of runtime column is greater than 150 minutes and False otherwise.

In [25]:
df['is_long_movie'] = df['runtime'] > 150

### 8. Drop unsuccessful movie.

Drop all the rows where the successful_movie column value is False. Use the inplace parameter to make the changes permanent.

In [26]:
df.drop(df.loc[df['successful_movie'] == False].index, inplace=True)

### 9. Drop high budget movie

Drop all the rows where the value of budget is greater than 100 million and store the new dataframe in the variable low_budget_df. Don't drop from the original dataframe.

In [27]:
low_budget_df = df.drop(df.loc[df['high_budget_movie'] == True].index)

### 10. Removing Low-Voted Movies

Drop all the rows where the value of votes is less than 1000 and store the new dataframe in the variable high_voted_df. Don't drop from the original dataframe.

In [28]:
high_voted_df = df.drop(df.loc[df['votes'] < 1000].index)

### 11. Drop the column budget

To remove the budget column from the movie dataframe, use the drop method and specify the column name budget. Ensure to specify the axis to indicate that it's a column and not a row. Additionally, specify the inplace parameter as True to make the change permanent."

In [29]:
df.drop(['budget'], axis=1, inplace=True)

### 12. Drop the director and writer columns from the dataframe.

To eliminate the director and writer columns from the movie dataframe, use the drop method and pass in the column names director and writer. Specify the axis to indicate that they are columns and not rows. Set the inplace parameter to False to create a new dataframe named new_df without modifying the original dataframe.

- Note that in this activity you have to create a new dataframe named new_df.

In [30]:
new_df = df.drop(['director', 'writer'], axis=1)

### 13. Drop Out Low-Rated and Low-Voted Movies

Drop all the rows where the value of score is less than 5 and the value of votes is less than 1000. Drop the rows from original dataframe df.

In [31]:
df.drop(df.loc[(df['score'] < 5) & (df['votes'] < 1000)].index, inplace=True)

### 14. Top High-Rated Movies

Create a new dataframe top_rated_movies that contains the top 5 high-rated movies. The dataframe should be sorted by the score column in descending order.

In [32]:
top_rated_movies = df.sort_values(by='score', ascending=False).iloc[:5]

### 15. Removing Specific Rows

Remove rows with index 2 and 10 from the DataFrame.

In [33]:
rows_to_drop = [2, 10]
df.drop(index=rows_to_drop)

Unnamed: 0,name,rating,genre,year,released,score,votes,director,writer,star,...,gross,company,runtime,revenue,percentage_profit,high_budget_movie,successful_movie,high_rated_movie,is_new_release,is_long_movie
0,The Shining,R,Drama,1980,"June 13, 1980 (United States)",8.4,927000.0,Stanley Kubrick,Stephen King,Jack Nicholson,...,46998772.0,Warner Bros.,146.0,27998772.0,59.573412,False,True,True,False,False
1,The Blue Lagoon,R,Adventure,1980,"July 2, 1980 (United States)",5.8,65000.0,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,...,58853106.0,Columbia Pictures,104.0,54353106.0,92.353845,False,True,False,False,False
3,Airplane!,PG,Comedy,1980,"July 2, 1980 (United States)",7.7,221000.0,Jim Abrahams,Jim Abrahams,Robert Hays,...,83453539.0,Paramount Pictures,88.0,79953539.0,95.806050,False,True,False,False,False
4,Caddyshack,R,Comedy,1980,"July 25, 1980 (United States)",7.3,108000.0,Harold Ramis,Brian Doyle-Murray,Chevy Chase,...,39846344.0,Orion Pictures,98.0,33846344.0,84.942157,False,True,False,False,False
5,Friday the 13th,R,Horror,1980,"May 9, 1980 (United States)",6.4,123000.0,Sean S. Cunningham,Victor Miller,Betsy Palmer,...,39754601.0,Paramount Pictures,95.0,39204601.0,98.616512,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7645,Birds of Prey,R,Action,2020,"February 7, 2020 (United States)",6.1,190000.0,Cathy Yan,Christina Hodson,Margot Robbie,...,201858461.0,Clubhouse Pictures (II),109.0,117358461.0,58.138985,False,True,False,False,False
7646,The Invisible Man,R,Drama,2020,"February 28, 2020 (United States)",7.1,186000.0,Leigh Whannell,Leigh Whannell,Elisabeth Moss,...,143151000.0,Universal Pictures,124.0,136151000.0,95.110059,False,True,False,False,False
7648,Bad Boys for Life,R,Action,2020,"January 17, 2020 (United States)",6.6,140000.0,Adil El Arbi,Peter Craig,Will Smith,...,426505244.0,Columbia Pictures,124.0,336505244.0,78.898266,False,True,False,False,False
7649,Sonic the Hedgehog,PG,Action,2020,"February 14, 2020 (United States)",6.5,102000.0,Jeff Fowler,Pat Casey,Ben Schwartz,...,319715683.0,Paramount Pictures,99.0,234715683.0,73.413878,False,True,False,False,False


oder

In [34]:
df.drop(df.iloc[[2, 10]].index)

Unnamed: 0,name,rating,genre,year,released,score,votes,director,writer,star,...,gross,company,runtime,revenue,percentage_profit,high_budget_movie,successful_movie,high_rated_movie,is_new_release,is_long_movie
0,The Shining,R,Drama,1980,"June 13, 1980 (United States)",8.4,927000.0,Stanley Kubrick,Stephen King,Jack Nicholson,...,46998772.0,Warner Bros.,146.0,27998772.0,59.573412,False,True,True,False,False
1,The Blue Lagoon,R,Adventure,1980,"July 2, 1980 (United States)",5.8,65000.0,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,...,58853106.0,Columbia Pictures,104.0,54353106.0,92.353845,False,True,False,False,False
3,Airplane!,PG,Comedy,1980,"July 2, 1980 (United States)",7.7,221000.0,Jim Abrahams,Jim Abrahams,Robert Hays,...,83453539.0,Paramount Pictures,88.0,79953539.0,95.806050,False,True,False,False,False
4,Caddyshack,R,Comedy,1980,"July 25, 1980 (United States)",7.3,108000.0,Harold Ramis,Brian Doyle-Murray,Chevy Chase,...,39846344.0,Orion Pictures,98.0,33846344.0,84.942157,False,True,False,False,False
5,Friday the 13th,R,Horror,1980,"May 9, 1980 (United States)",6.4,123000.0,Sean S. Cunningham,Victor Miller,Betsy Palmer,...,39754601.0,Paramount Pictures,95.0,39204601.0,98.616512,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7645,Birds of Prey,R,Action,2020,"February 7, 2020 (United States)",6.1,190000.0,Cathy Yan,Christina Hodson,Margot Robbie,...,201858461.0,Clubhouse Pictures (II),109.0,117358461.0,58.138985,False,True,False,False,False
7646,The Invisible Man,R,Drama,2020,"February 28, 2020 (United States)",7.1,186000.0,Leigh Whannell,Leigh Whannell,Elisabeth Moss,...,143151000.0,Universal Pictures,124.0,136151000.0,95.110059,False,True,False,False,False
7648,Bad Boys for Life,R,Action,2020,"January 17, 2020 (United States)",6.6,140000.0,Adil El Arbi,Peter Craig,Will Smith,...,426505244.0,Columbia Pictures,124.0,336505244.0,78.898266,False,True,False,False,False
7649,Sonic the Hedgehog,PG,Action,2020,"February 14, 2020 (United States)",6.5,102000.0,Jeff Fowler,Pat Casey,Ben Schwartz,...,319715683.0,Paramount Pictures,99.0,234715683.0,73.413878,False,True,False,False,False


### 16. Sci-Fi Blockbusters

Create a new DataFrame named sci_fi_blockbusters containing movies that are Sci-Fi genre and have a 'gross' greater than $150 million.

In [35]:
sci_fi_blockbusters = df.loc[(df['genre'] == 'Sci-Fi') & (df['gross'] > 150_000_000)]

### 17. Age of Movies

Create a new column age that contains the age of the movie in years. The age of the movie is calculated by subtracting the year column from the current year.

- Use current years as 2023.

In [36]:
df['age'] = 2023 - df['year']

### 18. Movies Released in Summer

Create a new DataFrame containing movies released in June, July, or August. Store the result in dataframe summer_movies.

In [37]:
summer_movies = df[df['released'].str.contains('June|July|August')]