# Scenario
This is a continuation of the analysis you’re doing alongside Mandisa Nkosi to inform decision-making in politics, first introduced in 3.1.5 Practical activity: Create and merge the DataFrames. In the previous activity, Mandisa created pivot tables to analyse the data set.

The political party is now ready to buy advertising slots and has received quotes from a broadcaster. The fee for running an ad is a flat rate per 10 seconds of advertisement, with a minimum purchase of 60 seconds of advertising time per film. The broadcaster also has a promotional, reduced fee for ads run during any documentary film.

Objective
To help the political party decide how they might best invest their budget, Mandisa will answer the following business questions:

What is the effect of adding 60 seconds (one minute) to each movie?
Which movies are documentaries?
Your objective is to answer the posed questions by using the apply() and/or applymap() functions depending on whether you’re working on a Pandas Series or a Pandas DataFrame.

In [1]:
# Import libraries

import pandas as pd
import numpy as np

# Import excel file

In [2]:
# read the Excel data 
movies = pd.read_excel('movies_merge.xlsx')

# View the column names.
print(movies.columns)

Index(['ID', 'Title', 'Year', 'Age', 'IMDb', 'Rotten Tomatoes', 'Directors',
       'Genres', 'Country', 'Language', 'Runtime'],
      dtype='object')


# Import the csv file 

In [3]:
# read the csv file 

ott = pd.read_csv('ott_merge.csv')

# view the column names

print(ott.columns)

Index(['ID', 'Netflix', 'Hulu', 'Prime Video', 'Disney+'], dtype='object')


# Validate and describe the Data

In [4]:
#Validate movies 

movies.head()
print(movies.shape)

# Validate ott
ott.head()
print(ott.shape)

(16744, 11)
(16744, 5)


# Combine the two DataFrames


In [6]:
# Merge the two DataFrames.
mov_ott = pd.merge(movies, ott, how='left', on = 'ID')

# DataFrames merged correctly?
print(mov_ott.columns)
print(mov_ott.shape)

Index(['ID', 'Title', 'Year', 'Age', 'IMDb', 'Rotten Tomatoes', 'Directors',
       'Genres', 'Country', 'Language', 'Runtime', 'Netflix', 'Hulu',
       'Prime Video', 'Disney+'],
      dtype='object')
(16744, 15)


In [7]:
# View the DataFrame
mov_ott

Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Directors,Genres,Country,Language,Runtime,Netflix,Hulu,Prime Video,Disney+
0,1,Inception,2010,13+,8.8,0.87,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0,0,0,1,0
1,2,The Matrix,1999,18+,8.7,0.87,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0,0,1,0,0
2,3,Avengers: Infinity War,2018,13+,8.5,0.84,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0,0,0,1,0
3,4,Back to the Future,1985,7+,8.5,0.96,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0,1,0,0,0
4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,0.97,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16739,16740,The Ghosts of Buxley Hall,1980,,6.2,,Bruce Bilson,"Comedy,Family,Fantasy,Horror",United States,English,120.0,0,0,1,0
16740,16741,The Poof Point,2001,7+,4.7,,Neal Israel,"Comedy,Family,Sci-Fi",United States,English,90.0,0,0,1,0
16741,16742,Sharks of Lost Island,2013,,5.7,,Neil Gelinas,Documentary,United States,English,,0,0,1,0
16742,16743,Man Among Cheetahs,2017,,6.6,,Richard Slater-Jones,Documentary,United States,English,,0,0,1,0


# What is the effect of adding 60 seconds (1 minute) to each movie

In [9]:
# Determine the runtime of each movie.
mov_ott_runtime = mov_ott[['ID', 'Runtime', 'Genres']]

# View the output.
mov_ott_runtime

Unnamed: 0,ID,Runtime,Genres
0,1,148.0,"Action,Adventure,Sci-Fi,Thriller"
1,2,136.0,"Action,Sci-Fi"
2,3,149.0,"Action,Adventure,Sci-Fi"
3,4,116.0,"Adventure,Comedy,Sci-Fi"
4,5,161.0,Western
...,...,...,...
16739,16740,120.0,"Comedy,Family,Fantasy,Horror"
16740,16741,90.0,"Comedy,Family,Sci-Fi"
16741,16742,,Documentary
16742,16743,,Documentary


In [10]:
# Add 60 seconds or 1 minute to runtime.
mov_ott_runtime['Runtime'].add(1)

0        149.0
1        137.0
2        150.0
3        117.0
4        162.0
         ...  
16739    121.0
16740     91.0
16741      NaN
16742      NaN
16743     33.0
Name: Runtime, Length: 16744, dtype: float64

# Which movies are documentaries?

In [11]:
# Create a new column with documentaries.
mov_ott_runtime['Gen_doc'] = np.where(mov_ott_runtime['Genres'].str.contains('Documentary'),
                                      'Documentary', 'Not Documentary')

# View the DataFrame.
mov_ott_runtime

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mov_ott_runtime['Gen_doc'] = np.where(mov_ott_runtime['Genres'].str.contains('Documentary'),


Unnamed: 0,ID,Runtime,Genres,Gen_doc
0,1,148.0,"Action,Adventure,Sci-Fi,Thriller",Not Documentary
1,2,136.0,"Action,Sci-Fi",Not Documentary
2,3,149.0,"Action,Adventure,Sci-Fi",Not Documentary
3,4,116.0,"Adventure,Comedy,Sci-Fi",Not Documentary
4,5,161.0,Western,Not Documentary
...,...,...,...,...
16739,16740,120.0,"Comedy,Family,Fantasy,Horror",Not Documentary
16740,16741,90.0,"Comedy,Family,Sci-Fi",Not Documentary
16741,16742,,Documentary,Documentary
16742,16743,,Documentary,Documentary


In [14]:
# Use the applymap (determine length of string).
mov_ott_runtime.Gen_doc.apply(len)

0        15
1        15
2        15
3        15
4        15
         ..
16739    15
16740    15
16741    11
16742    11
16743    11
Name: Gen_doc, Length: 16744, dtype: int64

# Challenge 

Challenge: Think of a way to help the political party save money. Mandisa suggests subtracting 6 seconds (0.01 minutes) from the Runtime of all the movies, resulting in saving if they limit the number of adverts during a movie. 

In [12]:
# Determine original runtime.
mov_ott_runtime[['ID', 'Runtime']]

Unnamed: 0,ID,Runtime
0,1,148.0
1,2,136.0
2,3,149.0
3,4,116.0
4,5,161.0
...,...,...
16739,16740,120.0
16740,16741,90.0
16741,16742,
16742,16743,


In [13]:
# Subtract 0.01 from runtime.
mov_ott_runtime['Runtime'].subtract(0.01)

0        147.99
1        135.99
2        148.99
3        115.99
4        160.99
          ...  
16739    119.99
16740     89.99
16741       NaN
16742       NaN
16743     31.99
Name: Runtime, Length: 16744, dtype: float64

# 3.2.6 Practical activity 

# Scenario
This is a continuation of the analysis you’re doing alongside Mandisa Nkosi to inform decision-making in politics, first introduced in 3.1.5 Practical activity: Create and merge the DataFrames. 

The political party has two questions in mind when it comes to running its ads:

Should the ads be more generic and appeal to as wide an audience as possible?
Should the ads be more provocative and be aired during films with the appropriate age rating?
They ask Mandisa for her input.

# Objective
Mandisa comes up with a few questions she feels might help the political party to make a decision and that she can answer through analysis:

What is the average rating per movie?
How many movies were released per content rating (age)?
How many movies were released per year?

# What is average rating per movie? 

In [15]:
# Create new DataFrame
mov_ott_ratings = mov_ott[['ID', 'IMDb', 'Rotten Tomatoes']]

# view the DataFrame
mov_ott_ratings

Unnamed: 0,ID,IMDb,Rotten Tomatoes
0,1,8.8,0.87
1,2,8.7,0.87
2,3,8.5,0.84
3,4,8.5,0.96
4,5,8.8,0.97
...,...,...,...
16739,16740,6.2,
16740,16741,4.7,
16741,16742,5.7,
16742,16743,6.6,


In [16]:
# Replace missing values with 0.
mov_ott_ratings_final = mov_ott_ratings.fillna(0)

# View the DataFrame.
mov_ott_ratings_final

Unnamed: 0,ID,IMDb,Rotten Tomatoes
0,1,8.8,0.87
1,2,8.7,0.87
2,3,8.5,0.84
3,4,8.5,0.96
4,5,8.8,0.97
...,...,...,...
16739,16740,6.2,0.00
16740,16741,4.7,0.00
16741,16742,5.7,0.00
16742,16743,6.6,0.00


In [17]:
# Add a new column to the DataFrame indicating average rating.

# Average rating is ((IMDb/10) + Rotten Tomaties)/n.
# Write a user defined function.
def av_col2(df1,df2):
    df = (df1/10 + df2)/2
    return df

mov_ott_ratings_final['Rating'] = av_col2(mov_ott_ratings_final['IMDb'],
                                          mov_ott_ratings_final['Rotten Tomatoes'])

# View the DataFrame.
mov_ott_ratings_final    

Unnamed: 0,ID,IMDb,Rotten Tomatoes,Rating
0,1,8.8,0.87,0.875
1,2,8.7,0.87,0.870
2,3,8.5,0.84,0.845
3,4,8.5,0.96,0.905
4,5,8.8,0.97,0.925
...,...,...,...,...
16739,16740,6.2,0.00,0.310
16740,16741,4.7,0.00,0.235
16741,16742,5.7,0.00,0.285
16742,16743,6.6,0.00,0.330


# How many movies were released per content rating (age)

In [18]:
# Category type count. 
def cat_cnt(df1):
    print(df1.value_counts())

# Number of movies released per 'Age'.
df = mov_ott['Age'].astype('category')

# View the output.
cat_cnt(df)

18+    3474
7+     1462
13+    1255
all     843
16+     320
Name: Age, dtype: int64


# How many movies were released per year 

In [19]:
# Categorical count. 
def cat_cnt(df1):
    print(df1.value_counts())

# Number of movies released per 'Year'.
df = mov_ott['Year'].astype('category')

# View the output.
cat_cnt(df)

2017    1401
2018    1285
2016    1206
2015    1065
2014     986
        ... 
1916       1
1912       1
1917       1
1924       1
1902       1
Name: Year, Length: 109, dtype: int64
