### LSE Data Analytics Online Career Accelerator

# DA201: Data Analytics Using Python

## Practical activity: apply() function

**Scenario**

Mandisa Nkosi is working with with a political party that needs to decide how best to invest the available advertising budget. Mandisa believes she can gain some insights into potential advertising avenues by analysing films that are available on streaming platforms. 

This analysis uses the `movies_merge.xlsx` and `ott_merge.csv` data sets. To help the political party decide how it might best invest its budget, Mandisa will answer the following business questions:

- What is the effect of adding 60 seconds (1 minute) to each movie?
- Which movies are documentaries?

The insights gained from the analysis will inform the campaign, promotional materials, slogans, and language the political party will use to reach potential voters.

### Prepare your workstation

In [2]:
# Import Pandas and NumPy.
import pandas as pd
import numpy as np

# Load the excel data using pd.read_excel.
movies = pd.read_excel('movies_merge.xlsx')

# Load the csv data using pd.read_csv.
ott = pd.read_csv('ott_merge.csv')

# Data imported correctly?
print(movies.columns)
print(movies.shape)
print(ott.columns)
print(ott.shape)

# Merge the two DataFrames.
mov_ott = pd.merge(movies, ott, how='left', on = 'ID')

# DataFrames merged correctly?
print(mov_ott.columns)
print(mov_ott.shape)

Index(['ID', 'Title', 'Year', 'Age', 'IMDb', 'Rotten Tomatoes', 'Directors',
       'Genres', 'Country', 'Language', 'Runtime'],
      dtype='object')
(16744, 11)
Index(['ID', 'Netflix', 'Hulu', 'Prime Video', 'Disney+'], dtype='object')
(16744, 5)
Index(['ID', 'Title', 'Year', 'Age', 'IMDb', 'Rotten Tomatoes', 'Directors',
       'Genres', 'Country', 'Language', 'Runtime', 'Netflix', 'Hulu',
       'Prime Video', 'Disney+'],
      dtype='object')
(16744, 15)


# 

## Question 1: What is the effect of adding 60 seconds (1 minute) to each movie?

In [3]:
# Determine the runtime of each movie.
mov_ott_runtime = mov_ott[['ID', 'Runtime', 'Genres']]

# View the output.
mov_ott_runtime

Unnamed: 0,ID,Runtime,Genres
0,1,148.0,"Action,Adventure,Sci-Fi,Thriller"
1,2,136.0,"Action,Sci-Fi"
2,3,149.0,"Action,Adventure,Sci-Fi"
3,4,116.0,"Adventure,Comedy,Sci-Fi"
4,5,161.0,Western
...,...,...,...
16739,16740,120.0,"Comedy,Family,Fantasy,Horror"
16740,16741,90.0,"Comedy,Family,Sci-Fi"
16741,16742,,Documentary
16742,16743,,Documentary


In [4]:
# Add 60 seconds or 1 minute to runtime.
mov_ott_runtime['Runtime'].add(1)

0        149.0
1        137.0
2        150.0
3        117.0
4        162.0
         ...  
16739    121.0
16740     91.0
16741      NaN
16742      NaN
16743     33.0
Name: Runtime, Length: 16744, dtype: float64

# 

## Question 2: Which movies are documentaries?

In [5]:
# Create a new column with documentaries.
mov_ott_runtime['Gen_doc'] = np.where(mov_ott_runtime['Genres'].str.contains('Documentary'),
                                      'Documentary', 'Not Documentary')

# View the DataFrame.
mov_ott_runtime

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mov_ott_runtime['Gen_doc'] = np.where(mov_ott_runtime['Genres'].str.contains('Documentary'),


Unnamed: 0,ID,Runtime,Genres,Gen_doc
0,1,148.0,"Action,Adventure,Sci-Fi,Thriller",Not Documentary
1,2,136.0,"Action,Sci-Fi",Not Documentary
2,3,149.0,"Action,Adventure,Sci-Fi",Not Documentary
3,4,116.0,"Adventure,Comedy,Sci-Fi",Not Documentary
4,5,161.0,Western,Not Documentary
...,...,...,...,...
16739,16740,120.0,"Comedy,Family,Fantasy,Horror",Not Documentary
16740,16741,90.0,"Comedy,Family,Sci-Fi",Not Documentary
16741,16742,,Documentary,Documentary
16742,16743,,Documentary,Documentary


In [6]:
# Use the applymap (determine length of string).
mov_ott_runtime.Gen_doc.apply(len)

0        15
1        15
2        15
3        15
4        15
         ..
16739    15
16740    15
16741    11
16742    11
16743    11
Name: Gen_doc, Length: 16744, dtype: int64

# 

## Challenge

In [7]:
# Determine original runtime.
mov_ott_runtime[['ID', 'Runtime']]

Unnamed: 0,ID,Runtime
0,1,148.0
1,2,136.0
2,3,149.0
3,4,116.0
4,5,161.0
...,...,...
16739,16740,120.0
16740,16741,90.0
16741,16742,
16742,16743,


In [8]:
# Subtract 0.01 from runtime.
mov_ott_runtime['Runtime'].subtract(0.01)

0        147.99
1        135.99
2        148.99
3        115.99
4        160.99
          ...  
16739    119.99
16740     89.99
16741       NaN
16742       NaN
16743     31.99
Name: Runtime, Length: 16744, dtype: float64

### Using lambda functions

In [9]:
# Add 60 seconds or one minute with lambda function.
mov_ott['Runtime'] = mov_ott['Runtime'].apply(lambda x: x + 1)

# View output
mov_ott['Runtime']

0        149.0
1        137.0
2        150.0
3        117.0
4        162.0
         ...  
16739    121.0
16740     91.0
16741      NaN
16742      NaN
16743     33.0
Name: Runtime, Length: 16744, dtype: float64

In [10]:
# View IMDb and Rotten Tomatoes columns.
mov_ott_ratings = mov_ott[['ID', 'IMDb', 'Rotten Tomatoes']]

# View the DataFrame.
mov_ott_ratings

# Replace missing values with 0.
mov_ott_ratings_final = mov_ott_ratings.fillna(0)

# View the DataFrame.
mov_ott_ratings_final

Unnamed: 0,ID,IMDb,Rotten Tomatoes
0,1,8.8,0.87
1,2,8.7,0.87
2,3,8.5,0.84
3,4,8.5,0.96
4,5,8.8,0.97
...,...,...,...
16739,16740,6.2,0.00
16740,16741,4.7,0.00
16741,16742,5.7,0.00
16742,16743,6.6,0.00


In [11]:
# Add a new column to the DataFrame indicating average rating.
# Average rating is ((IMDb/10) + Rotten Tomaties)/n.
# Write a user defined function.
def av_col2(df1,df2):
    df = (df1/10 + df2)/2
    return df

mov_ott_ratings_final['Rating'] = av_col2(mov_ott_ratings_final['IMDb'],
                                          mov_ott_ratings_final['Rotten Tomatoes'])

# View the DataFrame.
mov_ott_ratings_final 

Unnamed: 0,ID,IMDb,Rotten Tomatoes,Rating
0,1,8.8,0.87,0.875
1,2,8.7,0.87,0.870
2,3,8.5,0.84,0.845
3,4,8.5,0.96,0.905
4,5,8.8,0.97,0.925
...,...,...,...,...
16739,16740,6.2,0.00,0.310
16740,16741,4.7,0.00,0.235
16741,16742,5.7,0.00,0.285
16742,16743,6.6,0.00,0.330


In [12]:
# Categorical count. 
def cat_cnt(df1):
    print(df1.value_counts())

# Number of movies released per 'Age'.
df = mov_ott['Age'].astype('category')

# View the output.
cat_cnt(df)

18+    3474
7+     1462
13+    1255
all     843
16+     320
Name: Age, dtype: int64


In [13]:
# Categorical count. 
def cat_cnt(df1):
    print(df1.value_counts())

# Number of movies released per 'Year'.
df = mov_ott['Year'].astype('category')

# View the output.
cat_cnt(df)

2017    1401
2018    1285
2016    1206
2015    1065
2014     986
        ... 
1916       1
1912       1
1917       1
1924       1
1902       1
Name: Year, Length: 109, dtype: int64
