### LSE Data Analytics Online Career Accelerator

# DA201: Data Analytics Using Python

## Practical activity: Reshaping a DataFrame

**Scenario**

Mandisa Nkosi is working with with a political party that needs to decide how best to invest its available advertising budget. Mandisa believes she can gain some insights into potential advertising avenues by analysing films that are available on streaming platforms. 

This analysis uses the `movies_merge.xlsx` and `ott_merge.csv` data sets. Using the pivot() function, you will organise the DataFrame by:

- the film release date and content rating
- the title of movies, the directors, and genres by content rating
- the title of movies, the released year, and language by content rating
- Netflix-screened movies based on language, runtime, and country
- the title of movies, specified language, potential runtime, and genres by content rating.

The insights gained from the analysis will inform the campaign, promotional materials, slogans, and language the political party will use to reach potential voters.

## Prepare your workstation

In [23]:
# Import Pandas.
import pandas as pd
import numpy as np

# Load the Excel data using pd.read_excel.
movies = pd.read_excel('movies_merge.xlsx')

# Load the csv data using pd.read_csv.
ott = pd.read_csv('ott_merge.csv')

# Data imported correctly?
print(movies.columns)
print(movies.shape)
print(ott.columns)
print(ott.shape)

# Merge the two DataFrames.
mov_ott = pd.merge(movies, ott, how='left', on = 'ID')

# DataFrames merged correctly?
print(mov_ott.columns)
print(mov_ott.shape)

Index(['ID', 'Title', 'Year', 'Age', 'IMDb', 'Rotten Tomatoes', 'Directors',
       'Genres', 'Country', 'Language', 'Runtime'],
      dtype='object')
(16744, 11)
Index(['ID', 'Netflix', 'Hulu', 'Prime Video', 'Disney+'], dtype='object')
(16744, 5)
Index(['ID', 'Title', 'Year', 'Age', 'IMDb', 'Rotten Tomatoes', 'Directors',
       'Genres', 'Country', 'Language', 'Runtime', 'Netflix', 'Hulu',
       'Prime Video', 'Disney+'],
      dtype='object')
(16744, 15)


In [4]:
# View the DataFrame.
mov_ott

Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Directors,Genres,Country,Language,Runtime,Netflix,Hulu,Prime Video,Disney+
0,1,Inception,2010,13+,8.8,0.87,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0,0,0,1,0
1,2,The Matrix,1999,18+,8.7,0.87,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0,0,1,0,0
2,3,Avengers: Infinity War,2018,13+,8.5,0.84,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0,0,0,1,0
3,4,Back to the Future,1985,7+,8.5,0.96,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0,1,0,0,0
4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,0.97,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16739,16740,The Ghosts of Buxley Hall,1980,,6.2,,Bruce Bilson,"Comedy,Family,Fantasy,Horror",United States,English,120.0,0,0,1,0
16740,16741,The Poof Point,2001,7+,4.7,,Neal Israel,"Comedy,Family,Sci-Fi",United States,English,90.0,0,0,1,0
16741,16742,Sharks of Lost Island,2013,,5.7,,Neil Gelinas,Documentary,United States,English,,0,0,1,0
16742,16743,Man Among Cheetahs,2017,,6.6,,Richard Slater-Jones,Documentary,United States,English,,0,0,1,0


## 1: The film release date and content rating

In [5]:
# Determine movies per year and age group.
movies.pivot(index='Title', columns='Age', 
             values='Year')

Age,NaN,13+,16+,18+,7+,all
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Inception,,2010.0,,,,
The Matrix,,,,1999.0,,
Avengers: Infinity War,,2018.0,,,,
Back to the Future,,,,,1985.0,
"The Good, the Bad and the Ugly",,,,1966.0,,
...,...,...,...,...,...,...
The Ghosts of Buxley Hall,1980.0,,,,,
The Poof Point,,,,,2001.0,
Sharks of Lost Island,2013.0,,,,,
Man Among Cheetahs,2017.0,,,,,


## 2: The title of movies, the directors, and genres by content rating

In [None]:
# Determine movies, directors, and genres per age group.
movies.pivot(index='Title', columns='Age', 
             values=['Directors', 'Genres'])

## 3: The title of movies, the released year, and language by content rating

In [None]:
# Determine movies, year, and language per age group.
movies.pivot(index='Title', columns='Age', 
             values=['Year', 'Language'])

## 4: Netflix-screened movies based on language, runtime, and country

In [None]:
# Determine the language, runtime, and country of movies screened by Netflix.
mov_ott.pivot(index='Title', columns='Netflix', 
              values=['Language', 'Runtime', 'Country'])

## 5: The title of movies, specified language, potential runtime, and genres by content rating

In [None]:
# Determine the movies, language, runtime, and genres per age group.
mov_ott.pivot(index='Title', columns='Age', 
              values=['Language', 'Runtime','Genres'])

In [17]:
print(mov_ott.columns)

Index(['ID', 'Title', 'Year', 'Age', 'IMDb', 'Rotten Tomatoes', 'Directors',
       'Genres', 'Country', 'Language', 'Runtime', 'Netflix', 'Hulu',
       'Prime Video', 'Disney+'],
      dtype='object')


In [21]:
# Create a DF with these three columns

mov_ott_runtime = mov_ott[['ID', 'Runtime', 'Genres']]
mov_ott_runtime

Unnamed: 0,ID,Runtime,Genres
0,1,148.0,"Action,Adventure,Sci-Fi,Thriller"
1,2,136.0,"Action,Sci-Fi"
2,3,149.0,"Action,Adventure,Sci-Fi"
3,4,116.0,"Adventure,Comedy,Sci-Fi"
4,5,161.0,Western
...,...,...,...
16739,16740,120.0,"Comedy,Family,Fantasy,Horror"
16740,16741,90.0,"Comedy,Family,Sci-Fi"
16741,16742,,Documentary
16742,16743,,Documentary


In [22]:
# Add 60 seconds (1 minute) to 'Runtime

mov_ott_runtime['Runtime'].add(1)

0        149.0
1        137.0
2        150.0
3        117.0
4        162.0
         ...  
16739    121.0
16740     91.0
16741      NaN
16742      NaN
16743     33.0
Name: Runtime, Length: 16744, dtype: float64

In [24]:
mov_ott_runtime['Gen_doc'] = np.where(mov_ott_runtime['Genres'].str.contains('Documentary'),
                                      'Documentary', 'Not Documentary')

# View the DataFrame.
mov_ott_runtime

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mov_ott_runtime['Gen_doc'] = np.where(mov_ott_runtime['Genres'].str.contains('Documentary'),


Unnamed: 0,ID,Runtime,Genres,Gen_doc
0,1,148.0,"Action,Adventure,Sci-Fi,Thriller",Not Documentary
1,2,136.0,"Action,Sci-Fi",Not Documentary
2,3,149.0,"Action,Adventure,Sci-Fi",Not Documentary
3,4,116.0,"Adventure,Comedy,Sci-Fi",Not Documentary
4,5,161.0,Western,Not Documentary
...,...,...,...,...
16739,16740,120.0,"Comedy,Family,Fantasy,Horror",Not Documentary
16740,16741,90.0,"Comedy,Family,Sci-Fi",Not Documentary
16741,16742,,Documentary,Documentary
16742,16743,,Documentary,Documentary


In [29]:
mov_ott_runtime.Gen_doc.apply(len)

0        15
1        15
2        15
3        15
4        15
         ..
16739    15
16740    15
16741    11
16742    11
16743    11
Name: Gen_doc, Length: 16744, dtype: int64

In [32]:
# Remove 6 seconds (0.01 min) from each movie

# Subtract 0.01 from runtime.
mov_ott_runtime['Runtime'].subtract(0.01)

0        147.99
1        135.99
2        148.99
3        115.99
4        160.99
          ...  
16739    119.99
16740     89.99
16741       NaN
16742       NaN
16743     31.99
Name: Runtime, Length: 16744, dtype: float64