### LSE Data Analytics Online Career Accelerator

# DA201: Data Analytics Using Python

## Practical activity: Create and merge the DataFrames

**Scenario**

Mandisa Nkosi is working with with a political party that needs to decide how best to invest its available advertising budget. Mandisa believes she can gain some insights into potential advertising avenues by analysing films that are available on streaming platforms. 

This analysis uses the `movies_merge.xlsx` and `ott_merge.csv` data sets. Your objectives at this stage are to prepare for analysis by:

- importing the CSV files into DataFrames
- viewing the DataFrames
- describing the DataFrames to understand the structures and data types
- merging the two DataFrames into a single DataFrame.

The insights gained from the analysis will inform the campaign, promotional materials, slogans, and language the political party will use to reach potential voters.

## 1. Import Pandas

In [1]:
# Import necessary package.
import pandas as pd

## 2. Import Excel file

In [3]:
# Load the Excel data using pd.read_excel.
movies = pd.read_excel('movies_merge.xlsx')

# View the column names.
print(movies.columns)

Index(['ID', 'Title', 'Year', 'Age', 'IMDb', 'Rotten Tomatoes', 'Directors',
       'Genres', 'Country', 'Language', 'Runtime'],
      dtype='object')


## 3. Import CSV file

In [4]:
# Load the csv data using pd.read_csv.
ott = pd.read_csv('ott_merge.csv')

# View the column names.
print(ott.columns)

Index(['ID', 'Netflix', 'Hulu', 'Prime Video', 'Disney+'], dtype='object')


## 4. Validate the DataFrames

In [5]:
# Data imported correctly?
print(movies.shape)
movies.head()

(16744, 11)


Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Directors,Genres,Country,Language,Runtime
0,1,Inception,2010,13+,8.8,0.87,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,2,The Matrix,1999,18+,8.7,0.87,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,3,Avengers: Infinity War,2018,13+,8.5,0.84,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0
3,4,Back to the Future,1985,7+,8.5,0.96,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0
4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,0.97,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0


In [6]:
# Data imported correctly?
print(ott.shape)
ott.head()

(16744, 5)


Unnamed: 0,ID,Netflix,Hulu,Prime Video,Disney+
0,1,0,0,1,0
1,2,0,1,0,0
2,3,0,0,1,0
3,4,1,0,0,0
4,5,0,0,1,0


## 5. Describe the data types

In [7]:
# Determine the data types.
print(ott.dtypes)
print(movies.dtypes)

ID             int64
Netflix        int64
Hulu           int64
Prime Video    int64
Disney+        int64
dtype: object
ID                   int64
Title               object
Year                 int64
Age                 object
IMDb               float64
Rotten Tomatoes    float64
Directors           object
Genres              object
Country             object
Language            object
Runtime            float64
dtype: object


## 6. Combine the two DataFrames
### a) merge()

In [8]:
# Merge the two DataFrames.
df_mov_ott = pd.merge(movies, ott, how='left', on = 'ID')

# View the new DataFrame.
print(df_mov_ott.shape)
df_mov_ott.head()

(16744, 15)


Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Directors,Genres,Country,Language,Runtime,Netflix,Hulu,Prime Video,Disney+
0,1,Inception,2010,13+,8.8,0.87,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0,0,0,1,0
1,2,The Matrix,1999,18+,8.7,0.87,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0,0,1,0,0
2,3,Avengers: Infinity War,2018,13+,8.5,0.84,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0,0,0,1,0
3,4,Back to the Future,1985,7+,8.5,0.96,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0,1,0,0,0
4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,0.97,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0,0,0,1,0


### b) concat()

In [9]:
# Concatenate the two DataFrames.
mov_ott_concat = pd.concat([movies, ott], axis=0)

# View the new DataFrame.
print(mov_ott_concat.shape)
mov_ott_concat.head()

(33488, 15)


Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Directors,Genres,Country,Language,Runtime,Netflix,Hulu,Prime Video,Disney+
0,1,Inception,2010.0,13+,8.8,0.87,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0,,,,
1,2,The Matrix,1999.0,18+,8.7,0.87,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0,,,,
2,3,Avengers: Infinity War,2018.0,13+,8.5,0.84,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0,,,,
3,4,Back to the Future,1985.0,7+,8.5,0.96,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0,,,,
4,5,"The Good, the Bad and the Ugly",1966.0,18+,8.8,0.97,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0,,,,


In [10]:
# Question 1: How many films from each year (released from 2012 to the present) were watched on Netflix?

mo_gpby = df_mov_ott.groupby('Year')[['Netflix']].sum().reset_index()

mo_gpby[mo_gpby['Year']>=2012]

Unnamed: 0,Year,Netflix
100,2012,195
101,2013,200
102,2014,202
103,2015,252
104,2016,236
105,2017,294
106,2018,307
107,2019,138
108,2020,31


In [11]:
# Question 2: What is the average runtime of movies released each year?

mo_gpby1 = df_mov_ott.groupby('Year')[['Runtime']].mean().reset_index()

In [12]:
mo_gpby1[mo_gpby1['Year']>=2012]

Unnamed: 0,Year,Runtime
100,2012,90.511714
101,2013,91.54594
102,2014,92.995763
103,2015,92.914646
104,2016,93.99214
105,2017,94.460961
106,2018,94.635678
107,2019,93.410413
108,2020,93.976562


In [13]:
# Question 3: What are the best and worst reviews that movies received on Rotten Tomatoes?

mo_gpby2 = df_mov_ott.groupby('Year')[['Rotten Tomatoes']].agg(['max', 'min']).reset_index()

In [14]:
mo_gpby2[mo_gpby2['Year']>=2012]

Unnamed: 0_level_0,Year,Rotten Tomatoes,Rotten Tomatoes
Unnamed: 0_level_1,Unnamed: 1_level_1,max,min
100,2012,1.0,0.04
101,2013,1.0,0.02
102,2014,1.0,0.05
103,2015,1.0,0.05
104,2016,1.0,0.02
105,2017,1.0,0.04
106,2018,1.0,0.06
107,2019,1.0,0.05
108,2020,1.0,0.06
