# **Exploratory Data Analysis: Pirated Movies Dataset**

---

## **General Objectives**

---

The general objective of this project is to analyze data from a [Kaggle Dataset](https://www.kaggle.com/datasets/arsalanrehman/movies-dataset-from-piracy-website) has been gathered from a pirated website that has a user base of around 2M visitors per month. This data contains more than 20,000+ movies from all industries such as Hollywood, Bollywood, Anime, etc. The goal is to to describe the data the best as possible, utilizing data science methods, statistical aproaches to the data and the creation of a detailed report.

The steps for the creation of the report are as follows:

- Collect and Understanding of the Data.

- Data Prep and Transformation.
- Univariate Analysis.
- Multivariate Analysis.
- Questions, Insights and Answers.

## **Importing Packages and Collecting the Dataset**

---

In [8]:
# Importing libraries needed for the project:
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
from matplotlib.pyplot import MaxNLocator, FuncFormatter
from matplotlib.font_manager import FontProperties

# Setting seaborn plot parameters:
sns.set_theme(context='notebook', style='darkgrid')

# Filtering out warnings:
warnings.filterwarnings('ignore')

# Setting pandas dataframe visualization parameters:
pd.set_option('display.max_columns', 100)

print('Packages collected!')

Packages collected!


In [9]:
# Creating the Dataframe with original data:
data = pd.read_csv('data\movies_dataset.csv', sep=',')

- First rows:

In [10]:
data.head()

Unnamed: 0.1,Unnamed: 0,IMDb-rating,appropriate_for,director,downloads,id,industry,language,posted_date,release_date,run_time,storyline,title,views,writer
0,0,4.8,R,John Swab,304,372092,Hollywood / English,English,"20 Feb, 2023",Jan 28 2023,105,Doc\r\n facilitates a fragile truce between th...,Little Dixie,2794,John Swab
1,1,6.4,TV-PG,Paul Ziller,73,372091,Hollywood / English,English,"20 Feb, 2023",Feb 05 2023,84,Caterer\r\n Goldy Berry reunites with detectiv...,Grilling Season: A Curious Caterer Mystery,1002,John Christian Plummer
2,2,5.2,R,Ben Wheatley,1427,343381,Hollywood / English,"English,Hindi","20 Apr, 2021",Jun 18 2021,1h 47min,As the world searches for a cure to a disastro...,In the Earth,14419,Ben Wheatley
3,3,8.1,,Venky Atluri,1549,372090,Tollywood,Hindi,"20 Feb, 2023",Feb 17 2023,139,The life of a young man and his struggles agai...,Vaathi,4878,Venky Atluri
4,4,4.6,,Shaji Kailas,657,372089,Tollywood,Hindi,"20 Feb, 2023",Jan 26 2023,122,A man named Kalidas gets stranded due to the p...,Alone,2438,Rajesh Jayaraman


- Sample rows:

In [11]:
data.sample(5)

Unnamed: 0.1,Unnamed: 0,IMDb-rating,appropriate_for,director,downloads,id,industry,language,posted_date,release_date,run_time,storyline,title,views,writer
11875,11875,,,,1782,371740,Wrestling,English,"13 Feb, 2023",Feb 10 2023,,,WWE Smackdown 2023-02-10,4811,
3033,3033,8.8,,Xavier Manrique,75,371744,Hollywood / English,English,"13 Feb, 2023",Feb 03 2023,101.0,Follows\r\n a New York City family hiding out ...,Who Invited Charlie?,1746,Nicholas Schutt
6200,6200,6.6,TV-14,Simone Stock,763,371877,Hollywood / English,English,"15 Feb, 2023",Feb 11 2023,88.0,It follows Kara Robinson as she survives an ab...,The Girl Who Escaped: The Kara Robinson Story,7587,Haley Harris
3240,3240,6.0,R,Navot Papushado,1668,346905,Hollywood / English,"English,Russian","15 Jul, 2021",Jul 14 2021,114.0,Three generations of women fight back against ...,Gunpowder Milkshake,11285,"Navot Papushado, Ehud Lavski"
17069,17069,7.1,R,Elegance Bratton,472,371991,Hollywood / English,English,"17 Feb, 2023",Dec 02 2022,95.0,"A\r\n young, gay Black man, rejected by his mo...",The Inspection,6035,Elegance Bratton


## **Understanding the Dataset**

---

Before we can start describing and analyzing the Data, it's important to comprehend what variables are present in the Dataset and the values they hold. For that, we will use a simple markdown table explaining the contents of the Dataset:

| Variable Name             | Variable Contents                 | Variable Importance for Analysis             | Comments About Variable                                                                        |
|---------------------------|-----------------------------------|----------------------------------------------|------------------------------------------------------------------------------------------------|
| **`Unnamed:0`**           | Row id native to the Dataset      | 🔴 Irrelevant                                 | Not needed since Pandas Dataframes already comes with a Id column                              |
| **`IMDB-rating`**         | Rating of the movie on IMDB       | 🟢 High                                       | None                                                                                           |
| **`appropriate_for`**     | Movie classification rating       | 🟡 Medium                                     | Interesting possibilities for analysis. Be careful with the amount of different rating systems |
| **`director`**            | Name of movie director            | 🟡 Medium                                     | None                                                                                           |
| **`dowloads`**            | Number of dowloads per movie      | 🟢 High                                       | None                                                                                           |
| **`id`**                  | Unique Id per movie               | 🔴 Irrelevant                                 | Same motive as the first variable                                                              |
| **`industry`**            | Industry that produced the movie  | 🟢 High                                       | None                                                                                           |
| **`language`**            | Available languages for the movie | 🟠 Low                                         | Not much important for the analysis as a hole, but not totally irrelevant                      |
| **`posted_date`**         | When the movie was posted on the platform | 🟢 High                                | Very important metric for the analysis                                                         |
| **`released_date`**       | When the movie was released worldwide | 🟢 High                                    | In conjunction with the variable above, opens up lots of analytical possibilities             |
| **`run_time`**            | Runtime of the movie (minutes)    | 🟡 Medium                                      | None                                                                                           |
| **`storyline`**           | Movie synopsis                    | 🔴 Irrelevant                                  | For the pourposes of this analysis, the movie storyline is not needed                          |
| **`title`**               | Movie title                       | 🟢 High                                      | Without a name, there is no movie!                                                             |
| **`views`**               | Number of clicks per movie        | 🟢 High                                          | Very important metric alongside with downloads                                                  |
| **`writer`**              | List of all the movie writers     | 🟠 Low                                          | None                                                                                           |


With this table, we can identify the important and the not so important variables of the Dataset, but before we can trasnform or delete this data, we will keep it to still analyse and describe it. For that, we weill begin by checking the Dataset dimensons with the **`.shape`** method:

In [25]:
# Checking dataset dimensions:
print(f'Total Variables: {data.shape[1]}\nTotal Rows: {data.shape[0]}')

Total Variables: 15
Total Rows: 20548


We can see that this is a rather large Dataset, containing more than 20k registry entries. Now, lets use the **`.info()`** method to gather more detailed information about these variables:

In [26]:
# Cheking general information about the Dataset:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20548 entries, 0 to 20547
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       20548 non-null  int64  
 1   IMDb-rating      19707 non-null  float64
 2   appropriate_for  11072 non-null  object 
 3   director         18610 non-null  object 
 4   downloads        20547 non-null  object 
 5   id               20548 non-null  int64  
 6   industry         20547 non-null  object 
 7   language         20006 non-null  object 
 8   posted_date      20547 non-null  object 
 9   release_date     20547 non-null  object 
 10  run_time         18780 non-null  object 
 11  storyline        18847 non-null  object 
 12  title            20547 non-null  object 
 13  views            20547 non-null  object 
 14  writer           18356 non-null  object 
dtypes: float64(1), int64(2), object(12)
memory usage: 2.4+ MB


Now we have a lot more information about the specific variables of the Dataset:

**Null Values**

- We have null values in all columns except the **`id`** variable.

**Data Types**

- In the Dataset, we have one **`float64`** variable, two **`int64`** variables and twelve **`object`** variables.

**Memory Usage**

- The Dataset, in it`s current form, consumes over 2.4 MB of memory.

**Proposed Changes**

- Make an acessement and deal with the null values;

- Change most of the Data Types to boost the memory efficiency. Also this is the most important change in this case because most of the variables are not in a apropriate data format, i.e. **`posted_date`** being a object data type instead of datetime and **`runtime`** being a object type instead of float or int.

Now, let`s check the exact amout of null values in each column, as well as the percentage that it represents:

In [29]:
# Checking total null values:
null_count = (data
              .isna()
              .sum()
              .sort_values(ascending=False))

null_count

appropriate_for    9476
writer             2192
director           1938
run_time           1768
storyline          1701
IMDb-rating         841
language            542
downloads             1
industry              1
posted_date           1
release_date          1
title                 1
views                 1
Unnamed: 0            0
id                    0
dtype: int64

In [33]:
# Check percentage of total size:
null_percentage = round((data.isna().sum()/data.shape[0]) * 100, 4)

null_percentage.sort_values(ascending=False)

appropriate_for    46.1164
writer             10.6677
director            9.4316
run_time            8.6042
storyline           8.2782
IMDb-rating         4.0929
language            2.6377
downloads           0.0049
industry            0.0049
posted_date         0.0049
release_date        0.0049
title               0.0049
views               0.0049
Unnamed: 0          0.0000
id                  0.0000
dtype: float64

immediately we can see that the variable **`appropriate_for`** is basically useless for the analysis. With 9.476 missing values representing more than 46% of the total data size, this variable is too empty to be filled in using different methods. For that reason, it will be dropped in the cleaning step.

The **`writer`** column also has a good amout of missing values, 2.192 nulls representing more than 10.5% of the total data size. As this variable was already deemed as having a low priority in the description table previously made, it will also be dropped in the cleaning phase.

The same cannot be said about the **`director`** variable, as altough it also has almost the same amount of null values as the **`writer`** column, it has a substancial potencial for analysis, being labeled as High priority in the table describing the variables.

The **`runtime`** and **`storyline`** variables both have a similar amout of missing values (around 8.5% of the total data). The first of them can and will be filled in using the **`mode()`** of the variable (as this metric is not much affected by the presence of outliers and should maintain the data consistency), the second variable is going to be dropped as stated in the table describing the variables.

All other variables have little to none Null values and can be easily filled in or dropped (like the **`Unnamed: 0`** column).


## **Data Prep, Cleaning and Feature Engineering**

---