# **General Objectives**

---

The general objective of this project is to analyze data from a [Kaggle Dataset](https://www.kaggle.com/datasets/arsalanrehman/movies-dataset-from-piracy-website) that has been gathered from a pirated website that has a user base of around 2M visitors per month. This data contains more than 20,000+ movies from all industries such as Hollywood, Bollywood, Anime, etc. The goal is to to describe the data the best as possible, utilizing data science methods, statistical aproaches to the data and the creation of a detailed report.

The steps for the creation of the report are as follows:

- Collect and Understanding of the Data.

- Data Prep and Transformation.
- Univariate Analysis.
- Multivariate Analysis.
- Questions, Insights and Answers.

# **Importing Packages and Collecting the Dataset**

---

In [364]:
# Importing libraries needed for the project:
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
import re
from matplotlib.pyplot import MaxNLocator, FuncFormatter
from matplotlib.font_manager import FontProperties

# Setting seaborn plot parameters:
sns.set_theme(context='notebook', style='darkgrid')

# Filtering out warnings:
warnings.filterwarnings('ignore')

# Setting pandas dataframe visualization parameters:
pd.set_option('display.max_columns', 100)

print('Packages collected!')

Packages collected!


In [365]:
# Creating the Dataframe with original data:
data = pd.read_csv('data\movies_dataset.csv', sep=',')

- First rows:

In [366]:
data.head()

Unnamed: 0.1,Unnamed: 0,IMDb-rating,appropriate_for,director,downloads,id,industry,language,posted_date,release_date,run_time,storyline,title,views,writer
0,0,4.8,R,John Swab,304,372092,Hollywood / English,English,"20 Feb, 2023",Jan 28 2023,105,Doc\r\n facilitates a fragile truce between th...,Little Dixie,2794,John Swab
1,1,6.4,TV-PG,Paul Ziller,73,372091,Hollywood / English,English,"20 Feb, 2023",Feb 05 2023,84,Caterer\r\n Goldy Berry reunites with detectiv...,Grilling Season: A Curious Caterer Mystery,1002,John Christian Plummer
2,2,5.2,R,Ben Wheatley,1427,343381,Hollywood / English,"English,Hindi","20 Apr, 2021",Jun 18 2021,1h 47min,As the world searches for a cure to a disastro...,In the Earth,14419,Ben Wheatley
3,3,8.1,,Venky Atluri,1549,372090,Tollywood,Hindi,"20 Feb, 2023",Feb 17 2023,139,The life of a young man and his struggles agai...,Vaathi,4878,Venky Atluri
4,4,4.6,,Shaji Kailas,657,372089,Tollywood,Hindi,"20 Feb, 2023",Jan 26 2023,122,A man named Kalidas gets stranded due to the p...,Alone,2438,Rajesh Jayaraman


- Sample rows:

In [367]:
data.sample(5)

Unnamed: 0.1,Unnamed: 0,IMDb-rating,appropriate_for,director,downloads,id,industry,language,posted_date,release_date,run_time,storyline,title,views,writer
14805,14805,5.4,,Michael Tully,1383,96616,Hollywood / English,English,"10 Jun, 2014",Jan 18 2014,1h 32min,A family vacation during the summer of 1985 ch...,Ping Pong Summer,10349,"Michael Tully, Michael Tully"
5584,5584,6.6,TV-14,Simone Stock,763,371877,Hollywood / English,English,"15 Feb, 2023",Feb 11 2023,88,It follows Kara Robinson as she survives an ab...,The Girl Who Escaped: The Kara Robinson Story,7573,Haley Harris
7210,7210,8.1,,Venky Atluri,1869,372090,Tollywood,Hindi,"20 Feb, 2023",Feb 17 2023,139,The life of a young man and his struggles agai...,Vaathi,6045,Venky Atluri
16311,16311,,,,14063,71434,Stage shows,Hindi,"30 Jul, 2013",Jul 29 2013,,,Yeh Hai Naya Zamana,29657,
6131,6131,5.9,,Brendan Walter,604,319072,Hollywood / English,English,"25 Nov, 2019",Sep 23 2018,1h 27min,"After the death of his fiancée, an American il...",Spell,7674,Barak Hardley


# **Understanding the Dataset**

---

Before we can start describing and analyzing the Data, it's important to comprehend what variables are present in the Dataset and the values they hold. For that, we will use a simple markdown table explaining the contents of the Dataset:

**Table 1.1: Variable Dictionary and Analysis**

| Variable Name             | Variable Contents                 | Variable Importance for Analysis             | Comments About Variable                                                                        |
|---------------------------|-----------------------------------|----------------------------------------------|------------------------------------------------------------------------------------------------|
| **`Unnamed:0`**           | Row id native to the Dataset      | 🔴 Irrelevant                                 | Not needed since Pandas Dataframes already comes with a Id column                              |
| **`IMDB-rating`**         | Rating of the movie on IMDB       | 🟢 High                                       | None                                                                                           |
| **`appropriate_for`**     | Movie classification rating       | 🟡 Medium                                     | Interesting possibilities for analysis. Be careful with the amount of different rating systems |
| **`director`**            | Name of movie director            | 🟡 Medium                                     | None                                                                                           |
| **`dowloads`**            | Number of dowloads per movie      | 🟢 High                                       | None                                                                                           |
| **`id`**                  | Unique Id per movie               | 🔴 Irrelevant                                 | Same motive as the first variable                                                              |
| **`industry`**            | Industry that produced the movie  | 🟢 High                                       | None                                                                                           |
| **`language`**            | Available languages for the movie | 🟠 Low                                         | Not much important for the analysis as a hole, but not totally irrelevant                      |
| **`posted_date`**         | When the movie was posted on the platform | 🟢 High                                | Very important metric for the analysis                                                         |
| **`released_date`**       | When the movie was released worldwide | 🟢 High                                    | In conjunction with the variable above, opens up lots of analytical possibilities             |
| **`run_time`**            | Runtime of the movie (minutes)    | 🟡 Medium                                      | None                                                                                           |
| **`storyline`**           | Movie synopsis                    | 🔴 Irrelevant                                  | For the pourposes of this analysis, the movie storyline is not needed                          |
| **`title`**               | Movie title                       | 🟢 High                                      | Without a name, there is no movie!                                                             |
| **`views`**               | Number of clicks per movie        | 🟢 High                                          | Very important metric alongside with downloads                                                  |
| **`writer`**              | List of all the movie writers     | 🟠 Low                                          | None                                                                                           |


With this table, we can identify the important and the not so important variables of the Dataset, but before we can trasnform or delete this data, we will keep it to still analyse and describe it. For that, we weill begin by checking the Dataset dimensons with the **`.shape`** method:

In [368]:
# Checking dataset dimensions:
print(f'Total Variables: {data.shape[1]}\nTotal Rows: {data.shape[0]}')

Total Variables: 15
Total Rows: 20548


We can see that this is a rather large Dataset, containing more than 20k registry entries. Now, lets use the **`.info()`** method to gather more detailed information about these variables:

In [369]:
# Cheking general information about the Dataset:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20548 entries, 0 to 20547
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       20548 non-null  int64  
 1   IMDb-rating      19707 non-null  float64
 2   appropriate_for  11072 non-null  object 
 3   director         18610 non-null  object 
 4   downloads        20547 non-null  object 
 5   id               20548 non-null  int64  
 6   industry         20547 non-null  object 
 7   language         20006 non-null  object 
 8   posted_date      20547 non-null  object 
 9   release_date     20547 non-null  object 
 10  run_time         18780 non-null  object 
 11  storyline        18847 non-null  object 
 12  title            20547 non-null  object 
 13  views            20547 non-null  object 
 14  writer           18356 non-null  object 
dtypes: float64(1), int64(2), object(12)
memory usage: 2.4+ MB


Now we have a lot more information about the specific variables of the Dataset:

**Null Values**

- We have null values in all columns except the **`id`** variable.

**Data Types**

- In the Dataset, we have one **`float64`** variable, two **`int64`** variables and twelve **`object`** variables.

**Memory Usage**

- The Dataset, in it`s current form, consumes over 2.4 MB of memory.

**Proposed Changes**

- Make an acessement and deal with the null values;

- Change most of the Data Types to boost the memory efficiency. Also this is the most important change in this case because most of the variables are not in a apropriate data format, i.e. **`posted_date`** being a object data type instead of datetime and **`runtime`** being a object type instead of float or int.

Now, let`s check the exact amout of null values in each column, as well as the percentage that it represents:

In [370]:
# Checking total null values:
null_count = (data
              .isna()
              .sum()
              .sort_values(ascending=False))

null_count

appropriate_for    9476
writer             2192
director           1938
run_time           1768
storyline          1701
IMDb-rating         841
language            542
downloads             1
industry              1
posted_date           1
release_date          1
title                 1
views                 1
Unnamed: 0            0
id                    0
dtype: int64

In [371]:
# Check percentage of total size:
null_percentage = round((data.isna().sum()/data.shape[0]) * 100, 4)

null_percentage.sort_values(ascending=False)

appropriate_for    46.1164
writer             10.6677
director            9.4316
run_time            8.6042
storyline           8.2782
IMDb-rating         4.0929
language            2.6377
downloads           0.0049
industry            0.0049
posted_date         0.0049
release_date        0.0049
title               0.0049
views               0.0049
Unnamed: 0          0.0000
id                  0.0000
dtype: float64

immediately we can see that the variable **`appropriate_for`** is basically useless for the analysis. With 9.476 missing values representing more than 46% of the total data size, this variable is too empty to be filled in using different methods. For that reason, it will be dropped in the cleaning step.

The **`writer`** column also has a good amout of missing values, 2.192 nulls representing more than 10.5% of the total data size. As this variable was already deemed as having a low priority in Table 1.1 previously made, it will also be dropped in the cleaning phase.

The same cannot be said about the **`director`** variable, as altough it also has almost the same amount of null values as the **`writer`** column, it has a substancial potencial for analysis, being labeled as High priority in Table 1.1.

The **`runtime`** and **`storyline`** variables both have a similar amout of missing values (around 8.5% of the total data). The first of them can and will be filled in using the **`mode()`** of the variable (as this metric is not much affected by the presence of outliers and should maintain the data consistency), the second variable is going to be dropped as stated in Table 1.1.

All other variables have little to none Null values and can be easily filled in or dropped (like the **`Unnamed: 0`** column).


# **Data Prep, Cleaning and Feature Engineering**

---

Here we will make all of the changes that are needed in the dataset, all of them being described in the section above. For a detailed view of what exactly we will be doing here:

- Drop useless or undesirable variables;

- Rename the variables if necessary;

- Handle missing (null) values;

- Change data types;

- Handle duplicates, and

- Adding new features to the dataset (feature engineering).

Let's begin with the dropping of useless variables.

## **_Dropping Variables_**

As indicated by Table 1.1 and our subsequent analysis of null values, we have some variables that will be dropped and some that can be filled to use in the analysis. The following are the variables that will be deleted from the dataset:

**Variables to Drop**

- **`Unnamed: 0`**

- **`id`**

- **`storyline`**

- **`writer`**

- **`appropriate_for`**

For the manipulation of the dataset, we will create a copy of the Dataframe:

In [372]:
# Creating manipulation copy:
pirated_films = data.copy()

Now, we can drop the undesirable variables:

In [373]:
# Dropping variables:
pirated_films = (pirated_films
                 .drop(['Unnamed: 0', 'id', 'storyline', 'writer', 'appropriate_for'], 
                       axis=1))

In [374]:
# Cheking new data:
pirated_films.head()

Unnamed: 0,IMDb-rating,director,downloads,industry,language,posted_date,release_date,run_time,title,views
0,4.8,John Swab,304,Hollywood / English,English,"20 Feb, 2023",Jan 28 2023,105,Little Dixie,2794
1,6.4,Paul Ziller,73,Hollywood / English,English,"20 Feb, 2023",Feb 05 2023,84,Grilling Season: A Curious Caterer Mystery,1002
2,5.2,Ben Wheatley,1427,Hollywood / English,"English,Hindi","20 Apr, 2021",Jun 18 2021,1h 47min,In the Earth,14419
3,8.1,Venky Atluri,1549,Tollywood,Hindi,"20 Feb, 2023",Feb 17 2023,139,Vaathi,4878
4,4.6,Shaji Kailas,657,Tollywood,Hindi,"20 Feb, 2023",Jan 26 2023,122,Alone,2438


## **_Renaming Variables_**

With only the variables that we want in the dataset, we will change some variable names for the pourpose of standardizing the dataset and to have the best naming convention possible (short and descriptive).

In [375]:
# Renaming variables:
pirated_films = pirated_films.rename(columns={'IMDb-rating': 'imdb_user_rating',
                                              'director': 'film_director',
                                              'language': 'available_langs',
                                              'posted_date': 'platform_post_date',
                                              'release_date': 'worldwide_release',
                                              'run_time': 'run_time_min',
                                              'title': 'movie_title',
                                              'views': 'total_views'})

# Cheking data:
pirated_films.head()

Unnamed: 0,imdb_user_rating,film_director,downloads,industry,available_langs,platform_post_date,worldwide_release,run_time_min,movie_title,total_views
0,4.8,John Swab,304,Hollywood / English,English,"20 Feb, 2023",Jan 28 2023,105,Little Dixie,2794
1,6.4,Paul Ziller,73,Hollywood / English,English,"20 Feb, 2023",Feb 05 2023,84,Grilling Season: A Curious Caterer Mystery,1002
2,5.2,Ben Wheatley,1427,Hollywood / English,"English,Hindi","20 Apr, 2021",Jun 18 2021,1h 47min,In the Earth,14419
3,8.1,Venky Atluri,1549,Tollywood,Hindi,"20 Feb, 2023",Feb 17 2023,139,Vaathi,4878
4,4.6,Shaji Kailas,657,Tollywood,Hindi,"20 Feb, 2023",Jan 26 2023,122,Alone,2438


## **_Handling Null Values_**

Before we can determine the best practices for dealing with null values, let's check the null values of the variables still present in the dataset.

In [376]:
# Checking null values:
round((pirated_films.isna().sum()/pirated_films.shape[0]) * 100, 4)

imdb_user_rating      4.0929
film_director         9.4316
downloads             0.0049
industry              0.0049
available_langs       2.6377
platform_post_date    0.0049
worldwide_release     0.0049
run_time_min          8.6042
movie_title           0.0049
total_views           0.0049
dtype: float64

With that information, let's deal with these values, one variable at a time.

### **`imdb_user_rating`**

Let's get a better look at this variable:

In [377]:
pirated_films['imdb_user_rating'].describe()

count    19707.000000
mean         5.762151
std          1.374041
min          1.100000
25%          4.800000
50%          5.700000
75%          6.600000
max          9.900000
Name: imdb_user_rating, dtype: float64

For the fill method, we will use the **mean** of the total values of the variable:

In [378]:
# Defining mean:
mean_ratings = pirated_films['imdb_user_rating'].mean().round(1)

mean_ratings

5.8

In [379]:
# Placing mean in all null values at the variable:
pirated_films['imdb_user_rating'] = (pirated_films['imdb_user_rating']
                                     .fillna(mean_ratings))

pirated_films['imdb_user_rating'].isna().sum()

0

In [380]:
# Checking statistics:
pirated_films['imdb_user_rating'].describe()

count    20548.000000
mean         5.763700
std          1.345648
min          1.100000
25%          4.900000
50%          5.800000
75%          6.600000
max          9.900000
Name: imdb_user_rating, dtype: float64

### **`film_director`**

For this variable the procedure will be diferent, that's because it is a categorical (qualitative) variable. For these types of variables, the ideal fill method would be to use the mode of the column, the most recurring value.

But this column refers to the film director, in this case, if we assign all the null films to the same director (mode) the data will not be in a good shape. So, the null values will receive a string informing that the director was not informed.

In [381]:
# Filling nulls with custom string:
pirated_films['film_director'] = pirated_films['film_director'].fillna('Not Assigned')

pirated_films['film_director'].isna().sum()

0

In [382]:
# Checking variable:
pirated_films['film_director'].describe()

count            20548
unique            9673
top       Not Assigned
freq              1938
Name: film_director, dtype: object

Here we can already make out a characteristic of the data, in that the majority of films present in the dataset have no Director assigned to them.

### **`dowloads` and all other 0.0049% null variables**

Lets' check exactly how many null values are present in this variable, because, as the total null values check we did at the begining of this section shows, it has only 0.0049% of null values, along with other variables with the same amaout of nulls. Let's have a better look at this:

In [383]:
# Checking total values:
pirated_films['downloads'].isna().sum()

1

Apparently only one entry is missing the value, so could it be that all other variables that were shown to have only 0.0049% of nulls are in fact the same resigtry? Let's test this by querying the exact entry that has the null value on the `downloads` variable and see if the other ones are empty too:

In [384]:
# Checking the null entry:
pirated_films.query('downloads.isna()')

Unnamed: 0,imdb_user_rating,film_director,downloads,industry,available_langs,platform_post_date,worldwide_release,run_time_min,movie_title,total_views
149,7.1,Not Assigned,,,,,,,,


As we suspected, all of the variables that had 0.0049% of null values are in fact only one entry. As it doesn't even have the movie name, this entry is completely useless for the analysis and it will be dropped.

In [385]:
# Grabbing the index of the null registry:
drop_index = pirated_films.loc[(pirated_films['downloads'].isna())].index

# Dropping the entry:
pirated_films = pirated_films.drop(drop_index)

In [386]:
# Checking the null entry:
pirated_films.query('downloads.isna()')

Unnamed: 0,imdb_user_rating,film_director,downloads,industry,available_langs,platform_post_date,worldwide_release,run_time_min,movie_title,total_views


We can also check if all the other variables that had 0.0049% of null values are now at 0%:

In [387]:
# Checking total nulls:
round((pirated_films.isna().sum()/pirated_films.shape[0]) * 100, 4)

imdb_user_rating      0.0000
film_director         0.0000
downloads             0.0000
industry              0.0000
available_langs       2.6330
platform_post_date    0.0000
worldwide_release     0.0000
run_time_min          8.5998
movie_title           0.0000
total_views           0.0000
dtype: float64

Indeed all other 0.0049% null variables are gone.

### **`available_langs`**

Another variable with null values refers to the available languages of the movie. As it is a categorical (qualitative) variable, we will use the mode as the fill method, as in, the most reocurring languages. But first, let's check the variable:

In [388]:
# Describing variable:
pirated_films['available_langs'].describe()

count       20006
unique       1168
top       English
freq        12657
Name: available_langs, dtype: object

As the describe method points to, the most reocurring language in the dataset in English, so this will be the value that will be placed in place of the null entries.

In [389]:
# Defining mode:
mode_fill = pirated_films['available_langs'].mode()[0] # English

# Filling the nulls:
pirated_films['available_langs'] = pirated_films['available_langs'].fillna(mode_fill)

In [390]:
# Checking nulls:
round((pirated_films.isna().sum()/pirated_films.shape[0]) * 100, 4)

imdb_user_rating      0.0000
film_director         0.0000
downloads             0.0000
industry              0.0000
available_langs       0.0000
platform_post_date    0.0000
worldwide_release     0.0000
run_time_min          8.5998
movie_title           0.0000
total_views           0.0000
dtype: float64

### **`run_time_min`**

The last variable with null values, let's get a closer look at it starting with the describe method:

In [391]:
# Describing variable:
pirated_films['run_time_min'].describe()

count     18780
unique      415
top          93
freq        652
Name: run_time_min, dtype: object

This variable will require more work. The current data type is object, but it must be numbers before we can fill the nulls, so, to start let's clean the data and change the data type:

In [392]:
# Checking values:
pirated_films['run_time_min'].sample(30).unique()

array(['102 min', '1h 48min', '88', '84 min', '1h 28min', '83', '95',
       '119', '53', '115 min', '1h 25min', '90 min', '92', '1h 52min',
       '1h 59min', '97', nan, '85', '107', '108', '1h 34min', '139',
       '100', '114', '90', '93', '1h 32min', '1h 31min'], dtype=object)

With us checking a random sample of the data, we can see that it is not in a appropriate format, as the variable name said it should be in minutes, we still have a lot of entries defining hours and, of course, the null values.

To clean this variable, a more robust function to do string manipulation is necessary:

In [399]:
# Creating function:

def convert_runtime(value):
    '''
    Converts a movie runtime string into only minutes.

    Args
    -----
        value (str): The movie runtime string in different formats of hours and minutes.

    Returns
    --------
        int: The movie runtime in minutes. If the input value is not a string or
            does not match the expected format, the original value is returned.

    Example
    --------
        >>> convert_runtime("1h 47min")
        107
        >>> convert_runtime("120 min")
        120
        >>> convert_runtime("2h 5m")
        125
        >>> convert_runtime("3h 25 min")
        205
    '''
    if not isinstance(value, str):
        return value
    
    # Pattern 1: xxh/hours xxm/min
    match = re.search(r'(\d+)\s*h(?:h|ours)?\s*(\d+)\s*(?:m|min)?', value)
    if match:
        hours, minutes = map(int, match.groups())
        return hours * 60 + minutes

    # Pattern 2: xxh/hr xxm/min
    match = re.search(r'(\d+)\s*(?:h|hr)\s*(\d+)\s*(?:m|min)?', value)
    if match:
        hours, minutes = map(int, match.groups())
        return hours * 60 + minutes

    # Pattern 3: xxh/hr xxm
    match = re.search(r'(\d+)\s*(?:h|hr)\s*(\d+)\s*m', value)
    if match:
        hours, minutes = map(int, match.groups())
        return hours * 60 + minutes

    # Pattern 4: xxm/min
    match = re.search(r'(\d+)\s*(?:m|min)', value)
    if match:
        return int(match.group(1))

    # Pattern 5: xxh/hours
    match = re.search(r'(\d+)\s*(?:h|hours)', value)
    if match:
        return int(match.group(1))

    return value


In [394]:
# Applying the function:
pirated_films['run_time_min'] = pirated_films['run_time_min'].apply(convert_runtime)

Now let's do a quick random check on the variable values:

In [395]:
# Checking results:
pirated_films['run_time_min'].sample(10)

842       84
4904      73
4428     109
8064     122
7185     110
14478    100
241      143
15392     88
7711     117
14335    134
Name: run_time_min, dtype: object

We can see that the data has been successfully transformed, where we now have all the run times in minutes only.
Now, we will fill the null values temporarily with 0 as a string, so we can chance the column data type to int. After that, we will fill the nulls (now the zeros) with the median of the values.

In [396]:
# Filling with zeros temporarily:
pirated_films['run_time_min'] = pirated_films['run_time_min'].fillna('0')

In [397]:
# Changing data type:
pirated_films['run_time_min'] = pirated_films['run_time_min'].astype('int')

Now that the variable is in the right data type, let's fill the nulls (zero values in this case) with the median of the values. The median is a good option here because these data are very likely to have outliers and to be skeewd, and as the median isn't very influenced by these factors, it becomes a good filling method. 

In [402]:
# Calculating the median value of the variable (ignoring the zeros):
median = np.median(pirated_films['run_time_min'][pirated_films['run_time_min'] != 0])

# Replacing the zeros with the median value:
pirated_films['run_time_min'][pirated_films['run_time_min'] == 0] = median

Finally, we can check the null quantity of our final dataset:

In [403]:
round((pirated_films.isna().sum()/pirated_films.shape[0]) * 100, 4)

imdb_user_rating      0.0
film_director         0.0
downloads             0.0
industry              0.0
available_langs       0.0
platform_post_date    0.0
worldwide_release     0.0
run_time_min          0.0
movie_title           0.0
total_views           0.0
dtype: float64