# **Univariate Analysis**

---

In this notebook, we will focus on doing a Univariate Analysis of our dataset.

Univariate Analysis is a statistical technique that is used to examine the distribution and characteristics of a **single variable in isolation**, without considering the relationships between variables.

It involves summarizing and visualizing the distribution of a single variable using descriptive statistics such as measures of central tendency (e.g., mean, median, mode) and measures of variability (e.g., range, variance, standard deviation). For visualizing the statistical information, we will use graphical methods such as histograms and box plots.

This step is very important because it provides a basic understanding of the data and helps identify any patterns, trends, or outliers in the data.

In this EDA, as we mention in the begining of the project, the goal is detail, so to achieve that we will conduct a thorough analysis of each variable in the dataset.

**Steps**

For each variable, we will analize:

- Statistical information using the `describe()` method (categorical and numerical variables);

- The distribution of the data using histograms (numerical variables);

- The skewness of the data using the `skew()` method (numerical variables);

- The kurtosis of the data. using the `kurt()` metho (numerical variables).

Before we can begin, let's import our libraries and get our clean CSV.

In [1]:
# Importing libraries needed for the second step of the project:
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
from matplotlib.pyplot import MaxNLocator, FuncFormatter
from matplotlib.font_manager import FontProperties

# Setting seaborn plot parameters:
sns.set_theme(context='notebook', style='darkgrid')

# Filtering out warnings:
warnings.filterwarnings('ignore')

# Setting pandas dataframe visualization parameters:
pd.set_option('display.max_columns', 100)

print('Packages collected!')

Packages collected!


In [2]:
# Collecting data from previous step:
pirated_films = pd.read_csv('data/pirated_films.csv', sep=',')

pirated_films.head()

Unnamed: 0,movie_title,film_director,industry,available_langs,run_time_min,imdb_user_rating,worldwide_release,platform_post_date,days_to_piracy,avg_views_per_download,avg_downloads_per_day,downloads,total_views
0,Little Dixie,John Swab,Hollywood / English,English,105,4.8,2023-01-28,2023-02-20,23,9.190789,13.217391,304,2794
1,Grilling Season: A Curious Caterer Mystery,Paul Ziller,Hollywood / English,English,84,6.4,2023-02-05,2023-02-20,15,13.726027,4.866667,73,1002
2,In the Earth,Ben Wheatley,Hollywood / English,"English,Hindi",107,5.2,2021-06-18,2021-04-20,-59,10.104415,24.186441,1427,14419
3,Vaathi,Venky Atluri,Tollywood,Hindi,139,8.1,2023-02-17,2023-02-20,3,3.149128,516.333333,1549,4878
4,Alone,Shaji Kailas,Tollywood,Hindi,122,4.6,2023-01-26,2023-02-20,25,3.710807,26.28,657,2438


## **_Univariate Analysis: `movie_title`_**

### **Statistical View**

In [3]:
# Checking statistics:
pirated_films['movie_title'].describe()

count                                             20547
unique                                            16572
top       The Girl Who Escaped: The Kara Robinson Story
freq                                                402
Name: movie_title, dtype: object

With the describe method, we can have a first look at the `movie_title` variable and see some statistical information:

- Of the 20.547 entries in the dataset, we have **16.572** unique movie titles and 3.975 entries that contains the same movie title;

- The data point that is most occuring in this variable, is the movie title for **_The Girl Who Escaped: The Kara Robinson Story_**, showing up 402 times in the dataset.

We can check all of these registries containg the same movie title to see what is the reason for that happening:

In [4]:
# Checking reocurring movie title:
pirated_films.query('movie_title == "The Girl Who Escaped: The Kara Robinson Story"').head(10)

Unnamed: 0,movie_title,film_director,industry,available_langs,run_time_min,imdb_user_rating,worldwide_release,platform_post_date,days_to_piracy,avg_views_per_download,avg_downloads_per_day,downloads,total_views
28,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,9.776316,190.0,760,7430
40,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,9.777632,190.0,760,7431
84,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,9.778947,190.0,760,7432
128,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,9.780263,190.0,760,7433
171,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,9.781579,190.0,760,7434
215,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,9.784211,190.0,760,7436
259,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,9.785526,190.0,760,7437
303,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,9.786842,190.0,760,7438
347,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,9.789474,190.0,760,7440
391,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,9.790789,190.0,760,7441


In [5]:
# Checking the last entries for this film:
pirated_films.query('movie_title == "The Girl Who Escaped: The Kara Robinson Story"').tail(10)

Unnamed: 0,movie_title,film_director,industry,available_langs,run_time_min,imdb_user_rating,worldwide_release,platform_post_date,days_to_piracy,avg_views_per_download,avg_downloads_per_day,downloads,total_views
17239,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,10.287206,191.5,766,7880
17283,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,10.288512,191.5,766,7881
17327,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,10.292428,191.5,766,7884
17371,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,10.29765,191.5,766,7888
17415,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,10.298956,191.5,766,7889
17459,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,10.300261,191.5,766,7890
17503,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,10.293351,191.75,767,7895
17547,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,10.295958,191.75,767,7897
17591,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,10.297262,191.75,767,7898
17635,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,10.298566,191.75,767,7899


After checking the first and last 10 entries for this movie title, it appears that each time the movie received a new view (sometimes after receiving a few more views, such as 5 or 3), a new entry was registered in the database. 

This is not ideal, as it will affect all subsequent analyses. We have identified that this is due to a new record being created in the database for each new view. Therefore, we will drop all entries except the last one, which contains the most recent number of total views. This will allow us to continue the analysis using this film and ensure that we are utilizing the most recent data for this record.

As we know the latest record of this film contains 7.899 views, we can use this information to drop all other rows except this one.

In [6]:
# Defining rows to drop:
rows_to_drop = pirated_films.query('movie_title == "The Girl Who Escaped: The Kara Robinson Story" & total_views != 7899')

# Dropping:
pirated_films = pirated_films.drop(rows_to_drop.index)

In [7]:
# Checking entries for this film:
pirated_films.query('movie_title == "The Girl Who Escaped: The Kara Robinson Story"')

Unnamed: 0,movie_title,film_director,industry,available_langs,run_time_min,imdb_user_rating,worldwide_release,platform_post_date,days_to_piracy,avg_views_per_download,avg_downloads_per_day,downloads,total_views
17635,The Girl Who Escaped: The Kara Robinson Story,Simone Stock,Hollywood / English,English,88,6.6,2023-02-11,2023-02-15,4,10.298566,191.75,767,7899


Let's check the statistics again.

In [8]:
# Checking statistics:
pirated_films['movie_title'].describe()

count      20146
unique     16572
top       Vaathi
freq         402
Name: movie_title, dtype: object

We can see that we have the same situation happening again, where the same movie is present multiple times in the dataset only changing the number of views. Lets identify all movie titles that show up in our dataset more than 1 time:

In [9]:
# group the data by movie_title and count the number of occurrences
grouped = (pirated_films
           .groupby('movie_title')
           .size()
           .sort_values(ascending=False))

# Creating a new dataframe with the movie_title counts:
df = pd.DataFrame({'movie_title': grouped.index, 
                   'count': grouped.values})

In [10]:
# Checking:
df.query('count > 1')

Unnamed: 0,movie_title,count
0,Who Invited Charlie?,402
1,Vaathi,402
2,Consent,202
3,WWE Smackdown 2023-02-10,202
4,Vacation Home Nightmare,202
...,...,...
539,Trishna,2
540,Wonder Woman,2
541,Premature,2
542,Enemy,2


We have 544 titles that show up more than one time in the dataset, considering that some movies contain the same name but are diferent, let's only worry about about the clear ouliers in this case, mainly the movies that show up hundreds of times.

In [11]:
# Checking movies that show up more than 50 times:
df.query('count > 50')

Unnamed: 0,movie_title,count
0,Who Invited Charlie?,402
1,Vaathi,402
2,Consent,202
3,WWE Smackdown 2023-02-10,202
4,Vacation Home Nightmare,202
5,Little Dixie,202
6,The Inspection,202
7,Marlowe,201
8,Baby Ruby,201
9,Shehzada,201


For all of these titles, we have to do the same procedure as before, excluding all entries except the last one that contains the most updated number of total_views. Luckily there is a pattern that can be easily observed with all this unecessary entries:

- All variables contain the same values (are duplicates), except the last 4 variables (`avg_views_per_download`, `avg_downloads_per_day`, `downloads`, `total_views`) that change because each entry adds one or more to the `total_views` variable, wich influences all other 3 variables.

Because of this pattern, we can drop all rows were the variables are duplicates and only the last 4 ones change. Basically the same procedure we did above for the ***The Girl Who Escaped: The Kara Robinson Story*** movie, but this time for all of the movie titles that we identified in the list above showing up more than 50 times in the dataset.

In [12]:
# Sorting dataset by the total_views:
pirated_films = pirated_films.sort_values(by=['total_views'], ascending=False)

# Dropping duplicates:
pirated_films = pirated_films.drop_duplicates(subset=pirated_films.columns[:-4], keep='first')

Here we've done the following:

- Sorted our dataset by the `total_views` column in Descending order (highest to lowest).

- We dropped all entries that contains the same values for all columns except the last 4, we drop all rows and keep only the first that contains the most recent total_views value (the highest, since we sorted the dateset in Descending order).

Let's check the shape of the dataset again.

In [13]:
# Checking new shape:
pirated_films.shape

(17084, 13)

We can see that those 3K+ entries we've identified as being duplicated on the dataset in the begining of this section using the describe method, were composed by these unecessary entries, with that, we only should have a couple more unecessary entries in the dataset. Let's do the same check as we did before, were we group the dataset and search for duplicate movie names.

In [14]:
# group the data by movie_title and count the number of occurrences
grouped = pirated_films.groupby('movie_title').size().sort_values(ascending=False)

# Creating a new dataframe with the movie_title counts:
df = pd.DataFrame({'movie_title': grouped.index, 'count': grouped.values})

df.query('count > 1')

Unnamed: 0,movie_title,count
0,True Justice,5
1,Alone,5
2,Pinocchio,5
3,Sacrifice,5
4,Don,4
...,...,...
444,Sundown,2
445,Mercenaries,2
446,Dreamland,2
447,Love Story,2


Now we only have up to 5 movie entries that contains the same movie name showing up in the dataset. As this number is very low, the impact on the analysis shouldn't be big, so we will keep the rest of these values in the dataset. 

With this analysis, we observed:

- The way some of the movie titles were inserted on the dataset was not ideal, having a new entry just to account for a new visualization on the piracy website;

- Other movies had the data collected correctly, having only one entry in the dataset containg the most up to date number of views.

This shows the importance of an Univariate Analysis of all variables of our dataset, with it we can describe our data and also find necessary changes and corrections not found in the cleaning phase, in this case, the presence of multiple duplicate entries changing only on the number of views.

As this is a categorical variable and we are doing a Univariate Analysis, we don't have many further possibilities of analysis, so let's check the next variable in our dataset.

## **_Univariate Analysis: `film_director`_**