## Notebook Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

sns.set_style('whitegrid')

## Section: Data Analysis
In this section, I'll conduct a comprehensive analysis of the dataset. We'll explore and visualize the data to gain insights and a better understanding of its characteristics.

### 1.1 Exploratory Data Analysis (EDA)
- **Data Loading:** Load the dataset into the notebook.
- **Data Inspection:** Examine the structure and content of the dataset.
- **Descriptive Statistics:** Calculate and analyze basic statistics to summarize key features.

### 1.2 Data Visualization

- **Distribution Plots:** Visualize the distribution of key variables.
- **Correlation Analysis:** Explore relationships between different features.

### 1.3 Key Findings

- **Summarize** the key findings from the data analysis.

---

---

In [None]:
df = pd.read_csv('Netflix_Engagement.csv')
df.head()

In [None]:
# Dimension of the dataset
df.shape

In [None]:
df.info()

In [None]:
# % of the missing values.
df.isnull().sum() / df.shape[0] * 100

In [None]:
df.describe().T

In [None]:
df.describe(include='O').T

## **Movies Available Globally**

Let's find out how many movies are accessible globally.

In [None]:
df['Available Globally?'].value_counts()

In [None]:
custom_colors = ['#48dbfb', '#9980FA']
plt.pie(df['Available Globally?'].value_counts()/df.shape[0],
        autopct='%1.1f%%', labels=df['Available Globally?']\
        .value_counts().index, colors=custom_colors, explode=[0.05, 0.05],
        shadow=True, startangle=90)
plt.show()

Here, we can see that a considerable number of movies are not globally accessible, and this can be attributed to factors such as regional licensing restrictions, content distribution agreements, and cultural sensitivities that influence availability in certain areas.

## **Ratings**

In [None]:
sns.histplot(df['Rating'], bins=20, kde=True)
plt.axvline(x=np.mean(df.Rating), color='r', label='Mean', linestyle='dashed')
plt.title('Distribution of Rating')
plt.legend()
plt.show()



The histogram of 'Rating' data shows a distribution that is not symmetrical but slightly skewed to the left, indicating that the ratings are on the lower side of the scale. The skewness of the data suggests that the mean might not be the best measure of central tendency for imputing missing values, especially since the mean is more influenced by outliers and extreme values.

Additionally, given that a considerable 22% of the data is missing, using the mean to fill in these gaps could artificially deflate the variability and potentially bias any subsequent analysis. The mean is a good choice when the data is normally distributed and the percentage of missing values is small; however, neither condition appears to be met in this case.

The median, on the other hand, is a better choice for skewed distributions

In [None]:
#Imputing missing  "Rating" values with the median
df['Rating'] = df['Rating'].fillna(df['Rating'].median())

## **Ratings vs Hours Viewed**

In [None]:
fig = px.scatter(
    df, x='Number of Ratings',
    y='Hours Viewed', color='Rating',
    size='Rating', trendline='lowess',
    width=700
)

fig.show()



* The scatter plot visualizes the relationship between the number of ratings and hours viewed for a collection of content. The data is heavily concentrated at the lower end of the ratings scale, suggesting that most content has fewer ratings.

* Interestingly, there is a trend where content with fewer ratings has higher total viewership hours, indicated by a cluster of points with large 'Hours Viewed' values at the lower 'Number of Ratings' range. This could imply that less-rated content might have broad viewership but low viewer engagement in terms of leaving a rating.

* Outliers with exceptionally high viewership deviate from the general trend.

## **Movies with exceptionally high or low viewing hours.**

In [None]:
sns.displot(df['Hours Viewed'], log_scale=True, kde=True, color='green')
plt.title('Distribution of Hours Viewed', fontsize=14)
plt.show()

The high count in the first few bins suggests that most entities (which could be videos, channels, shows, etc.) have a relatively small number of hours viewed, with progressively fewer entities reaching higher hours viewed.


In [None]:
# Top 5 record based on Hours Viewed
top_5_hours_viewed = df.nlargest(5, 'Hours Viewed')
top_5_hours_viewed[['Title', 'Genre', 'Rating', 'Hours Viewed']]

In [None]:
bottom_5_hours_viewed = df.nsmallest(5, 'Hours Viewed')
bottom_5_hours_viewed[['Title', 'Genre', 'Rating', 'Hours Viewed']]

It appears that some movies with a substantial number of viewing hours receive poor ratings, while others with relatively fewer viewing hours attain similar ratings. This suggests that the movies might be widely watched due to their popularity, but their quality may not be commensurate with their high viewership, resulting in lower ratings.

## **Top-10 Best & Worst-Rated Movies**

In [None]:
top10 = df.nlargest(10, 'Rating')[['Title', 'Rating', 'Genre', 'Hours Viewed']]
fig = px.bar(top10, x='Title', y='Rating', color='Genre',
             hover_data='Hours Viewed', title='Top-Rated Movies',
             width=1000, height=700)
fig.show()

In [None]:
worst_rated = df.nsmallest(10, 'Rating')[['Title', 'Rating', 'Genre', 'Hours Viewed']]
fig = px.bar(worst_rated, x='Title', y='Rating',
             hover_data=['Hours Viewed'], color='Genre',
             title='Worst-Rated Movies', width=800)
fig.show()

## Correlation between different Numerical Columns

In [None]:
numeric_cols = ['Hours Viewed', 'Rating', 'Number of Ratings']
corr_matrix = df.loc[:, numeric_cols].corr()
mask = np.triu(np.ones_like(corr_matrix), k=1)
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm",
            xticklabels=numeric_cols,
            yticklabels=numeric_cols,
            mask=mask)

## **Most Prevalent Genres**

In [None]:
top_genres = df.groupby('Genre').size().nlargest(10).reset_index(name='Count')
top_genres_hours_viewed = df[df['Genre'].isin(top_genres['Genre'])].groupby('Genre')['Hours Viewed'].sum().reset_index()

In [None]:
fig = px.bar(top_genres_hours_viewed, x='Hours Viewed', y='Genre', color='Genre',
             title='Total Hours Viewed for Top 10 Genres by Number of Entries', 
             width=1000)
fig.update_layout(showlegend=False)
fig.show()

Here, we can observe that short films tend to attract the most viewership, as people appreciate the time-saving aspect while still enjoying the entertainment provided.

## **Movies Released in the first half of 2023**

In [None]:
# Movies Released in 2023
df['Release Date'] = pd.to_datetime(df['Release Date'])
movies_in_2023 = df[df['Release Date'].dt.year == 2023]
movies_in_2023.head()

In [None]:
movies_in_2023.shape

The dataset specifically covers the initial half of 2023, and within this timeframe, a noteworthy 335 movies have been released.

In [None]:
fig = px.scatter(movies_in_2023, x='Release Date', y='Rating', color_continuous_scale='oranges',
                 title='Ratings of Films Released in 2023', hover_data=['Title'], color='Title',
                 labels={'Rating': 'Rating', 'Release Date': 'Release Date'}, width=900)

fig.update_layout(
    showlegend=False
)

fig.show()

* The ratings are quite spread out, suggesting a variety of opinions on the films released.
* Most ratings seem to fall between 5 and 8, suggesting that most films are considered average to good.
* There are a few points at the very top and bottom, indicating some exceptional and poorly received films, respectively.

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(x='Release Date', y='Hours Viewed', data=movies_in_2023)
plt.title('Time Trend of Hours Viewed in 2023')
plt.show()