#  Analysis of User Engagement on Netflix



## 1.1 Project Overview
Netflix, one of the world's leading streaming platforms, continually seeks to enhance user experience by understanding viewer engagement. This project explores the factors that drive user engagement on Netflix, focusing on metrics such as ratings and watchtime.

## 1.2 Goals and Objectives
- Identify key factors influencing user engagement.
- Analyze how these factors vary across different regions.
- Provide recommendations to improve content recommendations and viewer satisfaction.

## 1.3 Methodology
We will use a combination of data from the IMDB API, TMDB API, and various Netflix-related datasets available on Kaggle. The analysis will involve data cleaning, exploratory data analysis, feature engineering, statistical analysis, and predictive modeling.


In [70]:
# Load datasets
import pandas as pd
import numpy as np  #- for cleaning and manipulation of data

import matplotlib.pyplot as plt
import seaborn as sns #-for dizualisation of data

from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor #- for analysis and modelling

import requests
import json #- for API

#import geopandas as gpd
import folium #- for geographical Analysis


In [71]:
netflix_titles= pd.read_csv(r"C:\Users\DELL\OneDrive\Desktop\Netflix_project\netflix_titles.csv")
#dataset_2 = pd.read_csv('path to dataset_2.csv')

In [72]:
# Display te inital rows of both datasets
netflix_titles.head() #dataset_2.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [73]:
#column_drop = 'description'

#netflix_titles.drop(columns=[column_drop], inplace=True)
print("\nDataFrame after dropping the columns:\n")
print(netflix_titles.info())


DataFrame after dropping the columns:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
None


In [74]:
netflix_titles.rename(columns={"date_added": "date", "listed_in": "genre"}, inplace=True)
print(netflix_titles['date'].head())


0    September 25, 2021
1    September 24, 2021
2    September 24, 2021
3    September 24, 2021
4    September 24, 2021
Name: date, dtype: object


In [75]:
print(netflix_titles['date'].head())

0    September 25, 2021
1    September 24, 2021
2    September 24, 2021
3    September 24, 2021
4    September 24, 2021
Name: date, dtype: object


It's more appropriate to convert the date_added to a datetime data- type in pandas for better handling of date-related operations.

In [76]:
netflix_titles['date'] = pd.to_datetime(netflix_titles['date'],format = '%B %d, %Y', errors='coerce')

In [77]:
print('\n date column with datetime datatype: \n')
print(netflix_titles['date'].head())



 date column with datetime datatype: 

0   2021-09-25
1   2021-09-24
2   2021-09-24
3   2021-09-24
4   2021-09-24
Name: date, dtype: datetime64[ns]


Out of 8807 rows of data, column 'country' has 7976 non null values, instead of deleting rows with null values, we can fill them with the value 'unknown', so that we don't lose out on valueable data

In [78]:
netflix_titles['country'].fillna('Unknown', inplace=True)

In [79]:
print("Having no 'null' values in the country column: \n")
print(netflix_titles['country'].info())

Having no 'null' values in the country column: 

<class 'pandas.core.series.Series'>
RangeIndex: 8807 entries, 0 to 8806
Series name: country
Non-Null Count  Dtype 
--------------  ----- 
8807 non-null   object
dtypes: object(1)
memory usage: 68.9+ KB
None


In [80]:
netflix_titles['rating'].fillna('Not Rated', inplace=True)
netflix_titles['release_year'].fillna('Unkown', inplace=True)

In [81]:
missing_dates = netflix_titles['date'].isnull().sum()
print(f"\n Number of missing dates: {missing_dates}")


 Number of missing dates: 98


In [82]:
duplicate_rows = netflix_titles[netflix_titles.duplicated()]


In [53]:
duplicate_rows = netflix_titles[netflix_titles.duplicated()]
print(f"\nNumber of duplicate rows: {duplicate_rows.shape[0]}")



Number of duplicate rows: 0


In [83]:
netflix_titles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   show_id       8807 non-null   object        
 1   type          8807 non-null   object        
 2   title         8807 non-null   object        
 3   director      6173 non-null   object        
 4   cast          7982 non-null   object        
 5   country       8807 non-null   object        
 6   date          8709 non-null   datetime64[ns]
 7   release_year  8807 non-null   int64         
 8   rating        8807 non-null   object        
 9   duration      8804 non-null   object        
 10  genre         8807 non-null   object        
 11  description   8807 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(10)
memory usage: 825.8+ KB


In [None]:
show_movie_count

In [86]:
show_movie_count = netflix_titles['type'].value_counts().reset_index()
show_movie_count

Unnamed: 0,type,count
0,Movie,6131
1,TV Show,2676
