### 1. Load the Data

In [1]:
# Import pandas 
import pandas as pd

In [3]:
# Load the netflix data and check the first 5 recprds
netflix_data = pd.read_csv("data/netflix_data.csv")
netflix_data.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


### 2. Data Cleaning

#### a. Check no of columns and rows

In [9]:
# compute the number of rows in the dataset
rows = len(netflix_data.axes[0])
 
# compute the number of columns in the dataset
cols = len(netflix_data.axes[1])

print("Number of Rows: ", rows)
print("Number of Columns: ", cols)

Number of Rows:  6234
Number of Columns:  12


#### b. Address missing values

In [5]:
# Check for missing values in each column
print(netflix_data.isnull().sum())

show_id            0
type               0
title              0
director        1969
cast             570
country          476
date_added        11
release_year       0
rating            10
duration           0
listed_in          0
description        0
dtype: int64


The "director", "cast", "country", "date_added" and the "rating" columns all have missing values. 

- director: 1969 missing values
- cast: 570 missing values
- country: 476 missing values
- date_added: 11 missing values
- rating: 10 missing values





To address this, we might consider a few options. First, if the missing values are few, as in the case of "date_added" and "rating" columns, we can drop the missing values. The other option is to fill the missing values with a placeholder or a summary statistic such as the most common one. Lastly, we can impute with mean, mode or median for the numeric columns with missing values. More information on handling missing values can be found on the official pandas documentation: [https://pandas.pydata.org/docs/user_guide/missing_data.html]. Or on this useful article on medium [https://learner-cares.medium.com/handy-pandas-python-library-for-handling-missing-values-dc5f0d1ebf82]

In [13]:
# First we check the data type for each column
data_types = netflix_data.dtypes
print(data_types)

show_id          int64
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object


In [14]:
# We will use a placeholder to fill in the missing values for director, cast and country woth unknown
netflix_data['director'].fillna('Unknown', inplace=True)
netflix_data['cast'].fillna('Unknown', inplace=True)
netflix_data['country'].fillna('Unknown', inplace=True)

In [15]:
# Fill in the missing values for date and rating columns with 'Not Available' and Unrated' respectively
netflix_data['date_added'].fillna('Not Available', inplace=True)
netflix_data['rating'].fillna('Unrated', inplace=True)

In [16]:
# Convert 'date_added' to datetime format
netflix_data['date_added'] = pd.to_datetime(netflix_data['date_added'], errors='coerce')

In [17]:
# Confirm the missing values have been handled
cleaned_netflix_data = netflix_data.isnull().sum()
cleaned_netflix_data

show_id          0
type             0
title            0
director         0
cast             0
country          0
date_added      11
release_year     0
rating           0
duration         0
listed_in        0
description      0
dtype: int64

### 3. Data Exploration

In [21]:
# Display descriptive statistics for numerical columns
numerical_summary = netflix_data.describe()
numerical_summary 

Unnamed: 0,show_id,release_year
count,6234.0,6234.0
mean,76703680.0,2013.35932
std,10942960.0,8.81162
min,247747.0,1925.0
25%,80035800.0,2013.0
50%,80163370.0,2016.0
75%,80244890.0,2018.0
max,81235730.0,2020.0


In [22]:
# Summary statistics for numeric and categorical data
print(netflix_data.describe())
print(netflix_data.describe(include=['object']))

            show_id  release_year
count  6.234000e+03    6234.00000
mean   7.670368e+07    2013.35932
std    1.094296e+07       8.81162
min    2.477470e+05    1925.00000
25%    8.003580e+07    2013.00000
50%    8.016337e+07    2016.00000
75%    8.024489e+07    2018.00000
max    8.123573e+07    2020.00000
         type      title director     cast        country rating  duration  \
count    6234       6234     6234     6234           6234   6234      6234   
unique      2       6172     3302     5470            555     15       201   
top     Movie  Limitless  Unknown  Unknown  United States  TV-MA  1 Season   
freq     4265          3     1969      570           2032   2027      1321   

            listed_in                                        description  
count            6234                                               6234  
unique            461                                               6226  
top     Documentaries  A surly septuagenarian gets another chance at ...  
fre

### 4. Data Visualization

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Count genres, split if there are multiple genres per title
genres = netflix_data['listed_in'].str.split(', ').explode().value_counts().head(10)

# Plot top genres
plt.figure(figsize=(12, 6))
sns.barplot(x=genres.values, y=genres.index, palette='viridis')
plt.title('Most Watched Genres on Netflix')
plt.xlabel('Number of Shows/Movies')
plt.ylabel('Genre')
plt.show()