### Importing the required libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [33]:
df = pd.read_csv("amazon_prime_titles.csv")
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...
1,s2,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...
2,s3,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,"March 30, 2021",2017,,74 min,"Action, Drama, Suspense",After a man discovers his wife is cheating on ...
3,s4,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,"March 30, 2021",2014,,69 min,Documentary,"Pink breaks the mold once again, bringing her ..."
4,s5,Movie,Monster Maker,Giles Foster,"Harry Dean Stanton, Kieran O'Brien, George Cos...",United Kingdom,"March 30, 2021",1989,,45 min,"Drama, Fantasy",Teenage Matt Banting wants to work with a famo...


In [3]:
df.shape

(9668, 12)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9668 entries, 0 to 9667
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       9668 non-null   object
 1   type          9668 non-null   object
 2   title         9668 non-null   object
 3   director      7586 non-null   object
 4   cast          8435 non-null   object
 5   country       672 non-null    object
 6   date_added    155 non-null    object
 7   release_year  9668 non-null   int64 
 8   rating        9331 non-null   object
 9   duration      9668 non-null   object
 10  listed_in     9668 non-null   object
 11  description   9668 non-null   object
dtypes: int64(1), object(11)
memory usage: 906.5+ KB


### Lets check the sanity of the data

Let us now look at what percentage of data is missing in the dataset

In [35]:
round(100*(df.isnull().sum(axis=0)/len(df.index)),2)

show_id          0.00
type             0.00
title            0.00
director        21.53
cast            12.75
country         93.05
date_added      98.40
release_year     0.00
rating           3.49
duration         0.00
listed_in        0.00
description      0.00
dtype: float64

### Removing the unnecessary columns

The country and date_added columns are missing in high numbers, so we are dropping them

In [36]:
df.drop(['date_added','country'],axis=1,inplace=True)
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'release_year',
       'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

### Now lets check the other values

Let us look at the other columns

In [38]:
round(100*(df.isnull().sum(axis=0)/len(df.index)),2)

show_id          0.00
type             0.00
title            0.00
director        21.53
cast            12.75
release_year     0.00
rating           3.49
duration         0.00
listed_in        0.00
description      0.00
dtype: float64

In [39]:
df.shape

(9668, 10)

### Let us look at the number of rows where data is missing

Out of the 9688 columns, let us look at the percentage of columns which are missing

In [40]:
(df.dropna(axis=0).shape[0]/9668)*100

68.74224244931733

### Let us look at rows which are completely empty

In [43]:
df.isnull().all(axis=1).sum()

0

### From the above cell we can infer that all rows atleast contain 1 value

### Let us now look at the rows having more than 2 columns to be empty

In [76]:
len(df[df.isnull().sum(axis=1) >= 2])

629

### Dropping these columns will affect the data in no way because this missing data consists of only a very small portion of the original data

In [60]:
100*(len(df[df.isnull().sum(axis=1) >= 2])/len(df.index))

6.505999172527928

In [64]:
df.isnull().sum()

show_id            0
type               0
title              0
director        2082
cast            1233
release_year       0
rating           337
duration           0
listed_in          0
description        0
dtype: int64

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''