### Amazon Prime video Data Analysis 

In [2]:
import numpy as np 
import pandas as pd 

In [3]:
## load the dataset
df = pd.read_csv("amazon_prime_titles.csv")
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...
1,s2,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...
2,s3,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,"March 30, 2021",2017,,74 min,"Action, Drama, Suspense",After a man discovers his wife is cheating on ...
3,s4,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,"March 30, 2021",2014,,69 min,Documentary,"Pink breaks the mold once again, bringing her ..."
4,s5,Movie,Monster Maker,Giles Foster,"Harry Dean Stanton, Kieran O'Brien, George Cos...",United Kingdom,"March 30, 2021",1989,,45 min,"Drama, Fantasy",Teenage Matt Banting wants to work with a famo...


In [4]:
## Drop exact duplicate rows
df.drop_duplicates(inplace=True)

In [6]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [7]:
## Standardize column names
df.columns = df.columns.str.lower().str.replace(" ", "_")

In [10]:
## Handle Missing Values
string_cols = ["director", "cast", "country", "rating", "duration", "listed_in", "description"]

for col in string_cols:
    df[col] = df[col].fillna("Unknown").str.strip()

In [11]:
## Clean Date Column: date_added
df["date_added"] = pd.to_datetime(df["date_added"], errors="coerce")

df["year_added"] = df["date_added"].dt.year
df["month_added"] = df["date_added"].dt.month
df["day_added"] = df["date_added"].dt.day

df["year_added"].fillna(df["year_added"].mode()[0], inplace=True)
df["month_added"].fillna(0, inplace=True)
df["day_added"].fillna(0, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["year_added"].fillna(df["year_added"].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["month_added"].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are 

In [12]:
## Clean Duration Column
def get_duration_value(x):
    try:
        return int(x.split()[0])
    except:
        return np.nan

def get_duration_unit(x):
    try:
        return x.split()[1]
    except:
        return "Unknown"

df["duration_value"] = df["duration"].apply(get_duration_value)
df["duration_unit"] = df["duration"].apply(get_duration_unit)

df["duration_value"].fillna(df["duration_value"].median(), inplace=True)

# Normalize units
df["duration_unit"] = df["duration_unit"].replace({
    "min": "Minutes",
    "Season": "Seasons"
})

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["duration_value"].fillna(df["duration_value"].median(), inplace=True)


In [13]:
## Split Multi-valued Columns
df["genre_list"] = df["listed_in"].apply(lambda x: [g.strip() for g in x.split(",")])
df["country_list"] = df["country"].apply(lambda x: [c.strip() for c in x.split(",")])

In [14]:
## Extract Genre Count
df["genre_count"] = df["genre_list"].apply(len)

In [15]:
## Clean Text Columns
text_columns = ["title", "director", "cast", "listed_in", "description"]
for col in text_columns:
    df[col] = df[col].str.lower().str.strip()

In [16]:
## Remove Outliers
df = df[df["duration_value"] < df["duration_value"].quantile(0.99)]
df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added,day_added,duration_value,duration_unit,genre_list,country_list,genre_count
0,s1,Movie,the grand seduction,don mckellar,"brendan gleeson, taylor kitsch, gordon pinsent",Canada,2021-03-30,2014,Unknown,113 min,"comedy, drama",a small fishing village must procure a local d...,2021.0,3.0,30.0,113,Minutes,"[Comedy, Drama]",[Canada],2
1,s2,Movie,take care good night,girish joshi,"mahesh manjrekar, abhay mahajan, sachin khedekar",India,2021-03-30,2018,13+,110 min,"drama, international",a metro family decides to fight a cyber crimin...,2021.0,3.0,30.0,110,Minutes,"[Drama, International]",[India],2
2,s3,Movie,secrets of deception,josh webber,"tom sizemore, lorenzo lamas, robert lasardo, r...",United States,2021-03-30,2017,Unknown,74 min,"action, drama, suspense",after a man discovers his wife is cheating on ...,2021.0,3.0,30.0,74,Minutes,"[Action, Drama, Suspense]",[United States],3
3,s4,Movie,pink: staying true,sonia anderson,"interviews with: pink, adele, beyoncé, britney...",United States,2021-03-30,2014,Unknown,69 min,documentary,"pink breaks the mold once again, bringing her ...",2021.0,3.0,30.0,69,Minutes,[Documentary],[United States],1
4,s5,Movie,monster maker,giles foster,"harry dean stanton, kieran o'brien, george cos...",United Kingdom,2021-03-30,1989,Unknown,45 min,"drama, fantasy",teenage matt banting wants to work with a famo...,2021.0,3.0,30.0,45,Minutes,"[Drama, Fantasy]",[United Kingdom],2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9663,s9664,Movie,pride of the bowery,joseph h. lewis,"leo gorcey, bobby jordan",Unknown,NaT,1940,7+,60 min,comedy,new york city street principles get an east si...,2021.0,0.0,0.0,60,Minutes,[Comedy],[Unknown],1
9664,s9665,TV Show,planet patrol,unknown,"dick vosburgh, ronnie stevens, libby morris, m...",Unknown,NaT,2018,13+,4 Seasons,tv shows,"this is earth, 2100ad - and these are the adve...",2021.0,0.0,0.0,4,Seasons,[TV Shows],[Unknown],1
9665,s9666,Movie,outpost,steve barker,"ray stevenson, julian wadham, richard brake, m...",Unknown,NaT,2008,R,90 min,action,"in war-torn eastern europe, a world-weary grou...",2021.0,0.0,0.0,90,Minutes,[Action],[Unknown],1
9666,s9667,TV Show,maradona: blessed dream,unknown,"esteban recagno, ezequiel stremiz, luciano vit...",Unknown,NaT,2021,TV-MA,1 Season,"drama, sports","the series tells the story of diego maradona, ...",2021.0,0.0,0.0,1,Seasons,"[Drama, Sports]",[Unknown],2


In [17]:
## Convert release_year to int
df["release_year"] = pd.to_numeric(df["release_year"], errors='coerce')
df["release_year"].fillna(df["release_year"].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["release_year"].fillna(df["release_year"].median(), inplace=True)


In [19]:
## Drop unused raw columns
final_df = df.copy()
final_df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added,day_added,duration_value,duration_unit,genre_list,country_list,genre_count
0,s1,Movie,the grand seduction,don mckellar,"brendan gleeson, taylor kitsch, gordon pinsent",Canada,2021-03-30,2014,Unknown,113 min,"comedy, drama",a small fishing village must procure a local d...,2021.0,3.0,30.0,113,Minutes,"[Comedy, Drama]",[Canada],2
1,s2,Movie,take care good night,girish joshi,"mahesh manjrekar, abhay mahajan, sachin khedekar",India,2021-03-30,2018,13+,110 min,"drama, international",a metro family decides to fight a cyber crimin...,2021.0,3.0,30.0,110,Minutes,"[Drama, International]",[India],2
2,s3,Movie,secrets of deception,josh webber,"tom sizemore, lorenzo lamas, robert lasardo, r...",United States,2021-03-30,2017,Unknown,74 min,"action, drama, suspense",after a man discovers his wife is cheating on ...,2021.0,3.0,30.0,74,Minutes,"[Action, Drama, Suspense]",[United States],3
3,s4,Movie,pink: staying true,sonia anderson,"interviews with: pink, adele, beyoncé, britney...",United States,2021-03-30,2014,Unknown,69 min,documentary,"pink breaks the mold once again, bringing her ...",2021.0,3.0,30.0,69,Minutes,[Documentary],[United States],1
4,s5,Movie,monster maker,giles foster,"harry dean stanton, kieran o'brien, george cos...",United Kingdom,2021-03-30,1989,Unknown,45 min,"drama, fantasy",teenage matt banting wants to work with a famo...,2021.0,3.0,30.0,45,Minutes,"[Drama, Fantasy]",[United Kingdom],2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9663,s9664,Movie,pride of the bowery,joseph h. lewis,"leo gorcey, bobby jordan",Unknown,NaT,1940,7+,60 min,comedy,new york city street principles get an east si...,2021.0,0.0,0.0,60,Minutes,[Comedy],[Unknown],1
9664,s9665,TV Show,planet patrol,unknown,"dick vosburgh, ronnie stevens, libby morris, m...",Unknown,NaT,2018,13+,4 Seasons,tv shows,"this is earth, 2100ad - and these are the adve...",2021.0,0.0,0.0,4,Seasons,[TV Shows],[Unknown],1
9665,s9666,Movie,outpost,steve barker,"ray stevenson, julian wadham, richard brake, m...",Unknown,NaT,2008,R,90 min,action,"in war-torn eastern europe, a world-weary grou...",2021.0,0.0,0.0,90,Minutes,[Action],[Unknown],1
9666,s9667,TV Show,maradona: blessed dream,unknown,"esteban recagno, ezequiel stremiz, luciano vit...",Unknown,NaT,2021,TV-MA,1 Season,"drama, sports","the series tells the story of diego maradona, ...",2021.0,0.0,0.0,1,Seasons,"[Drama, Sports]",[Unknown],2


In [20]:
## Export cleaned dataset
final_df.to_csv("amazon_prime_cleaned_preprocessed.csv", index=False)

print("DATA CLEANING + PREPROCESSING COMPLETED.")
final_df.head()

DATA CLEANING + PREPROCESSING COMPLETED.


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added,day_added,duration_value,duration_unit,genre_list,country_list,genre_count
0,s1,Movie,the grand seduction,don mckellar,"brendan gleeson, taylor kitsch, gordon pinsent",Canada,2021-03-30,2014,Unknown,113 min,"comedy, drama",a small fishing village must procure a local d...,2021.0,3.0,30.0,113,Minutes,"[Comedy, Drama]",[Canada],2
1,s2,Movie,take care good night,girish joshi,"mahesh manjrekar, abhay mahajan, sachin khedekar",India,2021-03-30,2018,13+,110 min,"drama, international",a metro family decides to fight a cyber crimin...,2021.0,3.0,30.0,110,Minutes,"[Drama, International]",[India],2
2,s3,Movie,secrets of deception,josh webber,"tom sizemore, lorenzo lamas, robert lasardo, r...",United States,2021-03-30,2017,Unknown,74 min,"action, drama, suspense",after a man discovers his wife is cheating on ...,2021.0,3.0,30.0,74,Minutes,"[Action, Drama, Suspense]",[United States],3
3,s4,Movie,pink: staying true,sonia anderson,"interviews with: pink, adele, beyoncé, britney...",United States,2021-03-30,2014,Unknown,69 min,documentary,"pink breaks the mold once again, bringing her ...",2021.0,3.0,30.0,69,Minutes,[Documentary],[United States],1
4,s5,Movie,monster maker,giles foster,"harry dean stanton, kieran o'brien, george cos...",United Kingdom,2021-03-30,1989,Unknown,45 min,"drama, fantasy",teenage matt banting wants to work with a famo...,2021.0,3.0,30.0,45,Minutes,"[Drama, Fantasy]",[United Kingdom],2
