# Exploratory Data Analysis of Prime movies and TV shows

Lets import relevant libraries

In [36]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [37]:
df = pd.read_csv("/kaggle/input/amazon-prime-movies-and-tv-shows/amazon_prime_titles.csv")
df.head(10)

### Investigating Data
Let's first check the total number of rows and columns

In [38]:
df.shape

In [39]:
df.describe()

In [40]:
df.columns

Let's observe the number of missing entries in each column of dataframe

In [41]:
df.isnull().sum()

### Data Cleaning
Here we will follow a few steps to make the data suitable for EDA:-
* Drop the "show_id" column
* Check for duplicate records of movies/TV shows and delete duplicates
* Replace NaN values in "cast" and "director" columns with "Unavailable"
* Replace NaN values in "rating" with mode of the column
* As column "country" has many missing values, fill some of the missing entries based on genre of movies/shows. 
    
    Example:- for genre Anime - country can be filled in as "Japan" and for genre Western, we can fill in as "United States"
* For missing values in "date_added", replace them with January 1st of {release year}
* Convert "date_added" into datetime datatype 

In [42]:
df.drop("show_id" , axis=1 , inplace=True)
df.head()

In [43]:
df = df.drop_duplicates(["title"])
print("Duplicates removed")

In [44]:
df["cast"] = df["cast"].replace(np.nan , "Unavailable")
df["director"] = df["director"].replace(np.nan , "Unavailable")

In [45]:
df["rating"] = df["rating"].fillna(df["rating"].mode()[0])

In [46]:
df=df.reset_index()
df.head()

In [47]:
df["date_added"] = df["date_added"].fillna("January 01, {}".format(str(df["release_year"].mode()[0])))

In [48]:
import re
months = {'January':1 , 'February':2 , 'March':3 , 'April':4 , 'May':5 , 'June':6 , 'July':7 , 'August':8 , 
          'September':9 , 'October':10 , 'November':11 , 'December':12}

date_list = []

for i in df["date_added"]:
    str1= re.findall('([a-zA-Z]+)\s[0-9]+\,\s[0-9]+' , i)
    str2= re.findall('[a-zA-Z]+\s([0-9]+)\,\s[0-9]+' , i)
    str3= re.findall('[a-zA-Z]+\s[0-9]+\,\s([0-9]+)' , i)
    dates = '{}-{}-{}'.format(str2[0] , months[str1[0]] , str3[0])
    date_list.append(dates)

In [49]:
df["date_added_cleaned"] = date_list

In [50]:
df['date_added_cleaned']=df['date_added_cleaned'].astype('datetime64[ns]')

In [51]:
for i , j in zip(df["country"].values , df.index):
    if i == np.nan:
        if (('Anime' in df.loc[j,'listed_in']) or ('anime' in df.loc[j,'listed_in'])):
            print(j)
            df.loc[j,'country'] ='Japan'
        elif (('Western' in df.loc[j,'listed_in']) or ('western' in df.loc[j,'listed_in'])):
            print(j)
            df.loc[j,'country'] ='United States'
        else:
            continue
    else:
        continue


In [52]:
df.head()

Now as we have cleaned data, we can proceed with analyzing this data. 

### Exploratory Data Analysis

Looking at the data, we can come up with the folowing questions:
1. What does each category in ratings define?
2. How many movies and TV shows does Amazon Prime have based on each category of the ratings? 
3. Which is the highest watched rating categories in TV shows and movies?
4. What are the most watched category in each country?
5. Which is more popular on Prime? Movies or TV shows?
6. What was the focus of Prime on type of content in recent years?

In [53]:
df["rating"].unique()

Lets find out what each of the above category defines:
* 13+ :This rating is a stronger caution for parents that content included may not be appropriate for children under 13 (pre-teen ages).
* 18+ or AGES_18_ :This rating signifies that the content is restricted to 18 years and over(adults).
* 16+ or AGES_16_ :This rating signifies that the content is restricted to 16 years and over
* TV-MA :This program is specifically designed to be viewed by adults and therefore may be unsuitable for children under 17.
* TV-14 :This program contains some material that many parents would find unsuitable for children under 14 years of age.
* TV-PG :This program contains material that parents may find unsuitable for younger children.
* R :Under 17 requires accompanying parent or adult guardian,Parents are urged to learn more about the film before taking their young children with them.
* PG-13 :Some material may be inappropriate for children under 13. Parents are urged to be cautious. Some material may be inappropriate for pre-teenagers.
* NOT_RATE or UNRATED :If a film has not been submitted for a rating or is an uncut version of a film that was submitted
* PG :Some material may not be suitable for children,May contain some material parents might not like for their young children.
* TV-Y7 or 7+ :This program is designed for children age 7 and above.
* TV-G :This program is suitable for all ages.
* TV-Y :Programs rated TV-Y are designed to be appropriate for children of all ages. The thematic elements portrayed in programs with this rating are specifically designed for a very young audience, including children ages 2-6.
* TV-Y7-FV :is recommended for ages 7 and older, with the unique advisory that the program contains fantasy violence.
* G or ALL or ALL_AGES :All ages admitted. Nothing that would offend parents for viewing by children.
* NC-17:No One 17 and Under Admitted. Clearly adult. Children are not admitted.

In [54]:
# We can combine the some ratings category to one single category 
# For Example:  NOT_RATE and UNRATED can be combined to say UNRATED

for i in df.index:
    if df.loc[i,"rating"]== "NOT_RATE":
        df.loc[i,"rating"] = "UNRATED"
    elif df.loc[i,"rating"]== "AGES_18_":
        df.loc[i,"rating"] = "18+"
    elif (df.loc[i,"rating"]== "AGES_16_") or (df.loc[i,"rating"]== "16"):
        df.loc[i,"rating"] = "16+"
    elif df.loc[i,"rating"]== "ALL_AGES":
        df.loc[i,"rating"] = "G"
    else:
        continue


In [55]:
plt.figure(figsize=(25,10))
sns.set_theme(style="darkgrid")
sns.countplot(x="rating", hue="type" , data=df, palette="Set2")
plt.legend(loc='upper center', title="Type", fontsize=15)

In [56]:
# To find the most popular rating category

df["rating"].value_counts().plot(color = "red")

In [57]:
plt.figure(figsize=(10,6))
df["rating"].value_counts(normalize =True).plot(kind="bar", color = "purple")
plt.xlabel("Ratings category")
plt.ylabel("Frequency distribution")
plt.title("Distribution of Prime movies & shows across ratings" , fontsize=20)

**Movies and shows of rating 13+ are more popular in Prime, followed by 16+ and others**

In [58]:
df["country"].value_counts().sort_values(ascending=False)

**As we have group of countries as many entries, let's focus of the top 4 countries for analysis**

In [59]:
df_top4 = df[(df["country"]=="United States") | (df["country"]=="India") | (df["country"]=="United Kingdom") | 
             (df["country"]=="Canada")]
df_top4.head(10)

In [60]:
plt.figure(figsize=(6,4))
sns.countplot(x="country" , hue="type" , data=df_top4, palette="hls")
plt.title("Comparing the content type available in top 4 countries", fontsize=12)

In [61]:
plt.figure(figsize=(20,8))
sns.set_theme(style = "white")
sns.stripplot(x="country" , y="rating" , hue="type", data=df_top4 , order=["United States" , "India" , "United Kingdom" , "Canada"])
plt.legend(loc="upper left" , title="Type" , fontsize=12)
plt.title("Distribution of rating-based movies and shows in top 4 countries", fontsize=15)

**We observe that in India, Amazon Prime has considerably more content in Movies as compared to TV Shows. Next in line is the United States but here the number of TV Shows produced is the highest out of the top four countries.**

In [62]:
df["type"].value_counts(normalize=True)

In [63]:
df["year"]=df["date_added_cleaned"].dt.year
df.head()

In [64]:
df.groupby("year")["type"].value_counts(normalize=True)*100

**From above analysis we can come to a conclusion that in Amazon Prime the content focus has been on Movies for upto 81% and the remaining 19% on TV Shows.** 

**Out of all the content available on Prime, Movies and shows of rating 13+ are more popular, followed by 16+ and others.**
**To be more specific, we have found that 13+ rated movies are highly produced as compared to the rated TV shows.**

**Amazon Prime has the highest content availability in United States, next in line is India, United Kingdom and Canada**
