# Amazon Prime Data Analysis

We can analyze a lot of data and models from Amazon Prime because this platform has consistently focused on changing business needs by shifting its business model from on-demand DVD movie rental and now focusing a lot about the production of their original shows.

I will take a look at some very important models of Amazon Prime data to understand what's best for their business. Some of the most important tasks that i can analyze from Amazon Prime data are:

      1. understand what content is available
      2. understand the similarities between the content
      3. understand the network between actors and directors 
      4. what exactly Amazon Prime is focusing on 
      5. sentiment analysis of content available on Amazon Prime.

#  Amazon Prime Data Analysis with Python

The dataset I use for the Amazon prime data analystics task consists of TV shows and movies streamed on Amazon Prime as of 2021. The dataset is provided by Kaggle.


I'll start this Amazon Prime data analysis task with Python by importing the dataset and all the Python libraies needed for this task:

In [3]:
import numpy as np
import pandas as pd
import plotly.express as px
from textblob import TextBlob

In [4]:
df=pd.read_csv("C:\\Users\\velsa\\Downloads\\amazon_prime_titles.csv\\amazon_prime_titles.csv")
df.shape

(9668, 12)

So the data consists of 9668 rows and 12 columns, now let's look at the column names:

In [4]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [5]:
df.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...
1,s2,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...
2,s3,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,"March 30, 2021",2017,,74 min,"Action, Drama, Suspense",After a man discovers his wife is cheating on ...
3,s4,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,"March 30, 2021",2014,,69 min,Documentary,"Pink breaks the mold once again, bringing her ..."
4,s5,Movie,Monster Maker,Giles Foster,"Harry Dean Stanton, Kieran O'Brien, George Cos...",United Kingdom,"March 30, 2021",1989,,45 min,"Drama, Fantasy",Teenage Matt Banting wants to work with a famo...


# Distribution of Content:

To begin the task of analyzing Amazon Prime data, I'll start by looking at the distribution of content ratings on Amazon Prime:

In [6]:
z = df.groupby(['rating']).size().reset_index(name='counts')
pieChart = px.pie(z, values='counts', names='rating', title='Distribution of content Ratings on Amazon Prime',
                   color_discrete_sequence=px.colors.qualitative.Set3)
pieChart.show()

The above Piechart shows that the majority of content on Amazon Prime is categorized as "13+", which means that most of the content available on Amazon Prime is intended for viewing by children and adult audiences.

# Top 5 Actors and Directors:

Now let's see the top 5 successful directors on this platform:

In [7]:
df['director'] = df['director'].fillna('No Director Specified')
filtered_directors = pd.DataFrame()
filtered_directors = df['director'].str.split(',',expand=True).stack()
filtered_directors = filtered_directors.to_frame()
filtered_directors.columns = ['Director']
directors = filtered_directors.groupby(['Director']).size().reset_index(name='Total Content')
directors = directors[directors.Director !='No Director Specified']
directors = directors.sort_values(by=['Total Content'], ascending=False)
directorsTop5 = directors.head()
directorsTop5 = directorsTop5.sort_values(by=['Total Content'])
fig1 = px.bar(directorsTop5, x='Total Content',y='Director',title='Top 5 Directors on Amazon Prime')
fig1.show()


From the above graph it is derived that the top 5 directors on this platform are:

   1. Mark Knight
   2. Cannis Holder
   3. Moonbug Entertainment
   4. Jay Chapman
   5. Arthur van Merwijk

Now let's have a look at the top 5 successful actors on this platform:

In [7]:
df['cast'] = df['cast'].fillna('No Cast Specified')
filtered_cast = pd.DataFrame()
filtered_cast = df['cast'].str.split(',',expand=True).stack()
filtered_cast = filtered_cast.to_frame()
filtered_cast.columns = ['Actor']
actors = filtered_cast.groupby(['Actor']).size().reset_index(name='Total Content')
actors = actors[actors.Actor !='No Cast Specified']
actors = actors.sort_values(by=['Total Content'], ascending=False)
actorsTop5 = actors.head()
actorsTop5 = actorsTop5.sort_values(by=['Total Content'])
fig1 = px.bar(actorsTop5, x='Total Content',y='Actor',title='Top 5 Actors on Amazon Prime')
fig1.show()

From the above plot, it is derived that the top 5 actors on Amazon Prime are:

  1. Maggie Binkley
  2. 1
  3. Gene Autry
  4. Nassar
  5. Champion

# Analyzing Content on Amazon Prime:

The next thing to analyze from this data is the trend of production over the years on Amazon Prime:

In [8]:
df1 = df[['type','release_year']]
df1 = df1.rename(columns={"release_year":"Release Year"})
df2 = df1.groupby(['Release Year','type']).size().reset_index(name='Total Content')
df2 = df2[df2['Release Year']>=2010]
fig3 = px.line(df2, x='Release Year',y='Total Content',color='type',title='Trend of content produced over the years on Amazon Prime')
fig3.show()

The above line graph shows that there has been a decline in the production of the content for both movies and other shows since 2021.

At last, to conclude our analysis, I will analyze the sentiment of content on Amazon Prime:

In [13]:
dfx = df[['release_year','description']]
dfx = dfx.rename(columns={'release_year':'Release Year'})
for index,row in dfx.iterrows():
    z=row['description']
    testimonial=TextBlob(z)
    p=testimonial.sentiment.polarity
    if p==0:
        sent='Netural'
    elif p>0:
        sent='Positive'
    else:
        sent='Negative'
    dfx.loc[[index,2],'Sentiment']=sent 

dfx=dfx.groupby(['Release Year','Sentiment']).size().reset_index(name='Total Content')

dfx=dfx[dfx['Release Year']>=2010]

fig4 = px.bar(dfx,x='Release Year',y='Total Content',color='Sentiment',title='Sentiment of content on Amazon Prime')
fig4.show()

So the above graph shows that the overall positive content is always greater than the netural and negative content combined.