**Basic Analysis: Exploring the Netflix Dataset**

This serves as the second part of the Netflix data analysis project. 

Following the data cleaning phase, we now focus on extracting insights using Numpy and Pandas. The analysis encompasses:

1. **Distribution of Content Types** 
Analyzing the breakdown of movies versus TV shows in the dataset.

2. **Frequent Directors** 
Identifying the directors with the highest number of productions.

3. **Movie Durations**
Calculating total, average, maximum, and minimum durations of movies.

4. **Country Productions** 
Determining which countries have the most and least Netflix productions.

5. **Release Dates and Years** 
Analyzing the trends in release dates and years.

6. **Keyword Occurrences** 
Counting how often specific keywords appear in the descriptions, such as "murder", "comedy", and "love".

This analysis aims to provide a deeper understanding of the dataset and lays the groundwork for further visualization and insights.

In [17]:
import numpy as np
import pandas as pd 

#importing the dataset
data = pd.read_csv("C:/Users/HP/Desktop/ANUDIP/Python/Project Python Anudip/cleaned_netflix_database_Anudip.csv")

*Distribution of Content Types*

In [18]:
#Data analysis using Numpy of Column name "type"
data['type'].value_counts()

type
Movie      6131
TV Show    2676
Name: count, dtype: int64

*By analyzing the distribution of content types, stakeholders can gain insights into the composition of the Netflix library.*

*Most Frequent Director*

In [19]:
# Get the director who appears the most frequently
most_frequent_director = data['director'].value_counts().idxmax()

# Get the actual count of occurrences
max_occurrences = data['director'].value_counts().max()

print(f"The director with the most movies is: {most_frequent_director} with {max_occurrences} occurrences.")

#Getting Director with 2nd highest occurance
# Get the frequency of each director
director_counts = data['director'].value_counts()

# Get the director with the second highest number of occurrences
second_highest_director = director_counts.index[1]
second_highest_count = director_counts.iloc[1]

print(f"The director with the second most movies is: {second_highest_director} with {second_highest_count} occurrences.")

The director with the most movies is: Unknown with 2634 occurrences.
The director with the second most movies is: Rajiv Chilaka with 19 occurrences.


*The analysis identifies the director with the highest number of movies. This insight can highlight key contributors to the Netflix library and inform content acquisition strategies.*

*Second Most Frequent Director*

In [20]:
max_occurrences = data['director'].value_counts()
print(max_occurrences.head(5))

director
Unknown                   2634
Rajiv Chilaka               19
Raúl Campos, Jan Suter      18
Suhas Kadav                 16
Marcus Raboy                16
Name: count, dtype: int64


*This query finds the director with the second highest number of movies, providing additional context on popular filmmakers in the dataset.*

*Highest vs Lowest Ratings*

In [21]:
print("Highest 5 ratings")
print(data['rating'].value_counts().head(5))

print("Lowest Ratings")
print(data['rating'].value_counts().tail(5))

Highest 5 ratings
rating
TV-MA    3214
TV-14    2160
TV-PG     863
R         799
PG-13     490
Name: count, dtype: int64
Lowest Ratings
rating
NR          80
G           41
TV-Y7-FV     6
NC-17        3
UR           3
Name: count, dtype: int64


*The analysis of the rating column shows the top five ratings, indicating which content is rated the highest by viewers. This can guide recommendations and marketing efforts.*

*Similar to the highest ratings, this query identifies the lowest ratings, which can help in understanding viewer dissatisfaction and areas for improvement.*

*Analysis of duration of movies*

In [22]:
#Data analysis using Numpy of Column name Movies

# 1.Total duration of Movies
total_Movies_duration = sum(data['Duration_Movies'].dropna())
print(f'Total duration of Movies is: {total_Movies_duration} mins')

# 2.Average duration of Movies
Avg_Movies_duration = np.mean(data['Duration_Movies'].dropna())
print(f'\nAverage duration of movies is: {Avg_Movies_duration} mins')

#3. Maximum duration of Movies
print(f'\nMaximum duration of Movies is: {data["Duration_Movies"].max()} mins with movie name {data.loc[data["Duration_Movies"].idxmax()]["title"]}')

#4. Minimum duration of Movies
print(f'\nMinimum duration of Movies is: {data["Duration_Movies"].min()} mins with movie name {data.loc[data["Duration_Movies"].idxmin()]["title"]}')

Total duration of Movies is: 610209.0 mins

Average duration of movies is: 99.57718668407311 mins

Maximum duration of Movies is: 312.0 mins with movie name Black Mirror: Bandersnatch

Minimum duration of Movies is: 3.0 mins with movie name Silent


- Total Duration of Movies: *This query calculates the total duration of all movies in the dataset, providing insight into the overall content length available for viewing.*

- Average Duration of Movies: *The average duration gives an idea of how long movies typically are, which can be useful for understanding viewer engagement and preferences.*

- Maximum Duration of Movies: *Identifying the longest movie in the dataset can highlight outliers and inform content strategy regarding longer formats.*

- Minimum Duration of Movies: *This query finds the shortest movie, which can be useful for understanding the range of content lengths available.*

*Countries with Most and least Netflix Productions*

In [23]:
print("Country with most Netflix Productions")
print(data['country'].value_counts().head(5))

print("\nCountry with lowest Netflix Productions")
print(data['country'].value_counts().tail(5))

Country with most Netflix Productions
country
United States     3649
India              972
United Kingdom     419
Japan              245
South Korea        199
Name: count, dtype: int64

Country with lowest Netflix Productions
country
Russia, Spain                                    1
Croatia, Slovenia, Serbia, Montenegro            1
Japan, Canada                                    1
United States, France, South Korea, Indonesia    1
Canada, Mexico, Germany, South Africa            1
Name: count, dtype: int64


- Highest: *The analysis shows which countries produce the most content for Netflix, indicating regional strengths and potential markets for expansion.*

- Lowest: *This insight identifies countries with minimal representation, which could be targets for future content development.*

*Dates with Highest and lowest Releases*

In [24]:
print("Date with highest releases")
print(data['date_added'].value_counts().nlargest())

print("\nDate with lowest releases")
print(data['date_added'].value_counts().nsmallest())

Date with highest releases
date_added
1/1/2020      120
11/1/2019      91
3/1/2018       75
12/31/2019     74
10/1/2018      71
Name: count, dtype: int64

Date with lowest releases
date_added
1/11/2020    1
9/4/2021     1
8/21/2021    1
4/20/2017    1
4/26/2017    1
Name: count, dtype: int64


- Highest Releases: *This query reveals which dates had the most content added, helping to identify trends in release strategies.*

- Lowest Releases: *Understanding when the least content is released can inform scheduling and marketing efforts.*

*Years with Highest and lowest Number of Releases*

In [25]:
print("Year with highest number of releases")
print(data['release_year'].value_counts().nlargest())

print("\nYear with lowest number of releases")
print(data['release_year'].value_counts().nsmallest())

Year with highest number of releases
release_year
2018    1147
2017    1032
2019    1030
2020     953
2016     902
Name: count, dtype: int64

Year with lowest number of releases
release_year
1961    1
1925    1
1959    1
1966    1
1947    1
Name: count, dtype: int64


- Highest Year : *This analysis shows which years had the most content added, indicating growth trends in Netflix's library.*

- Lowest Year: *Identifying years with fewer releases can help understand historical content strategies and market conditions.*

*Monthly Content Addition*

In [26]:
#extracted month name
Month_added = pd.to_datetime(data['date_added']).dt.month_name()

#counting months 
month_counts = Month_added[data['type'] == 'TV Show'].value_counts()
# Extract month name and count
Month_added = pd.to_datetime(data['date_added']).dt.month_name()

# Create a pivot table with month counts
pivot_table = pd.pivot_table(data=data, index=Month_added, columns='type',aggfunc='size', fill_value=0)
print(pivot_table)

type        Movie  TV Show
date_added                
April         550      214
August        519      236
December      547      266
February      382      181
January       546      202
July          565      262
June          492      236
March         529      213
May           439      193
November      498      207
October       545      215
September     519      251


*The pivot table created from the month counts provides insights into seasonal trends in content addition, which can inform future release strategies.*

*Keyword Occurrences in Descriptions*

In [27]:
# Check if the 'description' column contains the keyword 'murder'
contains_murder = data['description'].str.contains('murder', case=False, na=False)

# Count the number of rows that contain the keyword 'murder'
murder_count = contains_murder.sum()

print(f"Total number of description containing the keyword 'murder': {murder_count}")

Total number of description containing the keyword 'murder': 291


In [28]:
# Check if the 'description' column contains the keyword 'murder'
contains_comedy = data['description'].str.contains('comedy|funny', case=False, na=False)

# Count the number of rows that contain the keyword 'murder'
comedy_count = contains_comedy.sum()

print(f"Total number of descriptions containing the keyword 'comedy' or 'funny': {comedy_count}")

Total number of descriptions containing the keyword 'comedy' or 'funny': 216


In [29]:
# Check if the 'description' column contains the keyword 'murder'
contains_love = data['description'].str.contains('love', case=False, na=False)

# Count the number of rows that contain the keyword 'murder'
love_count = contains_love.sum()

print(f"Total number of descriptions containing the keyword 'love': {love_count}")

Total number of descriptions containing the keyword 'love': 704


*The analysis counts how many descriptions contain specific keywords (e.g., "murder", "comedy", "love"), providing insights into thematic trends and viewer interests.*