# Netflix Data Analysis Project

## Project Overview
This project analyzes Netflix data to provide insights into content distribution, trends, and patterns.

## Table of Contents
1. Data Loading and Initial Exploration
2. Data Cleaning and Preprocessing
3. Feature Engineering
4. Exploratory Data Analysis
5. Visualization and Insights
6. Conclusion and Recommendations

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Set plotting style
plt.style.use('default')
sns.set_palette('husl')

# Increase figure sizes
plt.rcParams['figure.figsize'] = (12, 8)

## 1. Data Loading and Initial Exploration

In [None]:
# Load the Netflix dataset
df = pd.read_csv('netflix_titles.csv')

# Display basic information about the dataset
print('Dataset shape:', df.shape)
print('
Basic statistics:
', df.describe())
print('
Missing values:
', df.isnull().sum())

# Display first few rows
display(df.head())

## 2. Data Cleaning and Preprocessing

In [None]:
# Handle missing values
# For categorical columns, we'll fill with 'Unknown'
categorical_cols = ['director', 'cast', 'country', 'rating']
for col in categorical_cols:
    df[col] = df[col].fillna('Unknown')

# For date_added, we'll fill with the release_year
df['date_added'] = pd.to_datetime(df['date_added'], format='mixed', errors='coerce')
df['date_added'] = df['date_added'].fillna(pd.to_datetime(df['release_year'], format='%Y'))

# Convert duration to numeric
df['duration'] = df['duration'].str.replace(' min', '').str.replace(' Season', '').str.replace('s', '')
df['duration'] = pd.to_numeric(df['duration'], errors='coerce')

# Create year and month columns from date_added
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month

## 3. Feature Engineering

In [None]:
# Create new features
# 1. Content age
df['content_age'] = df['year_added'] - df['release_year']

# 2. Content type category
df['content_type'] = df['type'].map({'Movie': 'Movie', 'TV Show': 'TV Show'})

# 3. Genre count
def count_genres(listed_in):
    return len(str(listed_in).split(',')) if pd.notna(listed_in) else 0
df['genre_count'] = df['listed_in'].apply(count_genres)

# 4. Cast size
def count_cast(cast):
    return len(str(cast).split(',')) if pd.notna(cast) else 0
df['cast_size'] = df['cast'].apply(count_cast)

## 4. Exploratory Data Analysis

In [None]:
# Content Type Distribution
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='content_type')
plt.title('Distribution of Content Types')
plt.xlabel('Content Type')
plt.ylabel('Count')
plt.show()

# Yearly Content Addition
plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='year_added', order=df['year_added'].value_counts().index)
plt.title('Content Added by Year')
plt.xticks(rotation=45)
plt.show()

# Top Countries
top_countries = df['country'].value_counts().head(10)
plt.figure(figsize=(12, 6))
sns.barplot(x=top_countries.values, y=top_countries.index)
plt.title('Top 10 Countries with Most Content')
plt.xlabel('Number of Content')
plt.ylabel('Country')
plt.show()

# Content Type Distribution by Year
plt.figure(figsize=(15, 6))
sns.countplot(data=df, x='year_added', hue='content_type', palette='husl')
plt.title('Content Type Distribution Over Years')
plt.xlabel('Year Added')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Content Type')
plt.show()

# Genre distribution
genres = df['listed_in'].str.split(',').explode()
genre_counts = genres.value_counts().head(15)

plt.figure(figsize=(12, 8))
sns.barplot(x=genre_counts.values, y=genre_counts.index)
plt.title('Top 15 Genres in Netflix Content')
plt.xlabel('Number of Content')
plt.ylabel('Genre')
plt.show()

## 5. Conclusion and Recommendations

Based on the analysis:
1. Netflix has shown a steady increase in content addition over the years
2. There's a significant difference in content distribution between movies and TV shows
3. Certain countries dominate the content production
4. Specific genres are more popular than others

Recommendations:
1. Continue diversifying content across different regions
2. Maintain balance between movies and TV shows
3. Explore emerging genres and trends
4. Focus on quality content production