# 📊 Netflix Data Analysis Project

This project explores a dataset of Netflix Movies and TV Shows, applying data cleaning, ETL techniques, and exploratory data analysis (EDA) to uncover trends and visual insights about Netflix's content library.

In [1]:
# Install missing packages if needed
%pip install matplotlib seaborn plotly

# Import essential libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Collecting matplotlib
  Using cached matplotlib-3.10.3-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Using cached contourpy-1.3.2-cp312-cp312-win_amd64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.58.4-cp312-cp312-win_amd64.whl.metadata (108 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Using cached kiwisolver-1.4.8-cp312-cp312-win_amd64.whl.metadata (6.3 kB)
Collecting pillow>=8 (from matplotlib)
  Using cached pillow-11.2.1-cp312-cp312-win_amd64.whl.metadata (9.1 kB)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Using cached pyparsing-3.2.3-py3-none-any.whl.metadata (5.0 kB)
Using cached matplotlib-3.10.3-cp312-cp312-win_amd64.whl (8.1 MB)
Using cached seaborn-0.13.2-py3-none-any.whl (294 kB)
Us

In [12]:
import os
import pandas as pd

# Define your path here - update as needed
csv_path = 'netflix_titles.csv'

print("Current working directory:", os.getcwd())
print("Files in directory:", os.listdir())

if not os.path.exists(csv_path):
    print(f"File not found: {csv_path}")
    print("Please update the 'csv_path' variable to the correct location of 'netflix_titles.csv'.")
else:
    df = pd.read_csv(csv_path)
    print("Data loaded successfully! Here are the first 5 rows:")
    print(df.head())


Current working directory: c:\Users\jmcde\OneDrive\Desktop\vscode-projects\Netflix-shows-movies\Notebooks
Files in directory: ['.gitignore', 'attribution.md', 'etl_netflix.log', 'Netflix_analysis_summary.ipynb', 'netflix_cleaned.csv', 'Netflix_Data_Analysis_Tidy.ipynb', 'netflix_etl.ipynb', 'README.md', 'Screenshots']
File not found: netflix_titles.csv
Please update the 'csv_path' variable to the correct location of 'netflix_titles.csv'.


## 🧹 Data Cleaning & ETL Process

Steps performed:
- Checked for missing values
- Removed duplicate records
- Converted relevant data types
- Extracted numerical values from the 'duration' column for movies

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Remove duplicates
df.drop_duplicates(inplace=True)

In [None]:
# Convert date_added to datetime format
df['date_added'] = pd.to_datetime(df['date_added'])

In [None]:
# Extract numbers from 'duration' column for movies
movies = df[df['type'] == 'Movie'].copy()
movies['duration_num'] = movies['duration'].str.extract(r'(\d+)')
movies['duration_num'] = movies['duration_num'].fillna('0').astype(int)

## 📊 Exploratory Data Analysis (EDA)

### 🔥 Correlation Heatmap

In [None]:
# Create a correlation heatmap of numeric columns
numeric_cols = df.select_dtypes(include='number')
corr = numeric_cols.corr()

plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.show()

### 🎥 Movie Duration vs Release Year by Rating (Scatter Plot)

In [None]:
# Scatter plot of movie duration vs release year by rating
fig = px.scatter(
    movies,
    x='release_year',
    y='duration_num',
    color='rating',
    labels={'duration_num': 'Duration (minutes)', 'release_year': 'Release Year'},
    title='Movie Duration vs Release Year by Rating',
    color_discrete_sequence=px.colors.qualitative.Pastel
)
fig.show()

### 🍩 Distribution of Titles by Rating (Pie Chart)

In [None]:
# Pie chart of distribution by rating
fig = px.pie(
    df,
    names='rating',
    title='Distribution of Netflix Titles by Rating',
    hole=0.3
)
fig.show()

## 📌 Key Insights & Findings

- The majority of Netflix titles are rated 'TV-MA'.
- Content releases have increased notably after 2010.
- No strong correlation exists between numeric fields like release year and movie duration.
- 'TV Shows' typically don't have a numeric duration value.

## 📊 Conclusion

This exploratory analysis successfully uncovered trends in Netflix's content library, including rating distributions and content durations over time. Visualisations provided insights into title counts by rating, content duration trends, and relationships between numeric features. Future projects could explore genre trends or content availability by region.

## 📜 Code Attribution
This project was developed using a combination of official documentation, community resources, and public tutorials to guide the data cleaning, exploratory analysis, and visualisation processes. The following resources were referenced:

- [Pandas Documentation](https://pandas.pydata.org/docs/) — for data cleaning, handling missing values, and DataFrame manipulation.
- [Seaborn Documentation](https://seaborn.pydata.org/) — for creating correlation heatmaps and understanding plot customisation.
- [Plotly Express Documentation](https://plotly.com/python/plotly-express/) — for building interactive scatter and pie charts.
- [Matplotlib Documentation](https://matplotlib.org/stable/contents.html) — for customising figure layouts, titles, and axis labels.
- [Stack Overflow](https://stackoverflow.com/questions/38834378/extract-numbers-from-string-in-pandas) — for extracting numeric values from strings in Pandas.
- [Real Python: Data Cleaning with Pandas](https://realpython.com/python-data-cleaning-numpy-pandas/) — for strategies on dealing with missing, inconsistent, and duplicate data.
- [DataCamp: Exploratory Data Analysis in Python](https://www.datacamp.com/courses/exploratory-data-analysis-in-python) — for concepts around visualising distributions, relationships, and categorical data.
- [Kaggle: Netflix Movies and TV Shows Dataset Projects](https://www.kaggle.com/datasets/shivamb/netflix-shows/code) — reviewed for inspiration on Netflix data analysis approaches and commonly explored variables.
- [Towards Data Science: Building ETL Pipelines in Python](https://towardsdatascience.com/building-etl-pipelines-with-python-2c3035e3898e) — for ETL workflow design and data transformation strategies.
- [Regular Expressions 101](https://regex101.com/) — for testing and refining regex patterns used in data extraction.
- [Jupyter Notebook Best Practices](https://realpython.com/jupyter-notebook-introduction/) — for structuring well-organised and readable notebooks.
- [The Data Visualisation Catalogue](https://datavizcatalogue.com/) — consulted for selecting appropriate visualisation types for different data relationships.
- [Datawrapper Academy: Data Storytelling](https://academy.datawrapper.de/article/179-how-to-tell-a-story-with-data) — for principles on effectively presenting insights and visual narratives.
- [Python Official Regular Expressions HOWTO](https://docs.python.org/3/howto/regex.html) — for understanding regular expressions in Python.
- [DataCamp: Working with Categorical Data in Python](https://www.datacamp.com/courses/categorical-data-in-python) — for handling and visualising categorical variables like content ratings.
- [Towards Data Science: A Practical Guide to Exploratory Data Analysis](https://towardsdatascience.com/a-practical-guide-to-exploratory-data-analysis-8b8bb597c2f0) — for best practices in conducting exploratory data analysis.
- [Analytics Vidhya: Comprehensive Guide to Data Cleaning in Python](https://www.analyticsvidhya.com/blog/2021/05/a-comprehensive-guide-to-data-cleaning-in-python/) — for advanced data cleaning strategies and techniques.
- [Kaggle Learn: Data Visualisation](https://www.kaggle.com/learn/data-visualization) — for practical hands-on guidance in creating visualisations using Matplotlib and Seaborn.
- [Medium: How to Structure Data Science Projects](https://medium.com/swlh/how-to-structure-data-science-projects-d73eb30f2721) — for ideas on organising data science projects for readability, reproducibility, and collaboration.
- [Google Developers: Data Analysis Best Practices](https://developers.google.com/machine-learning/data-prep/overview) — for general data analysis workflow and preparation best practices.

All other code was written independently by James McDermott

Inline comments within the Jupyter notebook cite any adapted external code snippets as appropriate.
