# Analysis Plan

To analyze the `netflix_data` dataset, we will follow these steps:

1. **Data Overview**
    - Display the first few rows of the dataset to understand its structure.
    - Get basic information about the dataset including the number of entries, column names, and data types.
    - Generate summary statistics for numerical columns.

2. **Data Cleaning and Preprocessing**
    - Handle missing values by either filling them with appropriate values or dropping the rows/columns.
    - Convert data types of columns if necessary (e.g., converting `date_added` to datetime).
    - Extract additional features if needed (e.g., extracting the year from `date_added`).

3. **Exploratory Data Analysis (EDA)**
    - Analyze the distribution of categorical variables such as `type`, `country`, `rating`, etc.
    - Visualize the distribution of numerical variables such as `duration`.
    - Explore relationships between different variables (e.g., the relationship between `release_year` and `rating`).

4. **Insights and Conclusions**
    - Summarize the key findings from the EDA.
    - Draw conclusions based on the analysis.
    - Provide recommendations or insights for further investigation.

By following these steps, we will be able to gain a comprehensive understanding of the `netflix_data` dataset and derive meaningful insights from it.

# Data overview

In [14]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [15]:
# read in the netflix csv as Data frame
netflix_df= pd.read_csv(('netflix_titles.csv'))

In [16]:
# Get basic information about the dataset
print("Number of entries:", len(netflix_df))
print("Column names:", netflix_df.columns.tolist())
print("Data types:\n", netflix_df.dtypes)


Number of entries: 8807
Column names: ['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']
Data types:
 show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object


In [21]:
# Basis information about the data
netflix_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [18]:
netflix_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [None]:
# Generate summary statistics for numerical columns
netflix_df.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


2. Define Objectives

In [20]:



# Generate summary statistics for numerical columns
summary_statistics = df.describe()
print("Summary statistics for numerical columns:\n", summary_statistics)

# Handle missing values
# Fill missing values with appropriate values or drop rows/columns
df.fillna({
    'column_name_1': 'default_value_1',  # Replace with actual column name and default value
    'column_name_2': 'default_value_2',  # Replace with actual column name and default value
    # Add more columns as needed
}, inplace=True)

# Drop rows with missing values in specific columns
df.dropna(subset=['column_name_3', 'column_name_4'], inplace=True)  # Replace with actual column names

# Convert data types of columns if necessary
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

# Extract additional features
df['year_added'] = df['date_added'].dt.year

# Exploratory Data Analysis (EDA)

# Analyze the distribution of categorical variables
categorical_columns = ['type', 'country', 'rating']
for column in categorical_columns:
    plt.figure(figsize=(10, 5))
    sns.countplot(data=df, x=column, order=df[column].value_counts().index)
    plt.title(f'Distribution of {column}')
    plt.xticks(rotation=90)
    plt.show()

# Visualize the distribution of numerical variables
numerical_columns = ['duration']  # Replace with actual numerical columns
for column in numerical_columns:
    plt.figure(figsize=(10, 5))
    sns.histplot(df[column].dropna(), kde=True)
    plt.title(f'Distribution of {column}')
    plt.show()

# Explore relationships between different variables
plt.figure(figsize=(10, 5))
sns.boxplot(data=df, x='release_year', y='rating')
plt.title('Relationship between Release Year and Rating')
plt.xticks(rotation=90)
plt.show()

NameError: name 'df' is not defined