# Netflix Dataset Analysis — Major Project
**Dataset expected:** `netflix.csv`  
**Project goal:** Content Trends Analysis for Strategic Recommendations (Movies vs TV Shows, genres, country contributions)  

**How to use:**  
1. Upload `netflix.csv` to the same folder as this notebook.  
2. Run cells top-to-bottom.  
3. Fill observation placeholders (where prompted) and export as PDF or submit `.ipynb`.  

*Notebook includes:* Data cleaning, EDA (plots + tables), genre & country trends, clustering example, and strategic recommendations.  
Dark theme styling is included for notebook display cells.  


In [None]:
from IPython.core.display import HTML
HTML('''
<style>
body { background:#0b0f14; color:#e6eef3; }
div.output_area pre { color: #e6eef3; background:#071024; }
.rendered_html { color: #e6eef3; }
.jp-OutputArea-output pre { color:#e6eef3; background:#071024; }
</style>
''')

In [None]:
# Setup: imports and basic configuration
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.preprocessing import MultiLabelBinarizer
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded. Make sure 'netflix.csv' is in the same folder as this notebook.")

In [None]:
# Load dataset
df = pd.read_csv('netflix.csv')
print('Dataset loaded. Shape:', df.shape)
df.head()

In [None]:
# Quick inspection
df.info()
df.describe(include='all').T

In [None]:
# Preprocessing
# Common column names handled: title, type, director, cast, country, date_added, release_year, rating, duration, listed_in

# 1) Standardize column names (lowercase, strip)
df.columns = df.columns.str.strip().str.lower()

# 2) Parse dates if present
if 'date_added' in df.columns:
    df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
    df['year_added'] = df['date_added'].dt.year

# 3) Ensure release_year exists
if 'release_year' not in df.columns and 'year' in df.columns:
    df['release_year'] = df['year']

# 4) Normalize text fields
for col in ['director', 'cast', 'country', 'listed_in']:
    if col in df.columns:
        df[col] = df[col].fillna('').astype(str).str.strip()

# 5) Create genres list from listed_in
if 'listed_in' in df.columns:
    df['genres'] = df['listed_in'].apply(lambda x: [g.strip() for g in x.split(',')] if x else [])
else:
    df['genres'] = [[] for _ in range(len(df))]

# 6) Extract primary_genre and num_genres
df['primary_genre'] = df['genres'].apply(lambda x: x[0] if len(x)>0 else 'Unknown')
df['num_genres'] = df['genres'].apply(len)

# 7) Normalize duration: create duration_min (for movies) and seasons (for TV Shows)
def parse_duration(x):
    if pd.isna(x): 
        return None
    s = str(x)
    if 'min' in s:
        try:
            return int(s.split(' ')[0])
        except:
            return None
    if 'Season' in s or 'Seasons' in s:
        try:
            return int(s.split(' ')[0])
        except:
            return None
    return None

if 'duration' in df.columns:
    df['duration_min'] = df['duration'].apply(lambda x: parse_duration(x) if isinstance(x,str) else None)

# 8) Fill missing country
if 'country' in df.columns:
    df['country'] = df['country'].replace('', 'Unknown')

print('Preprocessing done. Sample:')
df[['title','type','release_year','year_added','primary_genre','num_genres','country']].head()

In [None]:
# EDA: Movies vs TV Shows overall and over time
if 'type' in df.columns:
    type_counts = df['type'].value_counts()
    print('Overall counts by type:\n', type_counts)

    if 'year_added' in df.columns:
        year_type = df.groupby(['year_added','type']).size().unstack(fill_value=0)
        year_type['Total'] = year_type.sum(axis=1)
        display(year_type.tail(10))
        fig = px.area(year_type.reset_index(), x='year_added', y=[c for c in year_type.columns if c!='Total'],
                      title='Titles added per year by Type (Movies vs TV Shows)')
        fig.show()
else:
    print('Column `type` not found in dataset.')

In [None]:
# Top genres (multi-label handling)
from collections import Counter
all_genres = Counter(g for sub in df['genres'] for g in sub)
top_genres = all_genres.most_common(20)
pd.DataFrame(top_genres, columns=['genre','count']).set_index('genre')

In [None]:
# Time series for top 6 genres
top6 = [g for g,c in top_genres[:6]]
df_expl = df.explode('genres')
if 'year_added' in df.columns:
    ts = df_expl[df_expl['genres'].isin(top6)].groupby(['year_added','genres']).size().unstack(fill_value=0)
    fig = px.line(ts.reset_index(), x='year_added', y=top6, markers=True, title='Top 6 Genres Over Time')
    fig.show()
else:
    print('`year_added` column missing; cannot plot genre time series.')

In [None]:
# Country-wise contributions (top countries)
if 'country' in df.columns:
    # handle multiple countries per row: take first country as primary for simplicity
    df['primary_country'] = df['country'].apply(lambda x: x.split(',')[0].strip() if x else 'Unknown')
    country_counts = df['primary_country'].value_counts().head(20)
    country_counts_df = country_counts.reset_index()
    country_counts_df.columns = ['country','count']
    country_counts_df
    fig = px.bar(country_counts_df, x='count', y='country', orientation='h', title='Top 20 Countries by Titles')
    fig.show()
else:
    print('Column `country` not found.')

In [None]:
# Genre co-occurrence matrix (heatmap)
mlb = MultiLabelBinarizer()
genre_matrix = mlb.fit_transform(df['genres'])
genre_df = pd.DataFrame(genre_matrix, columns=mlb.classes_)
co_occ = pd.DataFrame(genre_df.T.dot(genre_df), index=mlb.classes_, columns=mlb.classes_)
# show top 12 genres co-occurrence heatmap
top12 = [g for g,c in top_genres[:12]]
co_small = co_occ.loc[top12, top12]
fig = px.imshow(co_small, text_auto=True, title='Genre Co-occurrence (Top 12)')
fig.show()

In [None]:
# Clustering example: KMeans on genre multi-hot vectors
# We'll cluster titles by their genre multi-hot vectors (k=5 clusters as a starting point)
if genre_df.shape[1] > 0:
    X = genre_df.values
    k = 5
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    df['genre_cluster'] = labels
    display(df[['title','type','primary_genre','genre_cluster']].head())
    # cluster sizes
    display(df['genre_cluster'].value_counts().sort_index())
else:
    print('No genres available for clustering.')

In [None]:
# Trend detection: compute linear trend (slope) for top genres
import numpy as np
from sklearn.linear_model import LinearRegression

if 'year_added' in df.columns:
    trends = []
    for genre, _ in top_genres:
        series = df_expl[df_expl['genres']==genre].groupby('year_added').size()
        if len(series) >= 3:
            X = np.array(series.index).reshape(-1,1)
            y = series.values
            lr = LinearRegression().fit(X,y)
            slope = lr.coef_[0]
            trends.append((genre, slope))
    trends_sorted = sorted(trends, key=lambda x: x[1], reverse=True)
    pd.DataFrame(trends_sorted, columns=['genre','slope']).head(15)
else:
    print('`year_added` missing; cannot compute trends.')

In [None]:
# Save cleaned sample output (optional)
df_clean_sample = df.sample(frac=0.1, random_state=42)  # 10% sample
df_clean_sample.to_csv('netflix_clean_sample.csv', index=False)
print('Saved netflix_clean_sample.csv')

## What to write in your report (placeholders)
- **Key Findings:** (write bullet points based on the plots and tables you see)
- **Top genres:** 
- **Rising genres:** 
- **Country insights:** 
- **Movies vs TV Shows trend:** 
- **Clustering interpretation:** 
- **Strategic recommendations:** 


---  
### Next steps you can do (suggestions)
- Add popularity or rating data if available to prioritize high-impact content.  
- Build an interactive dashboard using Dash or Streamlit.  
- Run forecasting (Prophet) for top genres if you want future projections.  
- Expand clustering using title text or descriptions (NLP).  

Good luck — run the notebook cells and paste your observations into the placeholders. If anything errors due to column name mismatches, update the column name at the top or message me and I will adjust the notebook for your file.  
