# Netflix Dataset Analysis — Colab Notebook

This notebook performs content trends analysis on the provided Netflix dataset according to the project proposal:

- Distribution: Movies vs TV Shows over years
- Popular genres and their trends
- Country-wise contributions

Open this notebook in Google Colab and run all cells. The dataset file should be uploaded to the session (or mounted from Drive).

In [None]:
# Setup: imports and dataset loading
import pandas as pd
import matplotlib.pyplot as plt

# Try to load dataset from default path in the Colab environment
csv_path = "/mnt/data/Netflix Dataset.csv"

try:
    df = pd.read_csv(csv_path)
    print("Loaded dataset from", csv_path)
except Exception as e:
    print("Could not load dataset from default path:", csv_path)
    print("Error:", e)
    print("\nIf running in Google Colab, upload the CSV using the file upload widget or mount your Drive.")
    # Colab-specific upload (will run only if user executes in Colab)
    try:
        from google.colab import files
        uploaded = files.upload()
        # pick the first uploaded file
        first = list(uploaded.keys())[0]
        df = pd.read_csv(first)
        print("Loaded uploaded file:", first)
    except Exception as e2:
        print("Upload step skipped or failed:", e2)
        df = None

# Basic info
if df is not None:
    print("Dataset shape:", df.shape)
    display(df.head())

## 1) Basic Cleaning & Preprocessing

- Convert 'date_added' or 'release_year' to numeric/year if needed.
- Split `listed_in` (genres) into lists for analysis.
- Normalize `type` column (Movie / TV Show).

In [None]:
# Basic cleaning
if df is None:
    raise ValueError("Dataset not loaded. Please upload or mount the CSV and re-run this cell.")

# Ensure consistent column names (strip)
df.columns = [c.strip() for c in df.columns]

# Create a 'year' column
if 'release_year' in df.columns:
    df['year'] = pd.to_numeric(df['release_year'], errors='coerce')
elif 'date_added' in df.columns:
    df['year'] = pd.to_datetime(df['date_added'], errors='coerce').dt.year
else:
    df['year'] = None

# Normalize 'type' column
if 'type' in df.columns:
    df['type'] = df['type'].str.strip()

# Split genres (assuming column named 'listed_in' or 'genre' or 'genres')
genre_col = None
for candidate in ['listed_in', 'genres', 'genre']:
    if candidate in df.columns:
        genre_col = candidate
        break

if genre_col is None:
    print("No genre column found. Genre-based analysis will be limited.")
else:
    # Convert to lists
    df['genres_list'] = df[genre_col].fillna('').apply(lambda x: [g.strip() for g in x.split(',')] if x else [])

# Inspect cleaned frame
df[['title'] + (['year'] if 'year' in df.columns else [])].head()

## 2) Distribution: Movies vs TV Shows Over Years

Plot how the counts of Movies and TV Shows changed by year.

In [None]:
# Movies vs TV Shows trend over years
if 'type' in df.columns and 'year' in df.columns:
    trend = df.groupby(['year','type']).size().unstack(fill_value=0).sort_index()
    print(trend.tail(10))
    # Plot
    plt.figure(figsize=(10,5))
    for col in trend.columns:
        plt.plot(trend.index, trend[col], label=col)
    plt.xlabel("Year")
    plt.ylabel("Count")
    plt.title("Movies vs TV Shows by Year")
    plt.legend()
    plt.grid(True)
    plt.show()
else:
    print("Required columns 'type' and 'year' not found.")

## 3) Genre Analysis: Most Common Genres and Trends

- Expand genre lists into a long table and count frequency.
- Show top genres overall and over time.

In [None]:
# Genre frequency
if 'genres_list' in df.columns:
    all_genres = Counter([g for sub in df['genres_list'] for g in sub if g])
    common = pd.DataFrame(all_genres.most_common(), columns=['genre','count']).head(20)
    display(common)
    
    # Expand per-year counts for top genres
    top_genres = [g for g,c in all_genres.most_common(6)]
    # Build per-year counts
    genre_year = {}
    for g in top_genres:
        genre_year[g] = df[df['genres_list'].apply(lambda lst: g in lst) & df['year'].notna()].groupby('year').size()
    genre_year_df = pd.DataFrame(genre_year).fillna(0).sort_index()
    display(genre_year_df.tail(10))
    # Plot each top genre trend
    plt.figure(figsize=(10,6))
    for g in genre_year_df.columns:
        plt.plot(genre_year_df.index, genre_year_df[g], label=g)
    plt.xlabel('Year')
    plt.ylabel('Count')
    plt.title('Top Genres Over Time')
    plt.legend()
    plt.grid(True)
    plt.show()
else:
    print('No genre list available for analysis.')

## 4) Country-wise Contributions

- Count contributions by country and show top contributors.

In [None]:
# Country analysis
if 'country' in df.columns:
    df['country_list'] = df['country'].fillna('').apply(lambda x: [c.strip() for c in x.split(',')] if x else [])
    from collections import Counter
    country_counts = Counter([c for sub in df['country_list'] for c in sub if c])
    country_df = pd.DataFrame(country_counts.most_common(20), columns=['country','count'])
    display(country_df)
    
    # Plot top 10 countries
    top10 = country_df.head(10).set_index('country')
    plt.figure(figsize=(10,5))
    plt.bar(top10.index, top10['count'])
    plt.xlabel('Country')
    plt.ylabel('Count')
    plt.title('Top 10 Countries by Content Count')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
else:
    print('No country column found.')

## 5) Strategic Recommendations (automated hints)

- Simple heuristics to suggest underrepresented countries/genres.

In [None]:
# Simple automated recommendations
recs = []
if 'genres_list' in df.columns and 'country_list' in df.columns:
    overall_genres = [g for g,c in Counter([gg for sub in df['genres_list'] for gg in sub if gg]).most_common(10)]
    low_country_counts = [(c, cnt) for c,cnt in Counter([c for sub in df['country_list'] for c in sub if c]).most_common() if cnt < 20]
    low_country = low_country_counts[:5] if low_country_counts else None
    recs.append('Consider increasing content in underrepresented countries: {}'.format(low_country if low_country else 'None flagged'))
    under = [g for g,c in Counter([gg for sub in df['genres_list'] for gg in sub if gg]).most_common()[-5:]]
    recs.append('Consider exploring or promoting genres: {}'.format(under))
else:
    recs.append('Insufficient data to generate automated recommendations.')

for r in recs:
    print('- ', r)

## Save cleaned dataset and summary (optional)

The next cell saves a cleaned CSV and some summary CSVs you can download from the Colab session.

In [None]:
# Save cleaned outputs
if df is not None:
    df.to_csv('netflix_cleaned.csv', index=False)
    print("Saved netflix_cleaned.csv in current working directory.")