
# Project Introduction: Content Category Analysis

This notebook embarks on an in-depth analytical exploration of content categories from a comprehensive dataset. The primary goal is to uncover the underlying dynamics and evolutionary patterns of various content types, providing a strategic overview of the content ecosystem.

<div align="center">
  <img src="https://raw.githubusercontent.com/AshishJangra27/data-analytics-projects/refs/heads/main/Articles%20Trend%20Analysis/banner_minimalistic.png" width="100%" alt="Project Banner">
</div>

**Dataset Details:**

Our analysis is based on the dataset sourced from [this GitHub repository](https://github.com/AshishJangra27/datasets/raw/main/GFG%20Articles%20Latest/gfg_articles_clean_data_for_dashboarding.csv). This dataset includes crucial information for each article, such as `title`, `last_updated` (including parsed `last_updated_date`, `last_updated_month`, `last_updated_year`), `clean_tags` (representing content categories), `day_of_week`, and `no_of_images`.

**Key Questions Addressed in This Analysis:**

To construct a clear picture of the content landscape, we systematically investigate the following aspects:

1.  **How many unique content categories (`clean_tags`) exist in the dataset?** (Understanding the breadth of topics).
2.  **Which content categories contribute the highest number of articles?** (Identifying dominant content pillars).
3.  **What percentage of the overall dataset does each category represent?** (Assessing the proportional significance of categories).
4.  **How does the distribution of content categories differ across years?** (Tracking temporal shifts and emerging trends).
5.  **Which categories show consistent publishing activity over time?** (Highlighting enduring topics of interest).
6.  **Do certain content categories tend to have more images than others?** (Exploring the role of visual engagement across different content types).
7.  **How does category diversity change as the platform grows in size?** (Analyzing the evolution of topic breadth).
8.  **Which categories show recent growth based on latest update dates?** (Pinpointing current areas of increased activity).
9.  **Can content categories be grouped into broader themes based on publishing behavior?** (Revealing underlying structural relationships between topics).

By addressing these questions, this analysis provides actionable insights into content strategy, audience engagement, and potential avenues for future content development.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objects as go


df = pd.read_csv('https://github.com/AshishJangra27/datasets/raw/main/GFG%20Articles%20Latest/gfg_articles_clean_data_for_dashboarding.csv')
df.head()

Unnamed: 0,title,last_updated,last_updated_date,last_updated_month,last_updated_year,clean_tags,day_of_week,no_of_images
0,Capgemini Interview Experience | On-Campus 202...,28 October 2020,28,October,2020,Interview Experiences,Wednesday,0
1,Optum(UHG) Interview Experience for Internship...,28 October 2020,28,October,2020,Interview Experiences,Wednesday,0
2,Amdocs Interview Experience (On-Campus),28 October 2020,28,October,2020,Interview Experiences,Wednesday,0
3,Capgemini Interview Experience | On-Campus (Vi...,27 October 2020,27,October,2020,Interview Experiences,Tuesday,0
4,SRIB Interview Experience for Internship 2020,23 July 2025,23,July,2025,Interview Experiences,Wednesday,0


### 1. Content Category Analysis

#### 1.1) How many unique content categories (clean_tags) exist in the dataset?

In [3]:
df['clean_tags'].nunique()

598

#### 1.2) Which content categories contribute the highest number of articles?

In [4]:
# @title
import plotly.graph_objects as go

data = df['clean_tags'].value_counts().head(25)

# Dark futuristic color palette
bar_color = '#2F8D46'
bg_color = '#0F111A'
grid_color = '#1B1E2D'
font_color = '#FFFFFF'

# Determine maximum value to give extra headroom for text
max_value = data.max() * 1.15  # 15% extra space above the highest bar

# Create interactive bar chart
fig = go.Figure(data=[
    go.Bar(
        x=data.index,
        y=data.values,
        text=data.values,
        textposition='outside',  # Always show above bars
        textfont=dict(size=14, color=font_color),
        marker=dict(color=bar_color, line=dict(color='#00FF9F', width=1.5)),
        hovertemplate='<b>%{x}</b><br>Count: %{y}<extra></extra>'
    )
])

# Update layout for dark futuristic theme
fig.update_layout(
    title='Top 25 Clean Tags',
    title_font=dict(family='Courier New, monospace', size=24, color='#00FF9F'),
    plot_bgcolor=bg_color,
    paper_bgcolor=bg_color,
    font=dict(family='Courier New, monospace', size=14, color=font_color),
    xaxis=dict(
        tickangle=-45,
        showgrid=True,
        gridcolor=grid_color,
        zeroline=False
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor=grid_color,
        zeroline=False,
        range=[0, max_value]  # Ensure text above highest bar is visible
    ),
    margin=dict(l=60, r=30, t=80, b=120),
    hoverlabel=dict(bgcolor='#1B1E2D', font_size=12, font_family='Courier New, monospace')
)

# Optional: futuristic animation effect
fig.update_traces(marker_line_width=1.5)
fig.show()


* **Dominant Categories:** **'Picked'** (26,906) and **'Web Technologies'** (19,689) are the primary drivers of content volume.
* **Top Languages:** **Python** (17,744) is the most popular language, nearly doubling the frequency of **JavaScript** (9,286) and **Java** (7,868).
* **Career Focus:** **'Interview Experiences'** (14,100) ranks as a top-tier category, highlighting a strong user interest in career preparation.
* **Trend:** The data follows a **long-tail distribution**; activity is heavily concentrated in the top 5 tags, with a sharp drop-off into specialized topics like Machine Learning and SQL.

#### 1.3) What percentage of the overall dataset does each category represent?

In [5]:
# @title
total_articles = len(df)

df_ = df['clean_tags'].value_counts().reset_index()
df_.columns = ['tag', 'count']
df_['percentage'] = (df_['count'] / total_articles * 100).round(4)

# Separate top 10 and sum the rest as 'Others'
top_n = 10
top_tags = df_.head(top_n).copy()
others_count = df_['count'][top_n:].sum()
others_percentage = df_['percentage'][top_n:].sum()

others_df = pd.DataFrame([{
    'tag': 'Others',
    'count': others_count,
    'percentage': others_percentage
}])

top_tags = pd.concat([top_tags, others_df], ignore_index=True)

# Dark futuristic theme colors
bg_color = '#0F111A'
font_color = '#FFFFFF'
colors = ['#2F8D46'] * top_n + ['#1F5E30']

# Create interactive donut chart
fig = go.Figure(data=[go.Pie(
    labels=top_tags['tag'],
    values=top_tags['percentage'],
    hole=0.4,
    textinfo='label+percent',
    textposition='outside',          # ✅ FORCE OUTSIDE
    marker=dict(
        colors=colors,
        line=dict(color='#00FF9F', width=1.5)
    ),
    hovertemplate='<b>%{label}</b><br>Percentage: %{value:.2f}%<extra></extra>'
)])

# Update layout
fig.update_layout(
    title='Top 10 Tags with Others',
    title_font=dict(
        family='Courier New, monospace',
        size=24,
        color='#00FF9F'
    ),
    plot_bgcolor=bg_color,
    paper_bgcolor=bg_color,
    font=dict(
        family='Courier New, monospace',
        size=14,
        color=font_color
    ),
    showlegend=False,                # optional: cleaner since labels are outside
    margin=dict(t=80, b=80, l=80, r=80)  # ✅ extra space for labels
)

fig.show()

* **Diversity:** **'Others'** accounts for **32.5%**, indicating a high variety of niche content beyond the top 10 categories.
* **Top Individual Tag:** **'Picked'** is the most significant single category at **15.4%**.
* **Core Technical Pillars:** **Web Technologies** (11.3%) and **Python** (10.2%) are the only specific technical subjects to reach double-digit percentages.
* **Consolidation:** The top 4 specific tags (**Picked, Web Tech, Python, and Interview Experiences**) together represent roughly **45%** of the total dataset.

#### 1.4) How does the distribution of content categories differ across years?

In [28]:
# @title
import plotly.graph_objects as go

# Prepare data
df_year_tag = (
    df.groupby(['last_updated_year', 'clean_tags'])
      .size()
      .reset_index(name='count')
)

years = sorted(df_year_tag['last_updated_year'].unique())

# Ensure 2025 exists
default_year = 2025 if 2025 in years else years[-1]
default_index = years.index(default_year)

# Create frames
frames = []
for year in years:
    year_data = (
        df_year_tag[df_year_tag['last_updated_year'] == year]
        .sort_values('count', ascending=False)
        .head(25)
    )

    frames.append(
        go.Frame(
            name=str(year),
            data=[
                go.Bar(
                    x=year_data['clean_tags'],
                    y=year_data['count'],
                    text=year_data['count'],
                    textposition='outside',
                    cliponaxis=False,
                    marker=dict(color='#2F8D46', line=dict(color='#00FF9F', width=1.5)),
                    hovertemplate='<b>%{x}</b><br>Count: %{y}<extra></extra>'
                )
            ],
            layout=go.Layout(
                yaxis=dict(range=[0, year_data['count'].max() * 1.25])
            )
        )
    )

# Initial data (2025)
initial_data = (
    df_year_tag[df_year_tag['last_updated_year'] == default_year]
    .sort_values('count', ascending=False)
    .head(25)
)

# Create figure
fig = go.Figure(
    data=[
        go.Bar(
            x=initial_data['clean_tags'],
            y=initial_data['count'],
            text=initial_data['count'],
            textposition='outside',
            cliponaxis=False,
            marker=dict(color='#2F8D46', line=dict(color='#00FF9F', width=1.5)),
            hovertemplate='<b>%{x}</b><br>Count: %{y}<extra></extra>'
        )
    ],
    frames=frames
)

# Layout with compact top-right slider
fig.update_layout(
    title='Top 25 Clean Tags by Year',
    title_font=dict(family='Courier New, monospace', size=24, color='#00FF9F'),
    plot_bgcolor='#0F111A',
    paper_bgcolor='#0F111A',
    font=dict(family='Courier New, monospace', size=14, color='#FFFFFF'),
    xaxis=dict(tickangle=-45, showgrid=True, gridcolor='#1B1E2D', zeroline=False, automargin=True),
    yaxis=dict(showgrid=True, gridcolor='#1B1E2D', zeroline=False, autorange=True),
    margin=dict(l=80, r=40, t=90, b=160),
    height=650,
    hoverlabel=dict(bgcolor='#1B1E2D', font_size=12, font_family='Courier New, monospace'),
    sliders=[{
        'active': default_index,
        'x': 0.70,
        'y': 1.08,
        'len': 0.25,
        'pad': {'t': 0, 'b': 0},
        'currentvalue': {
            'prefix': 'Year: ',
            'font': {'size': 12}
        },
        'steps': [
            {
                'method': 'animate',
                'label': str(year),
                'args': [[str(year)], {'mode': 'immediate', 'frame': {'duration': 0}}]
            }
            for year in years
        ]
    }]
)

fig.show()


* **Top Subject**: **Python** is the most frequent tag in 2023 with **2,074** entries.
* **Career Growth**: **Interview Experiences** ranks second, showing sustained high demand for career-related content.
* **Shifting Priority**: The **'Picked'** tag (1,361), which led the overall data, sits in third place for 2023.
* **Niche Presence**: Cloud and specific web tech like **Microsoft Azure** (148) and **PHP** (90) show visible but low-volume activity.

#### 1.5) Which categories show consistent publishing activity over time?

In [29]:
# @title
import pandas as pd
import plotly.graph_objects as go

# Step 1: Aggregate top recurring categories across years
categories = []
for year in range(2015, 2026):
    categories += list(
        df[df['last_updated_year'] == year]['clean_tags']
        .value_counts()
        .head(5)
        .index
    )

unique_categories = list(set(categories))
data = [[cat, categories.count(cat)] for cat in unique_categories]
df_ = pd.DataFrame(data, columns=['category', 'count'])

# Step 2: Calculate percentage for donut chart
total_count = df_['count'].sum()
df_['percentage'] = (df_['count'] / total_count * 100).round(2)

# Step 3: Separate top 5 and group the rest as 'Others'
top_n = 10
top_tags = df_.sort_values(by='count', ascending=False).head(top_n).copy()
others_count = df_['count'][top_n:].sum()
others_percentage = df_['percentage'][top_n:].sum()

others_df = pd.DataFrame([{
    'category': 'Others',
    'count': others_count,
    'percentage': others_percentage
}])

top_tags = pd.concat([top_tags, others_df], ignore_index=True)

# Step 4: Futuristic theme colors
bg_color = '#0F111A'
font_color = '#FFFFFF'
colors = ['#2F8D46'] * top_n + ['#1F5E30']

# Step 5: Create interactive donut chart (ALL LABELS OUTSIDE)
fig = go.Figure(data=[go.Pie(
    labels=top_tags['category'],
    values=top_tags['percentage'],
    hole=0.4,
    textinfo='label+percent',
    textposition='outside',   # ✅ force outside labels
    marker=dict(
        colors=colors,
        line=dict(color='#00FF9F', width=1.5)
    ),
    hovertemplate='<b>%{label}</b><br>Percentage: %{value:.2f}%<extra></extra>'
)])

# Step 6: Layout for futuristic dark theme
fig.update_layout(
    title='Top 10 Most Consistent Categories (2015–2025) with Others',
    title_font=dict(
        family='Courier New, monospace',
        size=24,
        color='#00FF9F'
    ),
    plot_bgcolor=bg_color,
    paper_bgcolor=bg_color,
    font=dict(
        family='Courier New, monospace',
        size=14,
        color=font_color
    ),
    showlegend=False,          # optional but recommended
    margin=dict(t=90, b=90, l=90, r=90)  # ✅ space for labels
)

fig.show()

* **Historical Diversity**: The **'Others'** category dominates at **52.7%**, indicating that over a 10-year span, content is highly fragmented and diverse.
* **Most Consistent Topic**: **Interview Experiences** (14.9%) is the single most consistent individual category over the last decade.
* **Steady Languages**: **Python** (10.8%) and **JavaScript** (5.4%) show high long-term retention compared to other programming languages.
* **Core Content**: **Web Technologies** (9.46%) and **Picked** (6.76%) round out the top 5 pillars of the platform’s decade-long growth.

#### 1.6) Do certain content categories tend to have more images than others?


In [22]:
# @title
import pandas as pd
import plotly.graph_objects as go

# Step 1: Aggregate data by tag
df_ = (
    df.groupby('clean_tags')['no_of_images']
      .agg(avg_images='mean', total_count='count')
      .reset_index()
)

df_['avg_images_pct'] = (df_['avg_images'] * 100).round(2)

# Theme colors
line_color = '#2F8D46'
bg_color = '#0F111A'
grid_color = '#1B1E2D'
font_color = '#FFFFFF'
fixed_line_width = 3

# Slider settings
min_count = 1000
default_count = 1000
max_count = min(3000, df_['total_count'].max())
step = 100
slider_values = list(range(min_count, max_count + 1, step))

# Step 2: Create frames (only categories that pass threshold)
frames = []
for val in slider_values:
    df_filtered = df_[df_['total_count'] >= val].sort_values(by='avg_images', ascending=False)

    frames.append(
        go.Frame(
            name=str(val),
            data=[
                go.Scatter(
                    x=df_filtered['clean_tags'],
                    y=df_filtered['avg_images_pct'],
                    mode='lines+markers',
                    marker=dict(color=line_color, size=8, line=dict(color='#00FF9F', width=1.5)),
                    line=dict(color=line_color, width=fixed_line_width),
                    customdata=df_filtered['total_count'],
                    hovertemplate=(
                        '<b>%{x}</b><br>'
                        'Avg Images: %{y:.2f}%<br>'
                        'Total Count: %{customdata}<extra></extra>'
                    )
                )
            ]
        )
    )

# Initial plot (default slider value)
initial_df = df_[df_['total_count'] >= default_count].sort_values(by='avg_images', ascending=False)

fig = go.Figure(
    data=[
        go.Scatter(
            x=initial_df['clean_tags'],
            y=initial_df['avg_images_pct'],
            mode='lines+markers',
            marker=dict(color=line_color, size=8, line=dict(color='#00FF9F', width=1.5)),
            line=dict(color=line_color, width=fixed_line_width),
            customdata=initial_df['total_count']
        )
    ],
    frames=frames
)

# Find default slider index
default_slider_index = slider_values.index(default_count)

# Layout
fig.update_layout(
    title='Average Number of Images per Tag (interactive by min total_count)',
    autosize=False,
    width=1400,
    height=700,
    plot_bgcolor=bg_color,
    paper_bgcolor=bg_color,
    font=dict(family='Courier New, monospace', size=14, color=font_color),
    xaxis=dict(
        tickangle=-45,
        showgrid=True,
        gridcolor=grid_color,
        zeroline=False,
        fixedrange=True
    ),
    yaxis=dict(
        title='Average Images (%)',
        showgrid=True,
        gridcolor=grid_color,
        zeroline=False,
        fixedrange=True
    ),
    margin=dict(l=80, r=40, t=100, b=160),
    hoverlabel=dict(bgcolor='#1B1E2D', font_size=12, font_family='Courier New, monospace'),
    sliders=[{
        'active': default_slider_index,
        'x': 0.75,
        'y': 1.08,
        'len': 0.25,
        'currentvalue': {'prefix': 'Min Total Count: ', 'font': {'size': 12}},
        'steps': [
            {
                'method': 'animate',
                'label': str(val),
                'args': [[str(val)], {'mode': 'immediate', 'frame': {'duration': 0}}]
            }
            for val in slider_values
        ]
    }]
)

fig.show()


* **High Visual Needs**: **Installation Guides** and **Web Tech** have the highest image density (over **90%**), likely due to screenshots and diagrams.
* **Low Visual Needs**: Narrative or code-heavy tags like **'Write it Up'** and **'Scala'** average near **0%** image usage.
* **Technical Trend**: Practical "How-To" content is significantly more visual than conceptual or specific language-based posts.

#### 1.7) How does category diversity change as the platform grows in size?

In [25]:
# @title
import pandas as pd
import plotly.graph_objects as go

# Step 1: Aggregate data by year
data = []
for year in range(2014, 2026):
    data.append([year, df[df['last_updated_year'] == year]['clean_tags'].nunique()])

df_ = pd.DataFrame(data, columns=['year', 'freq'])

# Step 2: Use same futuristic theme colors as before
line_color = '#2F8D46'       # Green line from previous plot
marker_line_color = '#00FF9F'  # Marker border
bg_color = '#0F111A'
grid_color = '#1B1E2D'
font_color = '#FFFFFF'

# Step 3: Create line chart
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=df_['year'],
        y=df_['freq'],
        mode='lines+markers',
        marker=dict(color=line_color, size=8, line=dict(color=marker_line_color, width=1.5)),
        line=dict(color=line_color, width=3),
        hovertemplate=(
            '<b>Year %{x}</b><br>'
            'Unique Tags: %{y}<extra></extra>'
        )
    )
)

# Step 4: Update layout for futuristic dark theme
fig.update_layout(
    title='Unique Tags per Year',
    title_font=dict(family='Courier New, monospace', size=24, color='#00FF9F'),
    plot_bgcolor=bg_color,
    paper_bgcolor=bg_color,
    font=dict(family='Courier New, monospace', size=14, color=font_color),
    xaxis=dict(tickangle=-45, showgrid=True, gridcolor=grid_color, zeroline=False),
    yaxis=dict(title='Number of Unique Tags', showgrid=True, gridcolor=grid_color, zeroline=False),
    margin=dict(l=80, r=40, t=100, b=120),
    hoverlabel=dict(bgcolor='#1B1E2D', font_size=12, font_family='Courier New, monospace')
)

fig.show()

* **Recent Surge**: Between **2022 and 2025**, the number of unique tags jumped from approximately **167 to over 440**, marking the most rapid period of content diversification in the record.
* **Early Niche**: During the first four years (**2014–2017**), the platform remained highly specialized, maintaining fewer than **25 unique categories** before beginning a steady climb.
* **Exponential Trend**: The growth curve shifted from linear to **exponential acceleration** after **2021**, suggesting that the breadth of topics covered is now expanding at a significantly faster rate each year.

#### 1.8) Which categories show recent growth based on latest update dates?


In [26]:
# @title
import pandas as pd
import plotly.graph_objects as go

# Step 1: Aggregate top recurring categories across 2020–2025
categories = []
for year in range(2020, 2026):
    categories += list(
        df[df['last_updated_year'] == year]['clean_tags']
        .value_counts()
        .head(5)
        .index
    )

unique_categories = list(set(categories))
data = [[cat, categories.count(cat)] for cat in unique_categories]
df_ = pd.DataFrame(data, columns=['category', 'count'])

# Step 2: Calculate percentage for donut chart
total_count = df_['count'].sum()
df_['percentage'] = (df_['count'] / total_count * 100).round(2)

# Step 3: Separate top 5 and group the rest as 'Others'
top_n = 5
top_tags = df_.sort_values(by='count', ascending=False).head(top_n).copy()
others_count = df_['count'][top_n:].sum()
others_percentage = df_['percentage'][top_n:].sum()
others_df = pd.DataFrame([{
    'category': 'Others',
    'count': others_count,
    'percentage': others_percentage
}])

top_tags = pd.concat([top_tags, others_df], ignore_index=True)

# Step 4: Futuristic theme colors
bg_color = '#0F111A'
font_color = '#FFFFFF'
colors = ['#2F8D46'] * top_n + ['#1F5E30']

# Step 5: Create interactive donut chart (labels outside)
fig = go.Figure(data=[go.Pie(
    labels=top_tags['category'],
    values=top_tags['percentage'],
    hole=0.4,
    textinfo='label+percent',
    textposition='outside',  # ✅ force labels outside
    marker=dict(colors=colors, line=dict(color='#00FF9F', width=1.5)),
    hovertemplate='<b>%{label}</b><br>Percentage: %{value:.2f}%<extra></extra>'
)])

# Step 6: Layout for futuristic dark theme
fig.update_layout(
    title='Top 5 Most Consistent Categories (2020–2025) with Others',
    title_font=dict(family='Courier New, monospace', size=24, color='#00FF9F'),
    plot_bgcolor=bg_color,
    paper_bgcolor=bg_color,
    font=dict(family='Courier New, monospace', size=14, color=font_color),
    showlegend=False,  # optional for cleaner look
    margin=dict(t=90, b=90, l=90, r=90)  # space for outside labels
)

fig.show()


* **Primary Pillars**: **'Interview Experiences'** and **'Web Technologies'** stand as the most consistent individual categories, each accounting for **14.6%** of the total.
* **Programming Focus**: Technical content is heavily weighted toward **Python (12.2%)** and **JavaScript (9.75%)**, which together represent a significant portion of the core consistent output.
* **Content Diversity**: The **'Others'** category is the largest single segment at **36.6%**, suggesting that while five categories are most consistent, the majority of the work is distributed across a wide variety of niche topics.

#### 1.9) Can content categories be grouped into broader themes based on publishing behavior?


In [27]:
# @title
# ================= Interview / Career =================
interviews_career = [
    'Interview Experiences', 'Interview Tips', 'Interview-Questions',
    'Experiences', 'Work Experiences', 'Campus Experiences',
    'Admission Experiences', 'Competitive Exam Experiences',
    'Contest Experiences', 'Fest Experiences', 'School Experience',
    'Career-Advices', 'Placements', 'placement preparation',
    'CS – Placements', 'Experienced', 'Off-Campus', 'On-Campus',
    'HR', 'HRM', 'Job-A-Thon', 'Exam Tips', 'Reasoning – Placements',
    'interview-preparation', 'Admission Process', 'NDA - SSB',
    'SSB', 'Govt-Exams-Experiences', 'Course Reviews','TCS Digital', 'TCS NQT', 'HackWithInfy',
    'Google Summer Code', 'GSoC', 'Google Girl Hackathon',
    'Google code jam', 'Amazon-WoW', 'Code for Good',
    'HackerRank','ACM-ICPC'
]

# ================= Programming Languages =================
programming_languages = [
    'C++', 'Java', 'Python', 'JavaScript', 'PHP', 'C#',
    'Perl', 'R Language', 'Go Language', 'Rust', 'Ruby',
    'Julia', 'Scala', 'Swift', 'Solidity', 'Dart', 'LISP',
    'Kotlin', 'python', 'Programming Language', 'JS++'
]

# ================= Programming Practice =================
language_programs = [
    'C Programs', 'C++ Programs', 'Java Programs',
    'Python Programs', 'Python numpy-program',
    'C Language', 'cpp-advanced', 'java-basics',
    'java-advanced', 'Kotlin Basics', 'pattern-printing',
    'C++ Quiz'
]

# ================= DSA =================
dsa = [
    'DSA', 'Data Structures', 'Algorithms',
    'Arrays', 'Strings', 'Matrix', 'Stack', 'Queue',
    'Linked List', 'Tree', 'Binary Search Tree',
    'Graph', 'Heap', 'Hash',
    'Sorting', 'Searching', 'Recursion',
    'Backtracking', 'Dynamic Programming', 'Greedy',
    'Divide and Conquer', 'Bit Magic', 'Sliding Window 23',
    'Branch and Bound', 'Randomized', 'STL',
    'Advanced Data Structure', 'Pattern Searching',
    'Analysis of Algorithms',
    'Algorithms-Analysis of Algorithms (Recurrences)',
    'Quick Sort', 'Searching Quiz', 'DSA Quiz',
    'C/C++ Puzzles', 'Puzzles', 'Competitive Programming'
]

# ================= Mathematics =================
mathematics = [
    'Mathematics', 'Engineering Mathematics', 'Mathematical',
    'Combinatorial', 'Game Theory', 'Geometric',
    'Aptitude', 'Logical Puzzles',
    'Analytical Mathematical Puzzles', 'Maths',
    'Maths-Formulas', 'Maths-Calculators', 'permutation',
    'MATLAB', 'Octave-GNU','computer-graphics', 'Computer Graphics'
]

# ================= Core CS =================
cs_core = [
    'Computer Subject', 'Computer Science Fundamentals',
    'Operating Systems', 'Operating Systems Questions',
    'DBMS', 'dbms', 'DBMS-SQL',
    'Computer Networks', 'Computer Networks-Network Layer',
    'Computer Networks-IP Addressing',
    'Compiler Design', 'Theory of Computation',
    'Computer Organization and Architecture',
    'Computer Organization &amp; Architecture',
    'Digital Logic', 'system-programming',
    'Distributed System', 'Design Pattern',
    'System Design', 'Software Engineering',
    'Information-Security', 'secure-coding',
    'cryptography', 'Ethical Hacking',
    'Linux-Unix','GDSC'

]

# ================= Web Development =================
web_development = [
    'HTML', 'HTML5', 'HTML-Questions', 'HTML-Colors', 'HTML-SVG',
    'CSS', 'CSS-Properties', 'JQuery',
    'JavaScript-Questions', 'TypeScript',
    'ReactJS', 'ReactJS-Basics', 'React-Hooks',
    'React-Redux', 'React-Questions', 'react-js',
    'AngularJS', 'Next.js',
    'Web Technologies', 'Web technologies',
    'Web Tech', 'Web technologies-HTML and XML',
    'Web Technologies - Difference Between',
    'Web Templates', 'Frameworks', 'Bootstrap',
    'Material-UI', 'UI Design', 'UI UX Design', 'UX Design',
    'Wordpress', 'Web Scraping', 'Websites & Apps',
    'WebTech-Tools', 'Web-Tech Blogs'
]

# ================= Backend & Frameworks =================
backend_frameworks = [
    'Node.js', 'ExpressJS-Functions', 'ExpressJS-Middlewares',
    'Java-Spring', 'Java-Spring-Boot', 'Java-Spring-MVC',
    'Java-Spring-Security', 'Java-Spring-Cloud',
    'Java-Spring-Data-JPA', 'Java-Spring-Batch',
    'java-servlet', 'Java-JSP', 'java-JVM',
    'Java-Hibernate', 'Java-Object Oriented',
    'rest-framework', 'Maven', 'ASP-Basics',
    'ASP-Methods', 'ASP-Properties', 'VBScript',
    'Web-API', 'Audio-API', 'JSON', 'GraphQL',
    'MERN Stack', 'Mongoose', 'Android', 'Flutter',
    'Flutter UI-components', 'Kotlin Android',
    'Java-Collections', 'Java-Multithreading', 'Java 8',
    'Advance Java', 'Abstract Class and Interface',
    'NodeJS-Questions', 'Java-Sql package',
    'java-swing', 'PHP-Misc', 'Django-Projects',
    'Apache', 'Salesforce', 'selenium','Mobile Computing', 'Software Testing'
]

# ================= Databases =================
databases = [
    'SQL', 'MySQL', 'mysql', 'PostgreSQL', 'postgreSQL',
    'SQL Server', 'SQLServer', 'SQLite', 'MariaDB',
    'DynamoDB', 'MongoDB', 'Firebase',
    'PL/SQL', 'SQL-PL/SQL', 'SQLmysql',
    'JDBC', 'CSV', 'Databases',
    'Elasticsearch', 'Teradata',
    'SQL-Clauses-Operators', 'Data Warehouse', 'Data Types'
]

# ================= AI / ML / DS =================
ai_ml_ds = [
    'Data Science', 'data-science', 'Data Analysis',
    'Data Analytics', 'Data Mining', 'data mining',
    'Machine Learning', 'ML-Statistics', 'ML-EDA',
    'ML-Reinforcement', 'Artificial Intelligence',
    'Artificial-intelligence', 'Deep Learning',
    'Deep-Learning', 'Neural Network',
    'NLP', 'Natural-language-processing',
    'Computer Vision', 'Image-Processing',
    'Generative AI', 'ChatGPT', 'AI Tools',
    'AI News', 'AI Blogs', 'AI Tool Blogs',
    'ChatGPT Blogs', 'ChatGPT Prompts',
    'Hugging Face', 'AI Chatbot',
    'AI-ML-DS With Python', 'Pandas AI',
    'Data Exploration', 'Data Engineering',
     'data', 'R Machine-Learning',
    'R Machine Learning', 'R-Packages',
    'Blockchain', 'Finance'
]

# ================= Python Ecosystem =================
python_ecosystem = [
    'Pandas', 'Numpy', 'Python-pandas', 'Python-numpy',
    'Python pandas-dataFrame', 'Python pandas-series',
    'Python pandas-basics', 'Python-matplotlib',
    'Python-Seaborn', 'Python-scipy',
    'Python-nltk', 'Python-PyTorch', 'Python-Tensorflow',
    'Tensorflow', 'Tensorflow.js',
    'Python-Altair', 'Python-Bokeh',
    'Python-Plotly', 'Python-Pyspark',
    'Python-OpenCV', 'OpenCV',
    'python-modules', 'python-regex',
    'python-os-module', 'Python-Library',
    'Python-Data-Analysis', 'python-utility',
    'Python Django', 'Python Flask',
    'Python Framework', 'Python-selenium',
    'Python-projects', 'Python-PyQt',
    'Python-multithreading', 'Python scikit-module'
]

# ================= Cloud & DevOps =================
cloud_devops = [
    'Cloud Computing', 'Cloud-Computing',
    'Amazon Web Services', 'AWS', 'aws-iam',
    'aws-ec2', 'aws-elastic-beanstalk',
    'Google Cloud Platform', 'Google-Cloud-Platform',
    'google-cloud-app-engine',
    'google-cloud-kubernetes-engine',
    'Microsoft Azure', 'azure',
    'Docker', 'docker', 'Docker Container',
    'Kubernetes', 'Kubernetes-Basics',
    'Hadoop', 'Apache Kafka', 'Apache-Hive',
    'MapReduce', 'BigData', 'virtualization',
    'Cloud Lending', 'DevOps'
]

# ================= Projects & Tools =================
projects_tools = [
    'Project', 'Project-Ideas', 'Web Development Projects',
    'Deep Learning Projects', 'Computer Vision Projects',
    'NLP-Projects', 'Open Source',
    'Internship', 'Installation Guide',
    'Git', 'GIT', 'GitHub', 'Postman',
    'Postman-API-Testing', 'Excel', 'excel',
    'Tableau', 'Power BI', 'Utilities',
    'Converter-Tools', 'Calculator-tools',
    'Image-Tools', 'Image-Converter',
    'PDF-Converter', 'Online-Game-Tools',
    'DSA Online Tools/Converters', 'Tools',
    'Chrome', 'Data Visualization'
]

# ================= Companies =================
companies = [
    'Amazon', 'Google', 'Microsoft', 'IBM', 'Oracle',
    'Goldman Sachs', 'Bank of America', 'Morgan Stanley',
    'JP Morgan', 'Deloitte', 'Accenture', 'TCS',
    'Infosys', 'Wipro', 'Cognizant', 'Capgemini',
    'Flipkart', 'Samsung', 'Airtel', 'Reliance',
    'Visa', 'HSBC', 'SBI', 'Zoho', 'Red Hat',
    'Cisco', 'GE', 'Siemens', 'Qualcomm',
    'Barclays', 'Deutsche Bank', 'Credit Suisse',
    'Fidelity Investments', 'Fidelity International',
    'Thoughtworks', 'Persistent Systems',
    'Tech Mahindra', 'Facebook', 'Netflix',
    'Renault-Nisaan', 'BrowserStack', 'Volkswagen IT Services',
    'Ola Cabs', 'Dell','Unacademy', 'Genpact', 'ISRO', 'Virtusa', 'Finastra',
    'Pwc', 'HCL', 'CRIS', 'Nagarro', 'Hexaware Technologies',
    'Sopra Steria', 'BARC', 'GoJek', 'KPIT', 'Tata Steel',
    'Apisero', 'Perfios', 'To The New', 'Verifone',
    'Incture Technologies', 'Optum', 'FavTutor',
    'o9 Solutions', 'Robert Bosch', 'DXC Technology',
    'DRDO', 'MAQ', "Byju's", 'BIT', 'ATMECS',
    'Infinite Computer Solutions', 'TSS Consultancy',
    'Mallow Technologies','Aakash-Byjus','SalesFor'
]

# ================= Education =================
education = [
    'IIT Delhi', 'IIT Bombay', 'IIT Madras', 'IIT Kanpur',
    'IIT Kharagpur', 'IIT Hyderabad', 'IIT Jodhpur',
    'IIT Patna', 'IIT Guwahati', 'IIIT Delhi',
    'IIIT Hyderabad', 'IIIT Bhubaneswar',
    'BITS Pilani', 'IISc Bangalore',
    'Lovely Professional University',
    'Poornima College of Engineering',
    'NIT Patna', 'AKTU', 'SRM', 'VIT-AP',
    'GATE', 'GATE CS', 'GATE-GATE IT 2004',
    'IIT JEE', 'IIT- JEE', 'NEET', 'UGC-NET',
    'UPSC', 'SSC/Banking', 'GRE', 'SAT',
    'WBJEE', 'NPTEL', 'BCA',
    'Class 8', 'Class 9', 'Class 10', 'Class 11', 'Class 12',
    'CBSE - Class 11', 'CBSE - Class 12',
    'NCERT Solutions Class-10', 'NCERT Solutions Class-8',
    'Education & Exams', 'School Learning', 'Biology',
    'Social Science', 'Chemistry', 'English', 'Commerce',
    'Geography-MAQ', 'Economics-Class 10',
    'Political Science-Class 10',
    'CBSE-Answer Keys', 'AKTU-question-papers',
    'MHTCET', 'Coaching Centers', 'SATI',
    'SSC Finance and Economics',
    'SSC Geography',
    'Aptech Kolkata', 'aakash', 'Tejas Network', 'PSTakeCare'
]

# ================= Content & Events =================
content_events = [
    'GBlog', 'GBlog 2024', 'GBlog 2025',
    'Articles', 'Tutorials', 'Blogs', 'blogs',
    'Technical Scripter', 'Technical Scripter 2018',
    'Technical Scripter 2019', 'Technical Scripter 2020',
    'Technical Scripter 2022', 'Dev Scripter 2024',
    'Write it Up', 'Write It Up 2024',
    'Blogathon', 'Blogathon-2021',
    'Geeks Premier League', 'Geeks Premier League 2023',
    'Geeks-Premier-League-2022',
    'GFG Academy', 'GFG-Course',
    'GeeksforGeeks', 'GeeksforGeeks Initiatives',
    'GeeksforGeeks-Contests', 'event',
    'News', 'Spotlight', 'RSS', 'Roadmap',
    'Best Apps', 'Outlook Tips', 'Blogger',
    'Google Sites', 'Write From Home', 'TechTips', 'Full Form', 'How To', 'Difference Between',
    'difference', 'AI-ML-DS Blogs', 'General Knowledge',
    'Game Quiz', 'Current GK'
]

electronics = [
    'VLSI', 'Electronics Engineering', 'Verilog-HDL',
    'Electronics', 'Arduino-projects', 'IoT',
    'microprocessor', 'Robotics', 'Microchip',
    'fuzzy-logic','Electronics'
]

misc = [
    'TrueGeek-2021', 'progeek', 'Algo-Geek 2021',
    'Algo Geek', 'ProGeek 2.0', 'ProGeek 2021',
    'ProGeek', 'Elite-Batch-2022', 'Coders-Journey','Misc','misc'
]


df['broader_category'] = np.where(df['clean_tags'].isin(interviews_career), 'Interview/Career', df['clean_tags'])
df['broader_category'] = np.where(df['clean_tags'].isin(programming_languages), 'Programming Languages', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(language_programs), 'Programming Practice', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(dsa), 'DSA', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(mathematics), 'Mathematics', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(cs_core), 'Core CS Subjects', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(web_development), 'Web Development', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(backend_frameworks), 'Backend & Frameworks', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(databases), 'Databases', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(ai_ml_ds), 'AI / ML / Data Science', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(python_ecosystem), 'Python Ecosystem', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(cloud_devops), 'Cloud & DevOps', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(projects_tools), 'Projects & Tools', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(companies), 'Companies', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(education), 'Education & Exams', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(content_events), 'Content & Events', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(electronics), 'Electronics', df['broader_category'])
df['broader_category'] = np.where(df['clean_tags'].isin(misc), 'Misc', df['broader_category'])



import pandas as pd
import plotly.graph_objects as go

# Step 1: Get counts of broader categories
category_counts = df['broader_category'].value_counts().reset_index()
category_counts.columns = ['category', 'count']

# Step 2: Separate top categories (>= 2000) and group others
top_categories = category_counts[category_counts['count'] >= 2000].copy()
others_count = category_counts[category_counts['count'] < 2000]['count'].sum()

if others_count > 0:
    top_categories = pd.concat(
        [top_categories, pd.DataFrame([{'category': 'Others', 'count': others_count}])],
        ignore_index=True
    )

# Step 3: Colors (neon green for top, darker green for Others)
colors = ['#2F8D46']*(len(top_categories)-1) + ['#1F5E30'] if others_count > 0 else ['#2F8D46']*len(top_categories)

# Step 4: Create interactive donut chart with labels outside
fig = go.Figure(data=[go.Pie(
    labels=top_categories['category'],
    values=top_categories['count'],
    hole=0.4,
    textinfo='label+value',
    textposition='outside',  # ✅ force labels outside
    marker=dict(colors=colors, line=dict(color='#00FF9F', width=1.5)),
    hovertemplate='<b>%{label}</b><br>Articles: %{value}<extra></extra>'
)])

# Step 5: Layout for futuristic dark theme
bg_color = '#0F111A'
font_color = '#FFFFFF'

fig.update_layout(
    title='Broader Categories with Others (<2000 Articles)',
    title_font=dict(family='Courier New, monospace', size=24, color='#00FF9F'),
    plot_bgcolor=bg_color,
    paper_bgcolor=bg_color,
    font=dict(family='Courier New, monospace', size=14, color=font_color),
    showlegend=False,  # optional for cleaner look
    margin=dict(t=90, b=90, l=90, r=90)  # space for outside labels
)

fig.show()

* **Volume Leader**: **'Programming Languages'** is the undisputed dominant category with **44,133 articles**, nearly doubling the count of the next highest category.
* **Content Pillars**: The trio of **'Picked'**, **'Web Development'**, and **'Interview/Career'** represents a massive knowledge base of over **66,000 articles** combined.
* **Specialized Long-Tail**: Technical niches like **'AI / ML / Data Science'** and **'Backend & Frameworks'** maintain a steady presence with over **2,500 articles** each, despite their smaller relative share.

## Project Summary: Content Category Analysis

This project analyzed content categories based on the provided dataset, revealing significant trends and distributions.

### Key Highlights:

*   **Dominant Categories:** 'Picked' and 'Web Technologies' are the primary content drivers, with Python emerging as the most popular language, significantly surpassing JavaScript and Java.
*   **Career Focus:** 'Interview Experiences' consistently ranks high, indicating strong user interest in career preparation content.
*   **Long-Tail Distribution:** Content is heavily concentrated in the top few tags, with a rapid decline into numerous specialized topics.
*   **Category Diversity:** The dataset exhibits high diversity; 'Others' accounts for a substantial portion (e.g., 32.5% in overall percentage, 52.7% in 10-year consistency), indicating a wide range of niche content.
*   **Consistency Over Time:** 'Interview Experiences', Python, and Web Technologies demonstrate strong long-term consistency in publishing activity across years.
*   **Image Density:** Content categories like 'Installation Guides' and 'Web Tech' are highly visual (over 90% image density), while code-heavy topics have minimal images.
*   **Rapid Diversification:** The number of unique content tags has experienced exponential growth, particularly between 2022 and 2025, suggesting a rapid expansion in topic breadth.
*   **Broader Thematic Groupings:** When categories are grouped, 'Programming Languages' stands out as the largest theme by volume, followed by 'Picked', 'Web Development', and 'Interview/Career'. Technical niches like 'AI / ML / Data Science' and 'Backend & Frameworks' also maintain a steady presence.