# üìä Exploratory Data Analysis: Research Publications
**Project Intern/Trainee Hiring Assessment - PAIU-OPSA, IISc Bangalore**

---
**Author:** Omkar Sharma

**Date:** November 2025

**Dashboard Link:** https://global-research-analytics.streamlit.app/

**Repository:** https://github.com/Omkar3101/iisc-eda-project

---
### üìù Objective
The goal of this analysis is to perform an in-depth Exploratory Data Analysis (EDA) on the provided dataset to identify trends, patterns, and anomalies. The insights derived from this notebook will drive the development of an interactive dashboard.

### üìñ Table of Contents
1. [Environment Setup](#setup)
2. [Data Loading & Overview](#loading)
3. [Data Preprocessing & Cleaning](#cleaning)
4. [Exploratory Data Analysis (EDA)](#eda)
    - Univariate Analysis
    - Bivariate Analysis
    - Multivariate Analysis
5. [Key Insights & Conclusion](#insights)
6. [Final Conclusion](#conclusion)

## 1. Environment Setup <a id="setup"></a>
Importing necessary libraries for data manipulation, visualization, and statistical analysis.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px # for interactive dashboard
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots  # to show more charts
from scipy import stats


# Hide irrelevent warnings
import warnings
warnings.filterwarnings('ignore')

## 2. Data Loading & Overview <a id="loading"></a>
Loading the dataset and performing a preliminary check to understand the structure, features, and data types.

#### 1. Loading dataset with error handling

In [2]:
#Loading the dataset
try:
    df = pd.read_csv('data/publications.csv')
    print("Dataset loaded successfully!")
except FileNotFoundError:
    print("Error: The file 'publications.csv' was not found. Please check the file path.")

Dataset loaded successfully!


#### 2. Visualizing rows of table

In [3]:
# Visualizing the first 5 rows
print("First 5 rows of the dataset:")
display(df.head())

# Visualizing the last 5 rows
print("\nLast 5 rows of the dataset:")
display(df.tail())

First 5 rows of the dataset:


Unnamed: 0,Name,Web of Science Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,Category Normalized Citation Impact,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,year
0,SWITZERLAND,24154,2705248,0.946748,8,97.93,1.024815,0.89,10.87,97,230,2023
1,CHINA,2185,157320,1.575928,44,99.6,0.900623,2.98,19.26,323,121,2014
2,CHINA,6896,744768,1.032983,42,95.23,1.679004,1.08,11.36,455,662,2013
3,UNITED KINGDOM,2399,177526,1.586585,3,99.21,1.444246,1.63,10.2,98,2463,2005
4,ITALY,10753,301084,0.812773,2,98.35,1.252122,0.81,17.43,440,134,2004



Last 5 rows of the dataset:


Unnamed: 0,Name,Web of Science Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,Category Normalized Citation Impact,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,year
995,UNITED KINGDOM,22195,2130720,1.276037,46,97.97,0.971705,2.9,20.73,274,1803,2024
996,BRAZIL,27344,1832048,1.565469,42,99.16,1.57703,1.39,22.49,143,1514,2020
997,SWITZERLAND,14360,1033920,0.853179,44,96.86,1.258788,2.95,15.25,224,830,2005
998,SWITZERLAND,5423,591107,0.838366,8,97.8,1.508564,0.87,18.58,151,707,2014
999,CHINA,23053,2996890,1.13527,36,96.31,1.458377,0.5,23.07,214,2073,2014


#### 3. Checking Shape of dataset

In [4]:
# Checking the shape of the data
rows, cols = df.shape # <-- destructuring in tuple
print(f"The dataset contains {rows} rows and {cols} columns.") # <-- use f-string to format output

The dataset contains 1000 rows and 12 columns.


#### 4. Checking Missing and Duplicate Values

In [5]:
# 1. Check for missing values
missing_count = df.isnull().sum().sum() # <-- use sum() to aggregate columns null value count
print(f"Total Missing Values: {missing_count}") # <-- use f-string to format output

# 2. Check for duplicates
duplicate_count = df.duplicated().sum() # <-- use sum() to get total duplicate value count
print(f"Total Duplicates: {duplicate_count}") # <-- use f-string to format output

Total Missing Values: 0
Total Duplicates: 0


#### 5. Getting Summary of dataset

In [6]:
# Getting a summary of columns, data types, and non-null values
print("Dataset Information:")
df.info()

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Name                                 1000 non-null   object 
 1   Web of Science Documents             1000 non-null   int64  
 2   Times Cited                          1000 non-null   int64  
 3   Collab-CNCI                          1000 non-null   float64
 4   Rank                                 1000 non-null   int64  
 5   % Docs Cited                         1000 non-null   float64
 6   Category Normalized Citation Impact  1000 non-null   float64
 7   % Documents in Top 1%                1000 non-null   float64
 8   % Documents in Top 10%               1000 non-null   float64
 9   Documents in Top 1%                  1000 non-null   int64  
 10  Documents in Top 10%                 1000 non-null   int64  
 11  year      

#### 6. Finding unique countries in dataset

In [7]:
# Extract Unique Countries
unique_countries = df['Name'].unique()

# sort and Print Unique Countries
print('\nUnique Countries:')
display(sorted(unique_countries))


#Here, we can see that United Kingdom and England recorded seperately.


Unique Countries:


['AUSTRALIA',
 'BRAZIL',
 'CANADA',
 'CHINA',
 'ENGLAND',
 'FRANCE',
 'GERMANY',
 'INDIA',
 'ITALY',
 'JAPAN',
 'NETHERLANDS',
 'SOUTH KOREA',
 'SPAIN',
 'SWEDEN',
 'SWITZERLAND',
 'UNITED KINGDOM',
 'USA']

#### 7. Checking count for repeative countries in each year

In [8]:
# Check count for repeating countries in each year
repeated_countries = pd.crosstab(df['year'], df['Name']) # <-- crosstab() will add frequency of each country in each year.
print(repeated_countries)

Name  AUSTRALIA  BRAZIL  CANADA  CHINA  ENGLAND  FRANCE  GERMANY  INDIA  \
year                                                                      
2003          6       0       5      2        4       3        3      2   
2004          2       3       5      4        1       1        1      2   
2005          5       3       2      2        4       5        1      0   
2006          5       1       6      1        1       5        2      2   
2007          2       2       2     10        2       4        5      3   
2008          6       4       4      2        4       4        1      4   
2009          1       4       2      3        0       0        0      4   
2010          3       2       1      2        1       1        2      5   
2011          1       5       3      3        2       4        3      0   
2012          4       6       0      1        2       1        1      3   
2013          0       1       1      2        1       1        2      3   
2014          0       2  

#### 8. Getting Statistical Summary of dataset

In [9]:
# Statistical summary of numerical columns
print("Statistical Summary:")
display(df.describe())

Statistical Summary:


Unnamed: 0,Web of Science Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,Category Normalized Citation Impact,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,year
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,14861.699,1296497.0,1.214932,24.722,97.41069,1.291637,1.7676,17.58979,261.327,1497.457,2013.86
std,8390.150609,967063.3,0.230261,14.108145,1.419199,0.234461,0.71711,4.3631,136.904576,844.902713,6.748477
min,512.0,21846.0,0.800182,1.0,95.0,0.900623,0.5,10.02,12.0,111.0,2003.0
25%,7616.75,507670.0,1.029402,12.0,96.15,1.08702,1.13,13.77,142.0,736.75,2008.0
50%,14711.0,1064920.0,1.214383,25.0,97.385,1.292028,1.81,17.39,261.5,1481.0,2014.0
75%,22022.25,1899791.0,1.415986,37.0,98.6525,1.499628,2.39,21.64,382.0,2202.25,2020.0
max,29959.0,4327668.0,1.599646,49.0,99.89,1.698257,3.0,24.99,499.0,2999.0,2025.0


#### 9. Finding Outliers in dataset

In [10]:
#select numerical columns
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()

#to store result
outlier_report = []

#toop on all numerical columns
for col in num_cols: # <-- add loop in each numberical columns

    if col == 'Year':
        continue;

    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # count outliers
    num_outliers = ((df[col] < lower_bound) | (df[col] > upper_bound)).sum()

    # add on report
    outlier_report.append({
        'Column Name': col,
        'Outlier Count': num_outliers,
    })

# create dataframe and display
outlier_df = pd.DataFrame(outlier_report)
outlier_df.sort_values(by='Outlier Count', ascending=False, inplace=True)

# display with style
print("Outlier Detection Summary Table:")
display(outlier_df)

Outlier Detection Summary Table:


Unnamed: 0,Column Name,Outlier Count
1,Times Cited,7
0,Web of Science Documents,0
2,Collab-CNCI,0
3,Rank,0
4,% Docs Cited,0
5,Category Normalized Citation Impact,0
6,% Documents in Top 1%,0
7,% Documents in Top 10%,0
8,Documents in Top 1%,0
9,Documents in Top 10%,0


### üìã Data Dictionary
Data must be

| Column Name | Description | Data Type |
| :--- | :--- | :--- |
| **Country** | Name of the Country | `String` |
| **Documents** | Total count of research papers published by the country | `Integer` |
| **Times Cited** | Total number of citations received by the published papers | `Integer` |
| **Collab-CNCI** | Category Normalized Citation Impact score for collaborative papers only | `Float` |
| **Rank** | Ranking position of the country | `Integer` |
| **% Docs Cited** | Percentage of documents that have received at least one citation | `Float` |
| **CNCI** | Impact score normalized by subject, year, and type (1.0 = World Average) | `Float` |
| **% Documents in Top 1%** | Percentage of papers that are in the global top 1% of most cited papers | `Float` |
| **% Documents in Top 10%** | Percentage of papers that are in the global top 10% of most cited papers | `Float` |
| **Documents in Top 1%** | Absolute count of papers in the global top 1% | `Integer` |
| **Documents in Top 10%** | Absolute count of papers in the global top 10% | `Integer` |
| **Year** | The specific year of publication for the data record | `Integer` |

<div class="alert alert-block alert-info">
<b>üßê Initial Observations:</b>
<ul>
    <li>The dataset contains <b>1000</b> rows and <b>12</b> columns.</li>
    <li>There are <b>no missing values</b> in the dataset.</li>
    <li>All <b>Data Types</b> appear to be correct (Numerical columns are recognized properly).</li>
    <li><b>Outliers:</b> The 'Times Cited' column contains significantly high values (Outliers).</li>
    <li><b>Data Consistency Issue:</b>
        <ul>
            <li>The dataset lists both <i>'United Kingdom'</i> and <i>'England'</i> separately.</li>
            <li><b>Duplicate Entries per Year:</b> There are multiple rows for the same Country in the same Year, which requires aggregation.</li>
        </ul>
    </li>
</ul>
</div>

## 3. Data Preprocessing & Cleaning <a id="cleaning"></a>

Before proceeding to analysis, we conducted a rigorous data quality check. While the dataset contained no missing values, we identified inconsistencies in country naming and redundancy in annual records that required intervention.

#### 1. Renaming Columns and Decimal Precision (Upto to 2 decimal places)

In [11]:
# 1. Renaming Columns for better readability
df.rename(columns={
    'Name': 'Country',
    'Category Normalized Citation Impact': 'CNCI',
    'Web of Science Documents': 'Documents',
    'year': 'Year'
}, inplace=True)

# 2. Precision Formatting
df = df.round({
    'Collab-CNCI': 2,
    'CNCI': 2,
    '% Docs Cited': 2,
    '% Documents in Top 1%': 2,
    '% Documents in Top 10%': 2
})

print("Columns Renamed and Precision Set.")
display(df.head(3))

Columns Renamed and Precision Set.


Unnamed: 0,Country,Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,CNCI,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,Year
0,SWITZERLAND,24154,2705248,0.95,8,97.93,1.02,0.89,10.87,97,230,2023
1,CHINA,2185,157320,1.58,44,99.6,0.9,2.98,19.26,323,121,2014
2,CHINA,6896,744768,1.03,42,95.23,1.68,1.08,11.36,455,662,2013


#### 2. Aggregating Common Country (UK and England)

In [12]:
# Check Specific Year data of England and UK
target_year = 2017

# Filter data for England & UK for that specific year
specific_check = df[
    (df['Country'].isin(['ENGLAND', 'UNITED KINGDOM'])) & 
    (df['Year'] == target_year)
]

print(f"Data Comparison for Year: {target_year}")
display(specific_check)

# as we can see tha england and uk data is not duplicated hence rename it.

Data Comparison for Year: 2017


Unnamed: 0,Country,Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,CNCI,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,Year
21,ENGLAND,19860,2919420,1.21,40,95.67,1.17,0.78,23.87,487,1734,2017
53,UNITED KINGDOM,25789,2811001,1.06,3,96.24,1.22,1.93,21.11,465,928,2017
458,ENGLAND,14562,1266894,1.25,16,99.05,0.93,1.55,12.52,246,1317,2017
469,ENGLAND,29654,3232286,1.56,44,96.54,1.1,2.24,15.46,397,1881,2017
479,ENGLAND,17013,1344027,1.33,32,99.17,1.06,2.84,20.34,138,1006,2017
664,UNITED KINGDOM,5051,136377,1.16,7,95.94,0.98,2.78,22.31,457,347,2017


In [13]:
# 1. Rename 'ENGLAND' to 'UNITED KINGDOM'
df.loc[df['Country'] == 'ENGLAND', 'Country'] = 'UNITED KINGDOM'  #<-- use loc() to find location

print("Successfully changed all 'ENGLAND' entries to 'UNITED KINGDOM'.")

# 2. Verification
uk_rows = df[(df['Country'] == 'UNITED KINGDOM') & (df['Year'] == target_year)]

print(f"\nTotal Rows for United Kingdom in {target_year}: {len(uk_rows)}")
display(uk_rows)

Successfully changed all 'ENGLAND' entries to 'UNITED KINGDOM'.

Total Rows for United Kingdom in 2017: 6


Unnamed: 0,Country,Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,CNCI,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,Year
21,UNITED KINGDOM,19860,2919420,1.21,40,95.67,1.17,0.78,23.87,487,1734,2017
53,UNITED KINGDOM,25789,2811001,1.06,3,96.24,1.22,1.93,21.11,465,928,2017
458,UNITED KINGDOM,14562,1266894,1.25,16,99.05,0.93,1.55,12.52,246,1317,2017
469,UNITED KINGDOM,29654,3232286,1.56,44,96.54,1.1,2.24,15.46,397,1881,2017
479,UNITED KINGDOM,17013,1344027,1.33,32,99.17,1.06,2.84,20.34,138,1006,2017
664,UNITED KINGDOM,5051,136377,1.16,7,95.94,0.98,2.78,22.31,457,347,2017


#### 3. Checking Outliers in Dataset

In [14]:
#Checking Outliers using IQR in Times Cited
col_name = 'Times Cited'

Q1 = df[col_name].quantile(0.25)
Q3 = df[col_name].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Wo rows dhoondhein jo outliers hain
outliers = df[(df[col_name] < lower_bound) | (df[col_name] > upper_bound)]

outliers


#Here, ywe can see that Outliers are not error, It is only showing elite performance of country.

Unnamed: 0,Country,Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,CNCI,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,Year
34,SWITZERLAND,28337,4080528,1.47,22,96.23,0.93,1.26,18.06,110,2411,2025
552,JAPAN,27698,4043908,1.18,48,99.06,1.56,2.27,11.58,125,546,2018
560,JAPAN,27450,4007700,1.14,24,96.51,1.09,0.68,21.16,170,2431,2017
632,CHINA,29594,4024784,1.56,7,97.61,1.22,0.92,24.66,206,303,2007
657,CANADA,29353,4226832,0.94,39,99.89,1.11,1.2,23.96,106,1993,2004
861,GERMANY,29241,4327668,1.34,13,96.6,1.5,0.7,10.73,471,739,2015
945,USA,29430,4149630,1.21,49,99.52,1.51,2.33,20.64,136,296,2006


#### 4. Aggregating Repeated Countries

In [15]:
#Aggregating Repeated Countries.

# Logic:
# Quantitative columns -> SUM 
# Quality/Percentage columns -> MEAN 
# Rank -> MIN (because rank 1 is better then rank 10)

agg_rules = {
    'Documents': 'sum', 
    'Times Cited': 'sum',
    'Documents in Top 1%': 'sum',
    'Documents in Top 10%': 'sum',
    'CNCI': 'mean',
    'Collab-CNCI': 'mean',
    '% Docs Cited': 'mean',
    '% Documents in Top 1%': 'mean',
    '% Documents in Top 10%': 'mean',
    'Rank': 'min' # <-- to get better rank
}

# Compressing data by Groupby 
df_clean = df.groupby(['Country', 'Year'], as_index=False).agg(agg_rules) 

# Rounding off again after mean calculation to keep it clean
df_clean = df_clean.round(2)

print(f"Aggregation Complete. Dataset shape changed from {df.shape} to {df_clean.shape}") # <-- to compare the shape of tables.
display(df_clean.head(30))

# Now, this dataset will use
df = df_clean.copy()

Aggregation Complete. Dataset shape changed from (1000, 12) to (340, 12)


Unnamed: 0,Country,Year,Documents,Times Cited,Documents in Top 1%,Documents in Top 10%,CNCI,Collab-CNCI,% Docs Cited,% Documents in Top 1%,% Documents in Top 10%,Rank
0,AUSTRALIA,2003,73479,3965411,1952,7645,1.39,1.33,96.88,1.76,13.41,1
1,AUSTRALIA,2004,34122,4026396,561,2594,1.47,1.34,97.88,2.02,11.29,1
2,AUSTRALIA,2005,76888,4458568,1177,8504,1.15,1.34,97.31,1.83,16.21,7
3,AUSTRALIA,2006,69315,5190781,1369,6318,1.24,1.2,96.39,1.47,19.8,12
4,AUSTRALIA,2007,11637,1443993,715,764,1.36,1.35,98.12,1.68,14.6,24
5,AUSTRALIA,2008,124250,9851009,1875,7333,1.37,1.1,96.71,2.48,20.33,5
6,AUSTRALIA,2009,22569,767346,87,2861,1.4,1.08,99.72,2.82,21.41,21
7,AUSTRALIA,2010,39018,2236904,579,4933,1.25,1.24,96.43,2.31,16.38,27
8,AUSTRALIA,2011,8580,909480,209,186,1.69,1.21,95.52,1.61,10.72,19
9,AUSTRALIA,2012,64809,4733799,1536,7240,1.16,1.32,97.47,1.8,20.84,8


<div class="alert alert-block alert-success">
<b>‚úÖ Data Readiness Summary:</b>

The data preprocessing phase is complete. Here is the summary of actions taken:
<ul>
    <li><b>Standardization:</b> Column names normalized (e.g., 'Name' -> 'Country', 'Category Normalized...' -> 'CNCI').</li>
    <li><b>Precision:</b> All numerical impact metrics rounded to 2 decimal places.</li>
    <li><b>Entity Resolution:</b> 'England' entries were replaced with 'United Kingdom' to ensure consistent country-level analysis.</li>
    <li><b>Data Aggregation:</b> Multiple entries for the same Country-Year combination were aggregated. Quantitative metrics (e.g., Documents, Citations) were <b>summed</b>, while qualitative metrics (e.g., CNCI) were <b>averaged</b> to ensure a single, accurate row per country per year.</li>
    <li><b>Outlier Decision:</b> The 'Times Cited' column shows significant outliers. These are <b>not errors</b> but represent "Elite Performance". Removing them would hide the most important insights regarding global research leadership. Hence, <b>we have retained all outliers.</b></li>
</ul>
We are now ready for <b>Exploratory Data Analysis (EDA).</b>
</div>

## 4. Exploratory Data Analysis (EDA) <a id="eda"></a>
Here, we dive deep into the data to answer specific questions and uncover patterns.

### 4.1 Univariate Analysis

#### Q1 Top Performers Analysis: Quantity vs. Quality

**Objective:**
To identify the leading Countries in the dataset based on two distinct performance metrics:

1.  **Volume (Quantity):** Who is producing the most research?
    *   *Metric used:* `Documents`
2.  **Impact (Quality):** Who is producing the most influential research relative to their field?
    *   *Metric used:* `Category Normalized Citation Impact (CNCI)`

In [16]:
# --- Step 1: Prepare the Data ---
# Quantity(Documents) : Sum of all documents across years for each Country
volume_df = df.groupby('Country')['Documents'].sum().reset_index()
top_volume = volume_df.sort_values(by='Documents', ascending=False).head(15)
# Quality(CNCI) : Average CNCI across years for each Country
quality_df = df.groupby('Country')['CNCI'].mean().reset_index()
top_quality = quality_df.sort_values(by='CNCI', ascending=False).head(10)


# --- Step 2: Visualization - Volume (Quantity) ---
fig_vol = px.bar(
    top_volume,
    x='Documents',
    y='Country',
    orientation='h',  # <-- Horizontal Bar Chart
    title='<b>Top 5 Countries by Research Volume</b><br><i>(Total Web of Science Documents)</i></br>',
    text_auto='.2s',  # <-- Format: 1.5k, 2M 
    color='Documents',
    color_continuous_scale='Viridis',
    labels={'Documents': 'Total Documents', 'Country': 'Country Names'}
)
# Layout Updates
fig_vol.update_layout(
    yaxis=dict(autorange="reversed"), # <-- Reverse Y-axis to show Rank #1 at top
    coloraxis_showscale=False # <-- Hides the color bar to keep it clean
)
fig_vol.show()


# --- Step 3: Visualization - Quality (Impact) ---
fig_qual = px.bar(
    top_quality,
    x='CNCI',
    y='Country',
    orientation='h',
    title='<b>Top 5 Countries by Research Quality</b><br><i>(Average Category Normalized Citation Impact)</i>',
    text_auto='.3f',  # <-- Format: 3 decimal places (e.g., 1.523)
    color='CNCI',
    color_continuous_scale='Magma',
    labels={'CNCI': 'Avg CNCI', 'Country': 'Country Names'}
)
# Layout Updates: Reverse Y-axis & Add Benchmark Line
fig_qual.update_layout(
    yaxis=dict(autorange="reversed"), # <-- Reverse Y-axis to show Rank #1 at top
    coloraxis_showscale=False # <-- Hides the color bar to keep it clean
)
# Adding the Global Average Benchmark Line (x=1.0)
fig_qual.add_vline(
    x=1.0, 
    line_dash="dash", 
    line_color="green", 
    annotation_text="Global Avg (1.0)" 
)
fig_qual.show()

<div class="alert alert-block alert-warning">
<b>üí° Insight:</b> The analysis reveals a distinct separation between <b>Quantity</b> and <b>Quality</b> leaders:
<ul>
    <li><b>Volume Leaders ("Mass Producers"):</b> Countries like <b>United Kingdom</b> and <b>Spain</b> dominate in total <code>Documents</code>, indicating massive research scale.</li>
    <li><b>Quality Leaders ("Elite Impact"):</b> However, the <code>CNCI</code> rankings are topped by nations like <b>Japan</b> and <b>Italy</b>, showing that high volume does not always guarantee high average impact.</li>
</ul>
This suggests a potential trade-off between scaling research output and maintaining elite citation performance.
</div>

#### Q2 Ranges & Distributions Analysis: Research Relevance

**Objective:**
To analyze the spread and consistency of research relevance to understand how effectively published papers attract attention. This helps us determine if "getting cited" is a common standard or a rare achievement in this dataset.

1.  **Spread & Range:** What are the boundaries of performance?
    *   *Metric used:* `% Docs Cited` (Minimum vs. Maximum values).
2.  **Distribution Shape:** How is the performance distributed across the dataset?
    *   *Left Skewed:* Indicates most countries achieve high citation rates (Good sign).
    *   *Right Skewed:* Indicates most countries have low citation rates (Bad sign).
3.  **Consistency:** Is the citation rate stable across different countries and years, or are there massive disparities in performance?

In [17]:
# --- Step 1: Calculate Statistics ---
mean_val = df['% Docs Cited'].mean()
median_val = df['% Docs Cited'].median()

# --- Step 2: Create Histogram with Marginal Box Plot ---
fig_dist = px.histogram(
    df, 
    x='% Docs Cited', 
    nbins=50, # <-- how many bars
    title='Distribution of Research Relevance (% Docs Cited)',
    marginal='box',  # <-- Adds a box plot above the histogram
    color_discrete_sequence=['#00CC96'], 
    opacity=0.8
)

# --- Step 3: Add Reference Lines & Annotations ---
# Add vertical lines (Without text here to avoid duplication on the marginal plot)
fig_dist.add_vline(x=mean_val, line_dash="dash", line_color="red")
fig_dist.add_vline(x=median_val, line_dash="dot", line_color="blue")

# Add Mean Annotation (Placed relative to paper/layout height)
fig_dist.add_annotation(
    x=mean_val,
    y=1.15,          # <-- placed higher to avoid overlap
    yref="paper",    # <-- references the figure layout, not data points
    text=f"Mean: {mean_val:.1f}%", 
    showarrow=False,
    font=dict(color="red")
)

# Add Median Annotation
fig_dist.add_annotation(
    x=median_val,
    y=1.08,          # <-- placed slightly lower than Mean
    yref="paper",    # <-- position relative to chart layout not data points
    text=f"Median: {median_val:.1f}%",
    showarrow=False,
    font=dict(color="blue")
)

# --- Step 4: Final Layout Adjustments ---
fig_dist.update_layout(
    xaxis_title='Percentage of Documents Cited',
    yaxis_title='Count',
    bargap=0.1
)

fig_dist.show()

# --- Step 5: Print Statistics ---
print("Statistical Summary of % Docs Cited:")
print(df['% Docs Cited'].describe())
print("-" * 30)
print(f"Skewness: {df['% Docs Cited'].skew():.2f}") 


Statistical Summary of % Docs Cited:
count    340.000000
mean      97.404941
std        0.991035
min       95.060000
25%       96.780000
50%       97.390000
75%       98.025000
max       99.860000
Name: % Docs Cited, dtype: float64
------------------------------
Skewness: 0.05


<div class="alert alert-block alert-warning">
<b>üí° Insight:</b> The analysis reveals a landscape of remarkable symmetry and consistency in research relevance (<code>% Docs Cited</code>).
<ul>
    <li><b>A Symmetrically High Performance:</b> The distribution shows almost no skew (<b>0.05</b>), with the Mean and Median being nearly identical. This indicates a rare symmetric, bell-shaped curve centered at a very high value (approx. 97.5%).</li>
    <li><b>"Getting Cited" is a Baseline, Not a Differentiator:</b> The extreme consistency and high average mean that simply getting papers cited is a "table stakes" metric in this dataset. There is no long tail of underperformers.</li>
</ul>
This finding implies that to truly differentiate between nations, we must look beyond this metric and focus on more discerning indicators like <b>Category Normalized Citation Impact (CNCI)</b> and the conversion to <b>Top 1%</b> documents.
</div>

#### Q3. Benchmarks Analysis: Performance vs. Global Standards

**Objective:**
To evaluate national research performance against the established **Global Benchmark** to identify systemic strengths or weaknesses. This analysis helps distinguish between nations that consistently deliver "Elite" quality versus those that struggle to meet global expectations.

1.  **The Yardstick (Metric Definition):** How do we measure success?
    *   *Metric used:* `Category Normalized Citation Impact` (CNCI).
    *   **CNCI > 1.0:** Indicates research is performing **better** than the world average.
    *   **CNCI < 1.0:** Indicates research is performing **worse** than the world average.

2.  **Structural Strength (Aggregated View):** Are there any nations that are fundamentally weak?
    *   *Method:* Grouping data by `Country` to see if any nation's *average* performance falls below 1.0.

3.  **Consistency Check (Granular View):** How stable is the quality over time?
    *   *Method:* Analyzing individual year-wise data points to spot specific instances (anomalies) where a high-performing country might have had an "off-year" (CNCI < 1.0).

In [18]:
# --- Step 1: The Big Picture (Grouped Analysis) ---
# Goal: Check if any country fails the benchmark when averaged over all years.
country_perf = df.groupby('Country')['CNCI'].mean()
countries_below_benchmark = country_perf[country_perf < 1.0]

print(f"--- Global Benchmark Analysis (CNCI = 1.0) ---")
print(f"Number of countries with Overall Average CNCI < 1.0: {len(countries_below_benchmark)}")
print("Observation: Overall, structurally, every country maintains a high standard (> 1.0).")
print("-" * 50)

# --- Step 2: The Reality Check (Granular Analysis) ---
# Goal: Identify specific years where performance dipped below average.
below_avg_instances = df[df['CNCI'] < 1.0]

print(f"However, examining granular data (Year-by-Year):")
print(f"We found {len(below_avg_instances)} specific instances (dots) of underperformance out of {len(df)} records.")

# --- Step 3: Visualizing the Nuance ---
# We create a temporary column to define colors: Red for 'Below', Grey for 'Above'
df['Benchmark Status'] = df['CNCI'].apply(lambda x: 'Below Average (< 1.0)' if x < 1.0 else 'Above Average (>= 1.0)')

# Create a Strip Plot (Jitter Plot)
fig_strip = px.strip(
    df, 
    y='CNCI', 
    x='Country', 
    color='Benchmark Status', # <-- this will add color mapping
    color_discrete_map={
        'Below Average (< 1.0)': 'red',  # <-- red for below average
        'Above Average (>= 1.0)': 'lightgrey' # <-- grey for above average
    },
    hover_data=['Country', 'Year', 'CNCI'], # <-- show data on hover
    title='Performance Distribution: Elite Status vs. Occasional Dips',
    template='plotly_white', # <-- add plotly white to the chart
    labels={'CNCI': 'CNCI (Impact)'}
)

# Add a horizontal line representing the Global Average
fig_strip.add_hline(
    y=1.0, 
    line_dash="dash", 
    line_color="black", 
    annotation_text="Global Baseline (1.0)",
    annotation_position="bottom right"
)

fig_strip.show()

--- Global Benchmark Analysis (CNCI = 1.0) ---
Number of countries with Overall Average CNCI < 1.0: 0
Observation: Overall, structurally, every country maintains a high standard (> 1.0).
--------------------------------------------------
However, examining granular data (Year-by-Year):
We found 14 specific instances (dots) of underperformance out of 340 records.


<div class="alert alert-block alert-warning">
<b>üí° Insight:</b> The benchmark analysis highlights the <b>robustness and consistency</b> of the nations in this dataset:
<ul>
    <li><b>Structural Strength (Aggregated View):</b> Remarkably, when averaged over time, <b>every single country</b> maintains a `CNCI` above <b>1.0</b>. This indicates there are no structurally weak performers in this group; all are producing research that exceeds the global average.</li>
    <li><b>Rare Anomalies (Granular View):</b> "Underperformance" is the exception, not the rule. Out of the entire dataset, there are only <b>14 specific instances</b> (year-wise data points) where the `CNCI` dipped below 1.0.</li>
</ul>
This suggests that while even top nations have occasional "off-years," their long-term research trajectory is consistently <b>above the global standard</b>.
</div>

#### Q4 Outlier Detection Analysis

**Objective:**
To identify data points that deviate significantly from the statistical norm, aiming to distinguish between potential data errors and genuine "Elite Performers" or "Mass Producers" that define the upper limits of research performance.

1.  **Volume Outliers ("Mass Producers"):** Which countries or years are producing a volume of documents that far exceeds the global distribution?
    *   *Metric used:* `Web of Science Documents` (Identifying Volume Extremes)
2.  **Impact Outliers ("Elite Performers"):** Are there specific instances (Country or Year) where the Citation Impact (CNCI) is exceptionally high or suspiciously low?
    *   *Metric used:* `Category Normalized Citation Impact (CNCI)` (Identifying Quality Anomalies)
3.  **Trend Consistency:** Do these outliers represent a consistent trend for a specific country, or are they isolated, one-off events?
    *   *Focus:* Analyzing the persistence of outliers across the `Year` column.

In [19]:
# Part 1: Outlier in Quantity (Documents)

# --- Step 1.1 Visualization ---
# Creating a Box Plot to visualize the spread and spot dots outside the whiskers
fig_docs = px.box(
    df, 
    y="Documents",
    points="all",  # <-- display every data dots next to box plot
    hover_data=["Country", "Year", "CNCI"], # <-- also add CNCI on hover for compare
    title="<b>Outlier Detection: Research Volume (Documents)</b>",
    color_discrete_sequence=['#EF553B'] # <-- Red color theme
)

fig_docs.update_layout(
    yaxis_title="Number of Documents",
    xaxis_title="Global Distribution",
    template="plotly_white",  # <-- white background template
    height=600 # <-- give height to the box plot
)

fig_docs.show()

# --- Step 1.2 Outlier Table Generation (IQR Method) ---
# Calculating the Interquartile Range (IQR) to mathematically identify outliers
Q1_doc = df['Documents'].quantile(0.25)
Q3_doc = df['Documents'].quantile(0.75)
IQR_doc = Q3_doc - Q1_doc   # <-- IQR Range (Q3 - Q1)
upper_bound_doc = Q3_doc + 1.5 * IQR_doc

# Filtering rows that exceed the upper bound
doc_outliers = df[df['Documents'] > upper_bound_doc]
doc_outliers = doc_outliers[['Country', 'Year', 'Documents']] #<-- ony show these three columns

# Checking if outliers exist before displaying
if not doc_outliers.empty:
    print(f"\n[Table] Found {len(doc_outliers)} Outliers in Research Volume (Documents):")
    # Sorting by Documents to show the biggest outliers first
    display(doc_outliers.sort_values(by='Documents', ascending=False).head(10).style.background_gradient(cmap='Reds'))
else:
    print("\n[Result] No statistical outliers found in Documents.")

# Check consistency: Count how many times a country appears in the outlier list
if not doc_outliers.empty:
    print("\n--- Consistency Check (Is it a Trend?) ---")
    print(doc_outliers['Country'].value_counts())



# PART 2: OUTLIERS IN QUALITY (CNCI)

# --- Step 2.1 Visualization ---
# Box Plot for Quality Metrics
fig_cnci = px.box(
    df, 
    y="CNCI",
    points="all",
    hover_data=["Country", "Year", "Documents"], # <-- also add documents on hover for compare
    title="<b>Outlier Detection: Research Quality (CNCI)</b>",
    color_discrete_sequence=['#00CC96'] # <-- Green color theme
)

# Adding a reference line for Global Average (1.0)
fig_cnci.add_hline(y=1.0, line_dash="dash", line_color="gray", annotation_text="Global Average (1.0)")

fig_cnci.update_layout(
    yaxis_title="CNCI Score",
    xaxis_title="Global Distribution",
    template="plotly_white",
    height=600
)

fig_cnci.show()

# --- Step 2.2 Outlier Table Generation (IQR Method) ---
# Calculating IQR for CNCI
Q1_cnci = df['CNCI'].quantile(0.25)
Q3_cnci = df['CNCI'].quantile(0.75)
IQR_cnci = Q3_cnci - Q1_cnci #<-- IQR Range (Q3 - Q1)
upper_bound_cnci = Q3_cnci + 1.5 * IQR_cnci
lower_bound_cnci = Q1_cnci - 1.5 * IQR_cnci # <-- Identifying extremely low quality too

# Filtering rows outside the bounds
cnci_outliers = df[(df_clean['CNCI'] > upper_bound_cnci) | (df_clean['CNCI'] < lower_bound_cnci)]
cnci_outliers = cnci_outliers[['Country', 'Year', 'CNCI', 'Documents']] #<-- only show these four columns

# Checking if outliers exist before displaying
if not cnci_outliers.empty:
    print(f"\n[Table] Found {len(cnci_outliers)} Outliers in Research Quality (CNCI):")
    display(cnci_outliers.sort_values(by='CNCI', ascending=False).head(10).style.background_gradient(cmap='Greens'))
else:
    print("\n[Result] No statistical outliers found in CNCI.")


[Table] Found 6 Outliers in Research Volume (Documents):


Unnamed: 0,Country,Year,Documents
149,ITALY,2004,149432
66,CHINA,2007,148329
310,UNITED KINGDOM,2016,134985
228,SOUTH KOREA,2023,127999
5,AUSTRALIA,2008,124250
27,BRAZIL,2012,119368



--- Consistency Check (Is it a Trend?) ---
Country
AUSTRALIA         1
BRAZIL            1
CHINA             1
ITALY             1
SOUTH KOREA       1
UNITED KINGDOM    1
Name: count, dtype: int64



[Result] No statistical outliers found in CNCI.


<div class="alert alert-block alert-warning">
<b>üí° Insight:</b> The outlier analysis highlights a fundamental difference in how <b>Quantity</b> and <b>Quality</b> are distributed across the research landscape:
<ul>
    <li><b>Volume Extremes ("Mega-Producers"):</b> The <code>Documents</code> distribution is highly skewed with distinct outliers. <b>China</b> and <b>Italy</b> emerge as extreme cases, indicating that specific years for these nations produced output far beyond the global median.</li>
    <li><b>Quality Consistency ("The Quality Ceiling"):</b> In contrast, the <code>CNCI</code> distribution shows <b>no significant outliers</b> (despite a slightly elongated upper whisker). This suggests that while research <i>volume</i> can be scaled exponentially, <i>citation impact</i> tends to remain within a predictable range, making it difficult to achieve "abnormally high" quality scores.</li>
    <li>
<b>Outliers are Isolated Events, Not Trends:</b> High-volume production by a country is typically a <b>one-off event</b> in a specific year, not a persistent, year-over-year trend. This indicates that outliers are likely driven by temporary factors (like policy shifts or project completions) rather than a sustained state of overproduction.
</li>
</ul>
This implies that while a country can decide to drastically increase its publication count, achieving statistically extreme <b>Impact Scores</b> is significantly harder and rarer.
</div>

#### Q5 Funnel Check: Elite Research Conversion (% Documents in Top 1%)

**Objective:**
To evaluate the "elite research conversion rate" by quantifying the percentage of documents that achieve "Top 1%" status. This analysis benchmarks performance against the theoretical global average (1%) to identify nations that are not just good, but exceptionally impactful, and to understand the overall distribution of research excellence.

1.  **Benchmark Performance:** How does the dataset's average conversion rate compare to the theoretical global baseline of 1%?
    *   *Metric used:* Mean and Median of `% Documents in Top 1%`.
2.  **Distribution Shape:** What is the overall distribution of elite research performance? Is it symmetric, or skewed towards a few high-performing instances?
    *   *Focus:* Analyzing the Histogram and Box Plot for skewness and spread.
3.  **Identifying Consistent Leaders:** Which nations consistently surpass the "Exceptional Performance" threshold (>2%) across multiple years?
    *   *Focus:* Counting the frequency of each country's appearance in the >2% elite group.

In [31]:
# --- Step 1: Calculate Statistics ---
# Calculate the mean, median, and max to understand the central tendency of the data
avg_elite = df['% Documents in Top 1%'].mean()
median_elite = df['% Documents in Top 1%'].median()
max_elite = df['% Documents in Top 1%'].max()

# Print the statistical summary
print(f"--- Statistics for % Documents in Top 1% ---")
print(f"Mean (Average):   {avg_elite:.2f}%")
print(f"Median:           {median_elite:.2f}%")
print(f"Max Value:        {max_elite:.2f}%")
print("-" * 40)


# --- Step 2: Identify Top Performers ---
# Filter rows where the elite research percentage is greater than 2.0% means double the global expected average of 1%
elite_performers = df[df['% Documents in Top 1%'] > 2.0]
elite_performers = elite_performers[['Country', 'Year', '% Documents in Top 1%']]
print(f"\nNumber of records with > 2% Elite Papers: {len(elite_performers)}")

# --- Consistency Check ---
# Count how frequently each country appears in the 'Elite' list to measure consistency
print("\n--- Which countries are CONSISTENTLY performing > 2%? ---")
consistency_check = elite_performers['Country'].value_counts().head(15) # <-- add value_count() to get consistent countries
print(consistency_check)
print("-" * 40)

# Show the top 5 highest single-year performances
print("\nTop 5 specific examples (Highest Single Year Performance):")
print(elite_performers.sort_values(by='% Documents in Top 1%', ascending=False).head(5))

# --- Step 3: Visualization (Histogram) ---
# Create a histogram to visualize the distribution of the data
fig_elite = px.histogram(
    df, 
    x='% Documents in Top 1%',
    nbins=40,
    title='Distribution of Elite Research (% Documents in Top 1%)',
    marginal='box',   # <-- add box plot also to find outliers
    color_discrete_sequence=['gold'], 
    labels={'% Documents in Top 1%': 'Percentage of Documents in Top 1%'}
)

# Add a vertical reference line for the Global Baseline (Theoretical 1%)
fig_elite.add_vline(x=1.0, line_dash="dash", line_color="red")

# Add a vertical reference line for the Dataset Average
fig_elite.add_vline(x=avg_elite, line_dash="dot", line_color="blue")

# --- Annotations ---
# Add text label for Global Baseline
fig_elite.add_annotation(
    x=1.0, 
    y=1.08, 
    yref="paper",  # <-- positions text relative to the chart layout not data values.
    text="Global Baseline (1%)", 
    showarrow=False, 
    font=dict(color="red", size=10) 
)

# Add text label for Dataset Average
fig_elite.add_annotation(
    x=avg_elite, 
    y=1.08, 
    yref="paper", 
    text=f"Dataset Avg ({avg_elite:.2f}%)", 
    showarrow=False, 
    font=dict(color="blue", size=10)
)

# layout for better readability
fig_elite.update_layout(bargap=0.1) # <-- add gap between bars
fig_elite.show()

--- Statistics for % Documents in Top 1% ---
Mean (Average):   1.77%
Median:           1.76%
Max Value:        2.96%
----------------------------------------

Number of records with > 2% Elite Papers: 106

--- Which countries are CONSISTENTLY performing > 2%? ---
Country
BRAZIL            10
GERMANY            9
UNITED KINGDOM     9
CHINA              8
JAPAN              8
SWEDEN             8
ITALY              6
CANADA             6
SWITZERLAND        6
USA                6
SPAIN              6
AUSTRALIA          5
SOUTH KOREA        5
INDIA              5
NETHERLANDS        5
Name: count, dtype: int64
----------------------------------------

Top 5 specific examples (Highest Single Year Performance):
         Country  Year  % Documents in Top 1%
107      GERMANY  2005                   2.96
192  NETHERLANDS  2004                   2.95
28        BRAZIL  2013                   2.93
290  SWITZERLAND  2018                   2.92
150        ITALY  2005                   2.89


In [21]:
# --- Additional Analysis: Aggregated Country-wise Elite Performance ---

# --- Step 1 : Display Top 10 Countries ---
# To determine which countries are the best overall, we group the data by 'Country'
# and calculate the mean (average) of their '% Documents in Top 1%' across all years.
country_elite_avg = df.groupby('Country')['% Documents in Top 1%'].mean().sort_values(ascending=False).reset_index()

# Display the top 10 countries as a table for a quick overview.
print("--- Top 10 Countries by Average % in Top 1% (Overall Performance) ---")
print(country_elite_avg.head(10))


# --- Step 2 : Visualization: Horizontal Bar Chart ---
# A horizontal bar chart is used for easy comparison of the top performers.
fig_bar = px.bar(
    country_elite_avg.head(10),
    x='% Documents in Top 1%',
    y='Country',                   
    orientation='h',
    text_auto='.3s',
    title='Top 10 Countries by Average Elite Research Performance',
    color='% Documents in Top 1%',
    color_continuous_scale='Teal',  # <-- visually appealing color scale
    labels={
        '% Documents in Top 1%': 'Average % of Documents in Top 1%',
        'Country' : 'Country Name'
    }
)

fig_bar.add_vline(
    x=1.0,
    line_dash='dash',
    line_color= 'red'
)

fig_bar.add_annotation(
    x=1.0, 
    y=1.08, 
    yref="paper", 
    text=f"Global Avg ({1}%)", 
    showarrow=False, 
    font=dict(color="red", size=10)
)

# Reverse the y-axis to display the top-ranked country at the top.
fig_bar.update_layout(
    yaxis=dict(autorange="reversed"),
    coloraxis_showscale=False
    )

# Render the interactive chart.
fig_bar.show()

--- Top 10 Countries by Average % in Top 1% (Overall Performance) ---
          Country  % Documents in Top 1%
0          SWEDEN               1.933333
1          BRAZIL               1.909524
2       AUSTRALIA               1.903158
3         GERMANY               1.827273
4  UNITED KINGDOM               1.826818
5           ITALY               1.802609
6           CHINA               1.789091
7     NETHERLANDS               1.771000
8           INDIA               1.765714
9             USA               1.749000


<div class="alert alert-block alert-warning">
<b>üí° Insight:</b> The analysis of Elite Research Conversion reveals a high-performing cohort that consistently outperforms the global theoretical baseline:
<ul>
    <li><b>Systematic Overperformance:</b> The dataset's mean (<b>1.77%</b>) and median (<b>1.76%</b>) are nearly identical and significantly higher than the expected global baseline of 1%. This indicates a <b>symmetric distribution</b> where the majority of these nations are producing elite research at nearly <b>double</b> the expected rate.</li>
    <li><b>Consistency vs. Peak Performance:</b> While <b>Germany</b> achieved the highest single-year peak (<b>2.96%</b>), <b>Brazil</b> demonstrates remarkable consistency, crossing the "Exceptional" (>2%) threshold <b>10 times</b>, followed closely by Germany and the USA (9 times). This signals a robust, high-impact research ecosystem in these nations.</li>
    <li><b>Sustained Excellence (Overall Average):</b> When aggregated over time, <b>Sweden</b> emerges as the overall leader with an average of <b>1.93%</b>, followed closely by <b>Brazil (1.90%)</b>. This suggests that while others may hit higher single-year peaks, Sweden maintains the highest standard of quality year-over-year.</li>
</ul>
This implies that getting into the Top 1% is not just a matter of luck for these nations; they have established systems that consistently produce world-class influential papers.
</div>

### 4.2 Bivariate Analysis

#### Q1 Top Performers Analysis: Publication Volume Trends

**Objective:**
To identify the **Top 5 countries** based on the *total volume of publications* in the dataset and then analyze their research output trends over the last two decades.

1.  **Identify Top Performers:** Who are the actual Top 5 countries based on the sum of `Web of Science Documents`?
2.  **Analyze Growth Trajectories:** How have these specific countries performed over time?

In [32]:
# --- Step-1 : Calculate the total documents published by each country over the entire period ---
top_countries_by_docs = df.groupby('Country')['Documents'].sum().sort_values(ascending=False).head(3)
print("--- Top 3 Countries by Total Publications ---")
print(top_countries_by_docs)
print("-------------------------------------------")

# Get the list of names of the top 3 countries
top_3_names = top_countries_by_docs.index.tolist()

# Filter the original dataframe to only include data for these top 3 countries
df_top3_trends = df[df['Country'].isin(top_3_names)].sort_values('Year')


# --- Step-2 : Create a Interactive Line Chart
fig_trend_doc = px.line(
    df_top3_trends,
    x='Year',
    y='Documents',
    color='Country',      # <-- creates a different colored line for each country
    markers=True,         # <-- adds markers to each data point for clarity
    title='<b>Trend in Research Publications for Top 3 Countries (2003-2025)</b>',
    labels={
        "Year": "Year of Publication",
        "Documents": "Number of Documents Published"
    },
    template="plotly_white" # <-- use a clean white background
)

# Improve the layout and readability
fig_trend_doc.update_layout(
    title_font_size=22,
    xaxis_title_font_size=16, 
    yaxis_title_font_size=16,
    legend_title_font_size=14, # <-- give size to title of legend
    legend_title_text='<b>Top 3 Countries</b>'
)

# Display the chart
fig_trend_doc.show()

--- Top 3 Countries by Total Publications ---
Country
UNITED KINGDOM    1540219
SPAIN             1091687
BRAZIL            1086793
Name: Documents, dtype: int64
-------------------------------------------


<div class="alert alert-block alert-warning">
<b>üí° Insight: A Dynamic Three-Way Race for Publication Leadership</b>
The trend analysis for the top three countries by publication volume‚Äîthe United Kingdom, Spain, and Brazil‚Äîreveals a highly dynamic and competitive landscape rather than a static hierarchy.
<ul>
<li><b>Shifting Leadership:</b> While the <b>United Kingdom</b> holds the overall lead, it wasn't always the frontrunner. <b>Spain</b> was the initial leader in the early 2000s, after which all three nations entered a period of intense, neck-and-neck competition from approximately 2005 to 2010.</li>
<li><b>The Rise of Brazil:</b> The dynamic shifted significantly after 2010, with <b>Brazil</b> emerging as a major powerhouse. In two distinct periods (2010-2015 and 2020-2025), Brazil's output surged to its peak, challenging the UK directly for the top position.</li>
<li><b>Recent Dynamics (2015-Present):</b> In the most recent decade, the competition has largely been between the UK and Brazil. While the UK consistently remains at the top, Brazil and Spain have often shown comparable output levels, solidifying a new competitive balance among the top performers.</li>
</ul>
This analysis indicates that research publication volume is not a fixed race. The consistent rise of Brazil, in particular, points to a significant shift in the global research landscape over the last decade, challenging the established dominance of other nations.
</div>

#### Q2: Quality vs. Time Trend

**Objective:**
To identify if there is a trade-off between the quantity of research (number of documents) and its quality (CNCI) over time, especially for major research-producing countries. We want to see if an increase in the number of publications leads to a decrease in the average quality.

1.  **How has the research quality (CNCI) changed over the years for major countries like Uk, Brazil and Spain?**
    *   **Metrics Used:** `Year`, `Category Normalized Citation Impact(CNCI)`, `Country`
2.  **Is there a visible inverse relationship when comparing the trend of `Web of Science Documents` and `CNCI` for these countries?**
    *   **Metrics Used:** `Year`, `Web of Science Documents`, `Category Normalized Citation Impact(CNCI)`, `Country`

In [35]:
# --- Step-1: Calculate the total documents published by each country over the entire period ---
top_countries_by_docs = df.groupby('Country')['Documents'].sum().sort_values(ascending=False).head(5)
print("--- Top 5 Countries by Total Publications ---")
print(top_countries_by_docs)
print("-------------------------------------------")

# Get the list of names of the top 3 countries
top_5_names = top_countries_by_docs.index.tolist()


# --- Step-2: Create Interactive Line Chart --- 
# We will create a separate chart for each country for better clarity
for country in top_5_names:
    df_country = df[df['Country'] == country].sort_values('Year') #<-- find country data sort with year

    # Create a figure with a secondary y-axis
    fig_cncidoc = make_subplots(specs=[[{"secondary_y": True}]])

    # Add Web of Science Documents trace (Quantity) to the primary y-axis
    fig_cncidoc.add_trace(
        go.Scatter(
            x=df_country['Year'],
            y=df_country['Documents'],
            name='Documents (Quantity)',
            mode='lines+markers', # <-- add lines and markers both
            line=dict(color='royalblue', width=3)
        ),
        secondary_y=False, # <-- this is not the secondary axis(primary axis)
    )

    # Add CNCI trace (Quality) to the secondary y-axis
    fig_cncidoc.add_trace(
        go.Scatter(
            x=df_country['Year'],
            y=df_country['CNCI'],
            name='CNCI (Quality)',
            mode='lines+markers',
            line=dict(color='firebrick', width=3, dash='dash')
        ),
        secondary_y=True, # <-- this is secondary axis.
    )

    # Add figure title and axis labels
    fig_cncidoc.update_layout(
        title_text=f'<b>{country}: Quality (CNCI) vs. Quantity (Documents) Over Time</b>',
        xaxis_title='Year',
        legend_title='Metrices'
    )

    # Set y-axes titles
    fig_cncidoc.update_yaxes(title_text='<b>Total Documents Published</b>', secondary_y=False)
    fig_cncidoc.update_yaxes(title_text='<b>CNCI (Research Quality)</b>', secondary_y=True)
    
    # Show the plot
    fig_cncidoc.show()

--- Top 5 Countries by Total Publications ---
Country
UNITED KINGDOM    1540219
SPAIN             1091687
BRAZIL            1086793
CANADA             976080
SWITZERLAND        966760
Name: Documents, dtype: int64
-------------------------------------------


<div class="alert alert-block alert-warning">
<b>üí° Insight: Performance Analysis of Top Publishing Countries</b>
An analysis of the top 5 countries by publication volume reveals interesting patterns in their research quality (CNCI) and consistency over time.
<ul>
<li><b>Top Performers by Volume:</b> According to the dataset, the United Kingdom, Spain, Brazil, Canada, and Switzerland are among the highest publishers of research documents.</li>
<li><b>Quality Fluctuations:</b> Despite high output, some top countries show significant dips in quality. For instance, Canada's CNCI was noticed to be below the world average (less than 1.0) in 2010 and 2024, and Switzerland also showed a similar dip in 2010.</li>
<li><b>Spain's Superior Quality:</b> When comparing these five nations, Spain's CNCI over time is consistently better. The research quality (CNCI) of the other four countries (UK, Brazil, Canada, Switzerland) appears to be more stable or average relative to their publication volume.</li>
</ul>
This suggests that simply publishing a high volume of documents does not guarantee consistent, above-average research quality, as seen by Spain's superior performance compared to its peers.
</div>

#### Q3: Collaboration vs. Overall Quality

**Objective:**
To identify the relationship between the impact of collaborative research and the overall research quality across different countries and years.

1.  **Is there a positive correlation between `Collab-CNCI` and `CNCI`?**
    *   **Metrics Used:** `Collab-CNCI`, `Category Normalized Citation Impact (CNCI)`
2.  **Does higher collaboration impact consistently lead to higher overall quality?**
    *   **Metrics Used:** Visual inspection of the scatter plot trend.

In [36]:
# --- Step-1 : Calculate the Pearson correlation coefficient ---
# This value tells us how strongly two variables are linearly related.
correlation_value = df['Collab-CNCI'].corr(df['CNCI'])
print(f"The Correlation between Collab-CNCI and CNCI is: {correlation_value:.4f}")

# --- Step-2 : Create an interactive scatter plot ---
# This will treat all data points as one group and show a single, clear trend.
fig_corr = px.scatter(
    df,
    x='Collab-CNCI',
    y='CNCI',
    hover_data=['Country', 'Year'],  # <-- show country and year on hover.
    trendline="ols",             # <-- this will draw single trend line called ols(ordinary least square)
    title='<b>Overall Quality (CNCI) vs. Collaboration Impact (Collab-CNCI)</b><br><sup>A weak correlation suggests no strong link between the two metrics.</sup>',
    labels={
        "Collab-CNCI": "Collaboration Impact (Collab-CNCI)",
        "CNCI": "Overall Research Quality (CNCI)"
    }
)

# Customize the trendline to make it more visible
fig_corr.update_traces(selector=dict(mode='lines'), line=dict(color='red', width=3))

# Display the cleaner plot
fig_corr.show()

The Correlation between Collab-CNCI and CNCI is: -0.0655


<div class="alert alert-block alert-warning">
<b>üí° Insight: Collaboration's Link to Overall Quality is Very Weak</b>
The analysis reveals a surprisingly weak and slightly negative correlation between the impact of a country's collaborative research and its overall research quality.
<ul>
<li><b>Correlation Value : </b> The calculated Pearson correlation is <b>-0.0655</b>, a value extremely close to zero. This statistically signifies that there is no meaningful linear relationship between Collab-CNCI and CNCI.</li>
<li><b>Visual Trend : </b> The trendline on the scatter plot is nearly flat, which visually confirms the lack of a strong positive or negative trend. A strong relationship would have resulted in a much steeper line.</li>
<li><b>Interpretation : </b> This finding implies that achieving high impact in collaborative papers does not automatically guarantee a high overall quality across all of a country's publications. The two metrics appear to be largely independent in this dataset.</li>
</ul>
Therefore, we can conclude that collaboration impact (Collab-CNCI) should not be used as a direct proxy or predictor for a country's overall research quality (CNCI).
</div>


#### Q4: Rank vs. Performance Metrics

**Objective:**
To determine whether a country's `Rank` in this dataset is more strongly influenced by the sheer `quantity` of its research (`Web of Science Documents`) or the overall `impact` of that research (`Times Cited`).

1.  **How does Rank correlate with the number of documents?**
    *   **Metrices used:** `Rank`, `Web of Science Documents`
2.  **How does Rank correlate with the number of times cited?**
    *   **Metrices used:** `Rank`, `Times Cited`

In [25]:
# --- Step-1 : Calculate the Pearson correlation coefficient for all metrices ---
cols_for_corr = ['Rank', 'Documents', 'Times Cited', 'CNCI'] # <-- We include CNCI as well for a broader context
correlation_matrix = df[cols_for_corr].corr()

# Print the matrix to see the exact values
print("Correlation Matrix:")
print(correlation_matrix)

# --- Step-2 : Create an interactive heatmap to visualize the correlations ---
fig_heatmap = px.imshow(
    correlation_matrix,
    text_auto=True,  # <-- this will diplay correlation value on heatmap.
    aspect="auto",   # <-- adjust aspect ratio for better readiability
    title='Correlation Heatmap: Is Rank Driven by Quantity or Impact?',
    color_continuous_scale="Oranges"
)

# Display the chart
fig_heatmap.show()

Correlation Matrix:
                 Rank  Documents  Times Cited      CNCI
Rank         1.000000  -0.375050    -0.359304  0.034124
Documents   -0.375050   1.000000     0.922424 -0.001763
Times Cited -0.359304   0.922424     1.000000  0.005971
CNCI         0.034124  -0.001763     0.005971  1.000000



<div class="alert alert-block alert-warning">
<b>üí° Insight: Rank is Moderately Driven by Both Quantity and Impact</b>
The analysis reveals a moderate inverse relationship between the calculated `Rank` and the two key performance metrics, suggesting the ranking criteria considers both volume and impact.
<ul>
<li><b>Rank vs. Documents: </b> The correlation between `Rank` and `Web of Science Documents` is <b>-0.375</b>. This moderate negative correlation indicates that as the publication *quantity* (Documents) increases, the numerical rank *improves* (i.e., the rank number decreases, moving towards rank 1).</li>
<li><b>Rank vs. Times Cited: </b> A very similar, moderate negative correlation of <b>-0.359</b> is observed with `Times Cited`. This shows that a higher overall <b>impact</b>(Citations) also leads to an improvement in the numerical rank.</li>
<li><b>Quantity vs. Impact: </b> A near-perfect positive correlation of <b>0.922</b> exists between `Web of Science Documents` and `Times Cited`. This confirms that the country with the highest number of publications is highly likely to also have the highest total number of citations.</li>
</ul>
<b>Conclusion:</b> The ranking mechanism within this dataset is influenced moderately and almost equally by both publication quantity and total citation impact. However, since the quantity of documents and total citations are so strongly linked, focusing on increasing publication volume is the primary underlying factor for improvement in both metrics.
</div>

#### Q5. Gap Analysis: Rank 1 vs Rank 2 (Times Cited)

**Objective:**
To measure the dominance of the market leader. Since the provided `Rank` column has weak correlation and duplicates, we will dynamically identify the **True Top 2 Performers** based purely on `Times Cited` for each year and analyze the gap between them.

**1. Who are the actual Top 2 players by Impact?**
*   **Metrics used:** `Country`, `Times Cited` (Sorted Descending per Year).

**2. How big is the dominance gap?**
*   **Metrics used:**
    *   **Absolute Gap:** (Leader Citations - Runner-up Citations).
    *   **Dominance %:** How much larger (%) is the Leader compared to the Runner-up.

In [26]:
# --- Step 1: Data Preparation (Dynamic Ranking) ---
# We are ignoring the original 'Rank' column because of low correlation (-0.3).
# Instead, we identify Rank 1 and 2 dynamically based on 'Times Cited'.

gap_data = []
years = sorted(df['Year'].unique()) # <-- get unique years from the dataset

for y in years:
    # Filter data for the specific year
    # Sort by 'Times Cited' in Descending order (Highest first)
    year_df = df[df['Year'] == y].sort_values(by='Times Cited', ascending=False)
    
    # We need at least 2 countries/entities to compare
    if len(year_df) >= 2:
        leader = year_df.iloc[0] # <-- The first row is the Actual Leader (Highest Citations)
        runner_up = year_df.iloc[1]  # <-- The second row is the Actual Runner-up
        
        # Calculate the Gap
        abs_gap = leader['Times Cited'] - runner_up['Times Cited']
        
        # Calculate Percentage Difference (Dominance)
        # Formula: ((Leader - RunnerUp) / RunnerUp) * 100
        if runner_up['Times Cited'] > 0:
            pct_gap = (abs_gap / runner_up['Times Cited']) * 100
        else:
            pct_gap = 0 # <-- Avoid division by zero
        
        gap_data.append({
            'Year': y,
            'Leader Name': leader['Country'],
            'Leader Citations': leader['Times Cited'],
            'Runner-up Name': runner_up['Country'],
            'Runner-up Citations': runner_up['Times Cited'],
            'Absolute Gap': abs_gap,
            'Dominance %': pct_gap
        })

# Create a DataFrame for the analysis
gap_df = pd.DataFrame(gap_data)

# Print a preview to verify the logic
print("Computed Top 2 Players based on Times Cited:")
print(gap_df[['Year', 'Leader Name', 'Runner-up Name', 'Absolute Gap', 'Dominance %']].sort_values(by='Dominance %', ascending=False).head())

# --- Step 2: Visualization using Plotly ---
# Chart 1: Comparison Bar Chart (Leader vs Runner-up)
fig1 = go.Figure()

# Bar for the Leader
fig1.add_trace(go.Bar(
    x=gap_df['Year'],
    y=gap_df['Leader Citations'],
    name='True Rank 1 (Leader)',
    marker_color='#1f77b4', # <-- blue theme
    text=gap_df['Leader Name'],
    textposition='auto',
    hovertemplate='<b>%{text}</b><br>Citations: %{y}<extra></extra>'
))

# Bar for the Runner-up
fig1.add_trace(go.Bar(
    x=gap_df['Year'],
    y=gap_df['Runner-up Citations'],
    name='True Rank 2 (Runner-up)',
    marker_color='#ff7f0e', # <-- orange theme
    text=gap_df['Runner-up Name'],
    textposition='auto',
    hovertemplate='<b>%{text}</b><br>Citations: %{y}<extra></extra>'
))

fig1.update_layout(
    title='<b>Top 2 Dominance: Leader vs Runner-up (By Times Cited)</b>',
    xaxis_title='Year',
    yaxis_title='Total Times Cited',
    barmode='group', # <-- side by side bar
    template='plotly_white',
    legend=dict(title="Position")
)

fig1.show()

# Chart 2: The Dominance Gap Trend
# This chart shows IF the gap is widening or closing over time.
fig2 = px.line(
    gap_df, 
    x='Year', 
    y='Dominance %',
    markers=True,
    title='<b>Dominance Trend: How much stronger is the Leader compared to Runnerup.</b>',
    hover_data=['Leader Name', 'Runner-up Name', 'Absolute Gap']
)

fig2.update_traces(
    line_color='crimson',
    marker=dict(size=10)
)

fig2.update_layout(
    xaxis_title='Year',
    yaxis_title='Dominance (% exceeding Runner-up)',
    template='plotly_white'
)

# Add a reference line at 0 (No Gap)
fig2.add_hline(y=0, line_dash="dash", line_color="gray", annotation_text="Equal Performance")

fig2.show()

Computed Top 2 Players based on Times Cited:
    Year     Leader Name  Runner-up Name  Absolute Gap  Dominance %
4   2007           CHINA  UNITED KINGDOM       6336975    94.582888
16  2019           SPAIN           INDIA       4233387    78.618823
0   2003           SPAIN          CANADA       4332231    62.394319
3   2006          CANADA          FRANCE       3872085    58.604342
13  2016  UNITED KINGDOM          FRANCE       4485334    55.105755


<div class="alert alert-block alert-warning">
<b>üí° Insight: Peak Dominance Years (Rank 1 vs Rank 2)</b>
<br>
The analysis reveals specific years where the top-performing country established a massive gap over the runner-up, indicating periods of significant research monopoly.
<ul>
<li><b>2007 (China vs UK) : </b> Recorded the highest dominance ever, where <b>China</b> led with a staggering <b>94.5%</b> margin over the UK.</li>
<li><b>2019 (Spain vs India) : </b> <b>Spain</b> secured the second-highest gap, outperforming India by <b>78.6%</b> in total citations.</li>
<li><b>2003 (Spain vs Canada) : </b> Spain also held the third-largest lead historically, with a <b>62.39%</b> gap over Canada.</li>
</ul>
<b>Conclusion:</b> While China held the single largest one-time lead in 2007, Spain demonstrates a pattern of strong dominance, appearing twice in the top 3 peak gaps.
</div>

### Multivariate Analysis

#### Q1. Segmentation Analysis: Quantity vs. Quality (Quadrants)

**Objective:**
To classify countries/entities into strategic performance groups based on their volume of work and the impact of that work. This helps in identifying who is driving global research quality versus who is merely increasing volume.

**1. Who are the "Elite Players"?**
*   **Definition:** Countries producing a high volume of papers with high citation impact.
*   **Metrics used:** High `Web of Science Documents` (> Median) **AND** High `CNCI` (> Median).

**2. Who are the "Mass Producers"?**
*   **Definition:** Countries publishing a lot of papers but with below-average impact (Quantity over Quality).
*   **Metrics used:** High `Web of Science Documents` (> Median) **BUT** Low `CNCI` (< Median).

**3. Who are the "Niche / High Potential" players?**
*   **Definition:** Smaller countries producing fewer papers, but of exceptional quality.
*   **Metrics used:** Low `Web of Science Documents` (< Median) **BUT** High `CNCI` (> Median).

**4. Who are the "Underperformers"?**
*   **Definition:** Countries struggling with both volume and impact.
*   **Metrics used:** Low `Web of Science Documents` (< Median) **AND** Low `CNCI` (< Median).

In [37]:
# Step 1: Data Aggregation (Grouping by Name)
overall_df = df.groupby('Country').agg({
    'Documents': 'sum',                  # Quantity (Total Papers)
    'Times Cited': 'sum',                # Citations (Total Impact)
    'CNCI': 'mean',                      # Quality Average
    'Collab-CNCI': 'mean',               # Collab Quality Average
    '% Docs Cited': 'mean'               # Consistency check
}).reset_index()

# Step 2: Calculate New Thresholds (Medians of Overall Data)
median_docs = overall_df['Documents'].median()
median_cnci = overall_df['CNCI'].median()

print(f"Overall Median Documents: {median_docs}")
print(f"Overall Median CNCI: {median_cnci}")

# Step 3: Create the Interactive Scatter Plot
fig_model = px.scatter(
    overall_df,
    x='Documents',
    y='CNCI',
    size='Times Cited',                 # <-- Bubble size = Total Citations till date
    color='Collab-CNCI',                # <-- Color = Average Collab Quality
    hover_name='Country',
    hover_data=['% Docs Cited'],
    log_x=True,                  # <-- adjust scale to get all data points      
    title="Overall Research Performance (All Years): Quantity vs. Quality",
    color_continuous_scale='Viridis',
    height=700
)

# Step 4: Add Quadrant Lines (The Crosshairs)
fig_model.add_vline(x=median_docs, line_width=2, line_dash="dash", line_color="red")
fig_model.add_hline(y=median_cnci, line_width=2, line_dash="dash", line_color="red")

# Step 5: Labels
max_docs = overall_df['Documents'].max()
max_cnci = overall_df['CNCI'].max()
min_docs = overall_df['Documents'].min()


# 1. Elite Players (Top-Right)
fig_model.add_annotation(
    xref="paper", yref="paper",
    x=0.98, y=0.98,  
    text="<b>CONSISTENT ELITE</b><br>(High Qty / High Qual)",
    showarrow=False, font=dict(size=13, color="green"),
    xanchor="right", yanchor="top"
)

# 2. Mass Producers (Bottom-Right)
fig_model.add_annotation(
    xref="paper", yref="paper",
    x=0.98, y=0.02,  
    text="<b>MASS PRODUCERS</b><br>(High Qty / Low Qual)",
    showarrow=False, font=dict(size=12, color="orange"),
    xanchor="right", yanchor="bottom"
)

# 3. Niche Players (Top-Left)
fig_model.add_annotation(
    xref="paper", yref="paper",
    x=0.02, y=0.98,  
    text="<b>NICHE / BOUTIQUE</b><br>(Low Qty / High Qual)",
    showarrow=False, font=dict(size=12, color="blue"),
    xanchor="left", yanchor="top"
)

# 4. Underperformers (Bottom-Left)
fig_model.add_annotation(
    xref="paper", yref="paper",
    x=0.02, y=0.02,  
    text="<b>LAGGING</b><br>(Low Qty / Low Qual)",
    showarrow=False, font=dict(size=12, color="gray"),
    xanchor="left", yanchor="bottom"
)

fig_model.update_layout(
    xaxis_title="Total Web of Science Documents (Log Scale)",
    yaxis_title="Average CNCI (Quality)",
    template="plotly_white",
    coloraxis_showscale=False
)

fig_model.show()

Overall Median Documents: 934912.5
Overall Median CNCI: 1.2851196172248804


<div class="alert alert-block alert-warning"><b>üí° Insight: Strategic Performance Segmentation (Quality vs. Quantity)</b><br>Overall data analysis reveals distinct research strategies among major countries:<ul><li><b>Mass Producer (UK) : </b> The UK emerges as the #1 Mass Producer with the highest volume of documents, but surprisingly records the lowest CNCI (Quality) among the comparison group.</li><li><b>Consistent Elite (Spain) : </b> Spain leads as the top performer in the Elite category, successfully maintaining both high publication volume and high research quality.</li><li><b>Lagging (Netherlands & India) : </b> The Netherlands ranks lowest in both quantity and quality. India also falls into this quadrant, with research quality slightly below the global median.</li><li><b>Boutique / Niche (Japan) : </b> Japan secures the #1 position in Research Quality (CNCI), despite its total publication quantity being slightly below the median.</li></ul><b>Conclusion:</b> While Spain achieves the ideal balance of scale and impact, the UK prioritizes volume over quality, whereas Japan adopts a "Quality over Quantity" approach.</div>

#### Q2: Pareto Principle (80/20 Rule) Analysis

**Objective:**
To identify the concentration of research impact and determine if a small minority of entities (e.g., top countries) are responsible for the majority of the global citations (influence). We want to validate if the "Vital Few" drive the results.

**[Question 1]**
Do the Top 20% of countries generate approx. 80% of the Total Citations?
*   **Metrics used:** `Country`, `Times Cited` (sorted), `Cumulative Percentage of Citations`.

**[Question 2]**
Which specific countries fall into this "Vital Few" (Top contributors) category?
*   **Metrics used:** `Country`, `Times Cited`, `Rank` (based on contribution).

In [38]:
# --- Step 1: Data Preparation (Same as before) ---
pareto_df = df.groupby('Country')['Times Cited'].sum().reset_index()
pareto_df = pareto_df.sort_values(by='Times Cited', ascending=False).reset_index(drop=True)

# Cumulative Calculations
total_citations = pareto_df['Times Cited'].sum()
pareto_df['Cumulative_Citations'] = pareto_df['Times Cited'].cumsum() # <-- cumsum() - cummulative sum
pareto_df['Cumulative_Cit_Perc'] = (pareto_df['Cumulative_Citations'] / total_citations) * 100

# --- Step 2: Visualization (Bars + Line) ---
fig_pareto = make_subplots(specs=[[{"secondary_y": True}]]) # <-- to create two axis

# 1. BARS: Individual Citations (Left Axis)
fig_pareto.add_trace(
    go.Bar(
        x=pareto_df['Country'],
        y=pareto_df['Times Cited'],
        name='Citations (Volume)',
        marker_color='rgb(55, 83, 109)', # <-- dark blue bar
        hovertemplate='<b>%{x}</b><br>Citations: %{y:,.0f}<extra></extra>'
    ),
    secondary_y=False, # <-- set to the left axis
)

# 2. LINE: Cumulative Percentage (Right Axis)
fig_pareto.add_trace(
    go.Scatter(
        x=pareto_df['Country'],
        y=pareto_df['Cumulative_Cit_Perc'],
        name='Cumulative %',
        mode='lines+markers',
        marker=dict(size=6, color='rgb(26, 118, 255)'),  # <-- blue line
        line=dict(width=3),
        hovertemplate='<b>%{x}</b><br>Cumulative Impact: %{y:.1f}%<extra></extra>'
    ),
    secondary_y=True, # <-- set to the right axis
)


# --- Step 3: Add Threshold Lines (80/20 Rule) ---
fig_pareto.add_shape(
    type="line",
    x0=-0.5, 
    x1=len(pareto_df)-0.5, 
    y0=80, 
    y1=80,
    yref="y2", # <-- refer to seconday axis 
    line=dict(color="crimson", width=2, dash="dash"),
)

# Annotation for 80%
fig_pareto.add_annotation(
    x=len(pareto_df)*0.8, 
    y=82,  # <-- put text on right side
    yref="y2", # <-- refer to secondary axis 
    text="80% Cumulative Impact",
    showarrow=False,
    font=dict(color="crimson")
)

# --- Step 4: Layout Improvements ---
fig_pareto.update_layout(
    title='<b>Classic Pareto Chart: Citations by Country</b><br><sup>Bars = Individual Count | Line = Cumulative %</sup>',
    template='plotly_white',
    hovermode='x unified', # <-- to show unified data
    legend=dict(x=0.9, y=8.0),
    height=650
)

# Axis Labels setup
fig_pareto.update_yaxes(title_text="<b>Total Citations (Volume)</b>", secondary_y=False)
fig_pareto.update_yaxes(title_text="<b>Cumulative Percentage (%)</b>", secondary_y=True)
fig_pareto.update_xaxes(title_text="<b>Countries (Sorted by Impact)</b>")

fig_pareto.show()

# --- Step 5: Insight Print ---
cutoff_df = pareto_df[pareto_df['Cumulative_Cit_Perc'] <= 80]
count_80 = len(cutoff_df)
percent_countries = (count_80 / len(pareto_df)) * 100

print(f"Pareto Insight: Top {count_80} countries ({percent_countries:.1f}% of total) generate ~80% of all citations.")

Pareto Insight: Top 11 countries (68.8% of total) generate ~80% of all citations.


<div class="alert alert-block alert-warning">
    <b>üí° Insight: Distributed Impact & Pareto Principle Deviation</b>
    <br>
    The analysis reveals that the research impact is not concentrated among a "Vital Few" but is widely distributed across the majority of the entities.
    <ul>
        <li><b>Pareto Failure : </b> The traditional 80/20 rule does not apply here. Instead of the top 20%, it took <b>75% of the countries</b> to generate 80% of the total citations.</li>
        <li><b>High Equality : </b> This indicates a balanced competitive landscape where citation influence is shared, rather than being monopolized by a few leaders.</li>
        <li><b>Dataset Nature : </b> Such a distribution strongly suggests that the dataset likely consists of a pre-selected group of high-performing entities (e.g., Top Economies) without a "long tail" of low performers.</li>
    </ul>
    <b>Conclusion:</b> This is not a "winner-takes-all" scenario; the research power is evenly spread among the analyzed countries.
</div>

#### Q3. Elite Conversion: Collaboration Impact on Top 1% Papers

**Objective:**
To identify if high-quality collaborations (`Collab-CNCI`) directly translate into elite-level research output (producing papers in the global Top 1%). We want to see if working effectively with others helps produce "Blockbuster" research.

**Key Questions & Metrics:**

1.  **Correlation Check:** Is there a strong positive relationship between Collaboration Quality and Elite Output?
    *   **Metrics Used:** `Collab-CNCI` (X-axis) vs. `% Documents in Top 1%` (Y-axis).
2.  **Identifying Super-Collaborators:** Which countries/entities maintain high collaboration standards AND produce a high volume of elite papers?
    *   **Metrics Used:** `Country` (Label), `Web of Science Documents` (Bubble Size), `Category Normalized Citation Impact` (Color).

In [39]:
# --- Step-1 : Data Aggregation (Country-Level Strategy View) ---
df_agg = df.groupby('Country').agg({
    'Collab-CNCI': 'mean',                  # <-- for X-Axis: Avg Collaboration Quality
    '% Documents in Top 1%': 'mean',        # <-- for Y-Axis: Avg Elite Output
    'Documents': 'sum',                     # <-- for Bubble Size: Total Volume
    'CNCI': 'mean'                          # <-- for Color: Overall Quality
}).reset_index()

# Calculate Correlation
correlation = df_agg['Collab-CNCI'].corr(df_agg['% Documents in Top 1%'])
print(f"Pearson Correlation Coefficient: {correlation:.4f}")

# --- Step-2 : Create Advanced Bubble Chart
fig_corr = px.scatter(
    df_agg,
    x="Collab-CNCI",
    y="% Documents in Top 1%",
    size="Documents",                       # <-- give Volume as a Bubble Size
    color="CNCI",                           # <-- give quality(CNCI) as a color
    hover_name="Country",                   
    size_max=60,                          
    color_continuous_scale="Viridis",       # <-- give continuous color
    title=f"<b>Elite Conversion Analysis: Collaboration vs. Top 1% Papers</b><br><sup>Correlation: {correlation:.2f} (Size = Volume, Color = Overall CNCI)</sup>",
    template="plotly_white",
    trendline="ols"                         # <-- ordinary Least Squares
)

# Customizing the Layout for Better Insights
fig_corr.update_traces(
    marker=dict(line=dict(width=1, color='DarkSlateGrey')) # <-- create border on bubble
)

# bold and red trendline
fig_corr.update_traces(selector=dict(mode='lines'), line=dict(color='red', width=3, dash='solid'))

avg_collab = df_agg['Collab-CNCI'].mean()
avg_elite = df_agg['% Documents in Top 1%'].mean()

fig_corr.add_hline(y=avg_elite, line_dash="dot", annotation_text=f"Avg Elite Output ({avg_elite:.2f}%)", annotation_position="bottom right")
fig_corr.add_vline(x=avg_collab, line_dash="dot", annotation_text=f"Avg Collab Quality ({avg_collab:.2f})", annotation_position="top right")

# Axis Titles
fig_corr.update_layout(
    xaxis_title="Collaboration Quality (Collab-CNCI)",
    yaxis_title="Elite Research Output (% in Top 1%)",
    height=700,
    width=1000,
    font=dict(size=12)
)

fig_corr.show()

Pearson Correlation Coefficient: -0.0186


<div class="alert alert-block alert-warning">
    <b>üí° Insight: Elite Performance Saturation - Collaboration is a Baseline</b><br>
    The analysis reveals a <b>near-zero correlation (-0.02)</b>, indicating that among these top-tier entities, higher collaboration quality does not linearly increase the volume of elite (Top 1%) papers. The data shows a "Saturation Effect."
    <ul>
        <li><b>High-Performance Cluster : </b> All analyzed entities are concentrated in the "Winner's Quadrant" with Collab-CNCI scores consistently <b>above 1.15</b> (well above Global Avg) and Elite Output <b>above 1.5%</b>.</li>
        <li><b>Threshold Effect : </b> It appears that high-quality collaboration is a <b>necessary entry requirement</b> (Hygiene Factor) to be in this elite group. However, once this threshold is met, further marginal improvements in collaboration do not yield proportional returns in "breakthrough" research.</li>
        <li><b>No Clear Leader by Metric : </b> Since the trendline is flat, no single country is gaining a massive advantage solely through better collaboration scores; they are all performing at a similarly high standard.</li>
    </ul>
    <b>Conclusion:</b> For these top players, high-quality collaboration is now a "Standard Norm" rather than a unique differentiator. To further increase the share of Top 1% papers, these entities likely need to focus on other variables (like high-risk funding or emerging topics), as they have already mastered the art of collaboration.
</div>

In [30]:
# Save this Formatted and Cleaned Data in New File
df.to_csv('data/cleaned_publications.csv', index=False)
print("Cleaned data saved successfully!")

Cleaned data saved successfully!


## 5. Key Insights & Conclusion <a id="insights"></a>
This analysis of global research performance data reveals a landscape of elite, high-performing nations where traditional metrics for success are insufficient. The key strategic insights are as follows:


### üîç Insight 1: The 'Elite Club' & Decentralized Power
*   **Pareto Principle Fails:** The analysis reveals that the traditional 80/20 rule does not apply here. Research power is highly decentralized, taking approximately **75% of nations** to generate 80% of the total research impact.
*   **High Performance Norm:** High performance (CNCI > 1.0) is the standard within this dataset, not the exception. There are no systemic underperformers.

<div class="alert alert-block alert-info" style="background-color: #e3f2fd; border-left: 5px solid #2196f3; padding: 10px;">
    <b>üáÆüá≥ India Lens:</b> India is firmly established as a member of this "Elite Club." Its performance metrics consistently align with or exceed global baselines, proving it is a core contributor to the global research ecosystem rather than an outsider.
</div>

<br>

### üîç Insight 2: Strategic Divergence (The Four Models)
*   **Mass Producers (Volume > Median, Quality < Median):** Led by the **UK**, which produces the highest document volume but records the lowest average CNCI among peers.
*   **Boutique Specialists (Quality > Median, Volume < Median):** **Japan** leads here with the highest CNCI score, despite its volume being slightly below the median line.
*   **Elite Performers (Both > Median):** **Spain** represents the strategic ideal, successfully maintaining both volume and quality above the median benchmarks.
*   **Lagging Zone (Both < Median):** The **Netherlands** falls into this category, trailing the median in both volume and quality metrics.

<div class="alert alert-block alert-info" style="background-color: #e3f2fd; border-left: 5px solid #2196f3; padding: 10px;">
    <b>üáÆüá≥ India Lens:</b> India is currently positioned in the <b>Lagging</b> category. However, it sits <b>critically close to the median lines</b> for both Volume and CNCI. This indicates that India is not stagnating but is in a transition phase, growing simultaneously in quantity and quality to cross into the Elite quadrant.
</div>

<br>

### üîç Insight 3: Distribution Analysis
*   **Symmetry in Quality:** Metrics like `% Docs Cited`, `CNCI`, and `Collab-CNCI` follow a normal distribution (Bell Curve) where Mean ‚âà Median.
*   **Consistency Champions:** **Brazil** is the most consistent elite performer, crossing the "2% Elite Threshold" **10 times**, followed by Germany (9 times) and UK/China/Japan (8 times).
*   **High Elite Peaks:** **Germany** holds the record for the single highest peak performance in `% Top 1% Documents` at **2.96%**.
*   **Overall Leader:** In overall lifetime performance (without year trends), **Sweden** takes the global #1 spot for `% Top 1% Documents`.

<div class="alert alert-block alert-info" style="background-color: #e3f2fd; border-left: 5px solid #2196f3; padding: 10px;">
    <b>üáÆüá≥ India Lens:</b> In overall lifetime performance for Elite Research (% Top 1% Documents), <b>India ranks 9th</b> globally, surprisingly outperforming the <b>USA (10th rank)</b>. This highlights India's growing capability to produce world-class influential papers.
</div>

<br>

### üîç Insight 4: Competitive Analysis
*   **Volatile Dominance:** The dominance gap between the Market Leader and the Runner-up is volatile but shows a long-term shrinking trend, indicating intensifying competition.
*   **Historical Dominance:**
    *   In `Times Cited`, **China (Leader)** historically dominated the UK (Runner-up) by a massive **32.1%** margin.
    *   **Spain (Leader)** recorded a significant **28.2%** dominance margin over the Runner-up in `Times Cited`.

<div class="alert alert-block alert-info" style="background-color: #e3f2fd; border-left: 5px solid #2196f3; padding: 10px;">
    <b>üáÆüá≥ India Lens:</b> In the second scenario mentioned above, the Runner-up was <b>India</b>. While Spain held a 28.2% lead, the fact that India emerged as the direct challenger (Runner-up) to the leader in `Times Cited` signifies its rising competitive stature.
</div>

<br>

### üîç Insight 5: Outlier Analysis
*   **The Quality Ceiling:** `CNCI` shows **zero statistical outliers**. This proves that while nations can force scale, they cannot engineer "abnormally high" average quality‚Äîit hits a natural ceiling.
*   **Volume Spikes:** `Documents` and `Times Cited` show **6 distinct outliers**, driven by "Hyper-production" years from **Italy (2004)** and **China (2007)**.
*   **Elite Spikes:** The `% Top 1%` metric shows **5 outliers**, with historical peaks from **Canada (2013)** and **USA (2003)**.

<div class="alert alert-block alert-info" style="background-color: #e3f2fd; border-left: 5px solid #2196f3; padding: 10px;">
    <b>üáÆüá≥ India Lens:</b> India does not appear as an extreme statistical outlier in Volume or Quality. This suggests its growth has been <b>steady and organic</b>, rather than driven by sudden, artificial policy shocks or temporary anomalies.
</div>

<br>

### üîç Insight 6: Collaboration Analysis & The Myth
*   **The Strong Link:** A linear, directly proportional relationship exists between `Volume` and `Times Cited`. Publishing more guarantees more total citations.
*   **The Correlation Myth:** There is **no linear relationship** between `Collaboration Quality (Collab-CNCI)` and `Elite Output (% Top 1% Docs)`. All nations score above 1.0 in collaboration, making it a "Hygiene Factor" rather than a differentiator.
*   **Sweden's Efficiency:** Sweden produces elite-level documents (`% Top 1%`) without necessarily relying on outlier collaboration scores.

<div class="alert alert-block alert-info" style="background-color: #e3f2fd; border-left: 5px solid #2196f3; padding: 10px;">
    <b>üáÆüá≥ India Lens:</b> India is the ultimate proof of this myth. <b>India ranks #1 globally in Collaboration Quality (Collab-CNCI)</b>, yet its conversion to Elite Papers (% Top 1%) is average. This proves that having the best collaboration score does not automatically guarantee the highest volume of elite research.
</div>

<br>

### üîç Insight 7: Performance Analysis (Global Leaderboard)
*   **Volume Leader:** **United Kingdom** ranks #1 in total document production.
*   **Quality Leader (CNCI):** **Japan** ranks #1 in average citation impact.
*   **Elite Leader (% Top 1%):** **Sweden** ranks #1 in the percentage of papers reaching the top 1%.

<div class="alert alert-block alert-info" style="background-color: #e3f2fd; border-left: 5px solid #2196f3; padding: 10px;">
    <b>üáÆüá≥ India Lens:</b> India demonstrates a balanced profile across the board:
    <ul>
        <li><b>Rank #11</b> in Volume (Documents)</li>
        <li><b>Rank #10</b> in Quality (CNCI)</li>
        <li><b>Rank #9</b> in Elite Impact (% Top 1%)</li>
    </ul>
    This consistent ranking around the top 10 mark confirms India's position as a balanced, emerging power competing directly with developed economies.
</div>

## 6. Strategic Recommendation and Final Conclusion <a id="conclusion"></a>

The analysis of this elite group of research nations reveals a clear and urgent message: the old strategies of prioritizing volume and broad collaboration are no longer sufficient for achieving top-tier status. The competitive landscape has matured, demanding a more nuanced approach.

Based on the key findings, the following strategic recommendations are proposed:

<div style="padding: 15px; border-radius: 10px; border-left: 5px solid #2196f3; background-color: #e3f2fd; color: #212121; margin-bottom: 20px; box-shadow: 2px 2px 5px rgba(0,0,0,0.1);">
    <h4>üéØ 1. Shift Focus from "Volume" to "Value": Prioritize Elite (Top 1%) Research</h4>
    <p>
        The data conclusively shows that "getting cited" is now a baseline standard, not a differentiator. The primary goal must shift from increasing the total number of publications to increasing the <b>conversion rate of papers into the Global Top 1%</b>.
    </p>
    <ul>
        <li><b>Action:</b> Re-allocate funding towards high-risk, high-reward projects that have the potential to become "Blockbuster" papers rather than incremental research.</li>
        <li><b>üáÆüá≥ India Strategy:</b> India ranks <b>9th</b> in Elite Output. To break into the Top 5, policy incentives must move away from "Number of Papers Published" to "Number of Papers in Top 1%". Quality must supersede Quantity.</li>
    </ul>
</div>

<div style="padding: 15px; border-radius: 10px; border-left: 5px solid #4caf50; background-color: #e8f5e9; color: #212121; margin-bottom: 20px; box-shadow: 2px 2px 5px rgba(0,0,0,0.1);">
    <h4>ü§ù 2. Fix the "Collaboration Efficiency Gap"</h4>
    <p>
        The "Collaboration Myth" insight proved that high collaboration scores do not automatically yield elite research. It is now a "hygiene factor."
    </p>
    <ul>
        <li><b>Action:</b> Stop signing generic MOUs. Conduct a "Collaboration Audit" to identify partnerships that actually convert to high impact and exit those that only add volume.</li>
        <li><b>üáÆüá≥ India Strategy:</b> India ranks <b>#1 globally in Collaboration Quality (Collab-CNCI)</b> but remains average in elite output. This indicates a massive <b>efficiency gap</b>. India needs to leverage these superior partnerships better to translate collaborative potential into actual elite publications.</li>
    </ul>
</div>

<div style="padding: 15px; border-radius: 10px; border-left: 5px solid #ff9800; background-color: #fff3e0; color: #212121; margin-bottom: 20px; box-shadow: 2px 2px 5px rgba(0,0,0,0.1);">
    <h4>üìà 3. Target the "Elite Quadrant" (The Spain Model)</h4>
    <p>
        Nations typically fall into "Mass Production" or "Boutique" models. The goal is to balance both, as demonstrated by Spain.
    </p>
    <ul>
        <li><b>Action:</b> Adopt a "Balanced Growth" strategy. Maintain volume scale while aggressively targeting the "Quality Ceiling" through center-of-excellence models.</li>
        <li><b>üáÆüá≥ India Strategy:</b> Currently in the "Catch-up Zone," India is critically close to the global median lines. A targeted policy push in specific STEM domains could propel India into the <b>Elite Quadrant</b> within the next 5 years, mimicking Spain's trajectory.</li>
    </ul>
</div>

<br>

### üèÅ Final Conclusion

The global research landscape is no longer a simple race for volume. It is a complex strategic arena where **Impact is currency** and **Efficiency is the competitive advantage**.

For emerging powerhouses like **India**, the data presents a promising narrative. India is not an outsider but a core competitor with superior collaboration networks and a growing elite footprint (Rank #9). By pivoting its strategy from **"Catching up in Volume"** to **"Leading in Excellence,"** India has the structural foundation to challenge the traditional hegemony of the Top 3 nations in the coming decade.