# üìä Exploratory Data Analysis: Research Publications
**Project Intern/Trainee Hiring Assessment - PAIU-OPSA, IISc Bangalore**

---
**Author:** Omkar Sharma
**Date:** November 2025
**Dashboard Link:** [Insert your Netlify/Streamlit Cloud Link Here]
**Repository:** [Insert your GitHub Link Here]

---
### üìù Objective
The goal of this analysis is to perform an in-depth Exploratory Data Analysis (EDA) on the provided dataset to identify trends, patterns, and anomalies. The insights derived from this notebook will drive the development of an interactive dashboard.

### üìñ Table of Contents
1. [Environment Setup](#setup)
2. [Data Loading & Overview](#loading)
3. [Data Preprocessing & Cleaning](#cleaning)
4. [Exploratory Data Analysis (EDA)](#eda)
    - Univariate Analysis
    - Bivariate Analysis
    - Multivariate Analysis
5. [Key Insights & Conclusion](#conclusion)

## 1. Environment Setup <a id="setup"></a>
Importing necessary libraries for data manipulation, visualization, and statistical analysis.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px # for interactive dashboard
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots  # for dual axis
from scipy import stats

# Configuration for aesthetic charts
sns.set_theme(style='whitegrid')
plt.rcParams['figure.figsize'] = (12,6)

# Hide irrelevent warnings
import warnings
warnings.filterwarnings('ignore')

## 2. Data Loading & Overview <a id="loading"></a>
Loading the dataset and performing a preliminary check to understand the structure, features, and data types.

In [3]:
#Loading the dataset
try:
    df = pd.read_csv('data/publications.csv')
    print("Dataset loaded successfully!")
except FileNotFoundError:
    print("Error: The file 'publications.csv' was not found. Please check the file path.")

Dataset loaded successfully!


In [4]:
# Visualizing the first 5 rows
print("First 5 rows of the dataset:")
display(df.head())

# Visualizing the last 5 rows
print("\nLast 5 rows of the dataset:")
display(df.tail())

First 5 rows of the dataset:


Unnamed: 0,Name,Web of Science Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,Category Normalized Citation Impact,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,year
0,SWITZERLAND,24154,2705248,0.946748,8,97.93,1.024815,0.89,10.87,97,230,2023
1,CHINA,2185,157320,1.575928,44,99.6,0.900623,2.98,19.26,323,121,2014
2,CHINA,6896,744768,1.032983,42,95.23,1.679004,1.08,11.36,455,662,2013
3,UNITED KINGDOM,2399,177526,1.586585,3,99.21,1.444246,1.63,10.2,98,2463,2005
4,ITALY,10753,301084,0.812773,2,98.35,1.252122,0.81,17.43,440,134,2004



Last 5 rows of the dataset:


Unnamed: 0,Name,Web of Science Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,Category Normalized Citation Impact,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,year
995,UNITED KINGDOM,22195,2130720,1.276037,46,97.97,0.971705,2.9,20.73,274,1803,2024
996,BRAZIL,27344,1832048,1.565469,42,99.16,1.57703,1.39,22.49,143,1514,2020
997,SWITZERLAND,14360,1033920,0.853179,44,96.86,1.258788,2.95,15.25,224,830,2005
998,SWITZERLAND,5423,591107,0.838366,8,97.8,1.508564,0.87,18.58,151,707,2014
999,CHINA,23053,2996890,1.13527,36,96.31,1.458377,0.5,23.07,214,2073,2014


In [5]:
# Checking the shape of the data
rows, cols = df.shape
print(f"The dataset contains {rows} rows and {cols} columns.")

The dataset contains 1000 rows and 12 columns.


In [6]:
# 1. Check for missing values
missing_count = df.isnull().sum().sum()
print(f"Total Missing Values: {missing_count}")

# 2. Check for duplicates
duplicate_count = df.duplicated().sum()
print(f"Total Duplicates: {duplicate_count}")

Total Missing Values: 0
Total Duplicates: 0


In [7]:
# Getting a summary of columns, data types, and non-null values
print("Dataset Information:")
df.info()

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Name                                 1000 non-null   object 
 1   Web of Science Documents             1000 non-null   int64  
 2   Times Cited                          1000 non-null   int64  
 3   Collab-CNCI                          1000 non-null   float64
 4   Rank                                 1000 non-null   int64  
 5   % Docs Cited                         1000 non-null   float64
 6   Category Normalized Citation Impact  1000 non-null   float64
 7   % Documents in Top 1%                1000 non-null   float64
 8   % Documents in Top 10%               1000 non-null   float64
 9   Documents in Top 1%                  1000 non-null   int64  
 10  Documents in Top 10%                 1000 non-null   int64  
 11  year      

In [8]:
# Extract Unique Countries
unique_countries = df['Name'].unique()

# sort and Print Unique Countries
print('\nUnique Countries:')
display(sorted(unique_countries))


#Here, you can see that United Kingdom and England recorded seperately.


Unique Countries:


['AUSTRALIA',
 'BRAZIL',
 'CANADA',
 'CHINA',
 'ENGLAND',
 'FRANCE',
 'GERMANY',
 'INDIA',
 'ITALY',
 'JAPAN',
 'NETHERLANDS',
 'SOUTH KOREA',
 'SPAIN',
 'SWEDEN',
 'SWITZERLAND',
 'UNITED KINGDOM',
 'USA']

In [9]:
# Check for total no. of repeating countries in all years.
duplicates_count = df.duplicated(subset=['Name', 'year']).sum()
print(f"Found {duplicates_count} rows where Countries are repeated.")

Found 639 rows where Countries are repeated.


In [10]:
# Statistical summary of numerical columns
print("Statistical Summary:")
display(df.describe())

Statistical Summary:


Unnamed: 0,Web of Science Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,Category Normalized Citation Impact,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,year
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,14861.699,1296497.0,1.214932,24.722,97.41069,1.291637,1.7676,17.58979,261.327,1497.457,2013.86
std,8390.150609,967063.3,0.230261,14.108145,1.419199,0.234461,0.71711,4.3631,136.904576,844.902713,6.748477
min,512.0,21846.0,0.800182,1.0,95.0,0.900623,0.5,10.02,12.0,111.0,2003.0
25%,7616.75,507670.0,1.029402,12.0,96.15,1.08702,1.13,13.77,142.0,736.75,2008.0
50%,14711.0,1064920.0,1.214383,25.0,97.385,1.292028,1.81,17.39,261.5,1481.0,2014.0
75%,22022.25,1899791.0,1.415986,37.0,98.6525,1.499628,2.39,21.64,382.0,2202.25,2020.0
max,29959.0,4327668.0,1.599646,49.0,99.89,1.698257,3.0,24.99,499.0,2999.0,2025.0


In [11]:
#select numerical columns
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()

#to store result
outlier_report = []

#toop on all numerical columns
for col in num_cols:

    if col == 'Year':
        continue;

    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # count outliers
    num_outliers = ((df[col] < lower_bound) | (df[col] > upper_bound)).sum()

    # add on report
    outlier_report.append({
        'Column Name': col,
        'Outlier Count': num_outliers,
    })

# create dataframe and display
outlier_df = pd.DataFrame(outlier_report)
outlier_df.sort_values(by='Outlier Count', ascending=False, inplace=True)

# display with style
print("Outlier Detection Summary Table:")
display(outlier_df.style.background_gradient(cmap='Reds', subset=['Outlier Count']))

Outlier Detection Summary Table:


Unnamed: 0,Column Name,Outlier Count
1,Times Cited,7
0,Web of Science Documents,0
2,Collab-CNCI,0
3,Rank,0
4,% Docs Cited,0
5,Category Normalized Citation Impact,0
6,% Documents in Top 1%,0
7,% Documents in Top 10%,0
8,Documents in Top 1%,0
9,Documents in Top 10%,0


### üìã Data Dictionary
Data must be

| Column Name | Description | Data Type |
| :--- | :--- | :--- |
| **Country** | Name of the Country | `String` |
| **Documents** | Total count of research papers published by the country | `Integer` |
| **Times Cited** | Total number of citations received by the published papers | `Integer` |
| **Collab-CNCI** | Category Normalized Citation Impact score for collaborative papers only | `Float` |
| **Rank** | Ranking position of the country | `Integer` |
| **% Docs Cited** | Percentage of documents that have received at least one citation | `Float` |
| **CNCI** | Impact score normalized by subject, year, and type (1.0 = World Average) | `Float` |
| **% Documents in Top 1%** | Percentage of papers that are in the global top 1% of most cited papers | `Float` |
| **% Documents in Top 10%** | Percentage of papers that are in the global top 10% of most cited papers | `Float` |
| **Documents in Top 1%** | Absolute count of papers in the global top 1% | `Integer` |
| **Documents in Top 10%** | Absolute count of papers in the global top 10% | `Integer` |
| **Year** | The specific year of publication for the data record | `Integer` |

<div class="alert alert-block alert-info">
<b>üßê Initial Observations:</b>
<ul>
    <li>The dataset contains <b>1000</b> rows and <b>12</b> columns.</li>
    <li>There are <b>no missing values</b> in the dataset.</li>
    <li>All <b>Data Types</b> appear to be correct (Numerical columns are recognized properly).</li>
    <li><b>Outliers:</b> The 'Times Cited' column contains significantly high values (Outliers).</li>
    <li><b>Data Consistency Issue:</b>
        <ul>
            <li>The dataset lists both <i>'United Kingdom'</i> and <i>'England'</i> separately.</li>
            <li><b>Duplicate Entries per Year:</b> There are multiple rows for the same Country in the same Year, which requires aggregation.</li>
        </ul>
    </li>
</ul>
</div>

## 3. Data Preprocessing & Cleaning <a id="cleaning"></a>

Before proceeding to analysis, we conducted a rigorous data quality check. While the dataset contained no missing values, we identified inconsistencies in country naming and redundancy in annual records that required intervention.

In [12]:
# 1. Renaming Columns for better readability
df.rename(columns={
    'Name': 'Country',
    'Category Normalized Citation Impact': 'CNCI',
    'Web of Science Documents': 'Documents',
    'year': 'Year'
}, inplace=True)

# 2. Precision Formatting
df = df.round({
    'Collab-CNCI': 2,
    'CNCI': 2,
    '% Docs Cited': 2,
    '% Documents in Top 1%': 2,
    '% Documents in Top 10%': 2
})

print("Columns Renamed and Precision Set.")
display(df.head(3))

Columns Renamed and Precision Set.


Unnamed: 0,Country,Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,CNCI,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,Year
0,SWITZERLAND,24154,2705248,0.95,8,97.93,1.02,0.89,10.87,97,230,2023
1,CHINA,2185,157320,1.58,44,99.6,0.9,2.98,19.26,323,121,2014
2,CHINA,6896,744768,1.03,42,95.23,1.68,1.08,11.36,455,662,2013


In [13]:
# Check Specific Year data of England and UK
target_year = 2017

# Filter data for England & UK for that specific year
specific_check = df[
    (df['Country'].isin(['ENGLAND', 'UNITED KINGDOM'])) & 
    (df['Year'] == target_year)
]

print(f"Data Comparison for Year: {target_year}")
display(specific_check)

Data Comparison for Year: 2017


Unnamed: 0,Country,Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,CNCI,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,Year
21,ENGLAND,19860,2919420,1.21,40,95.67,1.17,0.78,23.87,487,1734,2017
53,UNITED KINGDOM,25789,2811001,1.06,3,96.24,1.22,1.93,21.11,465,928,2017
458,ENGLAND,14562,1266894,1.25,16,99.05,0.93,1.55,12.52,246,1317,2017
469,ENGLAND,29654,3232286,1.56,44,96.54,1.1,2.24,15.46,397,1881,2017
479,ENGLAND,17013,1344027,1.33,32,99.17,1.06,2.84,20.34,138,1006,2017
664,UNITED KINGDOM,5051,136377,1.16,7,95.94,0.98,2.78,22.31,457,347,2017


In [14]:
# 1. Rename 'ENGLAND' to 'UNITED KINGDOM'
df.loc[df['Country'] == 'ENGLAND', 'Country'] = 'UNITED KINGDOM'

print("Successfully changed all 'ENGLAND' entries to 'UNITED KINGDOM'.")

# 2. Verification
target_year = 2017
uk_rows = df[(df['Country'] == 'UNITED KINGDOM') & (df['Year'] == target_year)]

print(f"\nTotal Rows for United Kingdom in {target_year}: {len(uk_rows)}")
display(uk_rows)

Successfully changed all 'ENGLAND' entries to 'UNITED KINGDOM'.

Total Rows for United Kingdom in 2017: 6


Unnamed: 0,Country,Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,CNCI,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,Year
21,UNITED KINGDOM,19860,2919420,1.21,40,95.67,1.17,0.78,23.87,487,1734,2017
53,UNITED KINGDOM,25789,2811001,1.06,3,96.24,1.22,1.93,21.11,465,928,2017
458,UNITED KINGDOM,14562,1266894,1.25,16,99.05,0.93,1.55,12.52,246,1317,2017
469,UNITED KINGDOM,29654,3232286,1.56,44,96.54,1.1,2.24,15.46,397,1881,2017
479,UNITED KINGDOM,17013,1344027,1.33,32,99.17,1.06,2.84,20.34,138,1006,2017
664,UNITED KINGDOM,5051,136377,1.16,7,95.94,0.98,2.78,22.31,457,347,2017


In [15]:
#Checking Outliers using IQR in Times Cited
col_name = 'Times Cited'

Q1 = df[col_name].quantile(0.25)
Q3 = df[col_name].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Wo rows dhoondhein jo outliers hain
outliers = df[(df[col_name] < lower_bound) | (df[col_name] > upper_bound)]

outliers


#Here, you can see that Outliers are not error, It is only showing elite performance of country.

Unnamed: 0,Country,Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,CNCI,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,Year
34,SWITZERLAND,28337,4080528,1.47,22,96.23,0.93,1.26,18.06,110,2411,2025
552,JAPAN,27698,4043908,1.18,48,99.06,1.56,2.27,11.58,125,546,2018
560,JAPAN,27450,4007700,1.14,24,96.51,1.09,0.68,21.16,170,2431,2017
632,CHINA,29594,4024784,1.56,7,97.61,1.22,0.92,24.66,206,303,2007
657,CANADA,29353,4226832,0.94,39,99.89,1.11,1.2,23.96,106,1993,2004
861,GERMANY,29241,4327668,1.34,13,96.6,1.5,0.7,10.73,471,739,2015
945,USA,29430,4149630,1.21,49,99.52,1.51,2.33,20.64,136,296,2006


In [16]:
#Aggregating Repeated Countries.

# Logic:
# Quantitative columns -> SUM 
# Quality/Percentage columns -> MEAN 
# Rank -> MIN (because rank 1 is better then rank 10)

agg_rules = {
    'Documents': 'sum',
    'Times Cited': 'sum',
    'Documents in Top 1%': 'sum',
    'Documents in Top 10%': 'sum',
    'CNCI': 'mean',
    'Collab-CNCI': 'mean',
    '% Docs Cited': 'mean',
    '% Documents in Top 1%': 'mean',
    '% Documents in Top 10%': 'mean',
    'Rank': 'min' #to get better rank
}

# Compressing data by Groupby 
df_clean = df.groupby(['Country', 'Year'], as_index=False).agg(agg_rules)

# Rounding off again after mean calculation to keep it clean
df_clean = df_clean.round(2)

print(f"Aggregation Complete. Dataset shape changed from {df.shape} to {df_clean.shape}")
display(df_clean.head(30))

# Now, this dataset will use
df = df_clean.copy()

Aggregation Complete. Dataset shape changed from (1000, 12) to (340, 12)


Unnamed: 0,Country,Year,Documents,Times Cited,Documents in Top 1%,Documents in Top 10%,CNCI,Collab-CNCI,% Docs Cited,% Documents in Top 1%,% Documents in Top 10%,Rank
0,AUSTRALIA,2003,73479,3965411,1952,7645,1.39,1.33,96.88,1.76,13.41,1
1,AUSTRALIA,2004,34122,4026396,561,2594,1.47,1.34,97.88,2.02,11.29,1
2,AUSTRALIA,2005,76888,4458568,1177,8504,1.15,1.34,97.31,1.83,16.21,7
3,AUSTRALIA,2006,69315,5190781,1369,6318,1.24,1.2,96.39,1.47,19.8,12
4,AUSTRALIA,2007,11637,1443993,715,764,1.36,1.35,98.12,1.68,14.6,24
5,AUSTRALIA,2008,124250,9851009,1875,7333,1.37,1.1,96.71,2.48,20.33,5
6,AUSTRALIA,2009,22569,767346,87,2861,1.4,1.08,99.72,2.82,21.41,21
7,AUSTRALIA,2010,39018,2236904,579,4933,1.25,1.24,96.43,2.31,16.38,27
8,AUSTRALIA,2011,8580,909480,209,186,1.69,1.21,95.52,1.61,10.72,19
9,AUSTRALIA,2012,64809,4733799,1536,7240,1.16,1.32,97.47,1.8,20.84,8


<div class="alert alert-block alert-success">
<b>‚úÖ Data Readiness Summary:</b>

The data preprocessing phase is complete. Here is the summary of actions taken:
<ul>
    <li><b>Standardization:</b> Column names normalized (e.g., 'Name' -> 'Country', 'Category Normalized...' -> 'CNCI').</li>
    <li><b>Precision:</b> All numerical impact metrics rounded to 2 decimal places.</li>
    <li><b>Entity Resolution:</b> 'England' entries were replaced with 'United Kingdom' to ensure consistent country-level analysis.</li>
    <li><b>Data Aggregation:</b> Multiple entries for the same Country-Year combination were aggregated. Quantitative metrics (e.g., Documents, Citations) were <b>summed</b>, while qualitative metrics (e.g., CNCI) were <b>averaged</b> to ensure a single, accurate row per country per year.</li>
    <li><b>Outlier Decision:</b> The 'Times Cited' column shows significant outliers. These are <b>not errors</b> but represent "Elite Performance". Removing them would hide the most important insights regarding global research leadership. Hence, <b>we have retained all outliers.</b></li>
</ul>
We are now ready for <b>Exploratory Data Analysis (EDA).</b>
</div>

## 4. Exploratory Data Analysis (EDA) <a id="eda"></a>
Here, we dive deep into the data to answer specific questions and uncover patterns.

### 4.1 Univariate Analysis

#### Q1 Top Performers Analysis: Quantity vs. Quality

**Objective:**
To identify the leading Countries in the dataset based on two distinct performance metrics:

1.  **Volume (Quantity):** Who is producing the most research?
    *   *Metric used:* `Documents`
2.  **Impact (Quality):** Who is producing the most influential research relative to their field?
    *   *Metric used:* `Category Normalized Citation Impact (CNCI)`

In [26]:
# --- Step 1: Prepare the Data ---

# A. Quantity: Sum of all documents across years for each Country
volume_df = df.groupby('Country')['Documents'].sum().reset_index()
top_volume = volume_df.sort_values(by='Documents', ascending=False).head(5)

# B. Quality: Average CNCI across years for each Country
quality_df = df.groupby('Country')['CNCI'].mean().reset_index()
top_quality = quality_df.sort_values(by='CNCI', ascending=False).head(5)

# --- Step 2: Visualization - Volume (Quantity) ---

fig_vol = px.bar(
    top_volume,
    x='Documents',
    y='Country',
    orientation='h',  # Horizontal bar chart
    title='<b>Top 5 Countries by Research Volume</b><br><i>(Total Web of Science Documents)</i>',
    text_auto='.2s',  # Format: 1.5k, 2M etc. (Smart formatting)
    color='Documents',
    color_continuous_scale='Viridis',
    labels={'Documents': 'Total Documents'}
)

# Layout Updates: Reverse Y-axis to show Rank #1 at top
fig_vol.update_layout(
    yaxis=dict(autorange="reversed"),
    xaxis_title="Total Documents",
    coloraxis_showscale=False # Hides the color bar to keep it clean
)

fig_vol.show()

# --- Step 3: Visualization - Quality (Impact) ---

fig_qual = px.bar(
    top_quality,
    x='CNCI',
    y='Country',
    orientation='h',
    title='<b>Top 5 Countries by Research Quality</b><br><i>(Average Category Normalized Citation Impact)</i>',
    text_auto='.3f',  # Format: 3 decimal places (e.g., 1.523)
    color='CNCI',
    color_continuous_scale='Magma',
    labels={'CNCI': 'Avg CNCI'}
)

# Layout Updates: Reverse Y-axis & Add Benchmark Line
fig_qual.update_layout(
    yaxis=dict(autorange="reversed"),
    xaxis_title="Average CNCI",
    coloraxis_showscale=False
)

# Adding the Global Average Benchmark Line (x=1.0)
fig_qual.add_vline(
    x=1.0, 
    line_dash="dash", 
    line_color="green", 
    annotation_text="Global Avg (1.0)", 
    annotation_position="bottom right"
)

fig_qual.show()

<div class="alert alert-block alert-warning">
<b>üí° Insight:</b> The analysis reveals a distinct separation between <b>Quantity</b> and <b>Quality</b> leaders:
<ul>
    <li><b>Volume Leaders ("Mass Producers"):</b> Countries like <b>United Kingdom</b> and <b>Spain</b> dominate in total <code>Documents</code>, indicating massive research scale.</li>
    <li><b>Quality Leaders ("Elite Impact"):</b> However, the <code>CNCI</code> rankings are topped by nations like <b>Japan</b> and <b>Italy</b>, showing that high volume does not always guarantee high average impact.</li>
</ul>
This suggests a potential trade-off between scaling research output and maintaining elite citation performance.
</div>

#### Q2 Ranges & Distributions Analysis: Research Relevance

**Objective:**
To analyze the spread and consistency of research relevance to understand how effectively published papers attract attention. This helps us determine if "getting cited" is a common standard or a rare achievement in this dataset.

1.  **Spread & Range:** What are the boundaries of performance?
    *   *Metric used:* `% Docs Cited` (Minimum vs. Maximum values).
2.  **Distribution Shape:** How is the performance distributed across the dataset?
    *   *Left Skewed:* Indicates most countries achieve high citation rates (Good sign).
    *   *Right Skewed:* Indicates most countries have low citation rates (Bad sign).
3.  **Consistency:** Is the citation rate stable across different countries and years, or are there massive disparities in performance?

In [35]:
# --- Step 1: Calculate Statistics ---
mean_val = df['% Docs Cited'].mean()
median_val = df['% Docs Cited'].median()

# --- Step 2: Create Histogram with Marginal Box Plot ---
fig_dist = px.histogram(
    df, 
    x='% Docs Cited', 
    nbins=30, 
    title='Distribution of Research Relevance (% Docs Cited)',
    marginal='box',  # Adds a box plot above the histogram
    color_discrete_sequence=['#00CC96'], 
    opacity=0.8
)

# --- Step 3: Add Reference Lines & Annotations ---

# Add vertical lines (Without text here to avoid duplication on the marginal plot)
fig_dist.add_vline(x=mean_val, line_dash="dash", line_color="red")
fig_dist.add_vline(x=median_val, line_dash="dot", line_color="blue")

# Add Mean Annotation (Placed relative to paper/layout height)
fig_dist.add_annotation(
    x=mean_val,
    y=1.15,          # Placed higher to avoid overlap
    yref="paper",    # References the figure layout, not data points
    text=f"Mean: {mean_val:.1f}%",
    showarrow=False,
    font=dict(color="red")
)

# Add Median Annotation
fig_dist.add_annotation(
    x=median_val,
    y=1.08,          # Placed slightly lower than Mean
    yref="paper",
    text=f"Median: {median_val:.1f}%",
    showarrow=False,
    font=dict(color="blue")
)

# --- Step 4: Final Layout Adjustments ---
fig_dist.update_layout(
    xaxis_title='Percentage of Documents Cited',
    yaxis_title='Count',
    bargap=0.1
)

fig_dist.show()

# --- Step 5: Print Statistics ---
print("Statistical Summary of % Docs Cited:")
print(df['% Docs Cited'].describe())
print("-" * 30)
print(f"Skewness: {df['% Docs Cited'].skew():.2f}") 


Statistical Summary of % Docs Cited:
count    340.000000
mean      97.404941
std        0.991035
min       95.060000
25%       96.780000
50%       97.390000
75%       98.025000
max       99.860000
Name: % Docs Cited, dtype: float64
------------------------------
Skewness: 0.05


<div class="alert alert-block alert-warning">
<b>üí° Insight:</b> The analysis of <code>% Docs Cited</code> reveals a highly <b>Consistent and Balanced</b> landscape:
<ul>
    <li><b>Normal Distribution (Bell Curve):</b> With a skewness of just <b>0.05</b> and the <b>Mean</b> being approximately equal to the <b>Median</b>, the data follows a near-perfect Normal Distribution.</li>
    <li><b>High Consistency:</b> This indicates that ensuring papers are cited is a <b>Global Standard</b>. Unlike volume or impact (CNCI) which vary widely, the "citation rate" is stable across most countries.</li>
</ul>
This suggests that there are very few "underperformers" regarding relevance‚Äîmost published research succeeds in getting noticed by the academic community.
</div>

#### Q3. Benchmarks Analysis: Performance vs. Global Standards

**Objective:**
To evaluate national research performance against the established **Global Benchmark** to identify systemic strengths or weaknesses. This analysis helps distinguish between nations that consistently deliver "Elite" quality versus those that struggle to meet global expectations.

1.  **The Yardstick (Metric Definition):** How do we measure success?
    *   *Metric used:* `Category Normalized Citation Impact` (CNCI).
    *   **CNCI > 1.0:** Indicates research is performing **better** than the world average.
    *   **CNCI < 1.0:** Indicates research is performing **worse** than the world average.

2.  **Structural Strength (Aggregated View):** Are there any nations that are fundamentally weak?
    *   *Method:* Grouping data by `Country` to see if any nation's *average* performance falls below 1.0.

3.  **Consistency Check (Granular View):** How stable is the quality over time?
    *   *Method:* Analyzing individual year-wise data points to spot specific instances (anomalies) where a high-performing country might have had an "off-year" (CNCI < 1.0).

In [39]:
import plotly.express as px

# --- Configuration: Mapping Column Names for Readability ---
# (Adjust these if your DataFrame uses different names like 'Name' or 'Category Normalized Citation Impact')
cnci_col = 'CNCI'
country_col = 'Country'
year_col = 'Year'

# --- PART 1: The Big Picture (Grouped Analysis) ---
# Goal: Check if any country fails the benchmark when averaged over all years.
country_perf = df.groupby(country_col)[cnci_col].mean()
countries_below_benchmark = country_perf[country_perf < 1.0]

print(f"--- Global Benchmark Analysis (CNCI = 1.0) ---")
print(f"Number of countries with Overall Average CNCI < 1.0: {len(countries_below_benchmark)}")
print("Observation: Overall, structurally, every country maintains a high standard (> 1.0).")
print("-" * 50)

# --- PART 2: The Reality Check (Granular Analysis) ---
# Goal: Identify specific years where performance dipped below average.
below_avg_instances = df[df[cnci_col] < 1.0]

print(f"However, examining granular data (Year-by-Year):")
print(f"We found {len(below_avg_instances)} specific instances (dots) of underperformance out of {len(df)} records.")

# --- PART 3: Visualizing the Nuance ---
# We create a temporary column to define colors: Red for 'Below', Grey for 'Above'
df['Benchmark Status'] = df[cnci_col].apply(lambda x: 'Below Average (< 1.0)' if x < 1.0 else 'Above Average (>= 1.0)')

# Create a Strip Plot (Jitter Plot)
fig_strip = px.strip(
    df, 
    y=cnci_col, 
    x=country_col, # Adding X-axis makes it easier to see which country the dot belongs to
    color='Benchmark Status', # This drives the color mapping
    color_discrete_map={
        'Below Average (< 1.0)': 'red', 
        'Above Average (>= 1.0)': 'lightgrey'
    },
    hover_data=[country_col, year_col, cnci_col], 
    title='Performance Distribution: Elite Status vs. Occasional Dips',
    template='plotly_white',
    labels={cnci_col: 'CNCI (Impact)'}
)

# Add a horizontal line representing the Global Average
fig_strip.add_hline(
    y=1.0, 
    line_dash="dash", 
    line_color="black", 
    annotation_text="Global Baseline (1.0)",
    annotation_position="bottom right"
)

fig_strip.update_layout(xaxis={'categoryorder':'total descending'}) # Sorts countries by volume/impact
fig_strip.show()

--- Global Benchmark Analysis (CNCI = 1.0) ---
Number of countries with Overall Average CNCI < 1.0: 0
Observation: Overall, structurally, every country maintains a high standard (> 1.0).
--------------------------------------------------
However, examining granular data (Year-by-Year):
We found 14 specific instances (dots) of underperformance out of 340 records.


<div class="alert alert-block alert-warning">
<b>üí° Insight:</b> The benchmark analysis highlights the <b>robustness and consistency</b> of the nations in this dataset:
<ul>
    <li><b>Structural Strength (Aggregated View):</b> Remarkably, when averaged over time, <b>every single country</b> maintains a `CNCI` above <b>1.0</b>. This indicates there are no structurally weak performers in this group; all are producing research that exceeds the global average.</li>
    <li><b>Rare Anomalies (Granular View):</b> "Underperformance" is the exception, not the rule. Out of the entire dataset, there are only <b>14 specific instances</b> (year-wise data points) where the `CNCI` dipped below 1.0.</li>
</ul>
This suggests that while even top nations have occasional "off-years," their long-term research trajectory is consistently <b>above the global standard</b>.
</div>

#### Q4 Outlier Detection Analysis

Identify data points that deviate significantly from the rest of the dataset. Detecting outliers is crucial because they can either represent errors in the data or, more likely in this context, "Elite Performers" or "Mass Producers" that define the upper limits of research performance.

**Key Questions to Answer**
1. Which countries/years are producing an exceptionally high volume of documents compared to the global norm?
2. Are there specific instances (Country or Year) where the Citation Impact (CNCI) is suspiciously high or low?
3. Do these outliers represent a consistent trend for a specific country, or are they one-off events?

In [41]:

# ==============================================================================
# PART 1: OUTLIERS IN QUANTITY (DOCUMENTS)
# ==============================================================================

# --- 1.1 Visualization ---
# Creating a Box Plot to visualize the spread and spot dots outside the whiskers
fig_docs = px.box(
    df_clean, 
    y="Documents",
    points="all",  # Display every data point next to the box
    hover_data=["Country", "Year"], # Essential context on hover
    title="<b>Outlier Detection: Research Volume (Documents)</b>",
    color_discrete_sequence=['#EF553B'] # Red-Orange theme
)

fig_docs.update_layout(
    yaxis_title="Number of Documents",
    xaxis_title="Global Distribution",
    template="plotly_white",
    height=600
)

fig_docs.show()

# --- 1.2 Outlier Table Generation (IQR Method) ---
# Calculating the Interquartile Range (IQR) to mathematically identify outliers
Q1_doc = df_clean['Documents'].quantile(0.25)
Q3_doc = df_clean['Documents'].quantile(0.75)
IQR_doc = Q3_doc - Q1_doc
upper_bound_doc = Q3_doc + 1.5 * IQR_doc

# Filtering rows that exceed the upper bound
doc_outliers = df_clean[df_clean['Documents'] > upper_bound_doc][['Country', 'Year', 'Documents']]

# Checking if outliers exist before displaying
if not doc_outliers.empty:
    print(f"\n[Table] Found {len(doc_outliers)} Outliers in Research Volume (Documents):")
    # Sorting by Documents to show the biggest outliers first
    display(doc_outliers.sort_values(by='Documents', ascending=False).head(10).style.background_gradient(cmap='Reds'))
else:
    print("\n[Result] No statistical outliers found in Documents.")


# ==============================================================================
# PART 2: OUTLIERS IN QUALITY (CNCI)
# ==============================================================================

# --- 2.1 Visualization ---
# Box Plot for Quality Metrics
fig_cnci = px.box(
    df_clean, 
    y="CNCI",
    points="all",
    hover_data=["Country", "Year", "Documents"], # Added docs to check volume-quality relationship
    title="<b>Outlier Detection: Research Quality (CNCI)</b>",
    color_discrete_sequence=['#00CC96'] # Green theme
)

# Adding a reference line for Global Average (1.0)
fig_cnci.add_hline(y=1.0, line_dash="dash", line_color="gray", annotation_text="Global Average (1.0)")

fig_cnci.update_layout(
    yaxis_title="CNCI Score",
    xaxis_title="Global Distribution",
    template="plotly_white",
    height=600
)

fig_cnci.show()

# --- 2.2 Outlier Table Generation (IQR Method) ---
# Calculating IQR for CNCI
Q1_cnci = df_clean['CNCI'].quantile(0.25)
Q3_cnci = df_clean['CNCI'].quantile(0.75)
IQR_cnci = Q3_cnci - Q1_cnci
upper_bound_cnci = Q3_cnci + 1.5 * IQR_cnci
lower_bound_cnci = Q1_cnci - 1.5 * IQR_cnci # Identifying extremely low quality too

# Filtering rows outside the bounds
cnci_outliers = df_clean[(df_clean['CNCI'] > upper_bound_cnci) | (df_clean['CNCI'] < lower_bound_cnci)][['Country', 'Year', 'CNCI', 'Documents']]

# Checking if outliers exist before displaying
if not cnci_outliers.empty:
    print(f"\n[Table] Found {len(cnci_outliers)} Outliers in Research Quality (CNCI):")
    display(cnci_outliers.sort_values(by='CNCI', ascending=False).head(10).style.background_gradient(cmap='Greens'))
else:
    print("\n[Result] No statistical outliers found in CNCI.")


# Check consistency: Count how many times a country appears in the outlier list
if not doc_outliers.empty:
    print("\n--- Consistency Check (Is it a Trend?) ---")
    print(doc_outliers['Country'].value_counts())


[Table] Found 6 Outliers in Research Volume (Documents):


Unnamed: 0,Country,Year,Documents
149,ITALY,2004,149432
66,CHINA,2007,148329
310,UNITED KINGDOM,2016,134985
228,SOUTH KOREA,2023,127999
5,AUSTRALIA,2008,124250
27,BRAZIL,2012,119368



[Result] No statistical outliers found in CNCI.

--- Consistency Check (Is it a Trend?) ---
Country
AUSTRALIA         1
BRAZIL            1
CHINA             1
ITALY             1
SOUTH KOREA       1
UNITED KINGDOM    1
Name: count, dtype: int64


<div class="alert alert-block alert-warning">
<b>üí° Insight:</b> The outlier analysis highlights a fundamental difference in how <b>Quantity</b> and <b>Quality</b> are distributed across the research landscape:
<ul>
    <li><b>Volume Extremes ("Mega-Producers"):</b> The <code>Documents</code> distribution is highly skewed with distinct outliers. <b>China</b> and <b>Italy</b> emerge as extreme cases, indicating that specific years for these nations produced output far beyond the global median.</li>
    <li><b>Quality Consistency ("The Quality Ceiling"):</b> In contrast, the <code>CNCI</code> distribution shows <b>no significant outliers</b> (despite a slightly elongated upper whisker). This suggests that while research <i>volume</i> can be scaled exponentially, <i>citation impact</i> tends to remain within a predictable range, making it difficult to achieve "abnormally high" quality scores.</li>
</ul>
This implies that while a country can decide to drastically increase its publication count, achieving statistically extreme <b>Impact Scores</b> is significantly harder and rarer.
</div>

#### Q5 Funnel Check: Elite Research Conversion (% Documents in Top 1%)

"Top 1%" refers to the papers that are in the top 1% of the most cited papers in their field for a given year.

**The Benchmark:** Theoretically, the global average should be 1%.
* **Exceptionalperformance:** If a country has > 2%, they are producing "Elite" research at a double higher rate than the world average.
* **Underperformance:** If a country has < 1%, they are struggling to produce highly influential papers.

**Key Questions to Answer:**
1. What is the average conversion rate across the entire dataset? Is it close to the expected 1%?
2. How is the data distributed? Is it skewed?
3. Which countries consistently achieve a high percentage (e.g., > 2% or 3%)?

In [24]:
# --- Step 1: Calculate Statistics ---
avg_elite = df['% Documents in Top 1%'].mean()
median_elite = df['% Documents in Top 1%'].median()
max_elite = df['% Documents in Top 1%'].max()

print(f"--- Statistics for % Documents in Top 1% ---")
print(f"Mean (Average):   {avg_elite:.2f}%")
print(f"Median:           {median_elite:.2f}%")
print(f"Max Value:        {max_elite:.2f}%")
print("-" * 40)

# --- Step 2: Identification of Top Performers (Rows with > 2% Elite Docs) ---
# Adjusting based on your previous dataset structure:
elite_performers = df[df['% Documents in Top 1%'] > 2.0][['Country', 'Year', '% Documents in Top 1%']]

print(f"\nNumber of records with > 2% Elite Papers: {len(elite_performers)}")
print("Top 5 examples of High Performance:")
print(elite_performers.sort_values(by='% Documents in Top 1%', ascending=False).head(5))

# --- Step 3: Visualization (Histogram) ---
fig = px.histogram(
    df, 
    x='% Documents in Top 1%',
    nbins=40,
    title='Distribution of Elite Research (% Documents in Top 1%)',
    marginal='box', 
    color_discrete_sequence=['gold'], 
    labels={'% Documents in Top 1%': 'Percentage of Documents in Top 1%'}
)



fig.add_vline(x=1.0, line_dash="dash", line_color="red")
fig.add_vline(x=avg_elite, line_dash="dot", line_color="blue")

# Global Baseline Text
fig.add_annotation(
    x=1.0, 
    y=1.08,
    yref="paper", 
    text="Global Baseline (1%)", 
    showarrow=False, 
    font=dict(color="red", size=10)
)

# Dataset Avg Text
fig.add_annotation(
    x=avg_elite, 
    y=1.08, 
    yref="paper", 
    text=f"Dataset Avg ({avg_elite:.2f}%)", 
    showarrow=False, 
    font=dict(color="blue", size=10)
)

fig.update_layout(bargap=0.1)
fig.show()

--- Statistics for % Documents in Top 1% ---
Mean (Average):   1.77%
Median:           1.76%
Max Value:        2.96%
----------------------------------------

Number of records with > 2% Elite Papers: 106
Top 5 examples of High Performance:
         Country  Year  % Documents in Top 1%
107      GERMANY  2005                   2.96
192  NETHERLANDS  2004                   2.95
28        BRAZIL  2013                   2.93
290  SWITZERLAND  2018                   2.92
150        ITALY  2005                   2.89


In [25]:
# --- Additional Analysis: Country-wise Elite Performance ---

# Ab hum Group By kar rahe hain Average nikaalne ke liye
country_elite_avg = df.groupby('Country')['% Documents in Top 1%'].mean().sort_values(ascending=False).reset_index()

# Top 10 Countries dikhate hain
print("--- Top 10 Countries by Average % in Top 1% ---")
print(country_elite_avg.head(10))

# Visualization (Bar Chart)
fig_bar = px.bar(
    country_elite_avg.head(10),
    x='% Documents in Top 1%',
    y='Country',
    orientation='h',
    title='Top 10 Countries: Who consistently produces Elite Research?',
    color='% Documents in Top 1%',
    color_continuous_scale='Teal'
)
fig_bar.update_layout(yaxis=dict(autorange="reversed")) # Rank 1 upar aayega
fig_bar.show()

--- Top 10 Countries by Average % in Top 1% ---
          Country  % Documents in Top 1%
0          SWEDEN               1.933333
1          BRAZIL               1.909524
2       AUSTRALIA               1.903158
3         GERMANY               1.827273
4  UNITED KINGDOM               1.826818
5           ITALY               1.802609
6           CHINA               1.789091
7     NETHERLANDS               1.771000
8           INDIA               1.765714
9             USA               1.749000


### 4.2 Bivariate Analysis
*Analyzing relationships between two variables (e.g., Correlation, Trends over time).*