# üìä Exploratory Data Analysis: Research Publications
**Project Intern/Trainee Hiring Assessment - PAIU-OPSA, IISc Bangalore**

---
**Author:** Omkar Sharma
**Date:** November 2025
**Dashboard Link:** [Insert your Netlify/Streamlit Cloud Link Here]
**Repository:** [Insert your GitHub Link Here]

---
### üìù Objective
The goal of this analysis is to perform an in-depth Exploratory Data Analysis (EDA) on the provided dataset to identify trends, patterns, and anomalies. The insights derived from this notebook will drive the development of an interactive dashboard.

### üìñ Table of Contents
1. [Environment Setup](#setup)
2. [Data Loading & Overview](#loading)
3. [Data Preprocessing & Cleaning](#cleaning)
4. [Exploratory Data Analysis (EDA)](#eda)
    - Univariate Analysis
    - Bivariate Analysis
    - Multivariate Analysis
5. [Key Insights & Conclusion](#conclusion)

## 1. Environment Setup <a id="setup"></a>
Importing necessary libraries for data manipulation, visualization, and statistical analysis.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px # for interactive dashboard
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots  # for dual axis
from scipy import stats

# Configuration for aesthetic charts
sns.set_theme(style='whitegrid')
plt.rcParams['figure.figsize'] = (12,6)

# Hide irrelevent warnings
import warnings
warnings.filterwarnings('ignore')

## 2. Data Loading & Overview <a id="loading"></a>
Loading the dataset and performing a preliminary check to understand the structure, features, and data types.

In [3]:
# Load the dataset
df = pd.read_csv('data/publications.csv')

df.tail() # display last 5 rows
df.head() # display first 5 rows

Unnamed: 0,Name,Web of Science Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,Category Normalized Citation Impact,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,year
0,SWITZERLAND,24154,2705248,0.946748,8,97.93,1.024815,0.89,10.87,97,230,2023
1,CHINA,2185,157320,1.575928,44,99.6,0.900623,2.98,19.26,323,121,2014
2,CHINA,6896,744768,1.032983,42,95.23,1.679004,1.08,11.36,455,662,2013
3,UNITED KINGDOM,2399,177526,1.586585,3,99.21,1.444246,1.63,10.2,98,2463,2005
4,ITALY,10753,301084,0.812773,2,98.35,1.252122,0.81,17.43,440,134,2004


### üìã Data Dictionary
Dataset ke columns aur unka matlab:

| Column Name | Description | Data Type |
| :--- | :--- | :--- |
| **Name** | Name of the Country | `String` |
| **Web of Science Documents** | Total count of research papers published by the country | `Integer` |
| **Times Cited** | Total number of citations received by the published papers | `Integer` |
| **Collab-CNCI** | Category Normalized Citation Impact score for collaborative papers only | `Float` |
| **Rank** | Ranking position of the country | `Integer` |
| **% Docs Cited** | Percentage of documents that have received at least one citation | `Float` |
| **Category Normalized Citation Impact** | Impact score normalized by subject, year, and type (1.0 = World Average) | `Float` |
| **% Documents in Top 1%** | Percentage of papers that are in the global top 1% of most cited papers | `Float` |
| **% Documents in Top 10%** | Percentage of papers that are in the global top 10% of most cited papers | `Float` |
| **Documents in Top 1%** | Absolute count of papers in the global top 1% | `Integer` |
| **Documents in Top 10%** | Absolute count of papers in the global top 10% | `Integer` |
| **year** | The specific year of publication for the data record | `Integer` |

In [4]:
df.shape #to check no. of rows and columns
df.info() #to check missing values and data types
df.describe() #to get statistical summary of numerical data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Name                                 1000 non-null   object 
 1   Web of Science Documents             1000 non-null   int64  
 2   Times Cited                          1000 non-null   int64  
 3   Collab-CNCI                          1000 non-null   float64
 4   Rank                                 1000 non-null   int64  
 5   % Docs Cited                         1000 non-null   float64
 6   Category Normalized Citation Impact  1000 non-null   float64
 7   % Documents in Top 1%                1000 non-null   float64
 8   % Documents in Top 10%               1000 non-null   float64
 9   Documents in Top 1%                  1000 non-null   int64  
 10  Documents in Top 10%                 1000 non-null   int64  
 11  year                           

Unnamed: 0,Web of Science Documents,Times Cited,Collab-CNCI,Rank,% Docs Cited,Category Normalized Citation Impact,% Documents in Top 1%,% Documents in Top 10%,Documents in Top 1%,Documents in Top 10%,year
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,14861.699,1296497.0,1.214932,24.722,97.41069,1.291637,1.7676,17.58979,261.327,1497.457,2013.86
std,8390.150609,967063.3,0.230261,14.108145,1.419199,0.234461,0.71711,4.3631,136.904576,844.902713,6.748477
min,512.0,21846.0,0.800182,1.0,95.0,0.900623,0.5,10.02,12.0,111.0,2003.0
25%,7616.75,507670.0,1.029402,12.0,96.15,1.08702,1.13,13.77,142.0,736.75,2008.0
50%,14711.0,1064920.0,1.214383,25.0,97.385,1.292028,1.81,17.39,261.5,1481.0,2014.0
75%,22022.25,1899791.0,1.415986,37.0,98.6525,1.499628,2.39,21.64,382.0,2202.25,2020.0
max,29959.0,4327668.0,1.599646,49.0,99.89,1.698257,3.0,24.99,499.0,2999.0,2025.0


<div class="alert alert-block alert-info">
<b>üßê Initial Observations:</b>
<ul>
    <li>The dataset contains <b>1000</b> rows and <b>12</b> columns.</li>
    <li>There is no any kind of missing values.</li>
    <li>All Datatypes are correct.</li>
</ul>
</div>