# üè´ **University Rankings Analysis** 


<h2 style="font-family: 'poppins'; font-weight: bold;">üë®‚ÄçüíªAuthor: Muhammad Hassan Saboor</h2>

[![GitHub](https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github)](https://github.com/MuhammadHassanSaboor) 
[![Kaggle](https://img.shields.io/badge/Kaggle-Profile-blue?style=for-the-badge&logo=kaggle)](https://www.kaggle.com/mhassansaboor) 
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin)](https://www.linkedin.com/in/muhammad-hassan-saboor/)  
[![Facebook](https://img.shields.io/badge/Facebook-Profile-blue?style=for-the-badge&logo=facebook)](https://www.facebook.com/profile.php?id=61555194218257) 
[![Twitter/X](https://img.shields.io/badge/Twitter-Profile-blue?style=for-the-badge&logo=twitter)](https://twitter.com/MUHAMMA84929767) 
[![Instagram](https://img.shields.io/badge/Instagram-Profile-blue?style=for-the-badge&logo=instagram)](https://www.instagram.com/m_hassan_saboor/) 

---

## Overview üìù

This notebook explores various aspects of **university rankings** and performance metrics using a dataset of top universities globally. The analysis covers multiple factors including overall scores, geographic distribution, reputation, internationalization, faculty-related insights, citations, and more. The goal is to identify patterns, correlations, and trends that can provide deeper insights into university rankings.

---

## **Steps Covered in This EDA** üõ†Ô∏è

1. **Ranking Analysis** üìä  
   Explore top-ranked universities, rank distribution, and the performance trends over time.

2. **Geographic Insights** üåç  
   Analyze the geographic distribution of universities, top countries by number of universities, and university concentration by city.

3. **Overall Score Analysis** üìà  
   Investigate the correlation between overall scores and rankings, and identify universities with the highest overall scores.

4. **Reputation and Ranking** ‚≠ê  
   Compare academic reputation with employer reputation, and analyze the role of international research centers in rankings.

5. **Internationalization Factors** üåê  
   Examine the impact of international student percentages, exchange programs, and international faculty on rankings.

6. **Faculty-Related Insights** üë©‚Äçüè´  
   Explore how faculty-student ratios and staff with PhDs correlate with rankings and overall performance.

7. **Citations and Research Productivity** üî¨  
   Analyze the relationship between research productivity, citations per paper, and university rankings.

8. **Country-Based Analysis** üåé  
   Investigate which countries have the highest-ranked universities and analyze countries showing significant improvement.

9. **Advanced Visualizations** üé®  
   Visualize correlations between key metrics, and explore clustering patterns among universities.

10. **Performance vs Faculty-Student Ratio** üìâ  
    Investigate the relationship between faculty-student ratios and university performance, and identify top universities balancing high rank and low ratio.

11. **Analysis of QS Ranking Variables** üîç  
    Explore the most influential factors in QS rankings and identify data consistency across various variables.

12. **Top Performing Universities in Specific Categories** üèÖ  
    Rank universities based on employer reputation, international student attraction, and research productivity.

13. **Statistical Analysis** üìä  
    Provide descriptive statistics of key variables and detect any outliers in the dataset.

---

## **Key Insights** üí°

- **Correlation Analysis** üîó: Detailed analysis of how features like academic reputation, employer reputation, and faculty-student ratio influence university rankings.
- **Geographic Trends** üåç: Insights on the concentration of top universities across different countries and cities, identifying key educational hubs.
- **Internationalization Factors** üåè: Analyzing the importance of international students, faculty, and exchange programs in determining university success.
- **Faculty and Research** üë©‚Äçüè´üî¨: The role of faculty-student ratios and research productivity in shaping rankings and performance.
  
---

## **Necessary Libraries Used** üìö

- **Pandas**: Data manipulation and cleaning
- **NumPy**: Numerical operations
- **Plotly**: Interactive visualizations
- **Scikit-learn**: Clustering and machine learning models for advanced analysis

---


## **Conclusion** üéì

This exploratory data analysis highlights key insights into university rankings, revealing significant trends and relationships between various factors like reputation, internationalization, faculty quality, and research productivity. The findings aim to provide a deeper understanding of what makes a top university in today‚Äôs global educational landscape.

Thank you for exploring this analysis! üôèüöÄ

---



# üìö **Importing Libraries**

In [1]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import warnings
import plotly.io as pio
pio.renderers.default = 'iframe'

# ‚öôÔ∏è **Basic Important Settings**

In [2]:
warnings.filterwarnings("ignore")

# üì• **Loading the Dataset**

In [3]:
df = pd.read_csv("/kaggle/input/top-300-asian-universities-qs-rankings-2024/topuniversities.csv")

# üìä **Exploring the Dataset**

In [4]:
df.head()

Unnamed: 0,Rank,Ordinal Rank,University Name,Overall Score,City,Country,Citations per Paper,Papers per Faculty,Academic Reputation,Faculty Student Ratio,Staff with PhD,International Research Center,International Students,Outbound Exchange,Inbound Exchange,International Faculty,Employer Reputation
0,1,1,Peking University,100.0,Beijing,China,96.4,79.8,100.0,98.6,90.7,98.0,69.1,100.0,88.5,83.2,100.0
1,2,2,The University of Hong Kong,99.7,Pokfulam,Hong Kong,99.5,55.0,100.0,93.3,97.4,98.4,100.0,100.0,99.8,100.0,96.8
2,3,3,National University of Singapore (NUS),98.9,Singapore,Singapore,99.9,57.4,100.0,85.8,82.5,99.9,99.2,97.6,93.4,100.0,99.9
3,4,4,Nanyang Technological University,98.3,Singapore,Singapore,100.0,53.8,100.0,93.0,67.0,99.7,98.8,97.9,90.5,100.0,98.8
4,5,5,Fudan University,97.2,Shanghai,China,92.1,63.1,99.8,92.5,73.4,92.1,81.0,94.9,99.5,98.9,99.5


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 17 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Rank                           300 non-null    int64  
 1   Ordinal Rank                   300 non-null    int64  
 2   University Name                300 non-null    object 
 3   Overall Score                  300 non-null    float64
 4   City                           300 non-null    object 
 5   Country                        300 non-null    object 
 6   Citations per Paper            300 non-null    float64
 7   Papers per Faculty             300 non-null    float64
 8   Academic Reputation            300 non-null    float64
 9   Faculty Student Ratio          300 non-null    float64
 10  Staff with PhD                 294 non-null    float64
 11  International Research Center  300 non-null    float64
 12  International Students         296 non-null    flo

In [6]:
df.isnull().sum()

Rank                              0
Ordinal Rank                      0
University Name                   0
Overall Score                     0
City                              0
Country                           0
Citations per Paper               0
Papers per Faculty                0
Academic Reputation               0
Faculty Student Ratio             0
Staff with PhD                    6
International Research Center     0
International Students            4
Outbound Exchange                 0
Inbound Exchange                  0
International Faculty            17
Employer Reputation               0
dtype: int64

# ‚öôÔ∏è **Data Preprocessing** 

In [7]:
df["Staff with PhD"] = df["Staff with PhD"].fillna(int(df["Staff with PhD"].mean()))
df["International Faculty"] = df["International Faculty"].fillna(int(df["International Faculty"].mean()))
df["International Students"] = df["International Students"].fillna(int(df["International Students"].mean()))

In [8]:
df.isnull().sum()

Rank                             0
Ordinal Rank                     0
University Name                  0
Overall Score                    0
City                             0
Country                          0
Citations per Paper              0
Papers per Faculty               0
Academic Reputation              0
Faculty Student Ratio            0
Staff with PhD                   0
International Research Center    0
International Students           0
Outbound Exchange                0
Inbound Exchange                 0
International Faculty            0
Employer Reputation              0
dtype: int64

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 17 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Rank                           300 non-null    int64  
 1   Ordinal Rank                   300 non-null    int64  
 2   University Name                300 non-null    object 
 3   Overall Score                  300 non-null    float64
 4   City                           300 non-null    object 
 5   Country                        300 non-null    object 
 6   Citations per Paper            300 non-null    float64
 7   Papers per Faculty             300 non-null    float64
 8   Academic Reputation            300 non-null    float64
 9   Faculty Student Ratio          300 non-null    float64
 10  Staff with PhD                 300 non-null    float64
 11  International Research Center  300 non-null    float64
 12  International Students         300 non-null    flo

# üìä **Exploratory Data Analysis (EDA)**

# üìä **Ranking Analysis** 

In [10]:
top_universities = df.head(10)

fig_top_universities = px.bar(
    top_universities,
    x="University Name",
    y="Overall Score",
    color="University Name",
    title="Top Universities in Asia (QS Rankings 2024)",
    template="plotly_dark",
)

fig_top_universities.update_layout(
    title="Top Universities in Asia (QS Rankings 2024)",
    xaxis_title="University Name",
    yaxis_title="Overall Score",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_top_universities.show()

In [11]:
rank_distribution = df.groupby("Country")["Rank"].mean().reset_index()

fig_rank_distribution = px.bar(
    rank_distribution,
    x="Country",
    y="Rank",
    color="Country",
    title="Rank Distribution Across Countries in QS Asian Rankings 2024",
    template="plotly_dark",
)

fig_rank_distribution.update_layout(
    title="Rank Distribution Across Countries in QS Asian Rankings 2024",
    xaxis_title="Country",
    yaxis_title="Average Rank",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_rank_distribution.show()

# üåç **Geographic Insights** 

In [12]:
country_universities_count = df["Country"].value_counts().reset_index()
country_universities_count.columns = ["Country", "Number of Universities"]

fig_country_universities = px.bar(
    country_universities_count,
    x="Country",
    y="Number of Universities",
    color="Country",
    title="Top Countries by Number of Universities in QS Asian Rankings 2024",
    template="plotly_dark",
)

fig_country_universities.update_layout(
    title="Top Countries by Number of Universities in QS Asian Rankings 2024",
    xaxis_title="Country",
    yaxis_title="Number of Universities",
    plot_bgcolor="black",
    paper_bgcolor="black",
)

fig_country_universities.show()

In [13]:
country_avg_score = df.groupby("Country")["Overall Score"].mean().reset_index()

fig_country_avg_score = px.bar(
    country_avg_score,
    x="Country",
    y="Overall Score",
    color="Country",
    title="Top Countries by Overall Score in QS Asian Rankings 2024",
    template="plotly_dark",
)

fig_country_avg_score.update_layout(
    title="Top Countries by Overall Score in QS Asian Rankings 2024",
    xaxis_title="Country",
    yaxis_title="Average Overall Score",
    plot_bgcolor="black",
    paper_bgcolor="black",
)

fig_country_avg_score.show()

In [14]:
city_rank_count = df.groupby("City")["Rank"].count().reset_index()

fig_city_rank_count = px.bar(
    city_rank_count,
    x="City",
    y="Rank",
    color="City",
    title="Top Cities by Number of Top-Ranked Universities in QS Asian Rankings 2024",
    template="plotly_dark",
)

fig_city_rank_count.update_layout(
    title="Top Cities by Number of Top-Ranked Universities in QS Asian Rankings 2024",
    xaxis_title="City",
    yaxis_title="Number of Top-Ranked Universities",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_city_rank_count.show()

# üìà **Overall Score Analysis**  

In [15]:
fig_corr_rank_score = px.scatter(
    df,
    x="Overall Score",
    y="Rank",
    color="Country",
    hover_name="University Name",
    title="Correlation between Overall Score and Rank (QS Asian Rankings 2024)",
    template="plotly_dark",
)

fig_corr_rank_score.update_layout(
    title="Correlation between Overall Score and Rank (QS Asian Rankings 2024)",
    xaxis_title="Overall Score",
    yaxis_title="Rank",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_corr_rank_score.show()

In [16]:
top_universities_by_score = df.nlargest(10, "Overall Score")

fig_top_universities_score = px.bar(
    top_universities_by_score,
    x="University Name",
    y="Overall Score",
    color="University Name",
    title="Top 10 Universities by Overall Score (QS Asian Rankings 2024)",
    template="plotly_dark",
)

fig_top_universities_score.update_layout(
    title="Top 10 Universities by Overall Score (QS Asian Rankings 2024)",
    xaxis_title="University Name",
    yaxis_title="Overall Score",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_top_universities_score.show()

In [17]:
fig_score_distribution = px.box(
    df,
    x="Country",
    y="Overall Score",
    title="Overall Score Distribution by Country (QS Asian Rankings 2024)",
    template="plotly_dark",
)

fig_score_distribution.update_layout(
    title="Overall Score Distribution by Country (QS Asian Rankings 2024)",
    xaxis_title="Country",
    yaxis_title="Overall Score",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_score_distribution.show()

# ‚≠ê **Reputation and Ranking** 

In [18]:
fig_academic_vs_employer = px.scatter(
    df,
    x="Academic Reputation",
    y="Employer Reputation",
    color="Country",
    hover_name="University Name",
    title="Academic Reputation vs Employer Reputation (QS Asian Rankings 2024)",
    template="plotly_dark",
)

fig_academic_vs_employer.update_layout(
    title="Academic Reputation vs Employer Reputation (QS Asian Rankings 2024)",
    xaxis_title="Academic Reputation",
    yaxis_title="Employer Reputation",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_academic_vs_employer.show()

In [19]:
top_international_research = df.nlargest(10, "International Research Center")

fig_international_research = px.bar(
    top_international_research,
    x="University Name",
    y="International Research Center",
    color="University Name",
    title="Top 10 Universities by International Research Center Rating (QS Asian Rankings 2024)",
    template="plotly_dark",
)

fig_international_research.update_layout(
    title="Top 10 Universities by International Research Center Rating (QS Asian Rankings 2024)",
    xaxis_title="University Name",
    yaxis_title="International Research Center Rating",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_international_research.show()

In [20]:
fig_reputation_vs_internationalization = px.scatter(
    df,
    x="International Students",
    y="International Faculty",
    size="Inbound Exchange",
    color="Employer Reputation",
    hover_name="University Name",
    title="Reputation vs Internationalization (QS Asian Rankings 2024)",
    template="plotly_dark",
)

fig_reputation_vs_internationalization.update_layout(
    title="Reputation vs Internationalization (QS Asian Rankings 2024)",
    xaxis_title="International Students",
    yaxis_title="International Faculty",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_reputation_vs_internationalization.show()

# üåê **Internationalization Factors**  

In [21]:
fig_international_students = px.scatter(
    df,
    x="International Students",
    y="Rank",
    color="Country",
    hover_name="University Name",
    title="International Student Percentage vs Rank (QS Asian Rankings 2024)",
    template="plotly_dark",
)

fig_international_students.update_layout(
    title="International Student Percentage vs Rank (QS Asian Rankings 2024)",
    xaxis_title="International Student Percentage",
    yaxis_title="Rank",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_international_students.show()

In [22]:
fig_inbound_outbound_exchange = px.scatter(
    df,
    x="Inbound Exchange",
    y="Outbound Exchange",
    color="Country",
    hover_name="University Name",
    title="Inbound vs Outbound Exchange (QS Asian Rankings 2024)",
    template="plotly_dark",
)

fig_inbound_outbound_exchange.update_layout(
    title="Inbound vs Outbound Exchange (QS Asian Rankings 2024)",
    xaxis_title="Inbound Exchange",
    yaxis_title="Outbound Exchange",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_inbound_outbound_exchange.show()

In [23]:
fig_international_faculty = px.scatter(
    df,
    x="International Faculty",
    y="Rank",
    color="Country",
    hover_name="University Name",
    title="International Faculty Percentage vs Rank (QS Asian Rankings 2024)",
    template="plotly_dark",
)

fig_international_faculty.update_layout(
    title="International Faculty Percentage vs Rank (QS Asian Rankings 2024)",
    xaxis_title="International Faculty Percentage",
    yaxis_title="Rank",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_international_faculty.show()

# üë©‚Äçüè´ **Faculty-Related Insights** 

In [24]:
fig_faculty_to_student_ratio = px.scatter(
    df,
    x="Faculty Student Ratio",
    y="Rank",
    color="Country",
    hover_name="University Name",
    title="Faculty to Student Ratio vs Rank (QS Asian Rankings 2024)",
    template="plotly_dark",
)

fig_faculty_to_student_ratio.update_layout(
    title="Faculty to Student Ratio vs Rank (QS Asian Rankings 2024)",
    xaxis_title="Faculty to Student Ratio",
    yaxis_title="Rank",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_faculty_to_student_ratio.show()

In [25]:
fig_staff_with_phd = px.scatter(
    df,
    x="Staff with PhD",
    y="Rank",
    color="Country",
    hover_name="University Name",
    title="Staff with PhD vs Rank (QS Asian Rankings 2024)",
    template="plotly_dark",
)

fig_staff_with_phd.update_layout(
    title="Staff with PhD vs Rank (QS Asian Rankings 2024)",
    xaxis_title="Staff with PhD Percentage",
    yaxis_title="Rank",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_staff_with_phd.show()

# üî¨ **Citations and Research Productivity**  

In [26]:
fig_citations_per_paper = px.scatter(
    df,
    x="Citations per Paper",
    y="Rank",
    color="Country",
    hover_name="University Name",
    title="Citations per Paper vs Rank (QS Asian Rankings 2024)",
    template="plotly_dark",
)

fig_citations_per_paper.update_layout(
    title="Citations per Paper vs Rank (QS Asian Rankings 2024)",
    xaxis_title="Citations per Paper",
    yaxis_title="Rank",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_citations_per_paper.show()

In [27]:
fig_papers_per_faculty = px.scatter(
    df,
    x="Papers per Faculty",
    y="Rank",
    color="Country",
    hover_name="University Name",
    title="Papers per Faculty vs Rank (QS Asian Rankings 2024)",
    template="plotly_dark",
)

fig_papers_per_faculty.update_layout(
    title="Papers per Faculty vs Rank (QS Asian Rankings 2024)",
    xaxis_title="Papers per Faculty",
    yaxis_title="Rank",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_papers_per_faculty.show()

In [28]:
fig_research_productivity = px.scatter(
    df,
    x="Citations per Paper",
    y="Papers per Faculty",
    size="International Research Center",
    color="Country",
    hover_name="University Name",
    title="Research Productivity (Citations, Papers, Research Centers) vs Rank",
    template="plotly_dark",
)

fig_research_productivity.update_layout(
    title="Research Productivity (Citations, Papers, Research Centers) vs Rank",
    xaxis_title="Citations per Paper",
    yaxis_title="Papers per Faculty",
    plot_bgcolor="black",
    paper_bgcolor="black",
)
fig_research_productivity.show()

# üåé **Country-Based Analysis** 

In [29]:
top_ranks = [10, 50, 100]
country_rank_count = pd.DataFrame()

# Count universities in top ranks for each country
for rank in top_ranks:
    # Filter universities that are in the top N ranks
    top_n_universities = (
        df[df["Rank"] <= rank].groupby("Country").size().reset_index(name=f"Top {rank}")
    )

    # Merge data on 'Country' column, making sure the 'Country' column exists in both DataFrames
    if country_rank_count.empty:
        country_rank_count = top_n_universities
    else:
        country_rank_count = pd.merge(
            country_rank_count, top_n_universities, on="Country", how="outer"
        )

fig_country_top_rank = px.bar(
    country_rank_count,
    x="Country",
    y=[f"Top {rank}" for rank in top_ranks],
    title="Country with Highest Ranked Universities (Top 10, 50, 100)",
    labels={"value": "Number of Universities", "variable": "Top N Ranks"},
    template="plotly_dark",
)

fig_country_top_rank.update_layout(
    barmode="stack",
    plot_bgcolor="black",
    paper_bgcolor="black",
    title="Country with Highest Ranked Universities (Top 10, 50, 100)",
    xaxis_title="Country",
    yaxis_title="Number of Universities",
)

fig_country_top_rank.show()

In [30]:
country_avg_rank = (
    df.groupby("Country")["Rank"].mean().reset_index().sort_values("Rank")
)

fig_country_improvement = px.bar(
    country_avg_rank,
    x="Country",
    y="Rank",
    title="Average Rank by Country (QS Asian Rankings 2024)",
    labels={"Rank": "Average Rank", "Country": "Country"},
    template="plotly_dark",
)

fig_country_improvement.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    title="Average Rank by Country (QS Asian Rankings 2024)",
    xaxis_title="Country",
    yaxis_title="Average Rank",
)

fig_country_improvement.show()

# üé® **Advanced Visualizations**  

In [31]:
correlation_columns = [
    "Overall Score",
    "Citations per Paper",
    "Papers per Faculty",
    "Academic Reputation",
    "Faculty Student Ratio",
    "Staff with PhD",
    "International Research Center",
    "International Students",
    "Outbound Exchange",
    "Inbound Exchange",
    "International Faculty",
    "Employer Reputation",
]
correlation_matrix = df[correlation_columns].corr()

fig_corr = go.Figure(
    data=go.Heatmap(
        z=correlation_matrix.values,
        x=correlation_matrix.columns,
        y=correlation_matrix.columns,
        colorscale="Viridis",
        colorbar=dict(title="Correlation"),
        text=correlation_matrix.values,
        texttemplate="%{text:.2f}",
    )
)

fig_corr.update_layout(
    title="Correlation Heatmap between Key Features",
    xaxis_title="Features",
    yaxis_title="Features",
    template="plotly_dark",
    plot_bgcolor="black",
    paper_bgcolor="black",
)

fig_corr.show()

In [32]:
fig_box = px.box(
    df,
    x="Country",
    y="Rank",
    title="Rank Variation Across Countries",
    labels={"Rank": "University Rank", "Country": "Country"},
    template="plotly_dark",
)

fig_box.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    title="Rank Variation Across Countries",
    xaxis_title="Country",
    yaxis_title="University Rank",
)

fig_box.show()

In [33]:
fig_violin = px.violin(
    df,
    x="Country",
    y="Overall Score",
    box=True,
    points="all",
    title="Overall Score Distribution Across Countries",
    labels={"Overall Score": "Overall Score", "Country": "Country"},
    template="plotly_dark",
)

fig_violin.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    title="Overall Score Distribution Across Countries",
    xaxis_title="Country",
    yaxis_title="Overall Score",
)

fig_violin.show()

In [34]:
df["Staff with PhD"] = df["Staff with PhD"].fillna(int(df["Staff with PhD"].mean()))
df["International Faculty"] = df["International Faculty"].fillna(
    int(df["International Faculty"].mean())
)
df["International Students"] = df["International Students"].fillna(
    int(df["International Students"].mean())
)

cluster_columns = [
    "Citations per Paper",
    "Papers per Faculty",
    "Faculty Student Ratio",
    "International Research Center",
    "International Faculty",
    "Employer Reputation",
]
df_cluster = df[cluster_columns]

# Standardize the data
scaler = StandardScaler()
df_cluster_scaled = scaler.fit_transform(df_cluster)

# Perform KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=42)
df["Cluster"] = kmeans.fit_predict(df_cluster_scaled)

# Plot the clusters
fig_cluster = px.scatter(
    df,
    x="Citations per Paper",
    y="Papers per Faculty",
    color="Cluster",
    title="Cluster Analysis of Universities Based on Research Metrics",
    labels={
        "Citations per Paper": "Citations per Paper",
        "Papers per Faculty": "Papers per Faculty",
    },
    template="plotly_dark",
)

fig_cluster.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    title="Cluster Analysis of Universities Based on Research Metrics",
    xaxis_title="Citations per Paper",
    yaxis_title="Papers per Faculty",
)

fig_cluster.show()

# üìâ **Performance vs Faculty-Student Ratio**  

In [35]:
fig_scatter = px.scatter(
    df,
    x="Faculty Student Ratio",
    y="Overall Score",
    color="Country",
    hover_name="University Name",
    title="Performance vs Faculty-Student Ratio",
    labels={
        "Faculty Student Ratio": "Faculty-Student Ratio",
        "Overall Score": "Overall Score",
    },
    template="plotly_dark",
)

fig_scatter.update_layout(plot_bgcolor="black", paper_bgcolor="black")

fig_scatter.show()

In [36]:
best_performing_universities = df.sort_values(by="Overall Score", ascending=False).head(20)
best_performing_universities_low_faculty_ratio = best_performing_universities[
    best_performing_universities["Faculty Student Ratio"]
    < best_performing_universities["Faculty Student Ratio"].median()
]

fig_best_performing = px.scatter(
    best_performing_universities_low_faculty_ratio,
    x="Faculty Student Ratio",
    y="Overall Score",
    text="University Name",
    color="Country",
    title="Best-Performing Universities with Low Faculty-Student Ratios",
    labels={
        "Faculty Student Ratio": "Faculty-Student Ratio",
        "Overall Score": "Overall Score",
    },
    template="plotly_dark",
)

fig_best_performing.update_layout(plot_bgcolor="black", paper_bgcolor="black")

fig_best_performing.show()

# üîç **Analysis of QS Ranking Variables** 

In [37]:
numeric_df = df.drop(["University Name", "City", "Country"], axis=1)
ranking_influence = numeric_df.corr()["Rank"].drop("Rank")

fig_ranking_influence = px.bar(
    ranking_influence,
    x=ranking_influence.index,
    y=ranking_influence.values,
    title="Factors Influencing QS Rank",
    labels={"x": "Feature", "y": "Correlation with Rank"},
    template="plotly_dark",
)

fig_ranking_influence.update_layout(plot_bgcolor="black", paper_bgcolor="black")

fig_ranking_influence.show()

In [38]:
variance_data = df.drop(["Rank", "University Name", "City", "Country"], axis=1).var()

fig_variance = px.bar(
    variance_data,
    x=variance_data.index,
    y=variance_data.values,
    title="Variance of QS Ranking Variables",
    labels={"x": "Feature", "y": "Variance"},
    template="plotly_dark",
)

fig_variance.update_layout(plot_bgcolor="black", paper_bgcolor="black")

fig_variance.show()

# üèÖ **Top Performing Universities in Specific Categories** 

In [39]:
top_employer_reputation = (
    df[["University Name", "Employer Reputation"]]
    .sort_values(by="Employer Reputation", ascending=False)
    .head(10)
)

fig_employer_reputation = px.bar(
    top_employer_reputation,
    x="University Name",
    y="Employer Reputation",
    title="Top 10 Universities by Employer Reputation",
    labels={
        "University Name": "University",
        "Employer Reputation": "Employer Reputation Score",
    },
    template="plotly_dark",
)

fig_employer_reputation.update_layout(plot_bgcolor="black", paper_bgcolor="black")

fig_employer_reputation.show()

In [40]:
top_international_students = (
    df[["University Name", "International Students"]]
    .sort_values(by="International Students", ascending=False)
    .head(10)
)

fig_international_students = px.bar(
    top_international_students,
    x="University Name",
    y="International Students",
    title="Top 10 Universities with the Highest Percentage of International Students",
    labels={
        "University Name": "University",
        "International Students": "Percentage of International Students",
    },
    template="plotly_dark",
)

fig_international_students.update_layout(plot_bgcolor="black", paper_bgcolor="black")

fig_international_students.show()

In [41]:
top_research_universities = (
    df[["University Name", "Citations per Paper"]]
    .sort_values(by="Citations per Paper", ascending=False)
    .head(10)
)

fig_research_universities = px.bar(
    top_research_universities,
    x="University Name",
    y="Citations per Paper",
    title="Top 10 Universities by Citations per Paper (Research)",
    labels={
        "University Name": "University",
        "Citations per Paper": "Citations per Paper",
    },
    template="plotly_dark",
)

fig_research_universities.update_layout(plot_bgcolor="black", paper_bgcolor="black")

fig_research_universities.show()

In [42]:
top_research_centers = (
    df[["University Name", "International Research Center"]]
    .sort_values(by="International Research Center", ascending=False)
    .head(10)
)

fig_research_centers = px.bar(
    top_research_centers,
    x="University Name",
    y="International Research Center",
    title="Top 10 Universities by International Research Centers",
    labels={
        "University Name": "University",
        "International Research Center": "International Research Centers",
    },
    template="plotly_dark",
)

fig_research_centers.update_layout(plot_bgcolor="black", paper_bgcolor="black")

fig_research_centers.show()

# üìä **Statistical Analysis** 

In [43]:
descriptive_stats = (
    df[
        [
            "Overall Score",
            "Rank",
            "Employer Reputation",
            "Citations per Paper",
            "International Students",
            "Faculty Student Ratio",
        ]
    ]
    .describe()
    .T
)
descriptive_stats["Mean"] = df[
    [
        "Overall Score",
        "Rank",
        "Employer Reputation",
        "Citations per Paper",
        "International Students",
        "Faculty Student Ratio",
    ]
].mean()
descriptive_stats["Median"] = df[
    [
        "Overall Score",
        "Rank",
        "Employer Reputation",
        "Citations per Paper",
        "International Students",
        "Faculty Student Ratio",
    ]
].median()
descriptive_stats["Mode"] = (
    df[
        [
            "Overall Score",
            "Rank",
            "Employer Reputation",
            "Citations per Paper",
            "International Students",
            "Faculty Student Ratio",
        ]
    ]
    .mode()
    .iloc[0]
)
descriptive_stats["Standard Deviation"] = df[
    [
        "Overall Score",
        "Rank",
        "Employer Reputation",
        "Citations per Paper",
        "International Students",
        "Faculty Student Ratio",
    ]
].std()

In [44]:
fig_desc_stats = px.bar(
    descriptive_stats,
    x=descriptive_stats.index,
    y=['mean', '50%', 'std'],
    title="Descriptive Statistics for Key Metrics",
    labels={'x': 'Metric', 'value': 'Value'},
    template='plotly_dark',
    barmode='group'
)

fig_desc_stats.update_layout(
    plot_bgcolor='black',
    paper_bgcolor='black'
)

fig_desc_stats.show()

In [45]:
outliers = df[(df['Overall Score'] < 30) & (df['Rank'] > 50)]

fig_outliers = px.scatter(
    outliers,
    x='Overall Score',
    y='Rank',
    hover_data=['University Name', 'Country'],
    title="Outliers: Low Score but High Rank",
    labels={'Overall Score': 'Overall Score', 'Rank': 'Rank'},
    template='plotly_dark'
)

fig_outliers.update_layout(
    plot_bgcolor='black',
    paper_bgcolor='black'
)

fig_outliers.show()

# **Thank You!** üéâ

Thank you for exploring this analysis of university rankings and performance metrics. üôè

Your interest and time mean a lot! üöÄ I hope you found this project insightful and useful. üòä If you have any questions or suggestions, feel free to reach out! üí¨


---

#### "Success is the sum of small efforts, repeated day in and day out." üåü

Thank you again, and best of luck in your data science journey! üéìüìö
