In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# World University Analysis (2023)
I greet everyone in another work where we will use a dataset containing data on many universities in the world to examine universities in Russia and the world as a whole to understand where, possibly, it is worth going to study if you suddenly want to purposefully learn something new in different countries and fields in general.

I already took the data from a ready-made table collected during the analysis in 2023 from the Kaggle.com website.
You can view and download the data via the link: https://www.kaggle.com/datasets/tariqbashir/world-university-ranking-2023

---
## Dataset Description
```
World University Rankings 2023 is based upon 1,799 universities across 104 countries and regions based on many (at least 13) performance indicators that measure teaching, research, knowledge transfer, and international outlook. Data was collected from over 2,500 institutions, including survey responses from 40,000 scholars and analysis of over 121 million citations in 15.5 million research publications. The US has the most institutions overall and in the top 200, but China has overtaken Australia for the fourth-highest number of institutions in the top 200. The University of Oxford is ranked first for this year, while the highest new entry is Italy's Humanitas University.
```

---

Comment: Judging from the data, even after processing all the values and bringing them to a normal form, more than 2000 (to be exact, 2345 rows) universities are obtained, so the phrase "based upon 1,799 universities" makes me wonder where I am wrong in my calculations.
## Import necessary Python libraries


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
db=pd.read_csv('dataset.csv',encoding='ISO-8859-1') # Without this encoding parameter, a file reading error will occur.
db

## Dataset Comment
Just look at this madness! NaN rows, different parameters in one column, everything needs to be normalized urgently.

---
## Table normalization

In [None]:
db.drop(labels=0, inplace=True)
db.reset_index(inplace=True, drop=True)
def find_irregularity(db,full=False):
    ''' 
    Let's check if the entire table has this structure or if it is heterogeneous.
    '''
    prev=-1
    sum=0
    for index, row in db.iterrows():
        # print(index,row['Rank'])
        if pd.isna(row['Rank']):
            if index-prev!=2:
                sum+=1
                if sum<3 or full==True: #To avoid cluttering the output, I set a limit on the output: sum<10.
                    print('-------------------------')
                    print(f'Attention, there is heterogeneity in the data.')
                    for i in range(index-2,index+3):
                        print(i,db['Rank'].iloc[i],db['Name'].iloc[i], index-prev)
            prev=index
    print(f"---------------------\nNumber of heterogeneities:{sum}")
    if sum==0:
        print(f"Congrats, all clear!")
find_irregularity(db)

## Conclusion:
As you can see, there is heterogeneity in the data. Besides the fact that one column contains both the names of universities and their countries, there is also a certain Explore row that changes the sequence, not allowing me to correctly and quickly convert the table without breaking its structure.
If you carefully study the table, you can assume that we do not need the Explore rows at all, they do not give us anything at all. Let's remove them from here.

In [None]:
db=db[db['Name']!='Explore']
db.reset_index(inplace=True, drop=True)
find_irregularity(db, full=True)

As you can see, there is only one heterogeneity left, and it is very strange. Let's take a closer look at it.

In [None]:
db.iloc[2345:2351]

As you can see, there is essentially an empty row with no data written because the university is most likely not accredited.
> Unaccredited Universities is a list of colleges, universities, and other institutions that do not have the equivalent of regional academic accreditation. Some of these institutions may have legal authority to enroll students and grant degrees, but do not have regional academic accreditation for various reasons.

You can find unaccredited universities using this link: https://www.scholaro.com/unaccredited-universities/

In our case, we will simply remove all such rows if they exist (there seems to be only one).

In [None]:
db=db[db['Name']!='Not accredited']
db.reset_index(inplace=True, drop=True)
find_irregularity(db, full=True) 

In [None]:
countries=[]
indexes=[]
for index, row in db.iterrows():
    if index%2==1:
        countries.append(row['Name'])
        indexes.append(index)
db.drop(labels=indexes, inplace=True)
db.reset_index(inplace=True, drop=True)

In [None]:
db.insert(2,'Country',countries)
db

---
## Errors in the text

You may have already noticed that some values look a bit strange. Let me list for you what cannot satisfy us in the data:

1. Data from columns, for example, No. of FTE Students, are written with a comma instead of a period, so the numbers are not recognized as numbers. Let's make sure of this by checking the data type of all values.

In [None]:
db.dtypes

As you can see, only the 'No. of students per staff' column has no problems (so far). Let's change the data, bringing it to a normal view!

P.S. In fact, the person who collected these data in this form should have his hands cut off. It would be possible to run them through Power Query to avoid problems with processing.

