## Data Cleaning Process for the scraped data
In this notebook, we perform a series of data cleaning steps on the universities.csv dataset. The goal is to prepare the data for further analysis by addressing issues related to missing values, data types, and inconsistent formats. Below is a detailed explanation of each step in the data cleaning process.

## Import Libraries and Load Data

In [27]:
import pandas as pd

In [28]:
# Read the data into a df
df = pd.read_csv('universities.csv', encoding='latin-1')

In [29]:
# check basic summary of the df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Country     69 non-null     object
 1   University  69 non-null     object
 2   Founded     69 non-null     object
 3   Type        69 non-null     object
 4   Enrollment  69 non-null     object
 5   Link        69 non-null     object
dtypes: object(6)
memory usage: 3.4+ KB


## Data exploration

In [30]:
# inspect the values in 'founded' column
df['Founded'].values

array(['1957', '1978', '1962', '1821', '1958', '1365', '1992', '1921',
       '1425', '1949', '1982', '1970', '1888', '1827', '1960', '1867',
       '1940', '1669', '1928', '1538', '1950', '1640', '1939', '1998',
       '1837', '1676', '1937', '1911', '1985', '1984', '1982', '1834',
       '1956', '1303', '1889', '1970', '1999', '1579', '2000', '1956',
       '1551', '1959', '1632', '1883', '1943', '1996', '2002', '1551',
       '1904', '1816', '1288', '1991', '1992', '1976', '1905', '1919',
       '1873', '1972', '1972', '1666', '1833[34]', '1928', '1971', '1960',
       '1958', '1969', '1876', '1949', '1721'], dtype=object)

some data format inconsistency was found and will be handled below

##  Extract Year from 'Founded' Column

In [31]:
# extract the year from the Founded column using a regular expression that matches four-digit numbers
#  for standardizing the format of the founding years.
df['Founded']=df['Founded'].str.extract("(\d{4})")

In [32]:
df['Founded'].values

array(['1957', '1978', '1962', '1821', '1958', '1365', '1992', '1921',
       '1425', '1949', '1982', '1970', '1888', '1827', '1960', '1867',
       '1940', '1669', '1928', '1538', '1950', '1640', '1939', '1998',
       '1837', '1676', '1937', '1911', '1985', '1984', '1982', '1834',
       '1956', '1303', '1889', '1970', '1999', '1579', '2000', '1956',
       '1551', '1959', '1632', '1883', '1943', '1996', '2002', '1551',
       '1904', '1816', '1288', '1991', '1992', '1976', '1905', '1919',
       '1873', '1972', '1972', '1666', '1833', '1928', '1971', '1960',
       '1958', '1969', '1876', '1949', '1721'], dtype=object)

In [33]:
# convert the values in 'founded' columns from strings to int
df['Founded']=df['Founded'].astype(int)

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Country     69 non-null     object
 1   University  69 non-null     object
 2   Founded     69 non-null     int64 
 3   Type        69 non-null     object
 4   Enrollment  69 non-null     object
 5   Link        69 non-null     object
dtypes: int64(1), object(5)
memory usage: 3.4+ KB


## Handle missing values in 'Links' columns

In [35]:
# filter the df to identify the row with missing link
df[df['Link'].isnull()]

Unnamed: 0,Country,University,Founded,Type,Enrollment,Link


## Clean the 'Enrollment' Column

In [36]:
# Inspect the 'Enrollment' Column
df['Enrollment'].values

array(['35,000', '85,000', '29,827', '311,175', '73,807', '91,000',
       '2,097,182', '25,500', '55,484', '30,866', '32,000', '215,000',
       '18 911', '84,000', '80,000', '53,581', '42,750', '72,480',
       '43,600', '170,530', '170,000', '36,500', '350,000[8]', '130,000',
       '104,000', '124,000', '32,000', '13,782', '7,140,000',
       '1,045,665 (total) as per 2019/2020[14]\n311,028 (active) as per 4 November 2020 [15]',
       '1,000,000', '32,900', '26,023', '112,564', '70,667', '84,000',
       '41,833', '23,606', '10,373', '100,000', '349,515', '604,437',
       '31,186', '33,050', '50,000', '39,000', '165,000', '37,032',
       '68,249', '44,400', '48,100', '311,928', '200,000', '177,234',
       '38,300', '48,821', '328,179', '173,758', '260,079', '30,646',
       '26,356', '31,758', '525,000', '60,000+', '1,969,733', '253,075',
       '73,284', '144,108', '41,059'], dtype=object)

Also, formatting issues were discovered

In [37]:
df['Enrollment'] = df['Enrollment'].str.replace(' ', ',')
df['Enrollment']=df['Enrollment'].str.extract("(\d{1,3}(?:,\d{1,3}))")
df['Enrollment'].values

In [40]:
# remove commas from the Enrollment column 
df['Enrollment'] = df['Enrollment'].str.replace(',', '')

In [41]:
df.head()

Unnamed: 0,Country,University,Founded,Type,Enrollment,Link
0,Albania,University of Tirana,1957,Public,35000,https://en.wikipedia.org/wiki/University_of_Ti...
1,Algeria,Constantine University,1978,Public,85000,https://en.wikipedia.org/wiki/List_of_universi...
2,Angola,Agostinho Neto University,1962,Public,29827,https://en.wikipedia.org/wiki/Agostinho_Neto_U...
3,Argentina,University of Buenos Aires,1821,Public,311175,https://en.wikipedia.org/wiki/University_of_Bu...
4,Australia,Monash University,1958,Public,73807,https://en.wikipedia.org/wiki/Monash_University


In [42]:
# convert the cleaned enrollment numbers from strings to int.
df['Enrollment'] = df['Enrollment'].astype(int)

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Country     69 non-null     object
 1   University  69 non-null     object
 2   Founded     69 non-null     int64 
 3   Type        69 non-null     object
 4   Enrollment  69 non-null     int64 
 5   Link        69 non-null     object
dtypes: int64(2), object(4)
memory usage: 3.4+ KB


## Save Cleaned Data

In [44]:
df.to_csv('universities_clean.csv', index= False)

## Conclusion
In this notebook, I cleaned the universities.csv dataset by addressing issues in the Founded and Enrollment columns and handling missing values in the Link column. The final cleaned dataset, universities_clean.csv, is now ready.