# 2. Cleaning raw data

First, let's remove any rows we don't need.

We'll use the Complete University Guide data, for simplicity I've saved from the previous section as a csv file using:

    df.to_csv("../data/CompleteUG.csv", index=None)
    
This page is a web page first and foremost - not optimised for Data Engineers hacking their way through Pandas!!

As a result, there are additional features that can appear - adverts, columns for navigation, and poorly named columns that don't appear as they do in the original DOM.
    
---

## Imports

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("../data/CompleteUG.csv")
df.head()

---

## Dropping columns we don't need

Notice how the first and last columns don't seem to contain anything useful? Let's get rid for clarity! 

Normally you might want to review more than the first 5 rows before jumping to this conclusion... On this occasion, trust me for now...

In [None]:
df.drop(columns=["Unnamed: 0", "Next Steps"], axis=1, inplace=True)
# inplace=true means we don't need to use a reassignment operator and no new objects are created in memory
# axis 0 = rows
# axis 1 = columns

df.head()

That's better. But the column names are still rubbish, so let's rename them!

In [None]:
# Provide a map of what to rename in the form of a dict
cols = {"Rank": "Rank", 
        "Rank.1": "Change_in_Rank", 
        "University Name": "University_Name",
        "Entry Standards Entry Standards The average UCAS tariff score of new students entering the university. Read more": "Entry_Standards", 
        "Student Satisfaction Student Satisfaction A guide to how satisfied students are with the quality of teaching they receive. Maximum score: 5.00 Read more": "Student_Satisfaction", 
        "Research Quality Research Quality A measure of the quality of the research undertaken in the university. Maximum score: 4.00 Read more": "Research_Quality", 
        "Graduate Prospects Graduate Prospects A guide to the success of graduates on completion of their courses at the university. Maximum score: 100.0 Read more": "Graduate_Prospects", 
        "Overall Score Overall Score The total score calculated by our independent and trusted methodology, comprising entry standards, student satisfaction, research assessment (quality and intensity), graduate prospects, student–staff ratio, academic services spend, facilities spend, good honours, and degree completion. Maximum score: 1000 Read more": "Overall_Score",
       }
df.rename(columns=cols, inplace=True)
df.head()

---

## More Cleaning...

Ok, lets keep exploring...

In [None]:
# Lets check the size of the DataFrame. If its massive, we probably don't want to pull it all back into the notebook!
df.shape

In [None]:
# look at the bottom of the DataFrame...
df.tail()

...But 133 rows (0-132) but 131 ranks....? Something is awry...

In [None]:
# AS a starting point, lets find rows where the "Rank" is clearly weird
df.head(20)

In [None]:
# List comprehension example - loop through all rows in the Rank column and check whether the first digit 
[str(x)[0].isdigit() for x in df["Rank"].head(20)]

In [None]:
# check the erroneous rows - note the "not" negation operator
df[[ not str(x)[0].isdigit() for x in df["Rank"]]]

In [None]:
# Rebuild the dataframe excluding the erroneous rows...
df = df[[ str(x)[0].isdigit() for x in df["Rank"]]] 

In [None]:
df.shape

In [None]:
# Onto transformations...

_**n.b** in reality you probably want more structure to your "data" subfolder, e.g. /data/raw, /data/processed, /data/train, but this is just a **simple** demo..._