# Creating our HH!
## Step 1: Data 🗃️!
We are going to split this step into multiple sub-steps 📄:
1. ⛏️**Web Scraping**.
3. 🐼**Data Transformation**L.
4. 🧹**Data cleansing**.
5. 📤**Export the Data**.
---

# 🤔 Pre-coding
Lets define and importa some useful stuff before we start coding

In [1]:
# Important imports
import requests as req # Library for HTTP requests (allows you to send HTTP requests etremely easily): https://pypi.org/project/requests/
from bs4 import BeautifulSoup # Python Library for pulling data out of HTML files (or in this case, web page): https://www.crummy.com/software/BeautifulSoup/bs4/doc/
import pandas as pd # For Data Analysis and Manipulation in Python: https://pandas.pydata.org/
import numpy as np # For matricial and array operations
import unicodedata # For normalizing in function below

In [2]:
# Normalize text function
def normalize_text(word):
    """
    This function takes a string 'word' and normalized
    Example: hElLó wÓrlD = HELLO WORLD
    """
    word = str(word) # Making sure this is a string
    upper_word = word.upper() # Only upper case letters
    striped_word = upper_word.strip() # No spaces at the beginning or end of the word
    # No accents -> https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize
    normalized_word = ''.join([
        letter for letter in unicodedata.normalize('NFD', striped_word)
        if unicodedata.category(letter) != 'Mn'
    ])

    return normalized_word

In [3]:
normalize_text('hElLó wÓrlD')

'HELLO WORLD'

# ⛏️Web Scraping
Is the process of extracting data from a web page, in this case we are going to use my [college page](http://www.dci.ugto.mx/estudiantes/index.php/mcursos/horarios-licenciatura) 🤗

In [4]:
# Request to the page function
def request_url(url):
    res = req.get(url)
    if res.status_code not in range(200, 300):
        raise Exception("Something wen wrong", res.tatus_code)
    else:
        print(res.status_code, "- Everythin is fine 🔥")
    return res

In [5]:
# Check our request function
url = 'http://www.dci.ugto.mx/estudiantes/index.php/mcursos/horarios-licenciatura'
res = request_url(url)
content = res.content
print(content[:15])

200 - Everythin is fine 🔥
b'<!DOCTYPE html>'


In [6]:
# We need a beautiful soup object, not html in a string
soup = BeautifulSoup(content, 'html.parser')
# The schedules are in a table, lets bring it (its in the second one)
all_tables = soup.find_all('table')
schedule_table = all_tables[1] # 👈 THIS IS OUR TABLE

# 🐼**Data Transformation**: We need to transform to pandas objects to make the cleaning

In [7]:
# We need to check how many columns are there in http://www.dci.ugto.mx/estudiantes/index.php/mcursos/horarios-licenciatura
# So we need each data component in the first row
column_row = schedule_table.find_all('tr')[0] # 1st row
# Every data component
td_with_column_names = column_row.find_all('td')
# Lets have a look to this column names
for td in td_with_column_names:
    print(td.text.replace('\n', ''), end = " | ")
print()
# I am going to change every column name to a better-shorter-normalized column name
column_names = ['index_in_page', 'NAME', 'GROUP', 'DAY/TIME/ROOM1', 'DAY/TIME/ROOM2', 'DAY/TIME/ROOM3', 'DAY/TIME/ROOM4', 'PROFESSORS'] # 👈 THIS IS OUR COLUMN NAMES LIST
for name in column_names:
    print(name, end = ' | ')
    
# Just comparing the number of columns so there is no error
assert len(column_names) == len(td_with_column_names), "Don't have the same lenght! CORRECT IT!"

# | UNIDAD DE APRENDIZAJE | GRUPO | DÍA/HORA/AULA | DÍA/HORA/AULA | DÍA/HORA/AULA | DÍA/HORA/AULA | PROFESOR (A) | 
index_in_page | NAME | GROUP | DAY/TIME/ROOM1 | DAY/TIME/ROOM2 | DAY/TIME/ROOM3 | DAY/TIME/ROOM4 | PROFESSORS | 

In [8]:
# Now we need to get the actual data IN THE ROWS
all_rows = schedule_table.find_all('tr')
schedules = []
for row in all_rows[1:]: # Starts in 1 because the 0th is the row with the column names
    tds = row.find_all('td') # Need the data of each row
    # We are going to create a dictionary for each row
    d_row = {}
    for index, column in enumerate(column_names):
        d_row[column] = tds[index].text # {column: tds[index]}
        
    # And we need to add it to the schedules list
    schedules.append(d_row)
schedules[0] # 👈 OUR SCHEDULES IN A PSEUDO-JSON FORMAT

{'index_in_page': '1',
 'NAME': 'ACABADO DEL CUERO',
 'GROUP': 'A',
 'DAY/TIME/ROOM1': 'SÁBADO/9-13/LAB. DE CURTIDURÍA, EDIF. G',
 'DAY/TIME/ROOM2': '\xa0',
 'DAY/TIME/ROOM3': '\xa0',
 'DAY/TIME/ROOM4': '\xa0',
 'PROFESSORS': 'JUAN FRANCISCO RAYAS ROJAS'}

In [9]:
# And now we can create a pandas data frame
raw_schedules_df = pd.DataFrame(schedules, columns = column_names)
raw_schedules_df.head(3)

Unnamed: 0,index_in_page,NAME,GROUP,DAY/TIME/ROOM1,DAY/TIME/ROOM2,DAY/TIME/ROOM3,DAY/TIME/ROOM4,PROFESSORS
0,1,ACABADO DEL CUERO,A,"SÁBADO/9-13/LAB. DE CURTIDURÍA, EDIF. G",,,,JUAN FRANCISCO RAYAS ROJAS
1,2,ADMINISTRACIÓN Y MANEJO DE PERSONAL,A,SÁBADO/9-12/F1,,,,ALDELMO EMMANUEL ISRAEL REYES PABLO
2,3,ÁLGEBRA LINEAL,A,MARTES/8-10/F6,JUEVES/8-10/F6,,,MIGUEL ÁNGEL VALLEJO HERNÁNDEZ


# 🧹**Data cleansing**: Lets clean the data and correct some errors
I am gonna list every error and thing as a sub-title in the following cells

## ❌ 1. Accents and upper letters
We need to normalize every field (we could do this before in the data extract part, but I think is better to separate things correctly)

In [10]:
# Function for normalizing the df with help of our function in 🤔 Pre-coding part: "normalize_text"
def normalize_df(column):
    normalized_column = [] # A list for the normalized column
    for cell in column:
        normalized_column.append(normalize_text(cell))

    return normalized_column

In [11]:
normalized_data = raw_schedules_df.apply(normalize_df)
normalized_data # 👈 OUR DATA NORMALIZED

Unnamed: 0,index_in_page,NAME,GROUP,DAY/TIME/ROOM1,DAY/TIME/ROOM2,DAY/TIME/ROOM3,DAY/TIME/ROOM4,PROFESSORS
0,1,ACABADO DEL CUERO,A,"SABADO/9-13/LAB. DE CURTIDURIA, EDIF. G",,,,JUAN FRANCISCO RAYAS ROJAS
1,2,ADMINISTRACION Y MANEJO DE PERSONAL,A,SABADO/9-12/F1,,,,ALDELMO EMMANUEL ISRAEL REYES PABLO
2,3,ALGEBRA LINEAL,A,MARTES/8-10/F6,JUEVES/8-10/F6,,,MIGUEL ANGEL VALLEJO HERNANDEZ
3,4,ALGEBRA LINEAL,B,LUNES/12-14/F2,MIERCOLES/12-14F2,,,TEODORO CORDOVA FRAGA
4,5,ALGEBRA LINEAL AVANZADA,A,LUNES/15-17/G1,MIERCOLES/15-17/G1,,,AZARAEL ADONAY YEBRA PEREZ
...,...,...,...,...,...,...,...,...
236,237,TOPICOS SELECTOS DE ASTRONOMIA,A,MARTES/15-17/C2,JUEVES/15-17/C2,,,CARSTEN HOLTFORT
237,238,TOPICOS SELECTOS DE ASTRONOMIA,B,LUNES/14-16/F9,MIERCOLES/14-16/F9,,,OCTAVIO JOSE OBREGON DIAZ
238,239,TOPICOS SELECTOS DE ENERGIAS ALTERNAS,A,MARTES/17-19/F8,JUVES/17-19/F8,,,ELDER DE LA ROSA CRUZ
239,240,VARIABLE COMPLEJA,A,MIERCOLES/8-10/F7,VIERNES/8-10/F7,,,MARCO ANTONIO REYES SANTOS


## ❌ 2. Professors Together
There are some subject with more than 1 professor (because of labs or something else), so we need to generate a field for each professor. When this happens, every profressor is separated with a '/' character, so we can use that for creating the new fields

In [12]:
# Grab the professor list
professor_list = normalized_data['PROFESSORS'].to_list()
# Split all this professors in a matrix
professor_matrix = [professor.split('/') for professor in professor_list]
# We need to re-normalize this matrix because it can be there are some not-wanted spaces
for i in range(len(professor_matrix)):
    for j in range(len(professor_matrix[i])):
        professor_matrix[i][j] = normalize_text(professor_matrix[i][j])
# As an example, lets see some multiple professors
professor_matrix[22:24] # One subject with 1 professor and the next one with 3 professors

[['GUSTAVO BASURTO ISLAS'],
 ['GUSTAVO BASURTO ISLAS',
  'EDGAR VAZQUEZ NUNEZ',
  'ARGELIA ROSILLO DE LA TORRE']]

In [13]:
# We need to know how many create so lets calculate it
max_professors = 0
for professors in professor_matrix:
    max_professors = max(len(professors), max_professors)
    
print("The max of professors per subject is:", max_professors)

The max of professors per subject is: 3


In [14]:
# And we need to add the missing columns as blank to professor matrix so its not sparse
for i in range(len(professor_matrix)):
    for missing in range(max_professors - len(professor_matrix[i])):
        professor_matrix[i].append('')
        
professor_matrix[22:24]

[['GUSTAVO BASURTO ISLAS', '', ''],
 ['GUSTAVO BASURTO ISLAS',
  'EDGAR VAZQUEZ NUNEZ',
  'ARGELIA ROSILLO DE LA TORRE']]

In [15]:
# So new fields are:
professor_columns = [f"PROFESSOR{index + 1}" for index in range(max_professors)]
professor_columns

['PROFESSOR1', 'PROFESSOR2', 'PROFESSOR3']

In [16]:
# Lets create this fields in our df
normalized_data[professor_columns] = professor_matrix
normalized_data[20:24] # Lets check some records in the df # 👈 THIS IS OUR CORRECT DF FOR CLEANSING

Unnamed: 0,index_in_page,NAME,GROUP,DAY/TIME/ROOM1,DAY/TIME/ROOM2,DAY/TIME/ROOM3,DAY/TIME/ROOM4,PROFESSORS,PROFESSOR1,PROFESSOR2,PROFESSOR3
20,21,BIOLOGIA CELULAR,A-IV,MARTES/12-15/F1,"LUNES/12-15/LAB. DE BIOLOGIA, EDIF. G",,,SILVIA ALEJANDRA LOPEZ JUAREZ,SILVIA ALEJANDRA LOPEZ JUAREZ,,
21,22,BIOLOGIA CELULAR,B-I,JUEVES/12-15/F1,"MARTES/12-15/LAB. DE BIOLOGIA, EDIF. G",,,GUSTAVO BASURTO ISLAS,GUSTAVO BASURTO ISLAS,,
22,23,BIOLOGIA CELULAR,B-II,JUEVES/12-15/F1,"VIERNES/12-15/LAB. DE BIOLOGIA, EDIF. G",,,GUSTAVO BASURTO ISLAS,GUSTAVO BASURTO ISLAS,,
23,24,BIOLOGIA CONTEMPORANEA,A,LUNES/16-18/SALA DE JUNTAS EDIF. B,JUEVES/16-18/ SALA DE JUNTAS EDIF. B,,,GUSTAVO BASURTO ISLAS / EDGAR VAZQUEZ NUNEZ /A...,GUSTAVO BASURTO ISLAS,EDGAR VAZQUEZ NUNEZ,ARGELIA ROSILLO DE LA TORRE


## ❌ 3. Empty cells
If there are empty fields (other than DAY/TIME/ROOM fields because empty cells in this fields is normal) then we need to fix it because that means something is wrong with the algorithm

In [17]:
# First of all, lets standard the spaces and blank spaces to np.nan objects
normalized_data = normalized_data.replace(r'^\s*$', np.nan, regex=True)
normalized_data.head(3) # 👈 STILL OUR DF

Unnamed: 0,index_in_page,NAME,GROUP,DAY/TIME/ROOM1,DAY/TIME/ROOM2,DAY/TIME/ROOM3,DAY/TIME/ROOM4,PROFESSORS,PROFESSOR1,PROFESSOR2,PROFESSOR3
0,1,ACABADO DEL CUERO,A,"SABADO/9-13/LAB. DE CURTIDURIA, EDIF. G",,,,JUAN FRANCISCO RAYAS ROJAS,JUAN FRANCISCO RAYAS ROJAS,,
1,2,ADMINISTRACION Y MANEJO DE PERSONAL,A,SABADO/9-12/F1,,,,ALDELMO EMMANUEL ISRAEL REYES PABLO,ALDELMO EMMANUEL ISRAEL REYES PABLO,,
2,3,ALGEBRA LINEAL,A,MARTES/8-10/F6,JUEVES/8-10/F6,,,MIGUEL ANGEL VALLEJO HERNANDEZ,MIGUEL ANGEL VALLEJO HERNANDEZ,,


In [18]:
# Function to detect if there is an empty cell where it should not be.
def detect_empty_cells(df, columns):
    # Save indices in a list 👇
    empty_cells_index = []
    # Loop for know if it is empty
    for index in range(len(df)):
        for column in columns:
            if pd.isna(df.at[index, column]):
                empty_cells_index.append(index)
                
    return empty_cells_index

In [19]:
# Probemos esta función
columns_to_check = ['NAME', 'GROUP', 'DAY/TIME/ROOM1', 'PROFESSORS'] # We only care if these fields are empty becuase that means something is wrong
empty_rows = detect_empty_cells(normalized_data, columns_to_check)

assert not empty_rows, f"It shouln'd be an empty cell in here 👉 {empty_rows}!" # If there are empty rows we need to compare with the link http://www.dci.ugto.mx/estudiantes/index.php/mcursos/horarios-licenciatura

## ❌ 4. Errors in DAY/TIME/ROOM fields
We need to format this field form, I choose:

_day/start_hour-end_hour/room_

por ejemplo:

_LUNES/14-16/F9_

Lunes is Monday in Spanish and F9 is the class room name.
In the following code we are going to fix the schedules that do not comply with the format 👇.

In [20]:
# Function to detect a bad format in dates (date by date)
def detect_wrong_dates(date):
    # We're not checking NAN values
    if date == 'NAN' or pd.isna(date):
        return False, None
    # Split for each "/" to obtain every DAY/TIME/ROOM
    date_split = date.split('/')

    # With this we can find 2 possible errors
    # 1. The date field does not have 3 "/"
    # 2. That, the start-end time does not have any "-" 

    # Detect the 1st
    if len(date_split) != 3:
        return True, 'Slash'

    # Detect the 2nd
    hours = date_split[1] # Porque queremos checar la hora
    if hours and len(hours.split('-')) != 2:
        return True, 'Hour'
        
        
    return False, None

In [21]:
# The above 👆 function only works for one date, we need to create a function that runs that 👆 function for all the df
def detect_wrong_dates_in_df(df, date_columns):
    # Iterating over rows
    for index, row in df.iterrows():
        for column in date_columns:
            detection = detect_wrong_dates(row[column])
            if detection[0]:
                print(f'{detection[1]} Error')
                print(f"Index={index}, #Pag={row['index_in_page']}, Column={column[-1]}")

In [22]:
# Test the function
date_columns = [f'DAY/TIME/ROOM{index+1}' for index in range(4)]
detect_wrong_dates_in_df(normalized_data, date_columns) # 😱 Too many errors, there is no option but to fix these by hand ✋
print("------------------------------------------------------")

Slash Error
Index=3, #Pag=4, Column=2
Slash Error
Index=58, #Pag=59, Column=3
Hour Error
Index=125, #Pag=126, Column=2
Slash Error
Index=128, #Pag=129, Column=2
Slash Error
Index=154, #Pag=155, Column=1
Slash Error
Index=164, #Pag=165, Column=1
Slash Error
Index=165, #Pag=166, Column=1
Hour Error
Index=193, #Pag=194, Column=3
------------------------------------------------------


If we check out the page http://www.dci.ugto.mx/estudiantes/index.php/mcursos/horarios-licenciatura we can see that the errors are human errors and these errors are no predictable at all, so we need to fix these erros by hand 🥶

### 💨 Fix 3

In [23]:
# 👀 what is happening?
normalized_data.at[3, 'DAY/TIME/ROOM2'] # It doesn't have a slash between the 14 and F2

'MIERCOLES/12-14F2'

In [24]:
# Corrijamos
normalized_data.at[3, 'DAY/TIME/ROOM2'] = "MIERCOLES/12-14/F2" # 'MIERCOLES/12-14/F2'
normalized_data.at[3, 'DAY/TIME/ROOM2']

'MIERCOLES/12-14/F2'

### 💨 Fix 58

In [25]:
# 👀 what is happening?
normalized_data.at[58, 'DAY/TIME/ROOM3']

'MIERCOLES10-12/LAB. DE FISICA MODERNA, EDIF. G'

In [26]:
normalized_data.at[58, 'DAY/TIME/ROOM3'] = "MIERCOLES/10-12/LAB. DE FISICA MODERNA, EDIF. G"
normalized_data.at[58, 'DAY/TIME/ROOM3']

'MIERCOLES/10-12/LAB. DE FISICA MODERNA, EDIF. G'

### 💨 Fix 125

In [27]:
# 👀 what is happening?
normalized_data.at[125, 'DAY/TIME/ROOM2']

'MIERCOLES/14:15:30/C3'

In [28]:
normalized_data.at[125, 'DAY/TIME/ROOM2'] = "MIERCOLES/14-15:30/C3"
normalized_data.at[125, 'DAY/TIME/ROOM2']

'MIERCOLES/14-15:30/C3'

## ❌ 4. There are some fields that only distract
Non useful fields should be dropped, like index_in_page

In [250]:
# Drop the index_page and change the variable name to something smaller
# norm_df = normalized_data.drop(['index_in_page'], axis = 1)
# norm_df.head(3)

Unnamed: 0,NAME,GROUP,DAY/TIME/ROOM1,DAY/TIME/ROOM2,DAY/TIME/ROOM3,DAY/TIME/ROOM4,PROFESSORS,PROFESSOR1,PROFESSOR2,PROFESSOR3
0,ACABADO DEL CUERO,A,"SABADO/9-13/LAB. DE CURTIDURIA, EDIF. G",,,,JUAN FRANCISCO RAYAS ROJAS,JUAN FRANCISCO RAYAS ROJAS,,
1,ADMINISTRACION Y MANEJO DE PERSONAL,A,SABADO/9-12/F1,,,,ALDELMO EMMANUEL ISRAEL REYES PABLO,ALDELMO EMMANUEL ISRAEL REYES PABLO,,
2,ALGEBRA LINEAL,A,MARTES/8-10/F6,JUEVES/8-10/F6,,,MIGUEL ANGEL VALLEJO HERNANDEZ,MIGUEL ANGEL VALLEJO HERNANDEZ,,
