# Creating our HH!
## Step 1: Data üóÉÔ∏è!
We are going to split this step into multiple sub-steps üìÑ:
1. ‚õèÔ∏è**Web Scraping**.
3. üêº**Data Transformation**L.
4. üßπ**Data cleansing**.
5. üì§**Export the Data**.
---

# ü§î Pre-coding
Lets define and importa some useful stuff before we start coding

In [2]:
# Important imports
import requests as req # Library for HTTP requests (allows you to send HTTP requests etremely easily): https://pypi.org/project/requests/
from bs4 import BeautifulSoup # Python Library for pulling data out of HTML files (or in this case, web page): https://www.crummy.com/software/BeautifulSoup/bs4/doc/
import pandas as pd # For Data Analysis and Manipulation in Python: https://pandas.pydata.org/
import numpy as np # For matricial and array operations
import unicodedata # For normalizing in function below

In [3]:
# Normalize text function
def normalize_text(word):
    """
    This function takes a string 'word' and normalized
    Example: hElL√≥ w√ìrlD = HELLO WORLD
    """
    word = str(word) # Making sure this is a string
    upper_word = word.upper() # Only upper case letters
    striped_word = upper_word.strip() # No spaces at the beginning or end of the word
    # No accents -> https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize
    normalized_word = ''.join([
        letter for letter in unicodedata.normalize('NFD', striped_word)
        if unicodedata.category(letter) != 'Mn'
    ])

    return normalized_word

# ‚õèÔ∏èWeb Scraping
Is the process of extracting data from a web page, in this case we are going to use my [college page](http://www.dci.ugto.mx/estudiantes/index.php/mcursos/horarios-licenciatura) ü§ó

In [10]:
# Request to the page function
def request_url(url):
    res = req.get(url)
    if res.status_code not in range(200, 300):
        raise Exception("Something wen wrong", res.tatus_code)
    else:
        print(res.status_code, "- Everythin is fine üî•")
    return res

In [11]:
# Check our request function
url = 'http://www.dci.ugto.mx/estudiantes/index.php/mcursos/horarios-licenciatura'
res = request_url(url)
content = res.content
print(content[:15])

200 - Everythin is fine üî•
b'<!DOCTYPE html>'


In [12]:
# We need a beautiful soup object, not html in a string
soup = BeautifulSoup(content, 'html.parser')
# The schedules are in a table, lets bring it (its in the second one)
all_tables = soup.find_all('table')
schedule_table = all_tables[1] # üëà THIS IS OUR TABLE

# üêº**Data Transformation**: We need to transform to pandas objects to make the cleaning

In [13]:
# We need to check how many columns are there in http://www.dci.ugto.mx/estudiantes/index.php/mcursos/horarios-licenciatura
# So we need each data component in the first row
column_row = schedule_table.find_all('tr')[0] # 1st row
# Every data component
td_with_column_names = column_row.find_all('td')
# Lets have a look to this column names
for td in td_with_column_names:
    print(td.text.replace('\n', ''), end = " | ")
print()
# I am going to change every column name to a better-shorter-normalized column name
column_names = ['index_in_page', 'NAME', 'GROUP', 'DAY/TIME/ROOM1', 'DAY/TIME/ROOM2', 'DAY/TIME/ROOM3', 'DAY/TIME/ROOM4', 'PROFESSORS'] # üëà THIS IS OUR COLUMN NAMES LIST
for name in column_names:
    print(name, end = ' | ')
    
# Just comparing the number of columns so there is no error
assert len(column_names) == len(td_with_column_names), "Don't have the same lenght! CORRECT IT!"

# | UNIDAD DE APRENDIZAJE | GRUPO | D√çA/HORA/AULA | D√çA/HORA/AULA | D√çA/HORA/AULA | D√çA/HORA/AULA | PROFESOR (A) | 
index_in_page | NAME | GROUP | DAY/TIME/ROOM1 | DAY/TIME/ROOM2 | DAY/TIME/ROOM3 | DAY/TIME/ROOM4 | PROFESSORS | 

In [14]:
# Now we need to get the actual data IN THE ROWS
all_rows = schedule_table.find_all('tr')
schedules = []
for row in all_rows[1:]: # Starts in 1 because the 0th is the row with the column names
    tds = row.find_all('td') # Need the data of each row
    # We are going to create a dictionary for each row
    d_row = {}
    for index, column in enumerate(column_names):
        d_row[column] = tds[index].text # {column: tds[index]}
        
    # And we need to add it to the schedules list
    schedules.append(d_row)
schedules[0] # üëà OUR SCHEDULES IN A PSEUDO-JSON FORMAT

{'index_in_page': '1',
 'NAME': 'ACABADO DEL CUERO',
 'GROUP': 'A',
 'DAY/TIME/ROOM1': 'S√ÅBADO/9-13/LAB. DE CURTIDUR√çA, EDIF. G',
 'DAY/TIME/ROOM2': '\xa0',
 'DAY/TIME/ROOM3': '\xa0',
 'DAY/TIME/ROOM4': '\xa0',
 'PROFESSORS': 'JUAN FRANCISCO RAYAS ROJAS'}

In [15]:
# And now we can create a pandas data frame
raw_schedules_df = pd.DataFrame(schedules, columns = column_names)
# I need to save this file 
raw_schedules_df.to_csv('../data/raw_schedules.csv', index=False)
raw_schedules_df.head(3)

Unnamed: 0,index_in_page,NAME,GROUP,DAY/TIME/ROOM1,DAY/TIME/ROOM2,DAY/TIME/ROOM3,DAY/TIME/ROOM4,PROFESSORS
0,1,ACABADO DEL CUERO,A,"S√ÅBADO/9-13/LAB. DE CURTIDUR√çA, EDIF. G",,,,JUAN FRANCISCO RAYAS ROJAS
1,2,ADMINISTRACI√ìN Y MANEJO DE PERSONAL,A,S√ÅBADO/9-12/F1,,,,ALDELMO EMMANUEL ISRAEL REYES PABLO
2,3,√ÅLGEBRA LINEAL,A,MARTES/8-10/F6,JUEVES/8-10/F6,,,MIGUEL √ÅNGEL VALLEJO HERN√ÅNDEZ


# üßπ**Data cleansing**: Lets clean the data and correct some errors
I am gonna list every error and thing as a sub-title in the following cells

**Note:** There are some times that the school gives me the schedules in an excel, so there is no need to scrap the data from the web page, so here I am going to bring that file

In [4]:
# Reading csv document (This covers the case where school gives me an excel file before they update their page)
raw_schedules_og_df = pd.read_csv('../data/raw_schedules.csv')
raw_schedules_og_df.head(10)

Unnamed: 0,√ÅREA,#,SEMESTRE,UNIDAD DE APRENDIZAJE,GRUPO,CARACTER√çSTICA,HORAS/SEMANA,D√çA/HORA/AULA,D√çA/HORA/AULA.1,D√çA/HORA/AULA.2,D√çA/HORA/AULA.3,PROFESOR\nASIGNADO,DEPARTAMENTO,NOTAS
0,ADMINISTRACI√ìN,1,7,ADMINISTRACI√ìN Y MANEJO DE PERSONAL,A,CURSO COMPLETO,4,S√ÅBADO/9-13/F6,,,,,,
1,IDIOMAS,2,OPT-G,ALEM√ÅN I,A,CURSO COMPLETO,3,S√ÅBADO/8-11/F7,,,,,,
2,MATEM√ÅTICAS,3,2,√ÅLGEBRA LINEAL,A,CURSO COMPLETO,4,LUNES/8-10/AUDITORIO EDIF. G,MI√âRCOLES/8-10/AUDITORIO EDIF. G,,,MIGUEL √ÅNGEL VALLEJO HERN√ÅNDEZ,INGENIER√çA F√çSICA,
3,MATEM√ÅTICAS,4,2,√ÅLGEBRA LINEAL,B,CURSO COMPLETO,4,MARTES/8-10/F1,JUEVES/8-10/F1,,,TEODORO C√ìRDOVA FRAGA,INGENIER√çA F√çSICA,
4,MATEM√ÅTICAS,5,2,√ÅLGEBRA LINEAL,C,CURSO COMPLETO,4,MARTES/10-12/F1,JUEVES/10-12/F1,,,OCTAVIO JOS√â OBREG√ìN D√çAZ,F√çSICA,
5,MATEM√ÅTICAS,6,2,√ÅLGEBRA LINEAL,D,CURSO COMPLETO,4,MI√âRCOLES/12-14/F1,VIERNES/12-14/F1,,,JOS√â LUIS L√ìPEZ PIC√ìN,F√çSICA,
6,MATEM√ÅTICAS,7,2,√ÅLGEBRA LINEAL,E,CURSO COMPLETO,4,LUNES/15-17/G1,MI√âRCOLES/15-17/G1,,,JOS√â DE JES√öS BERNAL ALVARADO,INGENIER√çA F√çSICA,
7,ELECTR√ìNICA,8,3,AN√ÅLISIS DE CIRCUITOS,A,CURSO COMPLETO,6,MARTES/8-10/LAB. DE ELECTR√ìNICA EDIF. G,JUEVES/8-12/LAB. DE ELECTR√ìNICA EDIF. G,,,CARLOS VILLASE√ëOR MORA,"INGENIER√çAS QU√çMICA, ELECTR√ìNICA Y BIOM√âDICA",
8,ELECTR√ìNICA,9,3,AN√ÅLISIS DE CIRCUITOS,B,CURSO COMPLETO,6,LUNES/12-16/LAB. DE ELECTR√ìNICA EDIF.G,MI√âRCOLES/8-10/LAB. DE ELECTR√ìNICA EDIF. G,,,JOS√â MARCO BALLEZA ORDAZ,INGENIER√çA F√çSICA,
9,SOCIALES,10,OPT-G,AN√ÅLISIS DE LA CULTURA MEXICANA,A,CURSO COMPLETO,3,LUNES/16-19/AUDITORIO DEL EDIF. G,,,,FERNANDO AGUAS √ÅNGEL,F√çSICA,


In [5]:
# Change the columns again because of the case they give me an excel file, remembering I need always this column names:
# index_in_page | NAME | GROUP | DAY/TIME/ROOM1 | DAY/TIME/ROOM2 | DAY/TIME/ROOM3 | DAY/TIME/ROOM4 | PROFESSORS |    

column_names = [
    "AREA", 
    "index_in_page", 
    "SEMESTER", 
    "NAME", 
    "GROUP", 
    "CHARACTERISTICS", 
    "HOURS/WEEK", 
    "DAY/TIME/ROOM1",
    "DAY/TIME/ROOM2",
    "DAY/TIME/ROOM3",
    "DAY/TIME/ROOM4",
    "PROFESSORS",
    "DEPARTMENT",
    "NOTES"
]

# Just comparing the number of columns so there is no error
assert len(column_names) == len(raw_schedules_og_df.columns), f"Don't have the same lenght!"

In [6]:
# Create the replace dictionary for the rename function for pandas
column_replace_dict = {og_col: new_col for og_col, new_col in zip(raw_schedules_og_df.columns, column_names)}
# Replace the column names
raw_schedules_df = raw_schedules_og_df.rename(columns = column_replace_dict)
raw_schedules_df.head(10)

Unnamed: 0,AREA,index_in_page,SEMESTER,NAME,GROUP,CHARACTERISTICS,HOURS/WEEK,DAY/TIME/ROOM1,DAY/TIME/ROOM2,DAY/TIME/ROOM3,DAY/TIME/ROOM4,PROFESSORS,DEPARTMENT,NOTES
0,ADMINISTRACI√ìN,1,7,ADMINISTRACI√ìN Y MANEJO DE PERSONAL,A,CURSO COMPLETO,4,S√ÅBADO/9-13/F6,,,,,,
1,IDIOMAS,2,OPT-G,ALEM√ÅN I,A,CURSO COMPLETO,3,S√ÅBADO/8-11/F7,,,,,,
2,MATEM√ÅTICAS,3,2,√ÅLGEBRA LINEAL,A,CURSO COMPLETO,4,LUNES/8-10/AUDITORIO EDIF. G,MI√âRCOLES/8-10/AUDITORIO EDIF. G,,,MIGUEL √ÅNGEL VALLEJO HERN√ÅNDEZ,INGENIER√çA F√çSICA,
3,MATEM√ÅTICAS,4,2,√ÅLGEBRA LINEAL,B,CURSO COMPLETO,4,MARTES/8-10/F1,JUEVES/8-10/F1,,,TEODORO C√ìRDOVA FRAGA,INGENIER√çA F√çSICA,
4,MATEM√ÅTICAS,5,2,√ÅLGEBRA LINEAL,C,CURSO COMPLETO,4,MARTES/10-12/F1,JUEVES/10-12/F1,,,OCTAVIO JOS√â OBREG√ìN D√çAZ,F√çSICA,
5,MATEM√ÅTICAS,6,2,√ÅLGEBRA LINEAL,D,CURSO COMPLETO,4,MI√âRCOLES/12-14/F1,VIERNES/12-14/F1,,,JOS√â LUIS L√ìPEZ PIC√ìN,F√çSICA,
6,MATEM√ÅTICAS,7,2,√ÅLGEBRA LINEAL,E,CURSO COMPLETO,4,LUNES/15-17/G1,MI√âRCOLES/15-17/G1,,,JOS√â DE JES√öS BERNAL ALVARADO,INGENIER√çA F√çSICA,
7,ELECTR√ìNICA,8,3,AN√ÅLISIS DE CIRCUITOS,A,CURSO COMPLETO,6,MARTES/8-10/LAB. DE ELECTR√ìNICA EDIF. G,JUEVES/8-12/LAB. DE ELECTR√ìNICA EDIF. G,,,CARLOS VILLASE√ëOR MORA,"INGENIER√çAS QU√çMICA, ELECTR√ìNICA Y BIOM√âDICA",
8,ELECTR√ìNICA,9,3,AN√ÅLISIS DE CIRCUITOS,B,CURSO COMPLETO,6,LUNES/12-16/LAB. DE ELECTR√ìNICA EDIF.G,MI√âRCOLES/8-10/LAB. DE ELECTR√ìNICA EDIF. G,,,JOS√â MARCO BALLEZA ORDAZ,INGENIER√çA F√çSICA,
9,SOCIALES,10,OPT-G,AN√ÅLISIS DE LA CULTURA MEXICANA,A,CURSO COMPLETO,3,LUNES/16-19/AUDITORIO DEL EDIF. G,,,,FERNANDO AGUAS √ÅNGEL,F√çSICA,


## ‚ùå 1. Accents and upper letters
We need to normalize every field (we could do this before in the data extract part, but I think is better to separate things correctly)

In [7]:
# Function for normalizing the df with help of our function in ü§î Pre-coding part: "normalize_text"
def normalize_df(column):
    normalized_column = [] # A list for the normalized column
    for cell in column:
        normalized_column.append(normalize_text(cell))

    return normalized_column

In [8]:
normalized_data = raw_schedules_df.apply(normalize_df)
normalized_data.head(10) # üëà OUR DATA NORMALIZED

Unnamed: 0,AREA,index_in_page,SEMESTER,NAME,GROUP,CHARACTERISTICS,HOURS/WEEK,DAY/TIME/ROOM1,DAY/TIME/ROOM2,DAY/TIME/ROOM3,DAY/TIME/ROOM4,PROFESSORS,DEPARTMENT,NOTES
0,ADMINISTRACION,1,7,ADMINISTRACION Y MANEJO DE PERSONAL,A,CURSO COMPLETO,4,SABADO/9-13/F6,NAN,NAN,NAN,NAN,NAN,NAN
1,IDIOMAS,2,OPT-G,ALEMAN I,A,CURSO COMPLETO,3,SABADO/8-11/F7,NAN,NAN,NAN,NAN,NAN,NAN
2,MATEMATICAS,3,2,ALGEBRA LINEAL,A,CURSO COMPLETO,4,LUNES/8-10/AUDITORIO EDIF. G,MIERCOLES/8-10/AUDITORIO EDIF. G,NAN,NAN,MIGUEL ANGEL VALLEJO HERNANDEZ,INGENIERIA FISICA,NAN
3,MATEMATICAS,4,2,ALGEBRA LINEAL,B,CURSO COMPLETO,4,MARTES/8-10/F1,JUEVES/8-10/F1,NAN,NAN,TEODORO CORDOVA FRAGA,INGENIERIA FISICA,NAN
4,MATEMATICAS,5,2,ALGEBRA LINEAL,C,CURSO COMPLETO,4,MARTES/10-12/F1,JUEVES/10-12/F1,NAN,NAN,OCTAVIO JOSE OBREGON DIAZ,FISICA,NAN
5,MATEMATICAS,6,2,ALGEBRA LINEAL,D,CURSO COMPLETO,4,MIERCOLES/12-14/F1,VIERNES/12-14/F1,NAN,NAN,JOSE LUIS LOPEZ PICON,FISICA,NAN
6,MATEMATICAS,7,2,ALGEBRA LINEAL,E,CURSO COMPLETO,4,LUNES/15-17/G1,MIERCOLES/15-17/G1,NAN,NAN,JOSE DE JESUS BERNAL ALVARADO,INGENIERIA FISICA,NAN
7,ELECTRONICA,8,3,ANALISIS DE CIRCUITOS,A,CURSO COMPLETO,6,MARTES/8-10/LAB. DE ELECTRONICA EDIF. G,JUEVES/8-12/LAB. DE ELECTRONICA EDIF. G,NAN,NAN,CARLOS VILLASENOR MORA,"INGENIERIAS QUIMICA, ELECTRONICA Y BIOMEDICA",NAN
8,ELECTRONICA,9,3,ANALISIS DE CIRCUITOS,B,CURSO COMPLETO,6,LUNES/12-16/LAB. DE ELECTRONICA EDIF.G,MIERCOLES/8-10/LAB. DE ELECTRONICA EDIF. G,NAN,NAN,JOSE MARCO BALLEZA ORDAZ,INGENIERIA FISICA,NAN
9,SOCIALES,10,OPT-G,ANALISIS DE LA CULTURA MEXICANA,A,CURSO COMPLETO,3,LUNES/16-19/AUDITORIO DEL EDIF. G,NAN,NAN,NAN,FERNANDO AGUAS ANGEL,FISICA,NAN


## ‚ùå 2. Professors Together
There are some subject with more than 1 professor (because of labs or something else), so we need to generate a field for each professor. When this happens, every profressor is separated with a '/' character, so we can use that for creating the new fields

In [9]:
# Grab the professor list
professor_list = normalized_data['PROFESSORS'].to_list()
# Split all this professors in a matrix
professor_matrix = [professor.split('/') for professor in professor_list]
# We need to re-normalize this matrix because it can be there are some not-wanted spaces
for i in range(len(professor_matrix)):
    for j in range(len(professor_matrix[i])):
        professor_matrix[i][j] = normalize_text(professor_matrix[i][j])
# As an example, lets see some multiple professors
professor_matrix[20:22] # One subject with 1 professor and the next one with 3 professors

[['MODESTO ANTONIO SOSA AQUINO',
  'ARTURO GONZALEZ VEGA',
  'TEODORO CORDOVA FRAGA',
  'FRANCISCO MIGUEL VARGAS LUNA'],
 ['TEODORO CORDOVA FRAGA']]

In [10]:
# We need to know how many create so lets calculate it
max_professors = 0
for professors in professor_matrix:
    max_professors = max(len(professors), max_professors)
    
print("The max of professors per subject is:", max_professors)

The max of professors per subject is: 4


In [11]:
# And we need to add the missing columns as blank to professor matrix so its not sparse
for i in range(len(professor_matrix)):
    for missing in range(max_professors - len(professor_matrix[i])):
        professor_matrix[i].append('')
        
professor_matrix[20:22]

[['MODESTO ANTONIO SOSA AQUINO',
  'ARTURO GONZALEZ VEGA',
  'TEODORO CORDOVA FRAGA',
  'FRANCISCO MIGUEL VARGAS LUNA'],
 ['TEODORO CORDOVA FRAGA', '', '', '']]

In [12]:
# So new fields are:
professor_columns = [f"PROFESSOR{index + 1}" for index in range(max_professors)]
professor_columns

['PROFESSOR1', 'PROFESSOR2', 'PROFESSOR3', 'PROFESSOR4']

In [13]:
# Lets create this fields in our df
normalized_data[professor_columns] = professor_matrix

In [14]:
normalized_data[20:22] # Lets check some records in the df # üëà THIS IS OUR CORRECT DF FOR CLEANSING

Unnamed: 0,AREA,index_in_page,SEMESTER,NAME,GROUP,CHARACTERISTICS,HOURS/WEEK,DAY/TIME/ROOM1,DAY/TIME/ROOM2,DAY/TIME/ROOM3,DAY/TIME/ROOM4,PROFESSORS,DEPARTMENT,NOTES,PROFESSOR1,PROFESSOR2,PROFESSOR3,PROFESSOR4
20,ING. FISICA,21,OPT-IB,BASES FISICAS PARA EL DIAGNOSTICO POR IMAGENES,A,CURSO COMPLETO,6,MARTES/15-18/C2,VIERNES/12-15/C3,NAN,NAN,MODESTO ANTONIO SOSA AQUINO/ARTURO GONZALEZ VE...,INGENIERIA FISICA,1/4,MODESTO ANTONIO SOSA AQUINO,ARTURO GONZALEZ VEGA,TEODORO CORDOVA FRAGA,FRANCISCO MIGUEL VARGAS LUNA
21,BIOMEDICA,22,6,BIOESTADISTICA,A,CURSO COMPLETO,4,LUNES/8-10/F8,VIERNES/8-10/F2,NAN,NAN,TEODORO CORDOVA FRAGA,INGENIERIA FISICA,NAN,TEODORO CORDOVA FRAGA,,,


## ‚ùå 3. Empty cells
If there are empty fields (other than DAY/TIME/ROOM fields because empty cells in this fields is normal) then we need to fix it because that means something is wrong with the algorithm

In [15]:
# First of all, lets standard the spaces and blank spaces to np.nan objects
normalized_data = normalized_data.replace(r'^\s*$', np.nan, regex=True)
normalized_data = normalized_data.replace('NAN', np.nan)

In [16]:
normalized_data.head(3) # üëà STILL OUR DF

Unnamed: 0,AREA,index_in_page,SEMESTER,NAME,GROUP,CHARACTERISTICS,HOURS/WEEK,DAY/TIME/ROOM1,DAY/TIME/ROOM2,DAY/TIME/ROOM3,DAY/TIME/ROOM4,PROFESSORS,DEPARTMENT,NOTES,PROFESSOR1,PROFESSOR2,PROFESSOR3,PROFESSOR4
0,ADMINISTRACION,1,7,ADMINISTRACION Y MANEJO DE PERSONAL,A,CURSO COMPLETO,4,SABADO/9-13/F6,,,,,,,,,,
1,IDIOMAS,2,OPT-G,ALEMAN I,A,CURSO COMPLETO,3,SABADO/8-11/F7,,,,,,,,,,
2,MATEMATICAS,3,2,ALGEBRA LINEAL,A,CURSO COMPLETO,4,LUNES/8-10/AUDITORIO EDIF. G,MIERCOLES/8-10/AUDITORIO EDIF. G,,,MIGUEL ANGEL VALLEJO HERNANDEZ,INGENIERIA FISICA,,MIGUEL ANGEL VALLEJO HERNANDEZ,,,


In [17]:
# Function to detect if there is an empty cell where it should not be.
def detect_empty_cells(df, columns):
    # Save indices in a list üëá
    empty_cells_index = []
    # Loop for know if it is empty
    for index in range(len(df)):
        for column in columns:
            # print(type(df.at[index, column]), df.at[index, column])
            if pd.isna(df.at[index, column]) or df.at[index, column] == 'NAN':
                empty_cells_index.append([index, column])
                
    return empty_cells_index

In [18]:
# Probemos esta funci√≥n
columns_to_check = ['NAME', 'GROUP', 'DAY/TIME/ROOM1', 'PROFESSORS', 'PROFESSOR1'] # We only care if these fields are empty becuase that means something is wrong
empty_rows = detect_empty_cells(normalized_data, columns_to_check)

assert not empty_rows, f"It shouln'd be an empty cell in here üëâ {empty_rows}!" # If there are empty rows we need to compare with the link http://www.dci.ugto.mx/estudiantes/index.php/mcursos/horarios-licenciatura
# If this empty data comes from the data source (the page or excel file) we cannot do anything to correct it, it is school obligation

AssertionError: It shouln'd be an empty cell in here üëâ [[0, 'PROFESSORS'], [0, 'PROFESSOR1'], [1, 'PROFESSORS'], [1, 'PROFESSOR1'], [12, 'PROFESSORS'], [12, 'PROFESSOR1'], [13, 'PROFESSORS'], [13, 'PROFESSOR1'], [32, 'PROFESSORS'], [32, 'PROFESSOR1'], [39, 'PROFESSORS'], [39, 'PROFESSOR1'], [40, 'PROFESSORS'], [40, 'PROFESSOR1'], [41, 'PROFESSORS'], [41, 'PROFESSOR1'], [45, 'PROFESSORS'], [45, 'PROFESSOR1'], [46, 'PROFESSORS'], [46, 'PROFESSOR1'], [50, 'PROFESSORS'], [50, 'PROFESSOR1'], [59, 'PROFESSORS'], [59, 'PROFESSOR1'], [75, 'PROFESSORS'], [75, 'PROFESSOR1'], [76, 'PROFESSORS'], [76, 'PROFESSOR1'], [88, 'PROFESSORS'], [88, 'PROFESSOR1'], [92, 'PROFESSORS'], [92, 'PROFESSOR1'], [93, 'PROFESSORS'], [93, 'PROFESSOR1'], [94, 'PROFESSORS'], [94, 'PROFESSOR1'], [98, 'PROFESSORS'], [98, 'PROFESSOR1'], [99, 'PROFESSORS'], [99, 'PROFESSOR1'], [100, 'PROFESSORS'], [100, 'PROFESSOR1'], [101, 'PROFESSORS'], [101, 'PROFESSOR1'], [102, 'PROFESSORS'], [102, 'PROFESSOR1'], [103, 'PROFESSORS'], [103, 'PROFESSOR1'], [104, 'PROFESSORS'], [104, 'PROFESSOR1'], [106, 'PROFESSORS'], [106, 'PROFESSOR1'], [107, 'PROFESSORS'], [107, 'PROFESSOR1'], [108, 'PROFESSORS'], [108, 'PROFESSOR1'], [114, 'PROFESSORS'], [114, 'PROFESSOR1'], [115, 'PROFESSORS'], [115, 'PROFESSOR1'], [116, 'PROFESSORS'], [116, 'PROFESSOR1'], [118, 'PROFESSORS'], [118, 'PROFESSOR1'], [119, 'PROFESSORS'], [119, 'PROFESSOR1'], [120, 'PROFESSORS'], [120, 'PROFESSOR1'], [121, 'PROFESSORS'], [121, 'PROFESSOR1'], [122, 'PROFESSORS'], [122, 'PROFESSOR1'], [124, 'PROFESSORS'], [124, 'PROFESSOR1'], [127, 'PROFESSORS'], [127, 'PROFESSOR1'], [132, 'PROFESSORS'], [132, 'PROFESSOR1'], [133, 'PROFESSORS'], [133, 'PROFESSOR1'], [134, 'PROFESSORS'], [134, 'PROFESSOR1'], [135, 'PROFESSORS'], [135, 'PROFESSOR1'], [136, 'PROFESSORS'], [136, 'PROFESSOR1'], [138, 'PROFESSORS'], [138, 'PROFESSOR1'], [144, 'PROFESSORS'], [144, 'PROFESSOR1'], [145, 'PROFESSORS'], [145, 'PROFESSOR1'], [151, 'PROFESSORS'], [151, 'PROFESSOR1'], [157, 'PROFESSORS'], [157, 'PROFESSOR1'], [158, 'PROFESSORS'], [158, 'PROFESSOR1'], [159, 'PROFESSORS'], [159, 'PROFESSOR1'], [160, 'PROFESSORS'], [160, 'PROFESSOR1'], [161, 'PROFESSORS'], [161, 'PROFESSOR1'], [162, 'PROFESSORS'], [162, 'PROFESSOR1'], [163, 'PROFESSORS'], [163, 'PROFESSOR1'], [166, 'PROFESSORS'], [166, 'PROFESSOR1'], [168, 'PROFESSORS'], [168, 'PROFESSOR1'], [171, 'PROFESSORS'], [171, 'PROFESSOR1'], [174, 'PROFESSORS'], [174, 'PROFESSOR1'], [175, 'PROFESSORS'], [175, 'PROFESSOR1'], [183, 'PROFESSORS'], [183, 'PROFESSOR1'], [190, 'PROFESSORS'], [190, 'PROFESSOR1'], [191, 'PROFESSORS'], [191, 'PROFESSOR1'], [192, 'PROFESSORS'], [192, 'PROFESSOR1'], [195, 'PROFESSORS'], [195, 'PROFESSOR1'], [196, 'PROFESSORS'], [196, 'PROFESSOR1'], [203, 'PROFESSORS'], [203, 'PROFESSOR1'], [204, 'PROFESSORS'], [204, 'PROFESSOR1'], [205, 'PROFESSORS'], [205, 'PROFESSOR1'], [209, 'PROFESSORS'], [209, 'PROFESSOR1'], [210, 'PROFESSORS'], [210, 'PROFESSOR1'], [211, 'PROFESSORS'], [211, 'PROFESSOR1'], [212, 'PROFESSORS'], [212, 'PROFESSOR1'], [216, 'PROFESSORS'], [216, 'PROFESSOR1']]!

## ‚ùå 4. Errors in DAY/TIME/ROOM fields
We need to format this field form, I choose:

_day/start_hour-end_hour/room_

por ejemplo:

_LUNES/14-16/F9_

Lunes is Monday in Spanish and F9 is the class room name.
In the following code we are going to fix the schedules that do not comply with the format üëá.

In [19]:
# Function to detect a bad format in dates (date by date)
def detect_wrong_dates(date):
    # We're not checking NAN values
    if date == 'NAN' or pd.isna(date):
        return False, None
    # Split for each "/" to obtain every DAY/TIME/ROOM
    date_split = date.split('/')

    # With this we can find 2 possible errors
    # 1. The date field does not have 3 "/"
    # 2. That, the start-end time does not have any "-" 

    # Detect the 1st
    if len(date_split) != 3:
        return True, 'Slash'

    # Detect the 2nd
    hours = date_split[1] # Porque queremos checar la hora
    if hours and len(hours.split('-')) != 2:
        return True, 'Hour'
        
        
    return False, None

In [20]:
# The above üëÜ function only works for one date, we need to create a function that runs that üëÜ function for all the df
def detect_wrong_dates_in_df(df, date_columns):
    # Iterating over rows
    for index, row in df.iterrows():
        for column in date_columns:
            detection = detect_wrong_dates(row[column])
            if detection[0]:
                print(f'{detection[1]} Error')
                print(f"Index={index}, #Pag={row['index_in_page']}, Column={column[-1]}")

In [21]:
# Test the function
date_columns = [f'DAY/TIME/ROOM{index+1}' for index in range(4)]
detect_wrong_dates_in_df(normalized_data, date_columns) # üò± Too many errors, there is no option but to fix these by hand ‚úã
print("------------------------------------------------------")

Slash Error
Index=10, #Pag=11, Column=1
Slash Error
Index=19, #Pag=20, Column=2
Slash Error
Index=28, #Pag=29, Column=4
Slash Error
Index=44, #Pag=45, Column=1
Slash Error
Index=47, #Pag=48, Column=2
Slash Error
Index=160, #Pag=161, Column=2
Slash Error
Index=197, #Pag=198, Column=1
Slash Error
Index=198, #Pag=199, Column=1
Slash Error
Index=198, #Pag=199, Column=2
------------------------------------------------------


If we check out the page http://www.dci.ugto.mx/estudiantes/index.php/mcursos/horarios-licenciatura we can see that the errors are human errors and these errors are no predictable at all, so we need to fix these erros by hand ü•∂

### üí® Fix 10

In [22]:
# üëÄ what is happening?
normalized_data.at[10, 'DAY/TIME/ROOM1'] # It doesn't have a slash between the 14 and F2

'MARTES15-17/F8'

In [23]:
# Corrijamos
normalized_data.at[10, 'DAY/TIME/ROOM1'] = "MARTES/15-17/F8" # 'MARTES/15-17/F8'
normalized_data.at[10, 'DAY/TIME/ROOM1']

'MARTES/15-17/F8'

### üí® Fix 19

In [24]:
# üëÄ what is happening?
normalized_data.at[19, 'DAY/TIME/ROOM2']

'VIERNES/15-17F7'

In [25]:
normalized_data.at[19, 'DAY/TIME/ROOM2'] = "VIERNES/15-17/F7"
normalized_data.at[19, 'DAY/TIME/ROOM2']

'VIERNES/15-17/F7'

### üí® Fix 28

In [26]:
# üëÄ what is happening?
normalized_data.at[28, 'DAY/TIME/ROOM4']

'VIERNES/8-11LAB. DE BIOLOGIA EDIF. G'

In [27]:
normalized_data.at[28, 'DAY/TIME/ROOM4'] = 'VIERNES/8-11/LAB. DE BIOLOGIA EDIF. G'
normalized_data.at[28, 'DAY/TIME/ROOM4']

'VIERNES/8-11/LAB. DE BIOLOGIA EDIF. G'

### üí® Fix 44

In [28]:
# üëÄ what is happening?
normalized_data.at[44, 'DAY/TIME/ROOM1']

'MARTES/15-17F2'

In [29]:
normalized_data.at[44, 'DAY/TIME/ROOM1'] = 'MARTES/15-17/F2'
normalized_data.at[44, 'DAY/TIME/ROOM1']

'MARTES/15-17/F2'

### üí® Fix 47

In [30]:
# üëÄ what is happening?
normalized_data.at[47, 'DAY/TIME/ROOM2']

'VIERNES/ 17-19//SALA DE JUNTAS EDIF. B'

In [31]:
normalized_data.at[47, 'DAY/TIME/ROOM2'] = 'VIERNES/17-19/SALA DE JUNTAS EDIF. B'
normalized_data.at[47, 'DAY/TIME/ROOM2']

'VIERNES/17-19/SALA DE JUNTAS EDIF. B'

### üí® Fix 160

In [32]:
# üëÄ what is happening?
normalized_data.at[160, 'DAY/TIME/ROOM2']

'MIERCOLES12-14/C2'

In [33]:
normalized_data.at[160, 'DAY/TIME/ROOM2'] = 'MIERCOLES/12-14/C2'
normalized_data.at[160, 'DAY/TIME/ROOM2']

'MIERCOLES/12-14/C2'

### üí® Fix 197

In [34]:
# üëÄ what is happening?
normalized_data.at[197, 'DAY/TIME/ROOM1']

'JUEVES15-18/AUDITORIO DE EDIF. G'

In [35]:
normalized_data.at[197, 'DAY/TIME/ROOM1'] = 'JUEVES/15-18/AUDITORIO DE EDIF. G'
normalized_data.at[197, 'DAY/TIME/ROOM1']

'JUEVES/15-18/AUDITORIO DE EDIF. G'

### üí® Fix 198

In [36]:
# üëÄ what is happening?
normalized_data.at[198, 'DAY/TIME/ROOM1']

'LUNES 12-14'

In [37]:
normalized_data.at[198, 'DAY/TIME/ROOM1'] = 'LUNES/12-14/'
normalized_data.at[198, 'DAY/TIME/ROOM1']

'LUNES/12-14/'

In [38]:
# Lets look for errors again in case we didn't fix them all
detect_wrong_dates_in_df(normalized_data, date_columns) # ‚ú® No more date errors

Slash Error
Index=198, #Pag=199, Column=2


### üí® Fix 198 Column 2

In [39]:
# üëÄ what is happening?
normalized_data.at[198, 'DAY/TIME/ROOM2']

'VIERNES 12-14'

In [40]:
normalized_data.at[198, 'DAY/TIME/ROOM2'] = 'VIERNES/12-14/'
normalized_data.at[198, 'DAY/TIME/ROOM2']

'VIERNES/12-14/'

In [41]:
detect_wrong_dates_in_df(normalized_data, date_columns) # Check again if there are errors

## ü§î 5. Add useful fields
For making things easier in the algorithm part, I need to add some fields, that are already in the table, but in a different way. Example: Separate DAY/TIME/ROOM in a field called DAY, another field called TIME and another field called ROOM.

And, even thought this gonna be akward, I need to add the index (not the index_in_page) for tracking propuses in the database, I am going to add the _ID column

In [42]:
date_column = [column_name for column_name in normalized_data.columns if 'DAY/' in column_name]

for index, column in enumerate(date_column):
    normalized_data[f"DAY{index + 1}"] = list(map(lambda day: day.split('/')[0] if not pd.isna(day) and day else np.nan, normalized_data[column].tolist()))
    normalized_data[f"TIME{index + 1}"] = list(map(lambda time: time.split('/')[1] if not pd.isna(time) and time else np.nan, normalized_data[column].tolist()))

normalized_data.insert(0, '_ID', range(0, len(normalized_data)))
    
normalized_data.head(3)

Unnamed: 0,_ID,AREA,index_in_page,SEMESTER,NAME,GROUP,CHARACTERISTICS,HOURS/WEEK,DAY/TIME/ROOM1,DAY/TIME/ROOM2,...,PROFESSOR3,PROFESSOR4,DAY1,TIME1,DAY2,TIME2,DAY3,TIME3,DAY4,TIME4
0,0,ADMINISTRACION,1,7,ADMINISTRACION Y MANEJO DE PERSONAL,A,CURSO COMPLETO,4,SABADO/9-13/F6,,...,,,SABADO,9-13,,,,,,
1,1,IDIOMAS,2,OPT-G,ALEMAN I,A,CURSO COMPLETO,3,SABADO/8-11/F7,,...,,,SABADO,8-11,,,,,,
2,2,MATEMATICAS,3,2,ALGEBRA LINEAL,A,CURSO COMPLETO,4,LUNES/8-10/AUDITORIO EDIF. G,MIERCOLES/8-10/AUDITORIO EDIF. G,...,,,LUNES,8-10,MIERCOLES,8-10,,,,


## ‚ùå 6. Remove the NAN and replace it with a blank space
For the frontend we will need to remove all the NaN data and just to update them to a '' blank space.

In [43]:
#normalized_data = normalized_data.replace(np.nan, '', regex=True)
#normalized_data.head(3)

## ‚ùå 7. Check the correct data type in each column
We need to be sure that, for example, every professor, name, group and Day field starts with a letter, and every TIME starts and ends with a number.

In [88]:
def check_initial_letter(df, column_name):
    for index, name in enumerate(df[column_name]):
        string_name = str(name)
        if not (string_name[0].isalpha() and string_name[-1].isalpha() or string_name[-1].isdigit()):
            return f"The record -> {index} in column -> {column_name} does not start or end with a letter -> {name}"
        
    return False
        
def check_initial_number(df, column_name):
    for index, name in enumerate(df[column_name]):
        string_name = str(name)
        if (string_name and string_name.upper() != 'NAN') and not (string_name[0].isdigit() and string_name[-1].isdigit()):
            return f"The record -> {index} in column -> {column_name} does not start or end with a number -> {name}"
        
    return False


In [89]:
letter_columns = [
    'AREA',
    'CHARACTERISTICS',
    'DEPARTMENT',
    'NAME', 
    'GROUP',
    'DAY/TIME/ROOM1',
    'DAY/TIME/ROOM2',
    'DAY/TIME/ROOM3',
    'DAY/TIME/ROOM4',
    'DAY1',
    'DAY2',
    'DAY3',
    'DAY4',
    'PROFESSORS',
    'PROFESSOR1',
    'PROFESSOR2',
    'PROFESSOR3',
    'PROFESSOR4'
]
number_columns = [
    "_ID",
    'HOURS/WEEK',
    "TIME1",
    "TIME2",
    "TIME3",
    "TIME4",
    "index_in_page"
]

ignored_columns= [ # We do not need to check all the data because it is probable we are not going to use all the columns
    'SEMESTER',
    'NOTES'
]

missing = [miss for miss in normalized_data.columns if miss not in letter_columns and miss not in number_columns and miss not in ignored_columns]
# Asserting we are checking all the columns
assert len(letter_columns) + len(number_columns) + len(ignored_columns) == len(normalized_data.columns), f"{len(letter_columns) + len(number_columns)} -> {len(normalized_data.columns)}. \nMissing -> {missing}"

# Check letter data
for column in letter_columns:
    error = check_initial_letter(normalized_data, column)
    if error: print(error)
    
# Check number data
for column in number_columns:
    error = check_initial_number(normalized_data, column)
    if error: print(error)

The record -> 198 in column -> DAY/TIME/ROOM1 does not start or end with a letter -> LUNES/12-14/
The record -> 198 in column -> DAY/TIME/ROOM2 does not start or end with a letter -> VIERNES/12-14/
The record -> 199 in column -> PROFESSORS does not start or end with a letter -> BIRZABITH MENDOZA NOVELO/


In [91]:
# Correct this errors (some of them are not errors just database inconsistent stuff)
normalized_data.at[155, 'DAY1'] = 'VIERNES'
normalized_data.at[113, 'PROFESSORS'] = 'ARTURO VEGA GONZALEZ'
normalized_data.at[199, 'PROFESSORS'] = 'BIRZABITH MENDOZA NOVELO'

# üì§**Export the Data**
We are done with the data analysis and cleaning, we can now export this file to create the algorithm for create the schedule combinations

In [92]:
# In JSON because that is how the data is going to be loaded from the request to the API
normalized_data.to_json('../data/clean_schedules.json', orient = 'records')
# Lets import in csv because we are going to use airtable and that is the format used there
normalized_data.to_csv('../data/clean_schedules.csv', index = False)

# ü§î What is next?
Now that I have the clean data in a CSV format, this file is going to be used to create (or update) a table in Airtable, which I am going to use in the following steps for creating the algorithm for making the schedule combinations.