
# Cleaning Calgary High School Sports Data

- Run this file after `hs_sports_pull.ipynb`
- Takes input `sports.pkl` from `hs_sports_pull.ipynb`
- Cleans `YEAR WON` and `SPORT` columns
- Splits ties from `WINNER` column, and pivots taller so each `WINNER` of a tie is on a new row
- There are two unresolved warnings in the section on cleaning the `WINNER` column - but it is functioning properly
- The colums `LEVEL`, `GENDER`, and `DIVISION` are not cleaned
- `Boys` and `Girls` prefix soccer was eliminated - if doing gender analysis, would need to ensure this information is in the gender column at an earlier step.
- Output DataFrame saved as `sports_3.pkl`
- Output is ready to fuzzymatch `WINNER` with master list of schools

Load Packages

In [1]:
# %pip install fuzzywuzzy
# %pip install python-Levenshtein

In [2]:
import pandas as pd
import re

1. Load the combined data

In [3]:
sports = pd.read_pickle('sports.pkl')

In [4]:
sports['YEAR WON'] = sports['YEAR WON'].str.strip()

In [5]:
sports['WINNER'] = sports.WINNER.str.strip()

In [6]:
sports['WINNER'] = sports.WINNER.replace("\n",'')

In [7]:
sports = sports.reset_index(drop = True)

In [8]:
def split_ties():
    sports['WINNER2'] = ''
    sports['WINNER3'] = ''
    for i in range(len(sports)):
        winners = re.split(", |\nand |/", sports.iloc[i]['WINNER'])
        if len(winners) == 3:
            sports.loc[i, 'WINNER'] = winners[0]
            sports.loc[i, 'WINNER2'] = winners[1]
            sports.loc[i, 'WINNER3'] = winners[2]
        elif len(winners) == 2:
            sports.loc[i, 'WINNER'] = winners[0]
            sports.loc[i, 'WINNER2'] = winners[1]

In [9]:
split_ties()

In [10]:
sports_tall = sports.melt(id_vars=["YEAR WON", "SPORT", "LEVEL", "GENDER", "DIVISION"], 
        var_name="winnner_num", 
        value_name="WINNER_SCHOOL")

In [11]:
sports_tall = sports_tall.rename(columns = {'WINNER_SCHOOL' : "WINNER"})

In [12]:
sports_3 = sports_tall[~sports_tall['WINNER'].isin(['did not occur', 'did not occur due to Covid19 school closures',
                                        'Not held due to labour unrest','Not Awarded', ''])]

In [13]:
sports_3['WINNER'] = sports_3['WINNER'].str.strip()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sports_3['WINNER'] = sports_3['WINNER'].str.strip()


In [14]:
sports_3['WINNER'] = sports_3.WINNER.str.replace("TIE -- ",'')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sports_3['WINNER'] = sports_3.WINNER.str.replace("TIE -- ",'')


In [15]:
sports_3 = sports_3.drop(columns = ['winnner_num'])

In [16]:
sports_3.SPORT = sports_3.SPORT.str.title().str.replace('Girls ', '').str.replace('Boys ', '').str.replace('And', '&')

In [17]:
sports_3.to_pickle('sports_3.pkl')

In [18]:
sports_pkl = pd.read_pickle('sports_3.pkl')