# Imports

In [69]:
## imports
import pandas as pd
import re
import numpy as np

## print multiple things from same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Load data and show examples

In [70]:
df= schools_df = pd.read_csv("../public_data/schools_df.csv")
schools_df.head()

Unnamed: 0,schoolname,individualispjune2020,participatingincepsy2021,is_elem_exercise,is_charter_exercise,is_highschool_exercise
0,stove prairie elementary school,0.0,N,True,False,False
1,stewart county elementary school,0.7603,Y,True,False,False
2,desert springs elementary school,,N,True,False,False
3,saunemin elem school,0.3893999999999999,N,True,False,False
4,fifth district elementary,0.0275,N,True,False,False


# re.sub illustration

**Task**: 

- Use the `school_df` dataset and filter to `is_elem_exercise` == True 
- Using the `schoolname` field, replace the different varieties of elementary school in the data with `elemschool` 

## Creating a new data set only with elementary schools 

Returns incorrect results that we'll see below

In [71]:
# Creates a new dataFrame containing only schools that are elementary schools
elem_ex = schools_df[schools_df.is_elem_exercise].copy() # The .copy() method creates a new data frame so they WILL not affect each other

elem_ex.head()

Unnamed: 0,schoolname,individualispjune2020,participatingincepsy2021,is_elem_exercise,is_charter_exercise,is_highschool_exercise
0,stove prairie elementary school,0.0,N,True,False,False
1,stewart county elementary school,0.7603,Y,True,False,False
2,desert springs elementary school,,N,True,False,False
3,saunemin elem school,0.3893999999999999,N,True,False,False
4,fifth district elementary,0.0275,N,True,False,False


# Regex Syntax

* r"String" converts our string into a RAW string meaning special charachters in python are just normal charachters. Example \n in python is the tab feature, when converted into a raw string it is just \n as normal charachters does not tab our string

* re.sub(pattern, replacement, string). Pattern is what we are searching for to replace, replacement is the what we are replacing it with and string is the string we are performing the subsitution on

* sub is useful when we need to substitute lots of data given a pattern
“\d{1,3}”


## A correct approach

Addresses issues above with `elementary school` and `elem.`

In [72]:
# this is our pattern. Converts our string into RAW String using r
# \s+ says it can have indefinite number of spaces
# elem.* says as long as it includes "elem" or any other charachter after like (elem)entary
# then the ? says that anything after that is optional, may include or may not include it
elem_pattern_try2 = r"(elem.*)(\s+)?(school)?"

# Creates a list of the new names for our schools
#replaces any of the patterns from above with "elemschool" and does this in a for loop with all schools in our data frame
new_schools_try2 = [re.sub(elem_pattern_try2, "elemschool", school) 
                    for school in elem_ex.schoolname]    

# actually goes and creates a new column using the list we made above
elem_ex['cleaned_name'] = new_schools_try2

# prints out our new column and the original column
elem_ex[["schoolname", "cleaned_name"]]

Unnamed: 0,schoolname,cleaned_name
0,stove prairie elementary school,stove prairie elemschool
1,stewart county elementary school,stewart county elemschool
2,desert springs elementary school,desert springs elemschool
3,saunemin elem school,saunemin elemschool
4,fifth district elementary,fifth district elemschool
5,paint branch elementary,paint branch elemschool
6,oak hill elem.,oak hill elemschool
7,lewis and clark elem.,lewis and clark elemschool
8,linden elementary school,linden elemschool
9,winchester avenue elementary school,winchester avenue elemschool


# re.findall and re.search illustrations

**Task**: 

- Filter to `is_charter_exercise` == True; note that this contains a mix of schools with charter in the name and schools without
- Construct a pattern that, for charter schools, gets the school name prior to appearance of the word charter. School names without charter will not have matches (so Hanover Charter becomes Hanover; Hanover High stays Hanover High)


## re.findall 

In [73]:
## filter to charter exercise
charter_ex = schools_df[schools_df.is_charter_exercise].copy() # Creates new data frame with only schools that is_charter_exercise is true
charter_ex.head(6)

Unnamed: 0,schoolname,individualispjune2020,participatingincepsy2021,is_elem_exercise,is_charter_exercise,is_highschool_exercise
10,frontier elementary school,8.51%,N,False,True,False
11,life source international charter,0.7201946472019465,Y,False,True,False
12,east valley senior high,0.45807770961145194,Y,False,True,False
13,children's community charter,0.8888888888888888,Y,False,True,False
14,south fork elementary,0.49640287769784175,Y,False,True,False
15,thomas edison charter academy ...,0.2855191256830601,N,False,True,False


# Findall method from re

* re.findall(pattern, string, flags=0)
* flags is an optional parameter not needed, used when we care about capitalization, case sensitivity or anything like that
* pattern is the pattern we are searching for in the string
* string is the actual string that we are searching in
* similar to re.sub just that it does not actually replace anything just finds them
* findall, will return the actual pattern and how it is written vs search will return something similar to a boolean only telling us if it matches the pattern or not


In [74]:
## charter pattern
# looks for any string that contains the word charter
# our pattern apears to be returing a list of list breaking things down further
charter_pattern = r"(.*)\s+(charter)(\s+)?(\w+)?"

## findall, searches the name of all the schools in our data base and creates a list of list 
test_charter_findall = [re.findall(charter_pattern, school)
                        for school in charter_ex.schoolname]

## print result
test_charter_findall

[[],
 [('life source international', 'charter', '', '')],
 [],
 [("children's community", 'charter', '', '')],
 [],
 [('thomas edison', 'charter', ' ', 'academy')],
 [('moving everest', 'charter', ' ', 'school')],
 [],
 [],
 [('south valley academy', 'charter', ' ', 'school')],
 [('brighter choice', 'charter', ' ', 'school')],
 [('buffalo collegiate', 'charter', ' ', 'school')],
 [('neighborhood', 'charter', ' ', 'school')],
 [],
 [],
 []]

In [7]:
## show example of one
print(test_charter_findall[1][0][0])

life source international


## re.search

In [75]:
# re.search(pattern, string, flag) similar as the re.findall method just that is doesnt return the actual string rather just a statement if it matches
# or not


## get matches
# uses the same pattern as above
test_charter_search = [re.search(charter_pattern, school) 
                       for school in charter_ex.schoolname]

test_charter_search


[None,
 <re.Match object; span=(0, 33), match='life source international charter'>,
 None,
 <re.Match object; span=(0, 28), match="children's community charter">,
 None,
 <re.Match object; span=(0, 29), match='thomas edison charter academy'>,
 <re.Match object; span=(0, 29), match='moving everest charter school'>,
 None,
 None,
 <re.Match object; span=(0, 35), match='south valley academy charter school'>,
 <re.Match object; span=(0, 30), match='brighter choice charter school'>,
 <re.Match object; span=(0, 33), match='buffalo collegiate charter school'>,
 <re.Match object; span=(0, 27), match='neighborhood charter school'>,
 None,
 None,
 None]

In [76]:
## extract matches

### here, we're just focusing on the 3rd match or the 6th entry wich is index 5 (thomas edison charter academy)
### and we're getting the first group from that match
thomas_match = test_charter_search[5]
thomas_match

### example where we're just getting the first group
### (name of school before charter)
thomas_firstgroup = thomas_match.group(1) # iterates over all the groups in that string based on the pattern and returns the item at that index, index 0
# is just the entire string
thomas_firstgroup


<re.Match object; span=(0, 29), match='thomas edison charter academy'>

'thomas edison'

In [77]:
# Look at how grouping works

### iterate over all groups and print
for i in range(0, len(thomas_match.groups())+1):
    print("Group " + str(i) + " is: ")
    print(thomas_match.group(i))

## see error if we go beyond actual number of
## groups thomas_match.group(5)

Group 0 is: 
thomas edison charter academy
Group 1 is: 
thomas edison
Group 2 is: 
charter
Group 3 is: 
 
Group 4 is: 
academy


In [78]:
## can also extract the groups as a tuple
## example- want to return group 1 and group 2 and paste together
thomas_groups_all = thomas_match.groups()
thomas_groups_all

## slice the tuple
thomas_groups_all[0:2]


('thomas edison', 'charter', ' ', 'academy')

('thomas edison', 'charter')

In [79]:
## can generalize to the full list with ifelse
def get_precharter_name(one_matchobj):
    
    if one_matchobj:
        school_name = one_matchobj.group(1)
    else:
        school_name = ""
    
    return(school_name)

all_charter_match = [get_precharter_name(one_search) 
                    for one_search in test_charter_search]

all_charter_match

['',
 'life source international',
 '',
 "children's community",
 '',
 'thomas edison',
 'moving everest',
 '',
 '',
 'south valley academy',
 'brighter choice',
 'buffalo collegiate',
 'neighborhood',
 '',
 '',
 '']

# Group activity

## Part 1: Subsetting
Filter the data to only those rows where `is_highschool_exercise` is True.

In [80]:
# your code here to filter high school data

df2 = high_school_data_frame = df[df["is_highschool_exercise"] == True].copy()
df2 = df2.reset_index(drop = True)

df2.schoolname.unique()


array(['mount pleasant area jshs', 'huron high school',
       'thomson high school',
       'kings county office of education highland facility',
       'clovis east high', 'camden jr. high school',
       'jackson junior high', 'emmett junior high school', 'atkins high',
       'lexington senior high', 'temple hs', 'forest hill high school',
       'pittsfield high', 'matanzas high school', 'pontiac high school'],
      dtype=object)

## Part 2: Standardizing names
To find the names of high schools, try out some regex patterns to standardize the high school names (e.g., 'high school' and 'high' could both become 'highschool'). AKA make everything 'highschool.'

**Hint:** Look at the school names for hints on what to avoid matching--e.g., 'highland facility'. To avoid things like this, after 'high' or 'hs', have your pattern look for a space (`\s`) or the end of the string (`$`). 

In [86]:
# your code pattern here
hs_sub_pattern = r"(jshs|high(\s.*|$)|hs)"

test_high_findall = [ re.findall(hs_sub_pattern, school)
    for school in df2["schoolname"]
]

test_high_findall # this is a series 

# What is a series? Each column is considered a series in DF but in pandas a series is just a more advnaced list


[[('jshs', '')],
 [('high school', ' school')],
 [('high school', ' school')],
 [],
 [('high', '')],
 [('high school', ' school')],
 [('high', '')],
 [('high school', ' school')],
 [('high', '')],
 [('high', '')],
 [('hs', '')],
 [('high school', ' school')],
 [('high', '')],
 [('high school', ' school')],
 [('high school', ' school')]]

In [89]:
[re.sub(hs_sub_pattern, "highschool", x) for x in df2["schoolname"]]

['mount pleasant area highschool',
 'huron highschool',
 'thomson highschool',
 'kings county office of education highland facility',
 'clovis east highschool',
 'camden jr. highschool',
 'jackson junior highschool',
 'emmett junior highschool',
 'atkins highschool',
 'lexington senior highschool',
 'temple highschool',
 'forest hill highschool',
 'pittsfield highschool',
 'matanzas highschool',
 'pontiac highschool']

## Part 3: Match schools
Using some example results, try writing a regex pattern and using `re.match` to get the name of the school that precedes the 'highschool' part of the name (e.g., 'new trier highschool' -> 'new trier')

In [None]:
# your code here to extract names of high schools