## Interpolate Data
We had missing test scores for some schools from the EOC dataset, we need to interpolate these values to try and smooth the data

In [1]:
# import packages
import pandas as pd

In [27]:
# import the final merged csv from previous notebook
df = pd.read_csv('Cleaned-Data/merged.csv')

# Convert 'Year' column to extract the starting year
df['StartYear'] = df['SchoolYear'].str[:4].astype(int)

print(df.head())

  SchoolYear  QualityScore                           School  GeoCoded_Y  \
0    2010-11      0.000000         Aki Kurose Middle School   47.546628   
1    2010-11      0.000000    Albert Einstein Middle School   47.769758   
2    2010-11      0.580000  Auburn Mountainview High School   47.342018   
3    2010-11      0.517067     Auburn Riverside High School   47.266955   
4    2010-11      0.462733        Auburn Senior High School   47.308599   

   GeoCoded_X Grade Category  StartYear  
0 -122.282550  Middle School       2010  
1 -122.362938  Middle School       2010  
2 -122.172016    High School       2010  
3 -122.223542    High School       2010  
4 -122.219837    High School       2010  


## Interpolation Strategy
### Case 1:
If we find a missing value for the first value of a School's timeline then we will interpolate the new score as a factor of the schools mean times `(1 / variance)` where `variance` is the variance of all the non-zero values for that school group 

### Case 2:
If a missing value has an observed score before it and after it, then we will calculate a linear interpolation and input a value that is in line with this linear calculation

In [43]:
schools_grouped = df.groupby('School', as_index=False)

# list to store updated groups
updated_groups = []

for name, group in schools_grouped:
    # Sort the group by 'StartYear'
    group = group.sort_values(by='StartYear').reset_index(drop=True)
    
    # Identify missing and observed indices
    missing_indices = group[group['QualityScore'] == 0.0].index
    observed_indices = group[group['QualityScore'] != 0.0].index
    
    if len(missing_indices) > 0:
        # Calculate mean, median, and variance
        observed_scores = group.loc[observed_indices, 'QualityScore']
        mean_score = observed_scores.mean()
        median_score = observed_scores.median()
        variance = observed_scores.var()
        
        for idx in missing_indices:
            current_year = group.loc[idx, 'StartYear']
            
            # Find previous and next observed indices
            prev_indices = observed_indices[observed_indices < idx]
            next_indices = observed_indices[observed_indices > idx]
            
            prev_idx = prev_indices.max() if len(prev_indices) > 0 else None
            next_idx = next_indices.min() if len(next_indices) > 0 else None
            
            if prev_idx is not None and next_idx is not None:
                # Linear interpolation
                prev_year = group.loc[prev_idx, 'StartYear']
                next_year = group.loc[next_idx, 'StartYear']
                prev_score = group.loc[prev_idx, 'QualityScore']
                next_score = group.loc[next_idx, 'QualityScore']
                
                # Calculate interpolated score
                interpolated_score = prev_score + ((next_score - prev_score) / (next_year - prev_year)) * (current_year - prev_year)
                group.loc[idx, 'QualityScore'] = interpolated_score
                
            elif prev_idx is None:
                # Missing value at the start
                if variance != 0:
                    adjusted_mean = mean_score * (1 / variance)
                else:
                    adjusted_mean = mean_score
                    
                group.loc[idx, 'QualityScore'] = adjusted_mean
                
    # Append the modified group to the list
    updated_groups.append(group)

In [44]:
# Concatenate all updated groups into a single DataFrame
updated_df = pd.concat(updated_groups, ignore_index=True)
updated_df = updated_df.drop(columns=['StartYear'])

# save as csv
updated_df.to_csv('Cleaned-Data/merged-interpolated.csv', index=False)