# SAT Semantic Mapping for Classification

The SAT mapping classification process is as follows:

1. A similarity matrix is built with SAT segments in rows and segments for classification is columns. The cell values contain the semantic similarity of a segment-segment pair computed from the segment encoding vectors.
2. A threshold is applied to the matrix and the indices of columns containing above-threshold cell values are collected. 
3. The column indices are considered as those segments classified by the SAT.
    

In [None]:
__author__      = 'Roy Gardner'

%run ./_library/packages.py

def do_load(model_path,exclusion_list=[],verbose=True):
    if verbose:
        print('Loading model…')
    model_dict = {}

    _, _, files = next(os.walk(model_path))
    files = [f for f in files if f.endswith('.json') and not f in exclusion_list]
    for file in files:
        model_name = os.path.splitext(file)[0]
        with open(model_path + file, 'r', encoding='utf-8') as f:
            model_dict[model_name] = json.load(f)
            f.close() 
    if verbose:
        print('Finished loading model.')
    return model_dict


## Load model

In [None]:
model_path = '../../model/'

model_dict = do_load(model_path,exclusion_list=[],verbose=True)

print('Finished')

## Build a SAT-segments similarity matrix

In [None]:
# Define the SAT topic
sat_topic = 'equalgr5'

print('Selected SAT:',sat_topic)

# Get the encoding vectors for the SAT
sat_segment_ids =  model_dict['sat_segments_dict'][sat_topic]
sat_segment_indices =  [model_dict['encoded_segments'].index(segment_id) for segment_id in sat_segment_ids]
sat_encodings = [model_dict['segment_encodings'][index] for index in sat_segment_indices]

print('Number of segments in SAT:',len(sat_segment_ids))
print()

# Build a similarity matrix with SAT segments in rows and corpus segments in columns
print('Building matrix…')
sim_matrix = cdist(sat_encodings,model_dict['segment_encodings'],ad.angular_distance)
print('Similarity matrix dimensions:',sim_matrix.shape)
print()



## Find above threshold column segments in the similarity matrix

In [None]:
# Set a semantic similarity threshold
threshold = 0.74

# Threshold the matrix and return cell indices containing above threshold values
indices = np.where(sim_matrix >= threshold)

# Get the IDs of above threshold column segments. We use the set of column indices to remove duplicates.
classified_segment_ids = [model_dict['encoded_segments'][index] for index in set(indices[1])]

# In this example our columns contain the SAT segments so we need to remove these.
# This won't be the case in real world scenarios where the segments to classify are unknown.
classified_segment_ids = set(classified_segment_ids).difference(set(sat_segment_ids))
print('Number of classified segments not in the SAT:',len(classified_segment_ids))
print()

# Display the classified segments
for segment_id in classified_segment_ids:
    segment_text = model_dict['segments_dict'][segment_id]['text']
    print(segment_id,segment_text)
    print()
    
