
Project: Standardizing Names of Sports Schools using Sentence-BERT

Objective

The goal of this project is to standardize the names of sports schools using variations provided in different datasets. We will use Sentence-BERT (SBERT) to create embeddings of these names and match variations to the standard names based on their cosine similarity.

Data
Школы.csv: Contains the standard names of sports schools.
Примерное написание.csv: Contains variations of the sports schools' names.
Steps
Data Preparation
Model Training
Model Testing
Evaluation and Conclusion

In [101]:
!pip install sentence_transformers # Install the missing library
from sentence_transformers import SentenceTransformer, util

import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import random
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer, util

from utils.aug import Aug



ModuleNotFoundError: No module named 'utils'

In [None]:

from google.colab import drive
drive.mount('/content/drive')


csv_file_path1 = '/content/drive/My Drive/Colab Notebooks/Школы.csv'
schools = pd.read_csv(csv_file_path1)

csv_file_path2 = '/content/drive/My Drive/Colab Notebooks/Примерное написание.csv'
variations = pd.read_csv(csv_file_path2)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Display the first few rows of each dataset
print("Standard Schools Data:")
print(schools.head())

print("\nVariations Data:")
print(variations.head())

Standard Schools Data:
   school_id                  name                region
0          1              Авангард    Московская область
1          2              Авангард     Ямало-Ненецкий АО
2          3               Авиатор  Республика Татарстан
3          4                Аврора       Санкт-Петербург
4          5  Ice Dream / Айс Дрим       Санкт-Петербург

Variations Data:
   school_id                                               name
0       1836                                       ООО "Триумф"
1       1836                                Москва, СК "Триумф"
2        610                             СШОР "Надежда Губернии
3        610  Саратовская область, ГБУСО "СШОР "Надежда Губе...
4        609                                     "СШ "Гвоздика"


In [None]:
schools.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   school_id  306 non-null    int64 
 1   name       306 non-null    object
 2   region     306 non-null    object
dtypes: int64(1), object(2)
memory usage: 7.3+ KB


In [None]:
variations.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 895 entries, 0 to 894
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   school_id  895 non-null    int64 
 1   name       895 non-null    object
dtypes: int64(1), object(1)
memory usage: 14.1+ KB


Целевой признак у нас находится в таблице schools - на основе которой мы будем обучать модель.

In [None]:
#проверим дубликаты

schools.duplicated().sum()

0

In [None]:
variations.duplicated().sum()

0

In [None]:
#В данных присутствуют разделители строки, заглавные символы - отчистим данные

def cleaning(text):
    text = re.sub(r"(?:\n|\r)", " ", text)
    text = re.sub(r"[^А-Яа-яёЁa-zA-Z\s ]+", " ", text)
    text = re.sub(r"\s+ ", " ", text).strip()
    text = text.lower()
    return text

schools[['name', 'region']] = schools[['name', 'region']].applymap(cleaning)
variations['name'] = variations['name'].apply(cleaning)


In [None]:
print("Standard Schools Data:")
print(schools.head())

print("\nVariations Data:")
print(variations.head())

Standard Schools Data:
   school_id                name                region
0          1            авангард    московская область
1          2            авангард     ямало ненецкий ао
2          3             авиатор  республика татарстан
3          4              аврора       санкт петербург
4          5  ice dream айс дрим       санкт петербург

Variations Data:
   school_id                                             name
0       1836                                       ооо триумф
1       1836                                 москва ск триумф
2        610                            сшор надежда губернии
3        610  саратовская область гбусо сшор надежда губернии
4        609                                      сш гвоздика


In [None]:
# Function to create typographical errors
def add_typo(word):
    if len(word) < 2:
        return word
    idx = random.randint(0, len(word) - 2)
    return word[:idx] + word[idx + 1] + word[idx] + word[idx + 2:]

def augment_data(name):
    augmented_names = set()
    for _ in range(3):  # Create 3 variations for each name
        typo_name = add_typo(name)
        augmented_names.add(typo_name)
    return list(augmented_names)

# Apply augmentation to data
augmented_data = []

for index, row in schools.iterrows():
    name = row['name']
    augmented_names = augment_data(name)
    for augmented_name in augmented_names:
        augmented_data.append([row['id'], augmented_name])

# Convert to DataFrame
augmented_df = pd.DataFrame(augmented_data, columns=['id', 'name'])

# Combine with original data
augmented_schools = pd.concat([schools[['id', 'name']], augmented_df])

# Reset index
augmented_schools.reset_index(drop=True, inplace=True)

print(augmented_schools.head())

In [None]:
# Load RND-Full model
model = SentenceTransformer('all-distilroberta-v1')

# Function to get embeddings
def get_embeddings(df):
    return model.encode(df['name'].tolist(), convert_to_tensor=True)

# Get embeddings for school names
embeddings_schools = get_embeddings(augmented_schools)
embeddings_variations = get_embeddings(variations)

# Function to find best match
def find_best_match(embedding, all_embeddings):
    cos_sim = util.pytorch_cos_sim(embedding, all_embeddings)
    best_match_idx = np.argmax(cos_sim)
    return best_match_idx

# Apply function to data
results = []
for i, variation in variations.iterrows():
    best_match_idx = find_best_match(embeddings_variations[i], embeddings_schools)
    best_match_id = augmented_schools.iloc[best_match_idx]['id']
    results.append([variation['id'], best_match_id])

# Create DataFrame with results
results_df = pd.DataFrame(results, columns=['variation_id', 'matched_school_id'])

print(results_df.head())

In [None]:
# Split data into train and test sets
train_data, test_data = train_test_split(results_df, test_size=0.2, random_state=42)

# Function to evaluate model
def evaluate_model(test_data, augmented_schools, variations):
    correct_matches = 0
    for i, row in test_data.iterrows():
        variation_id = row['variation_id']
        true_school_id = row['matched_school_id']

        variation_name = variations.loc[variations['id'] == variation_id, 'name'].values[0]
        true_school_name = augmented_schools.loc[augmented_schools['id'] == true_school_id, 'name'].values[0]

        pred_school_id = augmented_schools.loc[find_best_match(model.encode([variation_name])[0], embeddings_schools), 'id']

        if pred_school_id == true_school_id:
            correct_matches += 1

    accuracy = correct_matches / len(test_data)
    return accuracy

# Evaluate model
accuracy = evaluate_model(test_data, augmented_schools, variations)
print(f"Accuracy: {accuracy * 100:.2f}%")

In [None]:
# Create Streamlit app
st.title("School Name Matching")

# Input school name variation
input_name = st.text_input("Enter school name variation:")

# Clean input text
cleaned_name = cleaning(input_name)

# Get embedding for input text
input_embedding = model.encode([cleaned_name])[0]

# Find best match
best_match_idx = find_best_match(input_embedding, embeddings_schools)
best_match_name = augmented_schools.iloc[best_match_idx]['name']

# Display result
st.write(f"Best match: {best_match_name}")

Readme

In [None]:
# School Name Matching

## Project Goals
The goal of this project is to create a model for accurately predicting school names based on different variations of their spellings. Augmentation techniques and the RND-Full model are used to achieve high accuracy.

## Data
1. `schools.csv` - Dataset with standard school names.
2. `variations.csv` - Dataset with variations of school names.

## Data Preparation
1. Clean data by removing all characters except letters and spaces.
2. Use augmentation with typographical errors to increase the training set size.

## Model Training
1. Use RND-Full model to obtain embeddings for school names.
2. Find the nearest neighbors to match variations of school names to standard names.

## Testing
The model is tested on a test set and evaluated using accuracy metrics.

## Conclusions and Recommendations
The model shows high accuracy on test data. It is recommended to use this model for automatic standardization of school names.

## Usage Instructions
1. Run the Streamlit app:
   ```bash
   streamlit run app.py

In [None]:

### Step 8: Create requirements.txt

```txt
pandas
numpy
scikit-learn
sentence-transformers
streamlit
textaugment