# COGS 118A - Project Checkpoint

## Predicting Minimum Hamming Distance Between Word and Vocabulary List

# Names

- Peter Barnett
- William Lutz
- Ricardo Sedano

# Abstract 
Our goal is to predict the minimum number of character substitutions required to turn a given input string of alphabetical characters into an English dictionary word (into *any* English dictionary word). That is, to predict the hamming distance between a given input string and its nearest English word. This problem can be solved with brute force nearest-neighbor search, or with a BK-tree, but takes a considerable amount of time if one needs to make this prediction many times. So, the purpose of this project is to be able to generate approximations of this shortest distance more quickly. The words will come from the open source WordSet dictionary, the training inputs will be generated from random strings, and the training labels will be generated using the brute force search approach. The same process is used to create the test data. We will train our model on this data and its performance will be measured by computing the average error |predicted distance - true distance| across the test dataset.

# Background

Hamming distance is the measured difference between two strings of equal lengths. As a simple example, the two strings 'car' and 'cat' would have a Hamming distance of 1, meaning that they only differ by one character. This mathematical concept was first published in 1950 his paper 'Error detecting error correcting codes' <a name = "hamming"></a>[<sup>[1]</sup>](#hammingnote). </br> </br>

We see and acnowlegde how Hamming distance has multiple applications to solve difficult applications in science, for example, comparing genomic sequences <a name = "pinheiro"></a>[<sup>[2]</sup>](#pinheironote). Yet, while it has widely been used in DNA sequencing, error detection, and image processing, trying to find the minimum Hamming distance between an input and a large dataset is computationally expensive to calculate when iterating through many examples. 

For this project, our team aims to calculate the minimum Hamming distance from an input sting of text to dictionary entries of the same length. In this application, user creators can implement effecient algothitms for spell check, autocorrection<a name = "lalwari"></a>[<sup>[3]</sup>](#lalwarinote), text predition, and even word puzzle contruction.

 

# Problem Statement

Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

# Data

UPDATED FROM PROPOSAL!

You should have obtained and cleaned (if necessary) data you will use for this project.

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!


# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

In [None]:
# Preliminary results

#!pip install spotipy

import numpy as np 
from sklearn.model_selection import train_test_split
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from sklearn.neighbors import KNeighborsClassifier

def get_features_from_playlist(playlist_name):
   #create authentication/my credentials using Spotipy
   cid = 'f20bed5bd1e6439ab409ed55211f1f9d'
   secret = '27fb5f47ab8043978e797ddc884e99c3'
   client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
   sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

   #search for the playlist by name/can change to be by specific spotify ID
   sp.search()
   results = sp.search(q=playlist_name, type='playlist')

   #pull top playist ID from the search results
   playlist_id = results['playlists']['items'][0]['id']

   #get the tracks from that playlist
   playlist_tracks = sp.playlist_tracks(playlist_id)

   #extract the track IDs from that playlist
   track_ids = [track['track']['id'] for track in playlist_tracks['items']]

   # retrieve the audio features for each track in the playlist
   audio_features = []
   for i in range(0, len(track_ids), 50):
      features = sp.audio_features(track_ids[i:i+50])
      audio_features.append(features)

   #filter out irrelevenat info from data (type, id, uri, track_href, analysis_url)
   filtered_audio_features = []
   for feature_set in audio_features:
      for feature in feature_set:
         filtered_feature = {key: value for key, value in feature.items() if key not in ['type', 'id', 'uri', 'track_href', 'analysis_url']}
         filtered_audio_features.append(filtered_feature)

   data = []
   for feature in filtered_audio_features:
      data.append(list(feature.values()))

   return data

client_credentials_manager = SpotifyClientCredentials(client_id='f20bed5bd1e6439ab409ed55211f1f9d', client_secret='27fb5f47ab8043978e797ddc884e99c3')

sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

# Search for tracks with a low popularity score
results = sp.search(q='year:2022', type='track', market='US', limit=50, popularity='0,30')

# Extract track information from the search results
tracks = results['tracks']['items']
for track in tracks:
    print(track['name'], 'by', track['artists'][0]['name'], '- Popularity:', track['popularity'])


popular_data = np.array(get_features_from_playlist('Top Songs - Global'))
unpopular_data = np.array(get_features_from_playlist('Top Songs - Global'))

# create labels for the data
popular_labels = np.full(popular_data.shape[0], 1)
unpopular_labels = np.full(unpopular_data.shape[0], 0)

# combine the data and labels
X = np.concatenate((popular_data, unpopular_data))
y = np.concatenate((popular_labels, unpopular_labels))

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# create the K-nearest neighbor model with k=3
knn = KNeighborsClassifier(n_neighbors=3)

# train the model on the training data
knn.fit(X_train, y_train)

# predict the labels of the testing data
y_pred = knn.predict(X_test)

# print the accuracy of the model
print("Accuracy:", knn.score(X_test, y_test))



NEW SECTION!

Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters



# Ethics & Privacy

There are little to no obvious ethics & privacy concerns that arise from our project. As seen in our project checkpoint data section, a public playlist created by Spotify is being used. However, if a user decides to input a playlist associated with the user, then the user's data in terms of playlist features (username, playlist creation date, audio, spotify URI) would be seen and handled. This can be possibly extended on a larger scale, if users import other users' playlists who may not concent to sharing data for algorithmic purposes.


# Team Expectations 

Each project team member has agreed to and and is expected to: 
* Attend all-team meetings
* Remain attentive; communicate quickly and effectively
* Do work assigned to them
* Be respectful of each others' work
* Stay aware of deadlines and collaborate in favor of completing a comprehensive project

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/22  |  6 PM |  Brainstorm topics (all); exchange contact information | Project topic, Discuss ideal dataset(s) and ethics;Edit, finalize, and submit project proposal| 
| 3/6  |  7 PM |  Import & Wrangle Data | Discuss Wrangling and possible analytical approaches; Finalize wrangling/EDA; Begin programming for project | 
| 3/20  | 7 PM  | Continue programming for project | Discuss/edit project code; Draft results/conclusion/discussion   |
| 3/22  | Before 11:59 PM  | Discuss/edit full project; Complete project | Turn in Final Project  |

# Footnotes
<a name="hammingnote"></a>1.[^](#hamming): Hamming, R. W. (1950) Error detecting and error correcting codes. *Bell Systems Technical Journal, 29(2)*. https://ieeexplore.ieee.org/document/6772729<br> 
 

<a name="pinheironote"></a>2.[^](#pinheiro): Pinheiro, H.P.  Analysis of Variance for Hamming Distances Applied to Unbalanced Designs. (https://ime.unicamp.br/sites/default/files/pesquisa/relatorios/rp-2001-30.pdf).<br>

<a name="lalwaninote"></a>3.[^](#lalwani): Lalwani, M. (2014) Efficient Algorithm for Auto Correction Using n-gram Indexing *Nirma University, Ahmedabad, India* (https://www.researchgate.net/profile/Mahesh-Lalwani/publication/266886742_Efficient_Algorithm_for_Auto_Correction_Using_n-gram_Indexing/links/547ee25e0cf2d2200edeaf9f/Efficient-Algorithm-for-Auto-Correction-Using-n-gram-Indexing.pdf).<br>
