# Extension 1 - Namesake

In this extension I integrated the current model into Namesake, and compare the performance of the current model to the existing way that Namesake checks for similarity.  

Note: Namesake relies on an old version of Python2Vec and Wordnet, so it should be easy to outperform its results.


In [1]:
# Installing the requirements if not already installed

%pip3 install --upgrade pip
%pip3 install -r namesake/requirements.txt 


Note: you may need to restart the kernel to use updated packages.
Collecting atomicwrites==1.4.1
  Using cached atomicwrites-1.4.1-py2.py3-none-any.whl
Collecting attrs==21.4.0
  Using cached attrs-21.4.0-py2.py3-none-any.whl (60 kB)
Collecting colorama==0.4.5
  Using cached colorama-0.4.5-py2.py3-none-any.whl (16 kB)
Collecting humanize==4.2.3
  Using cached humanize-4.2.3-py3-none-any.whl (102 kB)
Collecting importlib-metadata==4.12.0
  Using cached importlib_metadata-4.12.0-py3-none-any.whl (21 kB)
Collecting joblib==1.1.0
  Using cached joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Collecting nltk==3.7
  Using cached nltk-3.7-py3-none-any.whl (1.5 MB)
Collecting numpy==1.21.6
  Using cached numpy-1.21.6-cp310-cp310-macosx_11_0_arm64.whl (12.4 MB)
Collecting packaging==21.3
  Using cached packaging-21.3-py3-none-any.whl (40 kB)
Collecting pandas==1.3.5
  Using cached pandas-1.3.5-cp310-cp310-macosx_11_0_arm64.whl (10.3 MB)
Collecting pytest==7.1.2
  Using cached pytest-7.1.2-py3-none-a

In [3]:
# To run Namesake on the file test1.py
%pip install --upgrade setuptools wheel

Note: you may need to restart the kernel to use updated packages.


In [1]:
# Crawling through numpy repository with the help of python_crawler.py
import python_crawler

path = "Python Repositories"

# Creating an object of the PythonCrawler class
crawler = python_crawler.PythonCrawler(path)

crawler.aggregate_py_files_flask()

In [3]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Running namesake on the file aggregate_py_files_numpy.py
!python3 Namesake-main/namesake.py Namesake-main/test1.py

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/chandrachudgowda/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


orthographic similarity:
	[i] on line 6 and [j] on line 7 are 0.46 similar!
	[E] on line 10 and [F] on line 11 are 0.86 similar!

phonological similarity:
	[write] on line 14 and [right] on line 15 are 1.00 similar!
	[file_E] on line 34 and [file_F] on line 35 are 0.86 similar!

semantic similarity:
	[right] on line 15 and [left] on line 16 are 0.91 similar!
	[left] on line 16 and [result] on line 20 are 1.00 similar!
	[total] on line 18 and [number] on line 24 are 1.00 similar!
	[number] on line 24 and [count] on line 25 are 1.00 similar!
	[get_count] on line 32 and [get_number] on line 33 are 1.00 similar!
	[file_E] on line 34 and [file_F] on line 35 are 0.94 similar!



In [4]:
# Using the python2vec model we built and comparing the results

# Importing the required libraries
import pandas as pd
import numpy as np
import gensim
import os
import string
from nltk.tokenize import RegexpTokenizer
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

code = []

with open('Namesake-main/test1.py', 'r') as f:
        corpus = f.read()
        raw_sent = sent_tokenize(corpus)
        for sent in raw_sent:
            code.append(simple_preprocess(sent))

# Printing the number of lines of code and the number of tokens (words) in the file
print(f'Number of lines of code: {len(code)}')
print(f'Number of tokens (words): {len([token for sent in code for token in sent])}')

Number of lines of code: 2
Number of tokens (words): 46


In [5]:
# Train your Gensim Word2Vec model with the tokenized lines of code
model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

In [6]:
model.build_vocab(code)

In [7]:
model.train(code, total_examples=model.corpus_count, epochs=model.epochs)

(17, 230)

In [8]:
# Deleting the old model if it exists
if os.path.exists('python2vec_extension_2.model'):
    os.remove('python2vec_extension_2.model')

# Saving the model as python2vec.model

model.save('python2vec_extension_2.model')

In [17]:
# Explore the trained model by examining the closest_words to some Python keywords like “for” and “if.”  Also explore the similarity of some popular identifier names like “math” and “numpy” in your notebook.

print('Similarity between write and right with Python2Vec model: ', model.wv.similarity('write', 'right'))

print()

print('Similarity between right and left with Python2Vec model: ', model.wv.similarity('right', 'left'))

print()

print('Similarity between left and result with Python2Vec model: ', model.wv.similarity('left', 'result'))

print()


Similarity between write and right with Python2Vec model:  0.015004866

Similarity between right and left with Python2Vec model:  0.044843625

Similarity between left and result with Python2Vec model:  0.060677346

