# Quantum Vocabulary List and AutoCorrect Feature
## 622 published articles in New Atlas (https://newatlas.com)

In [1]:
# importing libraries
import pandas as pd

import re
import textdistance
import Levenshtein as lev

from collections import Counter
import string

import spacy
nlp = spacy.load('en_core_web_lg')

import nltk
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance

import networkx as nx
stop_words = list(stopwords.words('english'))

from string import punctuation
from heapq import nlargest

import wordcloud
from wordcloud import WordCloud

import matplotlib.pyplot as plt

**Project Scope: Now the dataset including 622 unique quantum articles**. 
<br>The articles published from *2003-09-15* to *2022-12-01* in https://newatlas.com/.

## Creating a Quantum Word List
Having a list of words which are appeared in a special subject (here the Quantum) is helpful in many porpuses, such as: reading, applying and comrehending the *special text*, *creating a special vocabulary* and  *Designing autocorrect feature* and etc.

In [2]:
df = pd.read_csv("Quantum_Articles_NewAtlas.csv")
df.head(2)

Unnamed: 0,Title,Author,Article,Published On,Link
0,IBM brings quantum computing to the masses,"['Jon Simon Feature Photo Service For Ibm', 'C...","For the first time, IBM Research has thrown op...",2016-05-06 07:11:29.000,https://newatlas.com/quantum-processor-qubits-...
1,Diamond-based quantum computer paired with sup...,"['Pawsey Supercomputing Research Centre', 'Mic...",Quantum computing may have just taken a major ...,2022-06-03 06:39:18.610,https://newatlas.com/computers/quantum-compute...


### Quantum Vocabulary

In [3]:
def common_function(text):
    text = text.lower()

    # Removing punctuations
    text = text.replace("’s", "")
    text = text.replace("n’t" , " ")
    text = text.replace("’d", "")
    text = text.replace("’ve", "")    
    #text = text.translate(str.maketrans('', '', string.punctuation))
    #text = re.sub(" \d+", " ", text)
    
    #Remving numbers
    #text = re.sub(r'[0-9]+', '', text)
    
    #Remving all characters but alphabets
    text=re.sub("[^A-Z a-z]", ' ', text)
    
    #Removing multiple spaces
    text = re.sub(' +', ' ', text)

    
    # Tokenizing
    Token_L= text.split()
    Token_L = [token.replace(" ","") for token in Token_L]
    
    # Removing stopwords
    Token_L = [token for token in Token_L if not token in stop_words]
    
    # Removing Charghters of formulas(such as c, v, q and etc.)
    Token_L = [token for token in Token_L if len(token)>2]
    
    # Lemmatization
    Token_L = [token.lemma_ for token in nlp(" ".join(Token_L))]    
    
    return(Token_L)

*To prevent following Error word_L list is separated in 3 parts then they will be unified*
<br>ValueError: [E088] Text of length 2624523 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. 

In [4]:
text1 =" ".join(df["Article"][:200]) 
word_L1 = common_function(text1)

In [5]:
text2 =" ".join(df["Article"][200:400]) 
word_L2 = common_function(text2)

In [6]:
text3 =" ".join(df["Article"][400:500]) 
word_L3 = common_function(text3)

In [7]:
text4 =" ".join(df["Article"][500:600]) 
word_L4 = common_function(text4)

In [8]:
text5 =" ".join(df["Article"][600:]) 
word_L5 = common_function(text5)

In [9]:
word_L = word_L1 + word_L2 + word_L3 + word_L4 + word_L5
len(word_L)

341709

### Quantum Vocabulary set

In [10]:
Q_vocab = set(word_L)
len(Q_vocab)

20508

In [11]:
print("Ignoring stopwords, roughly there are ({}) words in the text. including ({}) unique words.".format(len(word_L) ,  len(Q_vocab)))

Ignoring stopwords, roughly there are (341709) words in the text. including (20508) unique words.


### AutoCorrect Feature

To have AutoCorrector the probebility matters most, so creating a Collections Counter (word_frequancy) is necessary.
<br>Then the autocorector feature offer some nearest word to what is wrote by user, in realtime.
<br>This sort of feature is offered in articles by **Jaccard Similarity** method the capability of this appraoch has been challenged by another appraoch, which is **Levenshtein Distance** method.
<br>**Conclusion**: The time cost of both methods are appropriate, the performance of **Levenshtein Distance** is much more better. but why?
<br>*Whatevere it is in AutoCorrect feature the difference between targeted word and suggestions matters not ther similarities.*

In [12]:
# Collectin Countor of Words
word_frequency = {}  
word_frequency = Counter(word_L)

print("Top 10 most common words:")
word_frequency.most_common()[0:10]

Top 10 most common words:


[('quantum', 3051),
 ('one', 2403),
 ('use', 2234),
 ('new', 1905),
 ('auction', 1809),
 ('time', 1687),
 ('make', 1653),
 ('first', 1555),
 ('year', 1358),
 ('could', 1293)]

In [13]:
word_frequency['Computing']

0

In [14]:
# Collectin Countor of Propbabilities of Words (regarding theie frequencies)
probability = {}     
    
for w in word_frequency.keys():
    probability[w] = word_frequency[w]/sum(word_frequency.values())

In [15]:
# The probability of appearing top 10 words in articles
sort_probability = sorted(probability.items(), key=lambda x: x[1], reverse=True)
for i in sort_probability[:10]:
    print(i[0], round(i[1],3))

quantum 0.009
one 0.007
use 0.007
new 0.006
auction 0.005
time 0.005
make 0.005
first 0.005
year 0.004
could 0.004


#### Approach 1: AutoCorrect using Jaccard Similarity

In [16]:
def autocorrector_J(word):
    word = word.lower()
    if word in Q_vocab:
        return(word)
    else:
        sim = [1-(textdistance.Jaccard(qval=1).distance(w,word)) for w in word_frequency.keys()]
        df0 = pd.DataFrame.from_dict(probability, orient='index').reset_index()
        df0 = df0.rename(columns={'index':'Offered Words', 0:'Similarity'})
        df0['Similarity'] = sim
        output = df0.sort_values('Similarity', ascending=False).head(3)
        return(output)

#### Approach 2: AutoCorrect using Levenshtein Distance

In [17]:
def autocorrector_L(word):
    word = word.lower()
    if word in Q_vocab:
        return(word)
    else:
        dist = [lev.distance(word,w) for w in word_frequency.keys()]
        df0 = pd.DataFrame.from_dict(probability, orient='index').reset_index()
        df0 = df0.rename(columns={'index':'Offered Words', 0: "Distance"})
        df0['Distance'] = dist
        output = df0.sort_values('Distance').head(3)
        return(output)

#### Which method is better? *Levenshtein Distance* vs. *Jaccard Similarity*
To evaluate performance 3 wrong words will be applied:
<br>*quantom* (the wrong spell of *quantum*)
<br>*speentronic* (the wrong spell of *spintronic*)
<br>*relitivity* (the wrong spell of *relativity*)

In [18]:
for w in ["quantom", "speentronic", "relitivity"]:
    print("\nSugestion for ({}) Jaccard method:".format(w))
    print(autocorrector_J(w),"\n")
    print("Sugestion for ({}) LevenShtein method:".format(w))
    print(autocorrector_L(w),"\n")


Sugestion for (quantom) Jaccard method:
     Offered Words  Similarity
217         amount    0.857143
9          quantum    0.750000
5509       quantop    0.750000 

Sugestion for (quantom) LevenShtein method:
      Offered Words  Distance
9           quantum         1
5509        quantop         1
15780        quantu         2 


Sugestion for (speentronic) Jaccard method:
      Offered Words  Similarity
7219    electronsin    0.833333
19856     inspector    0.818182
6640      reception    0.818182 

Sugestion for (speentronic) LevenShtein method:
      Offered Words  Distance
2009     spintronic         2
624      electronic         4
12510     eccentric         5 


Sugestion for (relitivity) Jaccard method:
      Offered Words  Similarity
1457     relativity    0.818182
13768   intuitively    0.750000
4238    resistivity    0.750000 

Sugestion for (relitivity) LevenShtein method:
      Offered Words  Distance
1457     relativity         1
4238    resistivity         2
13158     r

**Result**: The *LevenShtein mothod* worked much more better, in suggesting words in first of row.