# CSCI4022 Homework 2; Review

## Due Monday, February 8 at 11:59 pm to Canvas

#### Submit this file as a .ipynb with *all cells compiled and run* to the associated dropbox.

***

Your solutions to computational questions should include any specified Python code and results as well as written commentary on your conclusions.  Remember that you are encouraged to discuss the problems with your classmates, but **you must write all code and solutions on your own**.

**NOTES**: 

- Any relevant data sets should be available on Canvas. To make life easier on the graders if they need to run your code, do not change the relative path names here. Instead, move the files around on your computer.
- If you're not familiar with typesetting math directly into Markdown then by all means, do your work on paper first and then typeset it later.  Here is a [reference guide](https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference) linked on Canvas on writing math in Markdown. **All** of your written commentary, justifications and mathematical work should be in Markdown.  I also recommend the [wikibook](https://en.wikibooks.org/wiki/LaTeX) for LaTex.
- Because you can technically evaluate notebook cells is a non-linear order, it's a good idea to do **Kernel $\rightarrow$ Restart & Run All** as a check before submitting your solutions.  That way if we need to run your code you will know that it will work as expected. 
- It is **bad form** to make your reader interpret numerical output from your code.  If a question asks you to compute some value from the data you should show your code output **AND** write a summary of the results in Markdown directly below your code. 
- 45 points of this assignment are in problems.  The remaining 5 are for neatness, style, and overall exposition of both code and text.
- This probably goes without saying, but... For any question that asks you to calculate something, you **must show all work and justify your answers to receive credit**. Sparse or nonexistent work will receive sparse or nonexistent credit. 
- There is *not a prescribed API* for these problems.  You may answer coding questions with whatever syntax or object typing you deem fit.  Your evaluation will primarily live in the clarity of how well you present your final results, so don't skip over any interpretations!  Your code should still be commented and readable to ensure you followed the given course algorithm.

---

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import seaborn as sns

***
<a/ id='p1'></a>
[Back to top](#top)
# Problem 1 (Theory: minhashing; 10 pts)

Consider minhash values for a single column vector that contains 10 components/rows. Seven of rows hold 0 and three hold 1. Consider taking all 10! = 3,628,800 possible distinct permutations of ten rows. When we choose a permutation of the rows and produce a minhash value for the column, we will use the number of the row, in the permuted order, that is the first with a 1.  Use Markdown cells to demonstrate answers to the following.

#### a) For exactly how many of the 3,628,800 permutations is the minhash value for the column a 9?  What proportion is this?

So lets process through this if each permutation shuffles the order, any which order still has to contain 3 1's. Let's attempt to "pick"  a permutaion that creates a minhash value for column 9. It is hard because the way minhash works is it will take row with the first 1. Hence a permutation that minhashes to a 9 is impossible because we will need one less row that holds a 1 and we cant get rid of a 1.  
- 000 000 0011

Hence the proportion is 0/3628800 permutations

#### b) For exactly how many of the 3,628,800 permutations is the minhash value for the column a 8?

There is exactly 6 permutation that minhash to value 8. That permutaion is [ 0000000 111 ] as the first 1 would be the 8th index hence the value. However, by definition of permutation order matters so we have to acount all the different ways we can place the last 3 ones in the last 3 spots which is 6 ways and since the rest of the spots are 0's we need to multiply by $2^7$ to fill the open spots.
- $6 \times 2^7 = 768 $ times

Hence the proprtion is $6*2^7/362880 = 768/362880 \approx 2.12 \%$ of permutations map out to 8. (its 6*2^7)

#### c) For exactly how many of the 3,628,800 permutations is the minhash value for the column a 3?

We can solve this like part b. The only value in any given permutation will need to have the first 1 be in accordance to the minhash value. So the permutation must start with 001... So it will be the same as b however we need to account selecting 1 one and two 0's to put in the correct order. Hence the number of permutations that map out to 3 is given below:
- $6 \times 2^7 \times P(7,2) \times P(3,1) = 6 \times 2^7 \times 42 \times 3 = 96768 $ times

Hence the porportion is $96768/362880 \approx 26.67\% $ of permutations map out to 3 

***
<a/ id='p3'></a>
[Back to top](#top)
# Problem 2 (Applied Minhashing; 35 pts)

In this problem we compare similarities of 5 documents available on http://www.gutenberg.org

 1) The first approximately 10000 characters of Miguel de Unamuno's *Niebla*, written in Spanish, in the file `niebla.txt`
 
 2) The first approximately 10000 characters of Miguel de Cervantes *The Ingenious Gentleman Don Quixote of La Mancha*, written in Spanish, in the file `DQ.txt`
 
 3) The first approximately 10000 characters of Homer's *The Odyssey*, translated into English by Samuel Butler, in the file `odyssey.txt`
 
 4) The first approximately 10000 characters of Kate Chopin's *The Awakening* in the file `awaken.txt`
 
 5) The entirety of around 12000 characters of Kate Chopin's *Beyond the Bayou* in the file `BB.txt`
 
### a) Clean the 4 documents, scrubbing all punctuation, changes cases to lower case, and removing accent marks as appropriate.  

You should have only 27 unique characters in each book/section after cleaning, corresponding to white spaces and the 26 letters.  


**For this problem, you may import any text-based packages you desire to help wrangle the data.**  I recommend looking at some functions within `string` or the RegEx `re` packages.

You can and probably should use functions in the string package such as `string.lower`, `string.replace`, etc.

All 5 documents have been saved in UTF-8 encoding.




In [2]:
import re
import unicodedata
def cleanDoc(fileName):
    clean_Doc =""
    regEx_p= ['[a-z]+']
    delimeter = ' '
    with open(fileName, 'r',encoding='utf-8') as reader:
        # Read and print the entire file line by line
        #reader =reader.lower()
        for line in reader:
            #if(line !=""):
            line = line.lower()
            line = line.replace('\'','')
            line = line.replace(',','')
            line = line.replace('(','')
            line = line.replace(')','')
            line = line.replace('.','')
            line = unicodedata.normalize('NFD', line).encode('ascii', 'ignore').decode("utf-8") # will get rid of all accent marks
            #for just getting a-z
            for p in regEx_p:
                line_arr= re.findall(p, line)
                if(len(line_arr)!=0):
                    line = delimeter.join(line_arr)
                    #print(line, end='')
                    #print("")
                    clean_Doc = clean_Doc + " "+ line
            #print(line, end='')

    return clean_Doc
niebla = cleanDoc("niebla.txt")
dq = cleanDoc("DQ.txt")
odyssey =cleanDoc("odyssey.txt")
awaken =cleanDoc("awaken.txt")
bb = cleanDoc("BB.txt")
#print(len(cleanDoc("niebla.txt")))
#print(len(cleanDoc("DQ.txt")))
#print(cleanDoc("DQ.txt"))
#print(len(cleanDoc("BB.txt")))
#print(cleanDoc("BB.txt"))
#print(len(cleanDoc("awaken.txt")))



### b) Compute exact similarity scores between the documents.  Are these the expected results?

Notes:
- You may choose or explore different values of $k$ for your shingles.
- You may choose to shingle on words and create an n-gram model, but it is recommended you shingle on letters as described in class
- You may construct your characteristic matrix or characteristic sets with or without hash functions (e.g. by using `set()`).  Note that choice of hash function should change heavily with $k$!

In [3]:
k = 4  #5 or 9
def getDoc(doc,k):
    shingleList =[]
    for i in range(1,len(doc)):
        shingleList.append(doc[i-1:i+k-1])
        #print(doc[i-1:i+k-1])
    return set(shingleList)
def getCharacteristic_Col(charMatrix,shingleSet):
    simList = []
    bigList =charMatrix.iloc[:, 0]
    for i in range(len(bigList)):
        #if(i<4):
        word =bigList[i]
        tempSet = {word}
        #print(tempSet)
        boolAns = tempSet.issubset(shingleSet)
        intAns = int(boolAns)
        simList.append(intAns)
    
    #print(simList)
    return simList
nieble_shingles =getDoc(niebla,k)
dq_shingles =getDoc(dq,k)
odyssey_shingles =getDoc(odyssey,k)
awaken_shingles =getDoc(awaken,k)
bb_shingles =getDoc(bb,k)

bigList = nieble_shingles | dq_shingles | odyssey_shingles | awaken_shingles|bb_shingles


#create characteristic matrix
charMatrix = pd.DataFrame(list(bigList),columns=['shingle'])
#create function to go through shingle list and check with each one in df
nieble_simList = getCharacteristic_Col(charMatrix,nieble_shingles)
dq_simList = getCharacteristic_Col(charMatrix,dq_shingles)
odyssey_simList = getCharacteristic_Col(charMatrix,odyssey_shingles)
awaken_simList = getCharacteristic_Col(charMatrix,awaken_shingles)
bb_simList = getCharacteristic_Col(charMatrix,bb_shingles)
#add each to characteristic matrix
charMatrix['Nieble']= nieble_simList
charMatrix['DQ']= dq_simList
charMatrix['Odyssey']= odyssey_simList
charMatrix['Awaken']= awaken_simList
charMatrix['BB']= bb_simList

#charMatrix.head(5)

In [4]:
df_charMatrix = charMatrix.iloc[:, 1:].copy()
#df_charMatrix.head()

In [5]:
#single case for practice
sims = {}
docName = "Nieble"
for user2 in df_charMatrix.columns:
    #sims = {}
    #for user2 in df_charMatrix.columns:
    col1 = docName
    col2 = user2
    numer = np.sum((df_charMatrix[col1]==1) & (df_charMatrix[col2]==1))
    denom = np.sum((df_charMatrix[col1]==1) | (df_charMatrix[col2]==1))
    sim = numer/denom
    sims[user2] = sim
    #print()
print(sims)

{'Nieble': 1.0, 'DQ': 0.2853838065194532, 'Odyssey': 0.09849624060150376, 'Awaken': 0.10057769219374907, 'BB': 0.10232886379675371}


In [6]:
sims = {}
docs =["Nieble","DQ","Odyssey","Awaken","BB"]
def simScores(docName): #based off of nb3sol
    for user in df_charMatrix.columns:
        col1 = docName
        col2 = user
        numer = np.sum((df_charMatrix[col1]==1) & (df_charMatrix[col2]==1))
        denom = np.sum((df_charMatrix[col1]==1) | (df_charMatrix[col2]==1))
        sim = numer/denom
        sims[user] = sim
    print(sims)
for i in range(len(docs)):
    print("Similarity scores for "+ docs[i])
    simScores(docs[i])
    print("")

Similarity scores for Nieble
{'Nieble': 1.0, 'DQ': 0.2853838065194532, 'Odyssey': 0.09849624060150376, 'Awaken': 0.10057769219374907, 'BB': 0.10232886379675371}

Similarity scores for DQ
{'Nieble': 0.2853838065194532, 'DQ': 1.0, 'Odyssey': 0.08160466312360706, 'Awaken': 0.08170813718897109, 'BB': 0.08693571542510767}

Similarity scores for Odyssey
{'Nieble': 0.09849624060150376, 'DQ': 0.08160466312360706, 'Odyssey': 1.0, 'Awaken': 0.2907429345066847, 'BB': 0.3098315066252249}

Similarity scores for Awaken
{'Nieble': 0.10057769219374907, 'DQ': 0.08170813718897109, 'Odyssey': 0.2907429345066847, 'Awaken': 1.0, 'BB': 0.3135196252624778}

Similarity scores for BB
{'Nieble': 0.10232886379675371, 'DQ': 0.08693571542510767, 'Odyssey': 0.3098315066252249, 'Awaken': 0.3135196252624778, 'BB': 1.0}



### c) Implement minhashing with 1000 hash functions on the 4 documents, checking your results against those in part b).

- You may choose your own value of $p$ as the modulus of the hash functions.  You are encouraged to use the example code from the minhashing in class notebook to start you out.

In [7]:
def minhash1(nhash, dfC): #-------------------------------------------funcction taken nb03 (changed pHash value)
    '''
    Takes a number of hash functions to use (nhash) and characteristic matrix (dfC)
    '''
    # use the "universal hash":  (a*x+b) mod p, where a, b are random ints and p > N (= 10 here) is prime
    np.random.seed(4022)
    Ahash = np.random.choice(range(0,10000), size=nhash)
    Bhash = np.random.choice(range(0,10000), size=nhash)
    Phash =10069  # 1439

    # STEP 2:  initialize signature matrix to all infinities

    # initialize the signature matrix
    Msig = np.full([nhash, len(dfC.columns)], fill_value=np.inf)

    # fill in the signature matrix:

    # For each row of the characteristic matrix... 
    hash_vals = [0]*nhash # initialize
    for r in range(len(dfC)):
        # STEP 3:  Compute hash values (~permuted row numbers) for that row under each hash function
        for h in range(nhash):
            hash_vals[h] = (Ahash[h]*r + Bhash[h])%Phash
        # STEP 4:  For each column, if there is a 0, do nothing...
        for c in range(len(dfC.columns)):
            # ... but if there is a 1, replace signature matrix element in that column for each hash fcn 
            # with the minimum of the hash value in this row, and the current signature matrix element
            if dfC.iloc[r,c]==1:
                for h in range(nhash):
                    if hash_vals[h] < Msig[h,c]:
                        Msig[h,c] = hash_vals[h]
    return Msig

In [8]:
nHash =1000
Msig = minhash1(nHash, df_charMatrix)

In [9]:
df1 = pd.DataFrame(Msig) #puts in df
#df1.head(10005)

In [10]:
[sum(Msig[:,0]==Msig[:,k])/nHash for k in range(len(df1.columns))] #single case

[1.0, 0.319, 0.138, 0.121, 0.118]

In [11]:
#nhash=1000

def helper(index,Msig,nhash):
    sims2 =[sum(Msig[:,index]==Msig[:,k])/nHash for k in range(len(df1.columns))]
    #for i in range(5):
        #sims2 ={}
        #print(sum(Msig[:,index]==Msig[:,i]))
    #    sim = sum(Msig[:,index]==Msig[:,i])/nhash
    #    sims2.append(sim)
    print(sims2)
print("These are similarities with minHashing")
for i in range(5):
    helper(i,Msig,nHash)

print("Compared to the exact similarities...")
print("______________________________________________")
for i in range(len(docs)):
    print("Similarity scores for "+ docs[i])
    simScores(docs[i])
    print("")    

These are similarities with minHashing
[1.0, 0.319, 0.138, 0.121, 0.118]
[0.319, 1.0, 0.104, 0.106, 0.103]
[0.138, 0.104, 1.0, 0.323, 0.331]
[0.121, 0.106, 0.323, 1.0, 0.345]
[0.118, 0.103, 0.331, 0.345, 1.0]
Compared to the exact similarities...
______________________________________________
Similarity scores for Nieble
{'Nieble': 1.0, 'DQ': 0.2853838065194532, 'Odyssey': 0.09849624060150376, 'Awaken': 0.10057769219374907, 'BB': 0.10232886379675371}

Similarity scores for DQ
{'Nieble': 0.2853838065194532, 'DQ': 1.0, 'Odyssey': 0.08160466312360706, 'Awaken': 0.08170813718897109, 'BB': 0.08693571542510767}

Similarity scores for Odyssey
{'Nieble': 0.09849624060150376, 'DQ': 0.08160466312360706, 'Odyssey': 1.0, 'Awaken': 0.2907429345066847, 'BB': 0.3098315066252249}

Similarity scores for Awaken
{'Nieble': 0.10057769219374907, 'DQ': 0.08170813718897109, 'Odyssey': 0.2907429345066847, 'Awaken': 1.0, 'BB': 0.3135196252624778}

Similarity scores for BB
{'Nieble': 0.10232886379675371, 'DQ': 

I would like to note that if we increase the nHashes we would get a more precise similarity, however I think above is a good approximation.



### d) Discussion:

Can we detect expected differences here?  Are the two Spanish docuemnts most similar to each other?  Are the two documents by the same author, with the same theme, the most similar?  What kind of alternatives might have captured the structures between these texts?



Yes we can detect differences here, we can see either through minhashing or direct jaccard similarity that the document Nieble is around 30% similar to the DQ doc. So to answer the question, yes the two spanish documents are most similar to each other. The awakening document and BB document share the same author and also have similarity of around 30% to each other. It seems the general theme of individual growth in the Awakening and Beyond the Bayou have a similar structures to that as spirtual growth in The Odyssey. The documents Odyseey and Awakening share similarity of around 33%, Beyond the Bayou and the Odyseey share around a 34% similarity. Of course there could be more alternatives that affect the structure between the texts such as when was the text written and maybe even the context of how much of the document we are comparing. An instance of this would be if we compare one document's introduction verse another documents body paragraph, even if those are same length it could still have effect on similarity.