# DATA APPENDIXER

## 1. Adding knowledge base to the bot

First of all, to work upon we will gather data from the site provided by the user. For this I woulb be using request and beautiful soup libraries.
After that we will proceed for data processing.
 

In [1]:
import requests
import bs4

base_url = 'https://en.wikipedia.org/wiki/Jupiter' #website from where we will get our data

### Now we will read the HTML page and gather all the data

After the below section, we have all the data collected in the variable called "all_para"

In [2]:
r = requests.get(base_url)    #to get the data from website
soup = bs4.BeautifulSoup(r.text,'html5lib')  #to read and understand the response, and store all of this as your variable soup

headers = []
for url in soup.findAll("h3"):
    headers.append(url.text)       #finding the heading

i = len(headers) - 1
counter = 0
while counter <= i:
    if headers[counter].startswith('\n'):     #finding meanng full headings
        headers.pop(counter)
        counter -= 1
    counter += 1
    i = len(headers) -1


The above code helped us to gather all the headings so that in the later part, we could get the data stored in those headings.

In [3]:
#`r.text` contains the raw HTML returned when we made our GET request earlier. 
#`'html5lib'` tells BeautifulSoup that it is reading HTML information. 
r = requests.get(base_url)
all_para = ""
soup = bs4.BeautifulSoup(r.text,'html5lib')
for iteri in range(len(headers)):
    deet = soup.find('h3', text = headers[iteri]) # Search for div tags of class 'entry-content content'
    for para in deet.find_next_siblings(): # Within these tags, find all p tags
        if para.name == "h2" or para.name == "h3":
            break
        elif para.name == "p":
            all_para += para.get_text()
            all_para += '\n'
print(all_para)

Jupiter's upper atmosphere is about 90% hydrogen and 10% helium by volume. Since helium atoms are more massive than hydrogen molecules, Jupiter's atmosphere is approximately 75% hydrogen and 24% helium by mass, with the remaining one percent consisting of other elements. The atmosphere contains trace amounts of methane, water vapour, ammonia, and silicon-based compounds. There are also fractional amounts of carbon, ethane, hydrogen sulfide, neon, oxygen, phosphine, and sulfur. The outermost layer of the atmosphere contains crystals of frozen ammonia. Through infrared and ultraviolet measurements, trace amounts of benzene and other hydrocarbons have also been found.[38] The interior of Jupiter contains denser materials—by mass it is roughly 71% hydrogen, 24% helium, and 5% other elements.[39][40]

The atmospheric proportions of hydrogen and helium are close to the theoretical composition of the primordial solar nebula. Neon in the upper atmosphere only consists of 20 parts per million b

### Now we will store all this data in a .txt file for storing it and getting started

In [4]:
with open('./jupiter.txt', 'wb') as file_handler:
        file_handler.write(all_para.encode('utf8'))

# this will create a wikipen.txt file in the directory folder 

### Importing libraries
Now we begin will our data processing for the data appendixer.
we will be requiring the following libraries:

In [5]:
import nltk # to process text data
import random 
import string # to process standard python strings
from sklearn.metrics.pairwise import cosine_similarity # to decide how similar two sentences are
from sklearn.feature_extraction.text import TfidfVectorizer # This function helps to calculate tf-idf

nltk.download('punkt')    #tokenizer
nltk.download('wordnet')   #lemmatizer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Processing data

#### Open file

We will first load the .txt file and then load the read data in variable raw_data

In [6]:
'''
When you have collected your data, update the variable 'filepath' below with the location of your knowledge base. 
The knowledge base should consist of sentences in a text file.
'''
filepath='./jupiter.txt'
corpus=open(filepath,'r',errors = 'ignore') #opening the file
raw_data=corpus.read()    #reading the file
print (raw_data)

Jupiter's upper atmosphere is about 90% hydrogen and 10% helium by volume. Since helium atoms are more massive than hydrogen molecules, Jupiter's atmosphere is approximately 75% hydrogen and 24% helium by mass, with the remaining one percent consisting of other elements. The atmosphere contains trace amounts of methane, water vapour, ammonia, and silicon-based compounds. There are also fractional amounts of carbon, ethane, hydrogen sulfide, neon, oxygen, phosphine, and sulfur. The outermost layer of the atmosphere contains crystals of frozen ammonia. Through infrared and ultraviolet measurements, trace amounts of benzene and other hydrocarbons have also been found.[38] The interior of Jupiter contains denser materialsâ€”by mass it is roughly 71% hydrogen, 24% helium, and 5% other elements.[39][40]

The atmospheric proportions of hydrogen and helium are close to the theoretical composition of the primordial solar nebula. Neon in the upper atmosphere only consists of 20 parts per million

#### Conversion to lower case

We will convert all text to lower case first for better pre processing.

In [7]:
raw_data=raw_data.lower()
print(raw_data)

jupiter's upper atmosphere is about 90% hydrogen and 10% helium by volume. since helium atoms are more massive than hydrogen molecules, jupiter's atmosphere is approximately 75% hydrogen and 24% helium by mass, with the remaining one percent consisting of other elements. the atmosphere contains trace amounts of methane, water vapour, ammonia, and silicon-based compounds. there are also fractional amounts of carbon, ethane, hydrogen sulfide, neon, oxygen, phosphine, and sulfur. the outermost layer of the atmosphere contains crystals of frozen ammonia. through infrared and ultraviolet measurements, trace amounts of benzene and other hydrocarbons have also been found.[38] the interior of jupiter contains denser materialsâ€”by mass it is roughly 71% hydrogen, 24% helium, and 5% other elements.[39][40]

the atmospheric proportions of hydrogen and helium are close to the theoretical composition of the primordial solar nebula. neon in the upper atmosphere only consists of 20 parts per million

#### Making a list with the paragraphs as its elements.

In [8]:
raw_data_para=raw_data.split('\n')


#### Sentence segmentation

We will use the funtion .sent_tokenize to convert documents into a list of sentences.

In [9]:
sent_tokens=nltk.sent_tokenize(raw_data)
print(sent_tokens)

["jupiter's upper atmosphere is about 90% hydrogen and 10% helium by volume.", "since helium atoms are more massive than hydrogen molecules, jupiter's atmosphere is approximately 75% hydrogen and 24% helium by mass, with the remaining one percent consisting of other elements.", 'the atmosphere contains trace amounts of methane, water vapour, ammonia, and silicon-based compounds.', 'there are also fractional amounts of carbon, ethane, hydrogen sulfide, neon, oxygen, phosphine, and sulfur.', 'the outermost layer of the atmosphere contains crystals of frozen ammonia.', 'through infrared and ultraviolet measurements, trace amounts of benzene and other hydrocarbons have also been found.', '[38] the interior of jupiter contains denser materialsâ€”by mass it is roughly 71% hydrogen, 24% helium, and 5% other elements.', '[39][40]\n\nthe atmospheric proportions of hydrogen and helium are close to the theoretical composition of the primordial solar nebula.', 'neon in the upper atmosphere only co

#### Word tokenization

We will use the funtion .word_tokenizer to convert sentences into a list of words.

In [10]:
word_tokens=nltk.word_tokenize(raw_data)
print(word_tokens)

['jupiter', "'s", 'upper', 'atmosphere', 'is', 'about', '90', '%', 'hydrogen', 'and', '10', '%', 'helium', 'by', 'volume', '.', 'since', 'helium', 'atoms', 'are', 'more', 'massive', 'than', 'hydrogen', 'molecules', ',', 'jupiter', "'s", 'atmosphere', 'is', 'approximately', '75', '%', 'hydrogen', 'and', '24', '%', 'helium', 'by', 'mass', ',', 'with', 'the', 'remaining', 'one', 'percent', 'consisting', 'of', 'other', 'elements', '.', 'the', 'atmosphere', 'contains', 'trace', 'amounts', 'of', 'methane', ',', 'water', 'vapour', ',', 'ammonia', ',', 'and', 'silicon-based', 'compounds', '.', 'there', 'are', 'also', 'fractional', 'amounts', 'of', 'carbon', ',', 'ethane', ',', 'hydrogen', 'sulfide', ',', 'neon', ',', 'oxygen', ',', 'phosphine', ',', 'and', 'sulfur', '.', 'the', 'outermost', 'layer', 'of', 'the', 'atmosphere', 'contains', 'crystals', 'of', 'frozen', 'ammonia', '.', 'through', 'infrared', 'and', 'ultraviolet', 'measurements', ',', 'trace', 'amounts', 'of', 'benzene', 'and', 'oth

#### Now we have our tokens!

#### Tokens will be very useful at each and every step of the appendixer. Each token will play its role at each step.


### Lemmatization 

We will lemmatize our word tokens using the WordNetLemmatizer that we have downloaded.

In [11]:
lemmer = nltk.stem.WordNetLemmatizer() 
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

#### Normalization

This will remove punctuation since that will not be useful for our knowledge base.

In [12]:
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict))) 

## 2. Matching topics with cosine similarity

Now we have pre- processed our dataset.

Now we will move onto making document vectors as well as tf-idf values for our tokens.Computers are good at processing numbers, and therefore, we will next convert our tokens into numbers too! 

### How would computer finds similarity?

For finding similarity, we will be using the concept of cosine similarity. 
The cosine_similarity function allows us to compare sentences. Imagine each sentence as a vector, which is a line pointing in some direction. Cosine similarity calculates the angles between each line and the more similar two lines are, the smaller their angle, and the higher their cosine similarity.  


## 3. Using a bot to make it user friendly 

### Greetings
At the start of every conversation, if the bot encounters any greeting, then it will also respond the same way.

#### Create list of inputs and respons

Creating the list of greetings your chatbot will have.

In [13]:
GREETING_INPUTS = ["hello", "hi", "greetings", "sup", "what's up","hey", "hey there","greetings of the day"]
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]

#### Create function to receive and return greetings

In [14]:
def greeting(sentence):
    for word in sentence.split(): # Looks at each word in your sentence
        if word.lower() in GREETING_INPUTS: # checks if the word matches a GREETING_INPUT
            return random.choice(GREETING_RESPONSES) # replies with a GREETING_RESPONSE

#### Create function to receive questions and return answers

Now we are defining a function to calculate a response when someone asks the robot a question. 

The response function:
1. Takes in a question
2. Uses cosine similarity to find the closest sentence to the question
3. Find the most relevant sentence
4. Return the relatted paragraph as the answer.


In [15]:
def response(user_response):
    robo_response=''      # initialize a variable to contain string
    sent_tokens.append(user_response)             #add user response to sent_tokens
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english') 
    tfidf = TfidfVec.fit_transform(sent_tokens)                 #get tfidf value
    vals = cosine_similarity(tfidf[-1], tfidf)                      #get cosine similarity value
    idx=vals.argsort()[0][-2] 
    flat = vals.flatten() 
    flat.sort()                   #sort in ascending order
    req_tfidf = flat[-2] 
    counterVar=sent_tokens[idx]         #getting the relevant sentence
    if(req_tfidf==0):
        robo_response=robo_response+"I am sorry! I don't understand you"
        return robo_response
    else:
        printable=" "                
        for i in raw_data_para:               #getting the relevant paragraph
            if counterVar in i:                           
                printable=i
        robo_response = robo_response+printable
        return robo_response

#### Testing the response function


In [16]:
user_response='the atmospheric proportions of hydrogen'
response(user_response)



' '

## 4. Testing the data appendixer

In [17]:
flag=True
print("Appendixer: My name is Data Appendixer. I will answer all your questions related to Jupiter planet. If you want to exit, type Bye!")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("Appendixer: You are welcome..")
        else:
            if(greeting(user_response)!=None):
                print("Appendixer: "+greeting(user_response))    
            else:
                print("Appendixer: ",end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("Appendixer: Bye! take care..")
        
        

Appendixer: My name is Data Appendixer. I will answer all your questions related to Jupiter planet. If you want to exit, type Bye!
hi
Appendixer: I am glad! You are talking to me
jupiter mass
Appendixer: jupiter's mass is 2.5 times that of all the other planets in the solar system combinedâ€”this is so massive that its barycentre with the sun lies above the sun's surface at 1.068â solar radii from the sun's centre.[44] jupiter is much larger than earth and considerably less dense: its volume is that of about 1,321 earths, but it is only 318 times as massive.[7][45] jupiter's radius is about one tenth the radius of the sun,[46] and its mass is one thousandth the mass of the sun, so the densities of the two bodies are similar.[47] a "jupiter mass" (mj or mjup) is often used as a unit to describe masses of other objects, particularly extrasolar planets and brown dwarfs. for example, the extrasolar planet hd 209458 b has a mass of 0.69â mj, while kappa andromedae b has a mass of 12.8â mj.[

## That's all from my side, hope you find my appendixer useful.