In [1]:
!pip install requests



In [2]:
import requests

URL = "https://victormatara.com/list-of-britam-insurance-branches-in-kenya/"
page = requests.get(URL)

print(page.text)

﻿<!DOCTYPE html>

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
    <meta charset="utf-8" />
    <title>Temporary Page</title>
    <link rel="shortcut icon" href="https://cf.wpx.net/favicon.ico" type="image/x-icon" >
    <link href="https://fonts.googleapis.com/css?family=Montserrat" rel="stylesheet">
    <style>
        body {
            color: #fff;
            background: linear-gradient(-45deg,#fc5819, #961251);
        }

        h1,
        h6 {
            font-family: 'Montserrat', sans-serif;
            font-weight: 300;
            font-size: 1.60rem;
            text-align: center;
            position: absolute;
            right: 0;
            left: 0;
            padding-left: 50px;
            padding-right: 50px;
        }
        a {
            color: #fff;
            text-align: center;
        }

            a:hover {
                color:#961251;
            }

        .button {
            border-radius: 3rem

## Parse HTML Code With Beautiful Soup

You’ve successfully scraped some HTML from the Internet, but when you look at it, it just seems like a huge mess. There are tons of HTML elements here and there, thousands of attributes scattered around—and wasn’t there some JavaScript mixed in as well? It’s time to parse this lengthy code response with the help of Python to make it more accessible and pick out the data you want.

In [3]:
import requests
from bs4 import BeautifulSoup

URL = "https://victormatara.com/list-of-britam-insurance-branches-in-kenya/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

When you add the two highlighted lines of code, you create a Beautiful Soup object that takes `page.content`, which is the HTML content you scraped earlier, as its input.

* Note: You’ll want to pass `page.content` instead of `page.text` to avoid problems with character encoding. The `.content` attribute holds raw bytes, which can be decoded better than the text representation you printed earlier using the `.text` attribute.

The second argument, `"html.parser"`, makes sure that you use the appropriate parser for HTML content.

### Find Elements by ID
In an HTML web page, every element can have an id attribute assigned. As the name already suggests, that id attribute makes the element uniquely identifiable on the page. You can begin to parse your page by selecting a specific element by its ID.

Switch back to developer tools and identify the HTML object that contains all the job postings. Explore by hovering over parts of the page and using right-click to *Inspect*.

In [4]:
results = soup.find(id="inside-article")


In [5]:
print(results)

None


In [6]:
# import libraries
from newspaper import Article 
import random
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [7]:
nltk.download('punkt', quiet=True) # Download the punkt package

True

In [8]:
#Get the article URL
article = Article('https://victormatara.com/list-of-britam-insurance-branches-in-kenya/')
article.download() # download the article
article.parse() #Parse the article
article.nlp() #Apply Natural Language Processing (NLP)
corpus = article.text # Store the article text into corpus

In [9]:
#Print the corpus

print(corpus)

This is a list of all Britam Insurance branches in Kenya. It is one of the 56 licensed insurance companies in Kenya by the Insurance Regulatory Authority. Britam began its operations in 1965 as a subsidiary of British-American Holdings, offering home service life insurance. Over the years, it has grown to become one of the leading insurance companies in the region. Actually, it commands the largest market share in the long-term insurance business according to a recent report by IRA.

Britam Insurance products cater to individuals and businesses. For personal insurance, they provide critical illness cover, education cover, funeral cover, medical cover, personal accident cover, life insurance cover (Tegemeo Term Assurance), Travel Insurance, Home Insurance, Motor Insurance, Golf Insurance, and income protection policies covers such as Akiba, Dhamana, Family income solution, and Money Back Extra Cash.

For business, they have a wide range of products such as Britam Biashara, Engineering i

In [10]:
#tokenization

text = corpus
sent_tokens = nltk.sent_tokenize(text) # text to alist of sentences

In [11]:
#Print list of sentences
print(sent_tokens)

['This is a list of all Britam Insurance branches in Kenya.', 'It is one of the 56 licensed insurance companies in Kenya by the Insurance Regulatory Authority.', 'Britam began its operations in 1965 as a subsidiary of British-American Holdings, offering home service life insurance.', 'Over the years, it has grown to become one of the leading insurance companies in the region.', 'Actually, it commands the largest market share in the long-term insurance business according to a recent report by IRA.', 'Britam Insurance products cater to individuals and businesses.', 'For personal insurance, they provide critical illness cover, education cover, funeral cover, medical cover, personal accident cover, life insurance cover (Tegemeo Term Assurance), Travel Insurance, Home Insurance, Motor Insurance, Golf Insurance, and income protection policies covers such as Akiba, Dhamana, Family income solution, and Money Back Extra Cash.', 'For business, they have a wide range of products such as Britam Bi

In [12]:
# Function to return a random greeting response to a users greeting
def greeting_response(text):
    text = text.lower()#convert all text to be lowercase
    #keyword matching
    #Greeting respnoses back to the user from the bot
    bot_greetings = ['howdy','hi','hey',"what's good",
                    'hello','hey there','sasa','mambo']
    #greeting form the user
    user_greetings = ['niaje','sasa','mambo','hi','hello','hola','greetings', 'wassup','hey']
    
    #If user's input is a greeting, return a randomly chosen greeting response
    for word in text.split():
        if word in user_greetings:
            return random.choice(bot_greetings)

In [13]:
#Return the indices of the values from an array in sorted order by the values
def index_sort(list_var):
    length = len(list_var)
    list_index = list(range(0,length))
    x = list_var
    for i in range(length):
        for j in range(length):
            if x[list_index[i]] > x[list_index[j]]:
                temp = list_index[i]
                list_index[i] = list_index[j]
                list_index[j] = temp
                return list_index

In [14]:
# Generate the response
def bot_response(user_input):
    user_input = user_input.lower() #Convert the users input to all lowercase letters
    sentence_list.append(user_input.lower()) #Append the users response to the list of sentence tokens
    bot_response='' #Create an empty response for the bot
    cm = CountVectorizer().fit_transform(sentence_list) #Create the count matrix
    similarity_scores = cosine_similarity(cm[-1], cm) #Get the similarity scores to the users input
    flatten = similarity_scores.flatten() #Reduce the dimensionality of the similarity scores
    index = index_sort(flatten) #Sort the index from 
    index = index[1:] #Get all of the similarity scores except the first (the query itself)
    response_flag=0 #Set a flag letting us know if the text contains a similarity score greater than 0.0
    
    
    #Loop the through the index list and get the 'n' number of sentences as the response
    j = 0
    for i in range(0, len(index)):
        if flatten[index[i]] > 0.0:
            bot_response = bot_response+''+sentence_list[index[i]]
            response_flag = 1
            j = j+1
            if j > 2:
                break  
    
    #if no sentence contains a similarity score greater than 0.0 then print 'I apologize, I don't understand'
        if(response_flag==0):
            bot_response = bot_response+''+"I apologize, I don't understand."
            sentence_list.remove(user_input) #Remove the users response from the sentence tokens
    return bot_response

In [15]:
# Start the chat
print("InsuBot: I am an Insurance BOT or InsuBot for short. I will answer your queries about Britam Insurance. If you want to exit, type Bye!")

exit_list = ['exit', 'see you later', 'bye', 'quit', 'break']

while(True):
    user_input = input()
    if(user_input.lower() in exit_list):
        print("InsuBot: Chat with you later !")
        break
    else:
            if(greeting_response(user_input)!=None):
                print("InsuBot: "+greeting_response(user_input))
            else:
                print("InsuBot: "+bot_response(user_input))

InsuBot: I am an Insurance BOT or InsuBot for short. I will answer your queries about Britam Insurance. If you want to exit, type Bye!
sasa
InsuBot: hey there
sasa
InsuBot: sasa
hey
InsuBot: mambo
hello
InsuBot: what's good
question


NameError: name 'sentence_list' is not defined