## **Import Libraries**

In [22]:
#!pip install selenium
#!pip install webdriver-manager
#!pip install pyyaml ua-parser user-agents fake-useragent

################ WEB SCRAPING MODULES ############
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.utils import ChromeType
from selenium.webdriver.common.by import By
import bs4
from fake_useragent import UserAgent
import requests
################ TIME MODLULES ###################
import time
from datetime import date 
import datetime
############## DATA MANIPULATION MODULES #########
import os
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords

## **Define web source**

In [23]:
link = 'https://www.autism.org.sg/living-with-autism/what-is-autism'

## **Read 100++ questions list**

In [24]:
df = pd.read_excel('ASDquestions3r.xlsx',engine='openpyxl')

df[link]=np.nan
df.head(55)

Unnamed: 0,Question,https://birchtreecenter.org/learn/autism,https://www.myautismteam.com/resources/autism-an-overview,https://www.autism.org.uk/advice-and-guidance/what-is-autism/asperger-syndrome,https://iancommunity.org/autism-faq,https://icahn.mssm.edu/research/seaver/resources/autism-faqs,https://otsimo.com/en/frequently-asked-questions-autism/,https://rockmelon.com/about-autism/,https://www.amaze.org.au/understand-autism/about-autism/,https://autismrecovery.sg/autism/what-is-aspergers/,https://www.autism.org.sg/living-with-autism/what-is-autism
0,What are the Autism Spectrum Disorders (ASD)?,ASD refers to a wide spectrum of neurodevelopm...,,,,,,,,,
1,How common is autism?,According to a 2020 report commissioned by the...,It is estimated that in the United States 1.6 ...,,,,According to the U.S. Centers for Disease Cont...,Recent studies have found that 1 in 59 childre...,,,
2,What causes autism? Can it be cured?,The causes of this complex disorder remain unc...,,,,,You may think that something you did or ate or...,,There is no known cause of autism.Much researc...,,
3,Is autism contagious?,,Autism is not a contagious condition. Autism i...,,,,,,,,
4,Are rates of autism increasing?,,Estimates released by the Centers for Disease ...,,,,,,,,
5,Is autism a new condition?,,"It is likely that autism has always existed, b...",,,,,,,,
6,Is there a cure for autism?,,"There is no cure for autism. However, early in...",,,,,,,,
7,How Is Autism Diagnosed?,,There is no one single conclusive test for aut...,,There is no blood test to diagnose autism spec...,"While there is no medical test for autism, an ...",,There are a number of different therapies avai...,,,
8,Is autism permanent?,,There is some controversy on the topic of whet...,,,,,,,,
9,How Is Autism Treated?,,Treatment for autism depends largely on the in...,,,,,,,,


## **Scrape QA pairs from website**

In [4]:
# Open webpage in a new window for scraping
#driver = webdriver.Chrome(ChromeDriverManager(chrome_type=ChromeType.GOOGLE).install())   #cannot fix in colab

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(link)



Current google-chrome version is 92.0.4515
Get LATEST driver version for 92.0.4515
Driver [/home/aceirus/.wdm/drivers/chromedriver/linux64/92.0.4515.107/chromedriver] found in cache


In [25]:
# Parse text in webpage
source = driver.page_source
soup = bs4.BeautifulSoup(source, 'html.parser')

In [26]:
# Search the questions mentioned in webpage
quesList = []
for ques in soup.find_all('h4'):
#for ques in soup.find_all('button',{'class':'btn btn-unstyled'}):

    if(ques.text[-1]=='?'):
        #toSplit = str(ques.text)
        #quesSplit = toSplit.split()
        #quesJoin = ' '.join(quesSplit)
        
        print(ques.text)
        quesList.append(ques.text)

What is autism?
What causes autism?
What is Asperger’s Syndrome (DSM IV)?
How are people of different degrees of autism understood?
Is there a cure for autism? Will my child grow out of autism?


In [28]:
# Search the answers for corresponding questions in quesList
ansList = []

#for ans in soup.find_all('div',{'class':'content-header'}):
for ans in soup.find_all('p'):

    #if ans.find(string=re.compile("autism")):
            #toSplit = str(ans.text)            
            #ansSplit = toSplit.split("<br/>")
            #ansClean = toSplit.replace("\n","")
            
            print(ans.text)
            print('*'*100)
            ansList.append(ans.text)

Autism is a lifelong developmental disability that affects a person’s ability to make sense of the world and relate with others. Autism comes from ‘autos’, the Greek word for ‘self’, and a person on the autism spectrum is often referred to as someone who lives in a world of his own.
****************************************************************************************************
Although there are many theories, no one fully knows the definitive answer to this question. Research shows that autism can be caused by a variety of conditions that affect brain development, which may occur before, during or after birth. 
****************************************************************************************************
While the cause or combination of causes of autism is not fully understood, research suggests a biological correlation affecting the parts of the brain that process language and information coming in from the senses. Other research findings suggest that there may be an imba

## **Check which questions are similar**

In [29]:
stop_words = set(stopwords.words('english'))

def clean_text(sent):
    sent = sent.lower() # lowercase
    sent = re.sub(r'[^\w\s]', '', sent) # remove punctuations
    sent = re.sub('Autism Spectrum Disorder','ASD',sent) # Compress term
    sent = [w for w in sent.split() if not w.lower() in stop_words] # Remove stopwords
    sent = " ".join(sent)
    return sent

In [30]:
def jaccard_similarity(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    return float(len(s1.intersection(s2)) / len(s1.union(s2)))

In [31]:
# try to match websource questions with our own 100 questions list
for c1,i in enumerate(quesList):
    i2 = clean_text(i)
    list1 = i2.split()
    
    temp1 = 0.0
    temp2 = ''
    temp3 = ''
    temp4 = 0
    
    for c2,j in enumerate(df['Question']):
        j2 = clean_text(j)
        list2 = j2.split()
            
        sim = jaccard_similarity(list1, list2)
        if(sim>temp1):
            temp1 = sim
            temp2 = j
            temp3 = i
            temp4 = c2
            
    if(temp1>=0.3):  # sim threshold
        print('Website --> ',temp3,'(Index {})'.format(c1))
        print('100 questions list --> ',temp2,'(Index {})'.format(temp4))
        print('similarity:', temp1)
        print('*'*100)

Website -->  What is autism? (Index 0)
100 questions list -->  What is Autism? (Index 30)
similarity: 1.0
****************************************************************************************************
Website -->  What causes autism? (Index 1)
100 questions list -->  What causes autism? Can it be cured? (Index 2)
similarity: 0.6666666666666666
****************************************************************************************************
Website -->  What is Asperger’s Syndrome (DSM IV)? (Index 2)
100 questions list -->  What is Asperger’s Syndrome? (Index 51)
similarity: 0.5
****************************************************************************************************
Website -->  Is there a cure for autism? Will my child grow out of autism? (Index 4)
100 questions list -->  Is there a cure for autism? (Index 6)
similarity: 0.5
****************************************************************************************************


In [32]:
for i in range (0,len(quesList)):
    print("Index #{0:d}: {1:s}".format(i,quesList[i]))

Index #0: What is autism?
Index #1: What causes autism?
Index #2: What is Asperger’s Syndrome (DSM IV)?
Index #3: How are people of different degrees of autism understood?
Index #4: Is there a cure for autism? Will my child grow out of autism?


In [33]:
for i in range (0,len(ansList)):
    print("Index #{0:d}: {1:s}".format(i,ansList[i]))

Index #0: Autism is a lifelong developmental disability that affects a person’s ability to make sense of the world and relate with others. Autism comes from ‘autos’, the Greek word for ‘self’, and a person on the autism spectrum is often referred to as someone who lives in a world of his own.
Index #1: Although there are many theories, no one fully knows the definitive answer to this question. Research shows that autism can be caused by a variety of conditions that affect brain development, which may occur before, during or after birth. 
Index #2: While the cause or combination of causes of autism is not fully understood, research suggests a biological correlation affecting the parts of the brain that process language and information coming in from the senses. Other research findings suggest that there may be an imbalance in certain chemicals in the brain. Genetic factors may sometimes be involved in certain families. In reality, what we know is that autism may develop from a combinati

In [35]:
# add websource answer to matched question in existing dataframe
df[link].loc[30]=ansList[0] #0
df[link].loc[2]=ansList[1]+ansList[2] #1
df[link].loc[51]=ansList[3]+ansList[4]+ansList[5] #2
df[link].loc[6]=ansList[1]+ansList[2] #4


In [36]:
df[link].loc[30]

'Autism is a lifelong developmental disability that affects a person’s ability to make sense of the world and relate with others. Autism comes from ‘autos’, the Greek word for ‘self’, and a person on the autism spectrum is often referred to as someone who lives in a world of his own.'

In [50]:
# update with new valid questions list
quesListUpd = [quesList[3]]

quesListUpd

['How are people of different degrees of autism understood?']

In [51]:
# manually select answers to updated questions list
ansListUpd = [ansList[6]+ansList[7]+ansList[8]]

ansListUpd

['Autism is known as a spectrum condition as no two persons diagnosed with autism are the same. They may differ in the interaction of 2 key dimensions:Any person with autism may have differing degrees of autism as well as intellectual abilities. This helps us understand that any combination may exist and we must not make assumptions that high autism always implies low ability or vice versa.Every person with autism is to be understood so that we can find the best way to support them and help them better adapt to the community.']

In [52]:
# Create new dataframe with QA pairs
df2 = pd.DataFrame(zip(quesListUpd,ansListUpd),columns=['Question',link])
df2

Unnamed: 0,Question,https://www.autism.org.sg/living-with-autism/what-is-autism
0,How are people of different degrees of autism ...,Autism is known as a spectrum condition as no ...


In [53]:
# Concatenate existing and new dataframes
df3 = pd.concat([df,df2],axis=0)
df3 = df3.sort_values(by=list(df3.columns[1:])).reset_index(drop=True)
df3.head(55)

Unnamed: 0,Question,https://birchtreecenter.org/learn/autism,https://www.myautismteam.com/resources/autism-an-overview,https://www.autism.org.uk/advice-and-guidance/what-is-autism/asperger-syndrome,https://iancommunity.org/autism-faq,https://icahn.mssm.edu/research/seaver/resources/autism-faqs,https://otsimo.com/en/frequently-asked-questions-autism/,https://rockmelon.com/about-autism/,https://www.amaze.org.au/understand-autism/about-autism/,https://autismrecovery.sg/autism/what-is-aspergers/,https://www.autism.org.sg/living-with-autism/what-is-autism
0,What are the Autism Spectrum Disorders (ASD)?,ASD refers to a wide spectrum of neurodevelopm...,,,,,,,,,
1,How common is autism?,According to a 2020 report commissioned by the...,It is estimated that in the United States 1.6 ...,,,,According to the U.S. Centers for Disease Cont...,Recent studies have found that 1 in 59 childre...,,,
2,What causes autism? Can it be cured?,The causes of this complex disorder remain unc...,,,,,You may think that something you did or ate or...,,There is no known cause of autism.Much researc...,,"Although there are many theories, no one fully..."
3,Is autism contagious?,,Autism is not a contagious condition. Autism i...,,,,,,,,
4,Are rates of autism increasing?,,Estimates released by the Centers for Disease ...,,,,,,,,
5,Is autism a new condition?,,"It is likely that autism has always existed, b...",,,,,,,,
6,Is there a cure for autism?,,"There is no cure for autism. However, early in...",,,,,,,,"Although there are many theories, no one fully..."
7,How Is Autism Diagnosed?,,There is no one single conclusive test for aut...,,There is no blood test to diagnose autism spec...,"While there is no medical test for autism, an ...",,There are a number of different therapies avai...,,,
8,Is autism permanent?,,There is some controversy on the topic of whet...,,,,,,,,
9,How Is Autism Treated?,,Treatment for autism depends largely on the in...,,,,,,,,


## **Save Output**

In [54]:
df3.to_excel('ASDquestions10r.xlsx',index=False)