## **Import Libraries**

In [1]:
#!pip install selenium
#!pip install webdriver-manager
#!pip install pyyaml ua-parser user-agents fake-useragent

################ WEB SCRAPING MODULES ############
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.utils import ChromeType
from selenium.webdriver.common.by import By
import bs4
from fake_useragent import UserAgent
import requests
################ TIME MODLULES ###################
import time
from datetime import date 
import datetime
############## DATA MANIPULATION MODULES #########
import os
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords

## **Define web source**

In [2]:
link = 'https://rockmelon.com/about-autism/'

## **Read 100++ questions list**

In [3]:
df = pd.read_excel('ASDquestions7.xlsx',engine='openpyxl')

df[link]=np.nan
df.head(25)

Unnamed: 0,Question,https://birchtreecenter.org/learn/autism,https://www.myautismteam.com/resources/autism-an-overview,https://www.autism.org.uk/advice-and-guidance/what-is-autism/asperger-syndrome,https://iancommunity.org/autism-faq,https://icahn.mssm.edu/research/seaver/resources/autism-faqs,https://otsimo.com/en/frequently-asked-questions-autism/,https://rockmelon.com/about-autism/
0,What are the Autism Spectrum Disorders (ASD)?,ASD refers to a wide spectrum of neurodevelopm...,,,,,,
1,How common is autism?,According to a 2020 report commissioned by the...,It is estimated that in the United States 1.6 ...,,,,According to the U.S. Centers for Disease Cont...,
2,What causes autism? Can it be cured?,The causes of this complex disorder remain unc...,,,,,You may think that something you did or ate or...,
3,Is autism contagious?,,Autism is not a contagious condition. Autism i...,,,,,
4,Are rates of autism increasing?,,Estimates released by the Centers for Disease ...,,,,,
5,Is autism a new condition?,,"It is likely that autism has always existed, b...",,,,,
6,Is there a cure for autism?,,"There is no cure for autism. However, early in...",,,,,
7,How Is Autism Diagnosed?,,There is no one single conclusive test for aut...,,There is no blood test to diagnose autism spec...,"While there is no medical test for autism, an ...",,
8,Is autism permanent?,,There is some controversy on the topic of whet...,,,,,
9,How Is Autism Treated?,,Treatment for autism depends largely on the in...,,,,,


## **Scrape QA pairs from website**

In [4]:
# Open webpage in a new window for scraping
#driver = webdriver.Chrome(ChromeDriverManager(chrome_type=ChromeType.GOOGLE).install())   #cannot fix in colab

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(link)



Current google-chrome version is 92.0.4515
Get LATEST driver version for 92.0.4515
Driver [/home/aceirus/.wdm/drivers/chromedriver/linux64/92.0.4515.107/chromedriver] found in cache


In [5]:
# Parse text in webpage
source = driver.page_source
soup = bs4.BeautifulSoup(source, 'html.parser')

In [34]:
# Search the questions mentioned in webpage
quesList = []
for ques in soup.find_all('h5'):
    #if(ques.text[-1]=='?'):
        toSplit = str(ques)
        quesSplit = toSplit.split(". ")
        quesClean = quesSplit[1].replace("</h5>","")
        
        print(quesClean)
        quesList.append(quesClean)

What is Autism?
What is ASD?
What are the causes?
What are evidence-based therapies? 
How common is autism?
I suspect my child has autism, what should I do next?
How is autism diagnosed? 
What therapies are available for ASD?
How can I help my child with autism?
What are the early signs of autism?
What age do children usually start to show signs of autism?
What age can a child be tested for autism?
Is early intervention necessary in mild cases of autism?
How early can you start treatment for autism?
Is autism a condition that affects only males?
Does autism often occur with other disorders?  
What is the difference between autism and learning disabilities?


In [35]:
# Search the answers for corresponding questions in quesList
ansList = []

#for ans in soup.find_all('div', {'class':'field-items'}):
for ans in soup.find_all('p', {'style':'font-size:15px'}):

    #if ans.find(string=re.compile("autism")):
            #toSplit = str(ans)            
            #ansSplit = toSplit.split("<br/>")
            #ansClean = ansSplit[1].replace("</p>","")

            print(ans.text)
            print('*'*100)
            ansList.append(ans.text)

Autism, also known as Autism Spectrum Disorder (ASD), is a life-long condition (Autism Awareness Australia, 2018). People with ASD often experience difficulties with communication, social interaction and restricted/repetitive interests and behaviours (Autism Awareness Australia, 2018). While there is no single known cause associated with autism, the numbers of children diagnosed has steadily increased over the years (Autism Awareness Australia, 2018).
****************************************************************************************************
However, the thing is, there is no one way that autism can affect a person. Every individual on the autism spectrum will present differently. There is a saying, “if you’ve met one person with autism, then you’ve met one person with autism.”
****************************************************************************************************
ASD is the acronym for ‘Autism Spectrum Disorder’. This umbrella term includes autistic disorder (als

## **Check which questions are similar**

In [36]:
stop_words = set(stopwords.words('english'))

def clean_text(sent):
    sent = sent.lower() # lowercase
    sent = re.sub(r'[^\w\s]', '', sent) # remove punctuations
    sent = re.sub('Autism Spectrum Disorder','ASD',sent) # Compress term
    sent = [w for w in sent.split() if not w.lower() in stop_words] # Remove stopwords
    sent = " ".join(sent)
    return sent

In [37]:
def jaccard_similarity(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    return float(len(s1.intersection(s2)) / len(s1.union(s2)))

In [38]:
# try to match websource questions with our own 100 questions list
for c1,i in enumerate(quesList):
    i2 = clean_text(i)
    list1 = i2.split()
    
    temp1 = 0.0
    temp2 = ''
    temp3 = ''
    temp4 = 0
    
    for c2,j in enumerate(df['Question']):
        j2 = clean_text(j)
        list2 = j2.split()
            
        sim = jaccard_similarity(list1, list2)
        if(sim>temp1):
            temp1 = sim
            temp2 = j
            temp3 = i
            temp4 = c2
            
    if(temp1>=0.3):  # sim threshold
        print('Website --> ',temp3,'(Index {})'.format(c1))
        print('100 questions list --> ',temp2,'(Index {})'.format(temp4))
        print('similarity:', temp1)
        print('*'*100)

Website -->  What is Autism? (Index 0)
100 questions list -->  What is Autism? (Index 30)
similarity: 1.0
****************************************************************************************************
Website -->  What are the causes? (Index 2)
100 questions list -->  What causes autism? Can it be cured? (Index 2)
similarity: 0.3333333333333333
****************************************************************************************************
Website -->  How common is autism? (Index 4)
100 questions list -->  How common is autism? (Index 1)
similarity: 1.0
****************************************************************************************************
Website -->  I suspect my child has autism, what should I do next? (Index 5)
100 questions list -->  Why My Child Has Autism? (Index 28)
similarity: 0.5
****************************************************************************************************
Website -->  How is autism diagnosed?  (Index 6)
100 questions list -->  H

In [39]:
for i in range (0,len(quesList)):
    print("Index #{0:d}: {1:s}".format(i,quesList[i]))

Index #0: What is Autism?
Index #1: What is ASD?
Index #2: What are the causes?
Index #3: What are evidence-based therapies? 
Index #4: How common is autism?
Index #5: I suspect my child has autism, what should I do next?
Index #6: How is autism diagnosed? 
Index #7: What therapies are available for ASD?
Index #8: How can I help my child with autism?
Index #9: What are the early signs of autism?
Index #10: What age do children usually start to show signs of autism?
Index #11: What age can a child be tested for autism?
Index #12: Is early intervention necessary in mild cases of autism?
Index #13: How early can you start treatment for autism?
Index #14: Is autism a condition that affects only males?
Index #15: Does autism often occur with other disorders?  
Index #16: What is the difference between autism and learning disabilities?


In [40]:
for i in range (0,len(ansList)):
    print("Index #{0:d}: {1:s}".format(i,ansList[i]))

Index #0: Autism, also known as Autism Spectrum Disorder (ASD), is a life-long condition (Autism Awareness Australia, 2018). People with ASD often experience difficulties with communication, social interaction and restricted/repetitive interests and behaviours (Autism Awareness Australia, 2018). While there is no single known cause associated with autism, the numbers of children diagnosed has steadily increased over the years (Autism Awareness Australia, 2018).
Index #1: However, the thing is, there is no one way that autism can affect a person. Every individual on the autism spectrum will present differently. There is a saying, “if you’ve met one person with autism, then you’ve met one person with autism.”
Index #2: ASD is the acronym for ‘Autism Spectrum Disorder’. This umbrella term includes autistic disorder (also known as “classic autism”), Asperger’s disorder, and pervasive developmental disorder not otherwise specified (also known as “atypical autism”) according to the Fourth Ed

In [41]:
# add websource answer to matched question in existing dataframe
df[link].loc[30]=ansList[0]+ansList[1] #0
df[link].loc[1]=ansList[7]+ansList[8] #4
df[link].loc[7]=ansList[16]+ansList[17] #6
df[link].loc[51]=ansList[27]+ansList[28]+ansList[29] #11

In [42]:
df[link].loc[51]

'There are scientifically validated tests, such as the M-CHAT, that can be used to screen for autism in children as young as 15-18 months. Screening tests, however, are not a diagnosis and will pick up some children who don’t go on to have ASD. What screening tests do is point to social problems that should be investigated by a health professional, such as a paediatrician or child psychologist. For some children with ASD, a diagnosis can be reliably made at the age of two years by an experienced professional, although a more common age for diagnosis is around 3-5 years. The earlier a child receives a diagnosis the better, as it means they can start on early intervention to help address their social and behavioural challenges.'

In [43]:
# update with new valid questions list
quesListUpd = [quesList[1],
               quesList[2],
               quesList[3],
               quesList[5],
               quesList[7],
               quesList[8],
               quesList[9],
               quesList[10],
               quesList[12],
               quesList[13],
               quesList[14],
               quesList[15],
               quesList[16]]
quesListUpd

['What is ASD?',
 'What are the causes?',
 'What are evidence-based therapies? ',
 'I suspect my child has autism, what should I do next?',
 'What therapies are available for ASD?',
 'How can I help my child with autism?',
 'What are the early signs of autism?',
 'What age do children usually start to show signs of autism?',
 'Is early intervention necessary in mild cases of autism?',
 'How early can you start treatment for autism?',
 'Is autism a condition that affects only males?',
 'Does autism often occur with other disorders? \xa0',
 'What is the difference between autism and learning disabilities?']

In [50]:
# manually select answers to updated questions list
ansListUpd = [ansList[2]+ansList[3],
              ansList[4],
              ansList[5]+ansList[6],
              ansList[9]+ansList[10],
              ansList[16]+ansList[17],
              ansList[18]+ansList[19]+ansList[20],
              ansList[21]+ansList[22],
              ansList[23]+ansList[24]+ansList[26],
              ansList[30]+ansList[31]+ansList[32],
              ansList[33]+ansList[34]+ansList[35]+ansList[36],
              ansList[37]+ansList[38],
              ansList[39]+ansList[40]+ansList[41],
              ansList[42]+ansList[43]+ansList[44]]

ansListUpd

['ASD is the acronym for ‘Autism Spectrum Disorder’. This umbrella term includes autistic disorder (also known as “classic autism”), Asperger’s disorder, and pervasive developmental disorder not otherwise specified (also known as “atypical autism”) according to the Fourth Edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM) issued in 1992 (Autism Awareness Australia, 2018).The DSM was published by the American Psychiatric Association and serves as the primary manual used by clinicians in the U.S., Australia, and many other countries to provide the formal criteria for various diagnoses, including autism (Autism Awareness Australia, 2018).',
 'Currently, there is no single known cause for autism, however, recent research has identified strong genetic links (Autism Awareness Australia, 2018). We do know though that autism is NOT caused by an individual’s upbringing, their social or economic circumstances, nor is it caused by vaccines or bad parenting! ',
 'Evidence-b

In [51]:
# Create new dataframe with QA pairs
df2 = pd.DataFrame(zip(quesListUpd,ansListUpd),columns=['Question',link])
df2

Unnamed: 0,Question,https://rockmelon.com/about-autism/
0,What is ASD?,ASD is the acronym for ‘Autism Spectrum Disord...
1,What are the causes?,"Currently, there is no single known cause for ..."
2,What are evidence-based therapies?,Evidence-based therapy means that there has be...
3,"I suspect my child has autism, what should I d...",If you suspect your child has autism and you d...
4,What therapies are available for ASD?,There are a number of different therapies avai...
5,How can I help my child with autism?,If you suspect your child may have a developme...
6,What are the early signs of autism?,Some of the common traits associated with auti...
7,What age do children usually start to show sig...,The most common age for parents to first becom...
8,Is early intervention necessary in mild cases ...,Early intervention is recommended for all chil...
9,How early can you start treatment for autism?,The simple answer is as early as possible. Tre...


In [52]:
# Concatenate existing and new dataframes
df3 = pd.concat([df,df2],axis=0)
df3 = df3.sort_values(by=list(df3.columns[1:])).reset_index(drop=True)
df3.head(40)

Unnamed: 0,Question,https://birchtreecenter.org/learn/autism,https://www.myautismteam.com/resources/autism-an-overview,https://www.autism.org.uk/advice-and-guidance/what-is-autism/asperger-syndrome,https://iancommunity.org/autism-faq,https://icahn.mssm.edu/research/seaver/resources/autism-faqs,https://otsimo.com/en/frequently-asked-questions-autism/,https://rockmelon.com/about-autism/
0,What are the Autism Spectrum Disorders (ASD)?,ASD refers to a wide spectrum of neurodevelopm...,,,,,,
1,How common is autism?,According to a 2020 report commissioned by the...,It is estimated that in the United States 1.6 ...,,,,According to the U.S. Centers for Disease Cont...,Recent studies have found that 1 in 59 childre...
2,What causes autism? Can it be cured?,The causes of this complex disorder remain unc...,,,,,You may think that something you did or ate or...,
3,Is autism contagious?,,Autism is not a contagious condition. Autism i...,,,,,
4,Are rates of autism increasing?,,Estimates released by the Centers for Disease ...,,,,,
5,Is autism a new condition?,,"It is likely that autism has always existed, b...",,,,,
6,Is there a cure for autism?,,"There is no cure for autism. However, early in...",,,,,
7,How Is Autism Diagnosed?,,There is no one single conclusive test for aut...,,There is no blood test to diagnose autism spec...,"While there is no medical test for autism, an ...",,There are a number of different therapies avai...
8,Is autism permanent?,,There is some controversy on the topic of whet...,,,,,
9,How Is Autism Treated?,,Treatment for autism depends largely on the in...,,,,,


## **Save Output**

In [53]:
df3.to_excel('ASDquestions8.xlsx',index=False)