## **Import Libraries**

In [1]:
#!pip install selenium
#!pip install webdriver-manager
#!pip install pyyaml ua-parser user-agents fake-useragent

################ WEB SCRAPING MODULES ############
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.utils import ChromeType
from selenium.webdriver.common.by import By
import bs4
from fake_useragent import UserAgent
import requests
################ TIME MODLULES ###################
import time
from datetime import date 
import datetime
############## DATA MANIPULATION MODULES #########
import os
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords

## **Define web source**

In [2]:
link = 'https://iancommunity.org/autism-faq'

## **Read 100++ questions list**

In [3]:
df = pd.read_excel('ASDquestions4.xlsx',engine='openpyxl')

df[link]=np.nan
df.head(15)

Unnamed: 0,Question,https://birchtreecenter.org/learn/autism,https://www.myautismteam.com/resources/autism-an-overview,https://www.autism.org.uk/advice-and-guidance/what-is-autism/asperger-syndrome,https://iancommunity.org/autism-faq
0,What are the Autism Spectrum Disorders (ASD)?,ASD refers to a wide spectrum of neurodevelopm...,,,
1,How common is autism?,According to a 2020 report commissioned by the...,It is estimated that in the United States 1.6 ...,,
2,What causes autism? Can it be cured?,The causes of this complex disorder remain unc...,,,
3,Is autism contagious?,,Autism is not a contagious condition. Autism i...,,
4,Are rates of autism increasing?,,Estimates released by the Centers for Disease ...,,
5,Is autism a new condition?,,"It is likely that autism has always existed, b...",,
6,Is there a cure for autism?,,"There is no cure for autism. However, early in...",,
7,How Is Autism Diagnosed?,,There is no one single conclusive test for aut...,,
8,Is autism permanent?,,There is some controversy on the topic of whet...,,
9,How Is Autism Treated?,,Treatment for autism depends largely on the in...,,


## **Scrape QA pairs from website**

In [4]:
# Open webpage in a new window for scraping
#driver = webdriver.Chrome(ChromeDriverManager(chrome_type=ChromeType.GOOGLE).install())   #cannot fix in colab

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(link)



Current google-chrome version is 92.0.4515
Get LATEST driver version for 92.0.4515
Driver [/home/aceirus/.wdm/drivers/chromedriver/linux64/92.0.4515.107/chromedriver] found in cache


In [6]:
# Parse text in webpage
source = driver.page_source
soup = bs4.BeautifulSoup(source, 'html.parser')

In [73]:
# Search the questions mentioned in webpage
quesList = []
for ques in soup.find_all('strong'):
    if(ques.text[-1]=='?'):
        print(ques.text)
        quesList.append(ques.text)

I think I/my child may have autism but I'm not sure. How can I find out?
How is autism diagnosed? Is there a test for it?
How can I find out what caused my child's autism?
How can I find out if my/my child's case is genetic? Can we tell which side of the family the autism came from?
If there is no autism epidemic, why do the autism statistics just keep climbing?
How can I be sure vaccines have nothing to do with autism?
How can I find the best treatments for myself or my child?
What progress has been made so far in autism research?
How can I help in the search for better treatments?


In [68]:
# Search the answers for corresponding questions in quesList
ansList = []

#for ans in soup.find_all('div', {'class':'field-items'}):
for ans in soup.find_all('p'):

    if ans.find('br'):
        toSplit = str(ans)
        ansSplit = toSplit.split("<br/>\n")
        ansClean = ansSplit[1].replace("</p>","")

        print(ansClean)
        print('*'*100)
        ansList.append(ansClean)

The core features of autism are difficulties with social interaction and communication, and the presence of repetitive behaviors and restricted interests. However, the specific symptoms and the severity are highly variable among different individuals. If you have concerns about your child, you should speak with your pediatrician and ask that your child be screened for autism. The American Academy of Pediatrics recommends that all children be screened for autism between 18 and 24 months of age. If you or your child's pediatrician feel that further screening is required, your child should be referred to a specialized medical professional, such as a developmental pediatrician, child psychologist/psychiatrist, or neurologist, who can conduct autism-specific behavioral evaluations.
****************************************************************************************************
There is no blood test to diagnose autism spectrum disorder. A diagnosis is made based on behaviors. In order t

## **Check which questions are similar**

In [69]:
stop_words = set(stopwords.words('english'))

def clean_text(sent):
    sent = sent.lower() # lowercase
    sent = re.sub(r'[^\w\s]', '', sent) # remove punctuations
    sent = re.sub('Autism Spectrum Disorder','ASD',sent) # Compress term
    sent = [w for w in sent.split() if not w.lower() in stop_words] # Remove stopwords
    sent = " ".join(sent)
    return sent

In [70]:
def jaccard_similarity(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    return float(len(s1.intersection(s2)) / len(s1.union(s2)))

In [71]:
# try to match websource questions with our own 100 questions list
for c1,i in enumerate(quesList):
    i2 = clean_text(i)
    list1 = i2.split()
    
    temp1 = 0.0
    temp2 = ''
    temp3 = ''
    temp4 = 0
    
    for c2,j in enumerate(df['Question']):
        j2 = clean_text(j)
        list2 = j2.split()
            
        sim = jaccard_similarity(list1, list2)
        if(sim>temp1):
            temp1 = sim
            temp2 = j
            temp3 = i
            temp4 = c2
            
    if(temp1>=0.3):  # sim threshold
        print('Website --> ',temp3,'(Index {})'.format(c1))
        print('100 questions list --> ',temp2,'(Index {})'.format(temp4))
        print('similarity:', temp1)
        print('*'*100)

Website -->  How is autism diagnosed? Is there a test for it? (Index 1)
100 questions list -->  How Is Autism Diagnosed? (Index 7)
similarity: 0.6666666666666666
****************************************************************************************************


In [74]:
for i in range (0,len(quesList)):
    print("Index #{0:d}: {1:s}".format(i,quesList[i]))

Index #0: I think I/my child may have autism but I'm not sure. How can I find out?
Index #1: How is autism diagnosed? Is there a test for it?
Index #2: How can I find out what caused my child's autism?
Index #3: How can I find out if my/my child's case is genetic? Can we tell which side of the family the autism came from?
Index #4: If there is no autism epidemic, why do the autism statistics just keep climbing?
Index #5: How can I be sure vaccines have nothing to do with autism?
Index #6: How can I find the best treatments for myself or my child?
Index #7: What progress has been made so far in autism research?
Index #8: How can I help in the search for better treatments?


In [75]:
for i in range (0,len(ansList)):
    print("Index #{0:d}: {1:s}".format(i,ansList[i]))

Index #0: The core features of autism are difficulties with social interaction and communication, and the presence of repetitive behaviors and restricted interests. However, the specific symptoms and the severity are highly variable among different individuals. If you have concerns about your child, you should speak with your pediatrician and ask that your child be screened for autism. The American Academy of Pediatrics recommends that all children be screened for autism between 18 and 24 months of age. If you or your child's pediatrician feel that further screening is required, your child should be referred to a specialized medical professional, such as a developmental pediatrician, child psychologist/psychiatrist, or neurologist, who can conduct autism-specific behavioral evaluations.
Index #1: There is no blood test to diagnose autism spectrum disorder. A diagnosis is made based on behaviors. In order to be diagnosed with autism, an individual must display deficits in social communi

In [76]:
# add websource answer to matched question in existing dataframe
df[link].loc[7]=ansList[1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [78]:
df[link].loc[7]

'There is no blood test to diagnose autism spectrum disorder. A diagnosis is made based on behaviors. In order to be diagnosed with autism, an individual must display deficits in social communication and social interaction, and show restrictive and repetitive behaviors.'

In [86]:
# update with new valid questions list
quesListUpd = quesList[0:1]+quesList[2:9]

quesListUpd

["I think I/my child may have autism but I'm not sure. How can I find out?",
 "How can I find out what caused my child's autism?",
 "How can I find out if my/my child's case is genetic? Can we tell which side of the family the autism came from?",
 'If there is no autism epidemic, why do the autism statistics just keep climbing?',
 'How can I be sure vaccines have nothing to do with autism?',
 'How can I find the best treatments for myself or my child?',
 'What progress has been made so far in autism research?',
 'How can I help in the search for better treatments?']

In [87]:
# manually select answers to updated questions list
ansListUpd = ansList[0:1]+ansList[2:9]

ansListUpd

["The core features of autism are difficulties with social interaction and communication, and the presence of repetitive behaviors and restricted interests. However, the specific symptoms and the severity are highly variable among different individuals. If you have concerns about your child, you should speak with your pediatrician and ask that your child be screened for autism. The American Academy of Pediatrics recommends that all children be screened for autism between 18 and 24 months of age. If you or your child's pediatrician feel that further screening is required, your child should be referred to a specialized medical professional, such as a developmental pediatrician, child psychologist/psychiatrist, or neurologist, who can conduct autism-specific behavioral evaluations.",
 'In most individuals it is currently not possible to identify the exact cause of autism. There are a few genetic syndromes associated with autism (for example, Rett syndrome and fragile X syndrome) in which 

In [88]:
# Create new dataframe with QA pairs
df2 = pd.DataFrame(zip(quesListUpd,ansListUpd),columns=['Question',link])
df2

Unnamed: 0,Question,https://iancommunity.org/autism-faq
0,I think I/my child may have autism but I'm not...,The core features of autism are difficulties w...
1,How can I find out what caused my child's autism?,In most individuals it is currently not possib...
2,How can I find out if my/my child's case is ge...,If your child has autism and you are intereste...
3,"If there is no autism epidemic, why do the aut...","Just days after my live TED talk, the Centers ..."
4,How can I be sure vaccines have nothing to do ...,Immunizations are a cornerstone of public heal...
5,How can I find the best treatments for myself ...,Research suggests that early intensive behavio...
6,What progress has been made so far in autism r...,"In the past five years, scientists have made s..."
7,How can I help in the search for better treatm...,"In order for autism research to advance, parti..."


In [91]:
# Concatenate existing and new dataframes
df3 = pd.concat([df,df2],axis=0)
df3 = df3.sort_values(by=list(df3.columns[1:])).reset_index(drop=True)
df3.head(25)

Unnamed: 0,Question,https://birchtreecenter.org/learn/autism,https://www.myautismteam.com/resources/autism-an-overview,https://www.autism.org.uk/advice-and-guidance/what-is-autism/asperger-syndrome,https://iancommunity.org/autism-faq
0,What are the Autism Spectrum Disorders (ASD)?,ASD refers to a wide spectrum of neurodevelopm...,,,
1,How common is autism?,According to a 2020 report commissioned by the...,It is estimated that in the United States 1.6 ...,,
2,What causes autism? Can it be cured?,The causes of this complex disorder remain unc...,,,
3,Is autism contagious?,,Autism is not a contagious condition. Autism i...,,
4,Are rates of autism increasing?,,Estimates released by the Centers for Disease ...,,
5,Is autism a new condition?,,"It is likely that autism has always existed, b...",,
6,Is there a cure for autism?,,"There is no cure for autism. However, early in...",,
7,How Is Autism Diagnosed?,,There is no one single conclusive test for aut...,,There is no blood test to diagnose autism spec...
8,Is autism permanent?,,There is some controversy on the topic of whet...,,
9,How Is Autism Treated?,,Treatment for autism depends largely on the in...,,


## **Save Output**

In [92]:
df3.to_excel('ASDquestions5.xlsx',index=False)